True Crime Collection: Extracting Useful Information Using Regular Expressions

Updated on November 2, 2019

Introduction

In the past few years, several crimes have been solved by regular people who have access to the internet. Someone even developed a serial killer detector. Whether you're a fan of true crime stories and just want to do some extra reading or you want to use these crime-related information for your research, this article will help you collect, store, and search information from your websites of choice.

In another article, I wrote about loading information to Elasticsearch and searching through them. In this article, I will guide you through using regular expressions to extract structured data such as arrest date, victim names, etc.

Requirements

Python

I'm using Python 3.6.8 but you can use other versions. Some of the syntax could be different especially for Python 2 versions.

Elasticsearch

First, you need to install Elasticsearch. You can download Elasticsearch and find installation instructions from the Elastic website.

Second, you need to install the Elasticsearch client for Python so that we can interact with Elasticsearch through our Python code. You can get the Elasticsearch client for Python by entering "pip install elasticsearch" in your terminal. If you want to explore this API further, you can refer to the Elasticsearch API documentation for Python.

Getting The Arrest Date

We will use two regular expressions to extract the arrest date for each criminal. I will not go into detail on how regular expressions work but I will explain what each part of the two regular expressions in the code below does. I will be using the flag "re.I" for both to capture characters regardless if it's in lowercase or uppercase.

You can improve these regular expressions or adjust them however you want. A good website that allows you to test your regular expressions is Regex 101.

Extract_dates.py (Version 1)

import re
from elastic import es_search


for val in es_search():
     for result in re.finditer(r'(\w+\W+){0}(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\w+\W+)\d{1,4},?\s\d{0,4}(\w+\W+){1,10}(captured|caught|seized|arrested|apprehended)', val.get("story"), flags=re.I):
         print(result.group())

     for result in re.finditer(r'(\w+\W+){0}(captured|caught|seized|arrested|apprehended)\s(\w+\W+){1,10}(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\w+\W+)\d{1,4},?\s\d{0,4}', val.get("story"), flags=re.I):
         print(result.group())
Capture
Regular Expression
Month
(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\w+\W+)
Day or year
\d{1,4}
With or without a comma
,?
With or without a year
\d{0,4}
Words
(captured|caught|seized|arrested|apprehended)

Dates And Keywords

Line 6 looks for patterns that have the following things in order:

  1. The first three letters of each month. This captures "Feb" in "February", "Sep" in "September" and so on.
  2. One to four numbers. This captures both day (1-2 digits) or year (4 digits).
  3. With or without a comma.
  4. With (up to four) or without numbers. This captures a year (4 digits) but doesn't exclude results that has no year in it.
  5. The keywords related to arrests (synonyms).

Line 9 is similar to line 6 except it looks for patterns that have the words related to arrests followed by dates. If you run the code, you will get the result below.

The result of the regular expression for arrest dates.
The result of the regular expression for arrest dates.

The Data Extraction Module

We can see that we captured phrases that have a combination of arrest keywords and dates. In some phrases, the date comes before the keywords, the rest are of the opposite order. We can also see the synonyms we've indicated in the regular expression, words like "seized", "caught", etc.

Now that we got the dates related to arrests, let's clean up these phrases a little bit and extract only the dates. I created a new Python file named "extract.py" and defined the method get_arrest_date(). This method accepts an "arrest_date" value and returns an MM/DD/YYYY format if the date is complete and MM/DD or MM/YYYY if not.

Extract.py

from datetime import datetime


def get_arrest_date(arrest_date):
    if len(arrest_date) == 3:
        arrest_date = datetime.strptime(" ".join(arrest_date),"%B %d %Y").strftime("%m/%d/%Y")
    elif len(arrest_date[1]) <= 2:
        arrest_date = datetime.strptime(" ".join(arrest_date), "%B %d").strftime("%m/%d")
    else:
        arrest_date = datetime.strptime(" ".join(arrest_date), "%B %Y").strftime("%m/%Y")
    return arrest_date

We'll start using "extract.py" the same way we used "elastic.py" except this one will serve as our module that does everything related to data extraction. In line 3 of the code below, we imported the get_arrest_date() method from the module "extract.py".

Extract_dates.py (Version 2)

import re
from elastic import es_search
from extract import get_arrest_date


for val in es_search():
     arrests = list()
     for result in re.finditer(r'(\w+\W+){0}(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\w+\W+)\d{1,4},?\s\d{0,4}(\w+\W+){1,10}(captured|caught|seized|arrested|apprehended)', val.get("story"), flags=re.I):
         words = result.group().replace(",", "").split()
         arrest_date = words[:(3 if words[2].isdigit() == True else 2)]
         arrests.append(get_arrest_date(arrest_date))

     for result in re.finditer(r'(\w+\W+){0}(captured|caught|seized|arrested|apprehended)\s(\w+\W+){1,10}(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\w+\W+)\d{1,4},?\s\d{0,4}', val.get("story"), flags=re.I):
         words = result.group().replace(",", "").split()
         arrest_date = words[(-3 if words[-2].isdigit() == True else -2):]
         arrests.append(get_arrest_date(arrest_date))

     print(val.get("subject"), arrests) if len(arrests) > 0 else None

Multiple Arrests

You will notice that in line 7, I created a list named "arrests". When I was analyzing the data, I noticed that some of the subjects have been arrested multiple times for different crimes so I modified the code in order to capture all the arrest dates for each subject.

I also replaced the print statements with the code in lines 9 to 11 and 14 to 16. These lines split the result of the regular expression and cuts it in a way that only the date remains. Any non-numerical item before and after January 26, 1978, for example, is excluded. To give you a better idea, I printed out the result for each line below.

A step-by-step extraction of the date.
A step-by-step extraction of the date.

Now, if we run the "extract_dates.py" (version 2) script, we will get the result below.

Each subject followed by their arrest date(s).
Each subject followed by their arrest date(s).

Updating Records In Elasticsearch

Now that we are able to extract the dates when each subject has been arrested, we will update each subject's record to add this information. To do this, we will update our existing "elastic.py" module and define the method es_update() in line 17 to 20. This is similar to the previous es_insert() method. The only differences are the content of the body and the additional "id" parameter. These differences tell Elasticsearch that the information we're sending should be added to an existing record so that it doesn't create a new one.

Since we need the record's ID, I also updated the es_search() method to return this, see line 35.

Elastic.py

import json
from elasticsearch import Elasticsearch
es = Elasticsearch()


def es_insert(category, source, subject, story, **extras):
    doc = {
        "source": source,
        "subject": subject,
        "story": story,
        **extras,
    }
    res = es.index(index=category, doc_type="story", body=doc)
    print(res["result"])


def es_update(category, id, **extras):
    body = {"body": {"doc" : { **extras, } } }
    res = es.update(index=category, doc_type="story", id=id, body=body)
    print(res["result"])


def es_search(**filters):
    result = dict()
    result_set = list()
    search_terms = list()
    for key, value in filters.items():
        search_terms.append({"match": {key: value}})

    print("Search terms:", search_terms)
    size = es.count(index="truecrime").get("count")
    res = es.search(index="truecrime", size=size, body=json.dumps({"query": {"bool": {"must": search_terms}}}))
    for hit in res["hits"]["hits"]:
        result = {"total": res["hits"]["total"], \
                        "id": hit["_id"], \
                        "source": hit["_source"]["source"], \
                        "subject": hit["_source"]["subject"], \
                        "story": hit["_source"]["story"]}
        if "quote" in hit["_source"]:
            result.update({"quote": hit["_source"]["quote"]})
        result_set.append(result)
    return result_set

We will now modify the "extract_dates.py" script so that it will update the Elasticsearch record and add the "arrests" column. To do this, we'll add the import for the es_update() method in line 2.

In line 20, we call that method and pass the arguments "truecrime" for the index name, val.get("id") for the ID of the record we want to update, and arrests=arrests to create a column named "arrests" where the value is the list of arrest dates we extracted.

Extract_dates.py (Version 3)

import re
from elastic import es_search, es_update
from extract import get_arrest_date


for val in es_search():
    arrests = list()
    for result in re.finditer(r'(\w+\W+){0}(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\w+\W+)\d{1,4},?\s\d{0,4}(\w+\W+){1,10}(captured|caught|seized|arrested|apprehended)', val.get("story"), flags=re.I):
        words = result.group().replace(",", "").split()
        arrest_date = words[:(3 if words[2].isdigit() == True else 2)]
        arrests.append(get_arrest_date(arrest_date))

    for result in re.finditer(r'(\w+\W+){0}(captured|caught|seized|arrested|apprehended)\s(\w+\W+){1,10}(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\w+\W+)\d{1,4},?\s\d{0,4}', val.get("story"), flags=re.I):
        words = result.group().replace(",", "").split()
        arrest_date = words[(-3 if words[-2].isdigit() == True else -2):]
        arrests.append(get_arrest_date(arrest_date))

    if len(arrests) > 0:
        print(val.get("subject"), arrests)
        es_update("truecrime", val.get("id"), arrests=arrests)

When you run this code, you will see the result in the screenshot below. This means that the information has been updated in Elasticsearch. We can now search some of the records to see if the "arrests" column exists in them.

The result of successful update for each subject.
The result of successful update for each subject.
No arrest date was extracted from the Criminal Minds website for Gacy. One arrest date was extracted from the Bizarrepedia website.
No arrest date was extracted from the Criminal Minds website for Gacy. One arrest date was extracted from the Bizarrepedia website.
Three arrest dates were extracted from the Criminal Minds website for Goudeau.
Three arrest dates were extracted from the Criminal Minds website for Goudeau.

Disclaimer

Extraction

This is just an example on how to extract and transform the data. In this tutorial, I do not intend to capture all the dates of all formats. We looked specifically for date formats like "January 28, 1989" and there could be other dates in the stories like "09/22/2002" that are regular expression will not capture. It's up to you to adjust the code to better suit your project's needs.

Verification

Though some of the phrases indicate very clearly that the dates were arrest dates for the subject, it's possible to capture some dates not related to the subject. For example, some stories include some past childhood experiences of the subject and it's possible that they have parents or friends who committed crimes and were arrested. In that case, we may be extracting the arrest dates for those people and not the subjects themselves.

We can cross-check these information by scraping information from more websites or comparing them with datasets from sites like Kaggle and checking how consistently those dates appear. Then we can set aside the few inconsistent ones and we may have to verify them manually by reading the stories.

Extracting More Information

I created a script to assist our searches. It allows you to view all the records, filter them by source or subject, and search for specific phrases. You can utilize the search for phrases if you want to extract more data and define more methods in the "extract.py" script.

Truecrime_search.py

import re
from elastic import es_search


def display_prompt():
    print("\n----- OPTIONS -----")
    print("   v - view all")
    print("   s - search\n")

    return input("Option: ").lower()

def display_result(result):
    for ndx, val in enumerate(result):
        print("\n----------\n")
        print("Story", ndx + 1, "of", val.get("total"))
        print("Source:", val.get("source"))
        print("Subject:", val.get("subject"))
        print(val.get("story"))

def display_search():
    print("\n----- SEARCH -----")
    print("    s - search by story source")
    print("    n - search by subject name")
    print("    p - search for phrase(s) in stories\n")

    search = input("Search: ").lower()
    if search == "s":
        search_term = input("Story Source: ")
        display_result(es_search(source=search_term))
    elif search == "n":
        search_term = input("Subject Name: ")
        display_result(es_search(subject=search_term))
    elif search == "p":
        search_term = input("Phrase(s) in Stories: ")
        resno = 1
        for val in es_search(story=search_term):
            for result in re.finditer(r'(\w+\W+){0,10}' + search_term +'\s+(\w+\W+){0,10}' \
                                 , val.get("story"), flags=re.I):
                print("Result", resno, "\n", " ".join(result.group().split("\n")))
                resno += 1
    else:
        print("\nInvalid search option. Please try again.")
        display_search()

while True:
    option = display_prompt()
    if option == "v":
        display_result(es_search())
    elif option == "s":
        display_search()
    else:
        print("\nInvalid option. Please try again.\n")
        continue
    break
Sample usage of the search for phrases, search for "victim was".
Sample usage of the search for phrases, search for "victim was".
Search results for the phrase "victim was".
Search results for the phrase "victim was".

Finally

Now we can update existing records in Elasticsearch, extract, and format structured data from unstructured data. I hope this tutorial including the first two helped you get an idea on how to collect information for your research.

Questions & Answers

    © 2019 Joann Mistica

    Comments

      0 of 8192 characters used
      Post Comment

      No comments yet.

      working

      This website uses cookies

      As a user in the EEA, your approval is needed on a few things. To provide a better website experience, owlcation.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

      For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: https://owlcation.com/privacy-policy#gdpr

      Show Details
      Necessary
      HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
      LoginThis is necessary to sign in to the HubPages Service.
      Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
      AkismetThis is used to detect comment spam. (Privacy Policy)
      HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
      HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
      Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
      CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
      Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)
      Features
      Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
      Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
      Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
      Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
      Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
      VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
      PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
      Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
      MavenThis supports the Maven widget and search functionality. (Privacy Policy)
      Marketing
      Google AdSenseThis is an ad network. (Privacy Policy)
      Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
      Index ExchangeThis is an ad network. (Privacy Policy)
      SovrnThis is an ad network. (Privacy Policy)
      Facebook AdsThis is an ad network. (Privacy Policy)
      Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
      AppNexusThis is an ad network. (Privacy Policy)
      OpenxThis is an ad network. (Privacy Policy)
      Rubicon ProjectThis is an ad network. (Privacy Policy)
      TripleLiftThis is an ad network. (Privacy Policy)
      Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
      Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
      Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
      Statistics
      Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
      ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
      Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)
      ClickscoThis is a data management platform studying reader behavior (Privacy Policy)