True Crime Collection: Extracting Useful Information Using Regular Expressions - Owlcation - Education
Updated date:

True Crime Collection: Extracting Useful Information Using Regular Expressions

Joann has worked as a developer in the Analytics and Artificial Intelligence industry and is experienced in data scraping, mining, etc.

Introduction

In the past few years, several crimes have been solved by regular people who have access to the internet. Someone even developed a serial killer detector. Whether you're a fan of true crime stories and just want to do some extra reading or you want to use these crime-related information for your research, this article will help you collect, store, and search information from your websites of choice.

In another article, I wrote about loading information to Elasticsearch and searching through them. In this article, I will guide you through using regular expressions to extract structured data such as arrest date, victim names, etc.

Requirements

Python

I'm using Python 3.6.8 but you can use other versions. Some of the syntax could be different especially for Python 2 versions.

Elasticsearch

First, you need to install Elasticsearch. You can download Elasticsearch and find installation instructions from the Elastic website.

Second, you need to install the Elasticsearch client for Python so that we can interact with Elasticsearch through our Python code. You can get the Elasticsearch client for Python by entering "pip install elasticsearch" in your terminal. If you want to explore this API further, you can refer to the Elasticsearch API documentation for Python.

Getting The Arrest Date

We will use two regular expressions to extract the arrest date for each criminal. I will not go into detail on how regular expressions work but I will explain what each part of the two regular expressions in the code below does. I will be using the flag "re.I" for both to capture characters regardless if it's in lowercase or uppercase.

You can improve these regular expressions or adjust them however you want. A good website that allows you to test your regular expressions is Regex 101.

extract_dates.py

import re
from elastic import es_search


for val in es_search():
     for result in re.finditer(r'(\w+\W+){0}(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\w+\W+)\d{1,4},?\s\d{0,4}(\w+\W+){1,10}(captured|caught|seized|arrested|apprehended)', val.get("story"), flags=re.I):
         print(result.group())

     for result in re.finditer(r'(\w+\W+){0}(captured|caught|seized|arrested|apprehended)\s(\w+\W+){1,10}(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\w+\W+)\d{1,4},?\s\d{0,4}', val.get("story"), flags=re.I):
         print(result.group())
CaptureRegular Expression

Month

(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\w+\W+)

Day or year

\d{1,4}

With or without a comma

,?

With or without a year

\d{0,4}

Words

(captured|caught|seized|arrested|apprehended)

Dates And Keywords

Line 6 looks for patterns that have the following things in order:

  1. The first three letters of each month. This captures "Feb" in "February", "Sep" in "September" and so on.
  2. One to four numbers. This captures both day (1-2 digits) or year (4 digits).
  3. With or without a comma.
  4. With (up to four) or without numbers. This captures a year (4 digits) but doesn't exclude results that has no year in it.
  5. The keywords related to arrests (synonyms).

Line 9 is similar to line 6 except it looks for patterns that have the words related to arrests followed by dates. If you run the code, you will get the result below.

The result of the regular expression for arrest dates.

The result of the regular expression for arrest dates.

The Data Extraction Module

We can see that we captured phrases that have a combination of arrest keywords and dates. In some phrases, the date comes before the keywords, the rest are of the opposite order. We can also see the synonyms we've indicated in the regular expression, words like "seized", "caught", etc.

Now that we got the dates related to arrests, let's clean up these phrases a little bit and extract only the dates. I created a new Python file named "extract.py" and defined the method get_arrest_date(). This method accepts an "arrest_date" value and returns an MM/DD/YYYY format if the date is complete and MM/DD or MM/YYYY if not.

extract.py

from datetime import datetime


def get_arrest_date(arrest_date):
    if len(arrest_date) == 3:
        arrest_date = datetime.strptime(" ".join(arrest_date),"%B %d %Y").strftime("%m/%d/%Y")
    elif len(arrest_date[1]) <= 2:
        arrest_date = datetime.strptime(" ".join(arrest_date), "%B %d").strftime("%m/%d")
    else:
        arrest_date = datetime.strptime(" ".join(arrest_date), "%B %Y").strftime("%m/%Y")
    return arrest_date

We'll start using "extract.py" the same way we used "elastic.py" except this one will serve as our module that does everything related to data extraction. In line 3 of the code below, we imported the get_arrest_date() method from the module "extract.py".

extract_dates.py

import re
from elastic import es_search
from extract import get_arrest_date


for val in es_search():
     arrests = list()
     for result in re.finditer(r'(\w+\W+){0}(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\w+\W+)\d{1,4},?\s\d{0,4}(\w+\W+){1,10}(captured|caught|seized|arrested|apprehended)', val.get("story"), flags=re.I):
         words = result.group().replace(",", "").split()
         arrest_date = words[:(3 if words[2].isdigit() == True else 2)]
         arrests.append(get_arrest_date(arrest_date))

     for result in re.finditer(r'(\w+\W+){0}(captured|caught|seized|arrested|apprehended)\s(\w+\W+){1,10}(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\w+\W+)\d{1,4},?\s\d{0,4}', val.get("story"), flags=re.I):
         words = result.group().replace(",", "").split()
         arrest_date = words[(-3 if words[-2].isdigit() == True else -2):]
         arrests.append(get_arrest_date(arrest_date))

     print(val.get("subject"), arrests) if len(arrests) > 0 else None

Multiple Arrests

You will notice that in line 7, I created a list named "arrests". When I was analyzing the data, I noticed that some of the subjects have been arrested multiple times for different crimes so I modified the code in order to capture all the arrest dates for each subject.

I also replaced the print statements with the code in lines 9 to 11 and 14 to 16. These lines split the result of the regular expression and cuts it in a way that only the date remains. Any non-numerical item before and after January 26, 1978, for example, is excluded. To give you a better idea, I printed out the result for each line below.

A step-by-step extraction of the date.

A step-by-step extraction of the date.

Now, if we run the "extract_dates.py" script, we will get the result below.

Each subject followed by their arrest date(s).

Each subject followed by their arrest date(s).

Updating Records In Elasticsearch

Now that we are able to extract the dates when each subject has been arrested, we will update each subject's record to add this information. To do this, we will update our existing "elastic.py" module and define the method es_update() in line 17 to 20. This is similar to the previous es_insert() method. The only differences are the content of the body and the additional "id" parameter. These differences tell Elasticsearch that the information we're sending should be added to an existing record so that it doesn't create a new one.

Since we need the record's ID, I also updated the es_search() method to return this, see line 35.

elastic.py

import json
from elasticsearch import Elasticsearch
es = Elasticsearch()


def es_insert(category, source, subject, story, **extras):
    doc = {
        "source": source,
        "subject": subject,
        "story": story,
        **extras,
    }
    res = es.index(index=category, doc_type="story", body=doc)
    print(res["result"])


def es_update(category, id, **extras):
    body = {"body": {"doc" : { **extras, } } }
    res = es.update(index=category, doc_type="story", id=id, body=body)
    print(res["result"])


def es_search(**filters):
    result = dict()
    result_set = list()
    search_terms = list()
    for key, value in filters.items():
        search_terms.append({"match": {key: value}})

    print("Search terms:", search_terms)
    size = es.count(index="truecrime").get("count")
    res = es.search(index="truecrime", size=size, body=json.dumps({"query": {"bool": {"must": search_terms}}}))
    for hit in res["hits"]["hits"]:
        result = {"total": res["hits"]["total"], \
                        "id": hit["_id"], \
                        "source": hit["_source"]["source"], \
                        "subject": hit["_source"]["subject"], \
                        "story": hit["_source"]["story"]}
        if "quote" in hit["_source"]:
            result.update({"quote": hit["_source"]["quote"]})
        result_set.append(result)
    return result_set

We will now modify the "extract_dates.py" script so that it will update the Elasticsearch record and add the "arrests" column. To do this, we'll add the import for the es_update() method in line 2.

In line 20, we call that method and pass the arguments "truecrime" for the index name, val.get("id") for the ID of the record we want to update, and arrests=arrests to create a column named "arrests" where the value is the list of arrest dates we extracted.

extract_dates.py

import re
from elastic import es_search, es_update
from extract import get_arrest_date


for val in es_search():
    arrests = list()
    for result in re.finditer(r'(\w+\W+){0}(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\w+\W+)\d{1,4},?\s\d{0,4}(\w+\W+){1,10}(captured|caught|seized|arrested|apprehended)', val.get("story"), flags=re.I):
        words = result.group().replace(",", "").split()
        arrest_date = words[:(3 if words[2].isdigit() == True else 2)]
        arrests.append(get_arrest_date(arrest_date))

    for result in re.finditer(r'(\w+\W+){0}(captured|caught|seized|arrested|apprehended)\s(\w+\W+){1,10}(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\w+\W+)\d{1,4},?\s\d{0,4}', val.get("story"), flags=re.I):
        words = result.group().replace(",", "").split()
        arrest_date = words[(-3 if words[-2].isdigit() == True else -2):]
        arrests.append(get_arrest_date(arrest_date))

    if len(arrests) > 0:
        print(val.get("subject"), arrests)
        es_update("truecrime", val.get("id"), arrests=arrests)

When you run this code, you will see the result in the screenshot below. This means that the information has been updated in Elasticsearch. We can now search some of the records to see if the "arrests" column exists in them.

The result of successful update for each subject.

The result of successful update for each subject.

No arrest date was extracted from the Criminal Minds website for Gacy. One arrest date was extracted from the Bizarrepedia website.

No arrest date was extracted from the Criminal Minds website for Gacy. One arrest date was extracted from the Bizarrepedia website.

Three arrest dates were extracted from the Criminal Minds website for Goudeau.

Three arrest dates were extracted from the Criminal Minds website for Goudeau.

Disclaimer

Extraction

This is just an example on how to extract and transform the data. In this tutorial, I do not intend to capture all the dates of all formats. We looked specifically for date formats like "January 28, 1989" and there could be other dates in the stories like "09/22/2002" that are regular expression will not capture. It's up to you to adjust the code to better suit your project's needs.

Verification

Though some of the phrases indicate very clearly that the dates were arrest dates for the subject, it's possible to capture some dates not related to the subject. For example, some stories include some past childhood experiences of the subject and it's possible that they have parents or friends who committed crimes and were arrested. In that case, we may be extracting the arrest dates for those people and not the subjects themselves.

We can cross-check these information by scraping information from more websites or comparing them with datasets from sites like Kaggle and checking how consistently those dates appear. Then we can set aside the few inconsistent ones and we may have to verify them manually by reading the stories.

Extracting More Information

I created a script to assist our searches. It allows you to view all the records, filter them by source or subject, and search for specific phrases. You can utilize the search for phrases if you want to extract more data and define more methods in the "extract.py" script.

truecrime_search.py

import re
from elastic import es_search


def display_prompt():
    print("\n----- OPTIONS -----")
    print("   v - view all")
    print("   s - search\n")

    return input("Option: ").lower()

def display_result(result):
    for ndx, val in enumerate(result):
        print("\n----------\n")
        print("Story", ndx + 1, "of", val.get("total"))
        print("Source:", val.get("source"))
        print("Subject:", val.get("subject"))
        print(val.get("story"))

def display_search():
    print("\n----- SEARCH -----")
    print("    s - search by story source")
    print("    n - search by subject name")
    print("    p - search for phrase(s) in stories\n")

    search = input("Search: ").lower()
    if search == "s":
        search_term = input("Story Source: ")
        display_result(es_search(source=search_term))
    elif search == "n":
        search_term = input("Subject Name: ")
        display_result(es_search(subject=search_term))
    elif search == "p":
        search_term = input("Phrase(s) in Stories: ")
        resno = 1
        for val in es_search(story=search_term):
            for result in re.finditer(r'(\w+\W+){0,10}' + search_term +'\s+(\w+\W+){0,10}' \
                                 , val.get("story"), flags=re.I):
                print("Result", resno, "\n", " ".join(result.group().split("\n")))
                resno += 1
    else:
        print("\nInvalid search option. Please try again.")
        display_search()

while True:
    option = display_prompt()
    if option == "v":
        display_result(es_search())
    elif option == "s":
        display_search()
    else:
        print("\nInvalid option. Please try again.\n")
        continue
    break
Sample usage of the search for phrases, search for "victim was".

Sample usage of the search for phrases, search for "victim was".

Search results for the phrase "victim was".

Search results for the phrase "victim was".

Finally

Now we can update existing records in Elasticsearch, extract, and format structured data from unstructured data. I hope this tutorial including the first two helped you get an idea on how to collect information for your research.

© 2019 Joann Mistica