Joann has worked as a developer in the Analytics and Artificial Intelligence industry and is experienced in data scraping, mining, etc.
In the past few years, several crimes have been solved by regular people who have access to the internet. Someone even developed a serial killer detector. Whether you're a fan of true crime stories and just want to do some extra reading or you want to use these crime-related information for your research, this article will help you collect, store, and search information from your websites of choice.
In another article, I wrote about loading information to Elasticsearch and searching through them. In this article, I will guide you through using regular expressions to extract structured data such as arrest date, victim names, etc.
I'm using Python 3.6.8 but you can use other versions. Some of the syntax could be different especially for Python 2 versions.
First, you need to install Elasticsearch. You can download Elasticsearch and find installation instructions from the Elastic website.
Second, you need to install the Elasticsearch client for Python so that we can interact with Elasticsearch through our Python code. You can get the Elasticsearch client for Python by entering "pip install elasticsearch" in your terminal. If you want to explore this API further, you can refer to the Elasticsearch API documentation for Python.
Getting The Arrest Date
We will use two regular expressions to extract the arrest date for each criminal. I will not go into detail on how regular expressions work but I will explain what each part of the two regular expressions in the code below does. I will be using the flag "re.I" for both to capture characters regardless if it's in lowercase or uppercase.
You can improve these regular expressions or adjust them however you want. A good website that allows you to test your regular expressions is Regex 101.
Day or year
With or without a comma
With or without a year
Dates And Keywords
Line 6 looks for patterns that have the following things in order:
- The first three letters of each month. This captures "Feb" in "February", "Sep" in "September" and so on.
- One to four numbers. This captures both day (1-2 digits) or year (4 digits).
- With or without a comma.
- With (up to four) or without numbers. This captures a year (4 digits) but doesn't exclude results that has no year in it.
- The keywords related to arrests (synonyms).
Line 9 is similar to line 6 except it looks for patterns that have the words related to arrests followed by dates. If you run the code, you will get the result below.
The Data Extraction Module
We can see that we captured phrases that have a combination of arrest keywords and dates. In some phrases, the date comes before the keywords, the rest are of the opposite order. We can also see the synonyms we've indicated in the regular expression, words like "seized", "caught", etc.
Now that we got the dates related to arrests, let's clean up these phrases a little bit and extract only the dates. I created a new Python file named "extract.py" and defined the method get_arrest_date(). This method accepts an "arrest_date" value and returns an MM/DD/YYYY format if the date is complete and MM/DD or MM/YYYY if not.
We'll start using "extract.py" the same way we used "elastic.py" except this one will serve as our module that does everything related to data extraction. In line 3 of the code below, we imported the get_arrest_date() method from the module "extract.py".
You will notice that in line 7, I created a list named "arrests". When I was analyzing the data, I noticed that some of the subjects have been arrested multiple times for different crimes so I modified the code in order to capture all the arrest dates for each subject.
I also replaced the print statements with the code in lines 9 to 11 and 14 to 16. These lines split the result of the regular expression and cuts it in a way that only the date remains. Any non-numerical item before and after January 26, 1978, for example, is excluded. To give you a better idea, I printed out the result for each line below.
Now, if we run the "extract_dates.py" script, we will get the result below.
Updating Records In Elasticsearch
Now that we are able to extract the dates when each subject has been arrested, we will update each subject's record to add this information. To do this, we will update our existing "elastic.py" module and define the method es_update() in line 17 to 20. This is similar to the previous es_insert() method. The only differences are the content of the body and the additional "id" parameter. These differences tell Elasticsearch that the information we're sending should be added to an existing record so that it doesn't create a new one.
Since we need the record's ID, I also updated the es_search() method to return this, see line 35.
We will now modify the "extract_dates.py" script so that it will update the Elasticsearch record and add the "arrests" column. To do this, we'll add the import for the es_update() method in line 2.
In line 20, we call that method and pass the arguments "truecrime" for the index name, val.get("id") for the ID of the record we want to update, and arrests=arrests to create a column named "arrests" where the value is the list of arrest dates we extracted.
When you run this code, you will see the result in the screenshot below. This means that the information has been updated in Elasticsearch. We can now search some of the records to see if the "arrests" column exists in them.
This is just an example on how to extract and transform the data. In this tutorial, I do not intend to capture all the dates of all formats. We looked specifically for date formats like "January 28, 1989" and there could be other dates in the stories like "09/22/2002" that are regular expression will not capture. It's up to you to adjust the code to better suit your project's needs.
Though some of the phrases indicate very clearly that the dates were arrest dates for the subject, it's possible to capture some dates not related to the subject. For example, some stories include some past childhood experiences of the subject and it's possible that they have parents or friends who committed crimes and were arrested. In that case, we may be extracting the arrest dates for those people and not the subjects themselves.
We can cross-check these information by scraping information from more websites or comparing them with datasets from sites like Kaggle and checking how consistently those dates appear. Then we can set aside the few inconsistent ones and we may have to verify them manually by reading the stories.
Extracting More Information
I created a script to assist our searches. It allows you to view all the records, filter them by source or subject, and search for specific phrases. You can utilize the search for phrases if you want to extract more data and define more methods in the "extract.py" script.
Now we can update existing records in Elasticsearch, extract, and format structured data from unstructured data. I hope this tutorial including the first two helped you get an idea on how to collect information for your research.
© 2019 Joann Mistica