Joann has worked as a developer in the Analytics and Artificial Intelligence industry and is experienced in data scraping, mining, etc.
In the past few years, several crimes have been solved by regular people who have access to the internet. Someone even developed a serial killer detector. Whether you're a fan of true crime stories and just want to do some extra reading or you want to use these crime-related information for your research, this article will help you collect, store, and search information from your websites of choice.
In another article, I wrote about collecting information from two different website formats using Beautiful Soup and Python. In this article, I will guide you through loading those information to Elasticsearch and searching through them.
I'm using Python 3.6.8 but you can use other versions. Some of the syntax could be different especially for Python 2 versions.
Once you're done installing Python, you can get Beautiful Soup by entering "pip install beautifulsoup4" in your terminal. For more information, visit Crummy.
First, you need to install Elasticsearch. You can download Elasticsearch and find installation instructions from the Elastic website.
Second, you need to install the Elasticsearch client for Python so that we can interact with Elasticsearch through our Python code. You can get the Elasticsearch client for Python by entering "pip install elasticsearch" in your terminal. If you want to explore this API further, you can refer to the Elasticsearch API documentation for Python.
Once you have Elasticsearch, you will have a folder named "elasticsearch" including the version number. Inside that folder is a "bin" folder that contains executable files.
To run Elasticsearch, you can..
Option 1: Double-click the "elasticsearch" file in the "bin" folder and click the "Run in Terminal" button.
Option 2: Navigate to that directory using your terminal and enter "./elasticsearch".
The steps above should start the server. Keep that terminal alive and enter http://127.0.0.1:9200/ to your browser. If there are no issues, you will get something similar to the screenshot below.
Loading Data Into Elasticsearch
In the previous article, we scraped two different websites using Beautiful Soup which gave us two Python scripts - one for Bizarrepedia and another for Criminal Minds. In the future, we might want to add more websites and scripts so instead of writing the code to insert data into Elasticsearch and pasting them to all of the scripts, we will create a separate Python script specifically for Elasticsearch transactions.
For this article, I will create a file named "elastic.py" in the same directory where all the web scraper files are located.
In the first two lines, we import the Elasticsearch module and create an instance of it so we can use its methods.
In line 5 to 12, we define a method named es_insert(). We will use this method to insert records to Elasticsearch. It takes the following parameters - category, source, subject, and story. These parameters are combined into one JSON document named "doc" and then passed to Elasticsearch along with the index name and document type "doc_type".
The type of information. In this case, "True Crime".
The source of the data (e.g. Bizarrepedia, Criminal Minds, etc.)
The subject of the story or the criminal's name.
The whole story.
Loading Records From Bizarrepedia
We added new lines into our original code from the previous article.
In line 3, we import the es_insert() method that we defined earlier from the Python script named "elastic".
In line 34, we run the es_insert() method and pass the category, source, subject, and story parameters.
When we run this code, we will see each story followed with a line that says "created" confirming that the record has been created. Scraping all the stories and loading them into Elasticsearch may take a few seconds.
Once the script is done running, you can refresh the browser to see how many records are there. This can be found under "hits" -> "total" in the JSON tab.
If you remember, when we scraped Bizarrepedia's website in the previous article, we knew it has 108 records based on the button's label. The screenshot below shows that 108 records have been created.
This is a good confirmation that we were able to scrape all the articles and they were all successfully loaded into Elasticsearch.
Loading Records From Criminal Minds
We took one additional field from the Criminal Minds website named "quote". This field is not in the Bizarrepedia website. If we want to load this field to Elasticsearch, we will have to change the elastic.py's es_insert() method.
There are several ways to do this. We can add a parameter that sets the "quote" field to a blank default value so that the es_insert() method would still work for both websites regardless if they have that field or not but that would look like something you would do in an SQL database, not a NoSQL one like Elastisearch.
Instead, we will add a **kwargs parameter. Adding this parameter will also make our method flexible if we decide to load more fields. This will allow us to pass both the field name and the field value (i.e. quote = "Que Sera Sera.") We can also pass more than one key and value pair. You can read more about **kwargs from Python Tips.
**kwargs allows you to pass keyworded variable length of arguments to a function. You should use **kwargs if you want to handle named arguments in a function. Here is an example to get you going with it:
— Python Tips
In line 5, we add a **kwargs parameter named "extras" and in line 10, we simply include it to the JSON document. This change will not impact the scripts using it. For instance, if we run the code for Bizarrepedia again without making any adjustments, it will still work as it did previously.
We can now use the es_insert() method to load the data scraped from the Criminal Minds website including the "quote" field.
Notice line 28 in the code below. We pass the first few arguments to es_insert() the same way we did in Bizarrepedia's code except this time, we passed the "quote" field name and field value (quote=quote) in the end.
Searching Data From Elasticsearch
To search data from Elasticsearch, we will create the new method es_search() from line 16 to 34. We want to be able to search for any one of the following:
- Phrases in the stories
In the future, we may want to be able to search using multiple filter combinations (e.g. source and subject, source and phrases, etc.) so we will use a "**filters" parameter instead.
There are several ways to query Elasticsearch. In this article, we will use the Boolean Query and its occurrence type "must". This will allow us to build different combinations of queries (i.e. subject = "gacy" and source = "criminalminds") instead of matching only one field (i.e. subject = "gacy").
In line 19 to 23, we create a list of dictionaries named "search_terms". This will allow us to create two "match" objects that we can use for our boolean query. The sample usage of the es_search() method in the code below shows how this part of code works.
In this example, we pass two filters - subject and source.
In line 24, we count the number of records in the "truecrime" index and pass its value to the variable "size". We do this because Elasticsearch returns only 10 records by default. There are more efficient ways to do this but for now, since we have only a few records, we'll stick to this method.
In line 25, we pass the index name, size, and the body (query) to Elasticsearch's search() method. We'll use json.dumps() to make sure we are passing a valid JSON document.
In line 26 to 34, we loop through the "hits" or the results we got from our query. We take the total number of results, source, subject, story, and quote (if any). We create a dictionary out of each result and pass it to the variable "result". Then we add each result to a list named "result_set". Once our code is done looping through all the results, it will return the list of results.
Below are two sample scripts that you can run to try out the es_search() method.
truecrime_search.py (Sample 1)
truecrime_search.py (Sample 2)
Now we can search through the information we collected. We can search for specific sources, subjects, quotes, and phrases. In the next article, we'll be extracting more specific information like when the subjects have been arrested, what are the names of their victims, etc. We will then update the existing records in Elasticsearch to add those information.
© 2019 Joann Mistica