Updated date:

True Crime Collection: Storing and Searching Stories Using Python and Elasticsearch

Joann has worked as a developer in the Analytics and Artificial Intelligence industry and is experienced in data scraping, mining, etc.

Introduction

In the past few years, several crimes have been solved by regular people who have access to the internet. Someone even developed a serial killer detector. Whether you're a fan of true crime stories and just want to do some extra reading or you want to use these crime-related information for your research, this article will help you collect, store, and search information from your websites of choice.

In another article, I wrote about collecting information from two different website formats using Beautiful Soup and Python. In this article, I will guide you through loading those information to Elasticsearch and searching through them.

Requirements

Python

I'm using Python 3.6.8 but you can use other versions. Some of the syntax could be different especially for Python 2 versions.

Beautiful Soup

Once you're done installing Python, you can get Beautiful Soup by entering "pip install beautifulsoup4" in your terminal. For more information, visit Crummy.

Elasticsearch

First, you need to install Elasticsearch. You can download Elasticsearch and find installation instructions from the Elastic website.

Second, you need to install the Elasticsearch client for Python so that we can interact with Elasticsearch through our Python code. You can get the Elasticsearch client for Python by entering "pip install elasticsearch" in your terminal. If you want to explore this API further, you can refer to the Elasticsearch API documentation for Python.

Running Elasticsearch

Once you have Elasticsearch, you will have a folder named "elasticsearch" including the version number. Inside that folder is a "bin" folder that contains executable files.

To run Elasticsearch, you can..

Option 1: Double-click the "elasticsearch" file in the "bin" folder and click the "Run in Terminal" button.

Option 2: Navigate to that directory using your terminal and enter "./elasticsearch".

Option 1

Option 1

Option 2

Option 2

The steps above should start the server. Keep that terminal alive and enter http://127.0.0.1:9200/ to your browser. If there are no issues, you will get something similar to the screenshot below.

true-crime-collection-storing-and-searching-stories-using-python-and-elasticsearch

Loading Data Into Elasticsearch

In the previous article, we scraped two different websites using Beautiful Soup which gave us two Python scripts - one for Bizarrepedia and another for Criminal Minds. In the future, we might want to add more websites and scripts so instead of writing the code to insert data into Elasticsearch and pasting them to all of the scripts, we will create a separate Python script specifically for Elasticsearch transactions.

For this article, I will create a file named "elastic.py" in the same directory where all the web scraper files are located.

true-crime-collection-storing-and-searching-stories-using-python-and-elasticsearch

elastic.py

from elasticsearch import Elasticsearch
es = Elasticsearch()


def es_insert(category, source, subject, story):
    doc = {
        "source": source,
        "subject": subject,
        "story": story,
    }
    res = es.index(index=category, doc_type="story", body=doc)
    print(res["result"])

In the first two lines, we import the Elasticsearch module and create an instance of it so we can use its methods.

In line 5 to 12, we define a method named es_insert(). We will use this method to insert records to Elasticsearch. It takes the following parameters - category, source, subject, and story. These parameters are combined into one JSON document named "doc" and then passed to Elasticsearch along with the index name and document type "doc_type".

ParametersValue

category

The type of information. In this case, "True Crime".

source

The source of the data (e.g. Bizarrepedia, Criminal Minds, etc.)

subject

The subject of the story or the criminal's name.

story

The whole story.

Loading Records From Bizarrepedia

We added new lines into our original code from the previous article.

In line 3, we import the es_insert() method that we defined earlier from the Python script named "elastic".

In line 34, we run the es_insert() method and pass the category, source, subject, and story parameters.

bizarrepedia.py

import requests
from bs4 import BeautifulSoup
from elastic import es_insert


# Retrieve all pages
url = "https://www.bizarrepedia.com/crime/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
pages = int(soup.find("a", {"data-num-pages": True})["data-num-pages"])

for page in range(1, pages + 1):
    if page == 1: # First page has no page number
        url = "https://www.bizarrepedia.com/crime"
    else:
        url = "https://www.bizarrepedia.com/crime/page/" + str(page)

    # Retrieve each story
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")
    stories = soup.find_all("a", {"class": "bx"})

    for story in stories:
        response = requests.get(story["href"])
        soup = BeautifulSoup(response.text, "lxml")
        subject = soup.find("h1", {"class": "entry"}).text
        main_story = soup.find("div", {"class": "typography"})
        blocks = main_story.find_all("p")
        full_story = ""

        for block in blocks:
            full_story = full_story + block.text + "\n\n"
        print(subject + "\n\n" + full_story)
        es_insert("truecrime", "bizarrepedia", subject, full_story)

When we run this code, we will see each story followed with a line that says "created" confirming that the record has been created. Scraping all the stories and loading them into Elasticsearch may take a few seconds.

The result after running the code.

The result after running the code.

Once the script is done running, you can refresh the browser to see how many records are there. This can be found under "hits" -> "total" in the JSON tab.

If you remember, when we scraped Bizarrepedia's website in the previous article, we knew it has 108 records based on the button's label. The screenshot below shows that 108 records have been created.

This is a good confirmation that we were able to scrape all the articles and they were all successfully loaded into Elasticsearch.

108 records created.

108 records created.

Loading Records From Criminal Minds

We took one additional field from the Criminal Minds website named "quote". This field is not in the Bizarrepedia website. If we want to load this field to Elasticsearch, we will have to change the elastic.py's es_insert() method.

There are several ways to do this. We can add a parameter that sets the "quote" field to a blank default value so that the es_insert() method would still work for both websites regardless if they have that field or not but that would look like something you would do in an SQL database, not a NoSQL one like Elastisearch.

Instead, we will add a **kwargs parameter. Adding this parameter will also make our method flexible if we decide to load more fields. This will allow us to pass both the field name and the field value (i.e. quote = "Que Sera Sera.") We can also pass more than one key and value pair. You can read more about **kwargs from Python Tips.

**kwargs allows you to pass keyworded variable length of arguments to a function. You should use **kwargs if you want to handle named arguments in a function. Here is an example to get you going with it:

— Python Tips

In line 5, we add a **kwargs parameter named "extras" and in line 10, we simply include it to the JSON document. This change will not impact the scripts using it. For instance, if we run the code for Bizarrepedia again without making any adjustments, it will still work as it did previously.

elastic.py

from elasticsearch import Elasticsearch
es = Elasticsearch()


def es_insert(category, source, subject, story, **extras):
    doc = {
        "source": source,
        "subject": subject,
        "story": story,
        **extras,
    }
    res = es.index(index=category, doc_type="story", body=doc)
    print(res["result"])

We can now use the es_insert() method to load the data scraped from the Criminal Minds website including the "quote" field.

Notice line 28 in the code below. We pass the first few arguments to es_insert() the same way we did in Bizarrepedia's code except this time, we passed the "quote" field name and field value (quote=quote) in the end.

criminal_minds.py

import requests
from bs4 import BeautifulSoup
from elastic import es_insert


# Retrieve all stories
url = "https://criminalminds.fandom.com/wiki/Real_Criminals/Serial_Killers"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
stories = soup.find_all("div", {"class": "lightbox-caption"})

for story in stories:
    # Retrieve each story
    url = "https://criminalminds.fandom.com" + story.find("a")["href"]
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")
    main_story = soup.find("div", {"id":"mw-content-text"})
    subject = story.find("a")["title"]
    if main_story.find("table") is not None:
        quote = " ".join(main_story.find("table").text.split())
    blocks = main_story.find_all("p")
    full_story = ""

    for block in blocks:
        full_story = full_story + block.text + "\n"

    print(quote + "\n" + subject + "\n\n" + full_story)
    es_insert("truecrime", "criminalminds", subject, full_story, quote=quote)

Searching Data From Elasticsearch

To search data from Elasticsearch, we will create the new method es_search() from line 16 to 34. We want to be able to search for any one of the following:

  • Source
  • Subject
  • Phrases in the stories

In the future, we may want to be able to search using multiple filter combinations (e.g. source and subject, source and phrases, etc.) so we will use a "**filters" parameter instead.

elastic.py

import json
from elasticsearch import Elasticsearch
es = Elasticsearch()


def es_insert(category, source, subject, story, **extras):
    doc = {
        "source": source,
        "subject": subject,
        "story": story,
        **extras,
    }
    res = es.index(index=category, doc_type="story", body=doc)
    print(res["result"])

def es_search(**filters):
    result = dict()
    result_set = list()
    search_terms = list()
    for key, value in filters.items():
        search_terms.append({"match": {key: value}})

    print("Search terms: ", search_terms)
    size = es.count(index="truecrime").get("count")
    res = es.search(index="truecrime", size=size, body=json.dumps({"query": {"bool": {"must": search_terms}}}))
    for hit in res["hits"]["hits"]:
        result = {"total": res["hits"]["total"], \
                        "source": hit["_source"]["source"], \
                        "subject": hit["_source"]["subject"], \
                        "story": hit["_source"]["story"]}
        if "quote" in hit["_source"]:
            result.update({"quote": hit["_source"]["quote"]})
        result_set.append(result)
    return result_set

Boolean Query

There are several ways to query Elasticsearch. In this article, we will use the Boolean Query and its occurrence type "must". This will allow us to build different combinations of queries (i.e. subject = "gacy" and source = "criminalminds") instead of matching only one field (i.e. subject = "gacy").

In line 19 to 23, we create a list of dictionaries named "search_terms". This will allow us to create two "match" objects that we can use for our boolean query. The sample usage of the es_search() method in the code below shows how this part of code works.

In this example, we pass two filters - subject and source.

es_search(subject="gacy", source="criminalminds")
A list of two search terms (dictionaries) for subject and source.

A list of two search terms (dictionaries) for subject and source.

In line 24, we count the number of records in the "truecrime" index and pass its value to the variable "size". We do this because Elasticsearch returns only 10 records by default. There are more efficient ways to do this but for now, since we have only a few records, we'll stick to this method.

In line 25, we pass the index name, size, and the body (query) to Elasticsearch's search() method. We'll use json.dumps() to make sure we are passing a valid JSON document.

Invalid and valid JSON formats.

Invalid and valid JSON formats.

In line 26 to 34, we loop through the "hits" or the results we got from our query. We take the total number of results, source, subject, story, and quote (if any). We create a dictionary out of each result and pass it to the variable "result". Then we add each result to a list named "result_set". Once our code is done looping through all the results, it will return the list of results.

Below are two sample scripts that you can run to try out the es_search() method.

truecrime_search.py (Sample 1)

from elastic import es_search


result = es_search(subject="gacy", source="criminalminds")
for ndx, val in enumerate(result):
    print("\n----------\n")
    print("Story", ndx + 1, "of", val.get("total"))
    print("Subject:", val.get("subject"))
    print(val.get("story"))
Single result.

Single result.

truecrime_search.py (Sample 2)

from elastic import es_search


result = es_search(story="arrested")
for ndx, val in enumerate(result):
    print("\n----------\n")
    print("Story", ndx + 1, "of", val.get("total"))
    print("Subject:", val.get("subject"))
Multiple results excluding the stories.

Multiple results excluding the stories.

Finally

Now we can search through the information we collected. We can search for specific sources, subjects, quotes, and phrases. In the next article, we'll be extracting more specific information like when the subjects have been arrested, what are the names of their victims, etc. We will then update the existing records in Elasticsearch to add those information.

© 2019 Joann Mistica

Related Articles