True Crime Collection: Storing and Searching Stories Using Python and Elasticsearch

Updated on October 22, 2019

Introduction

In the past few years, several crimes have been solved by regular people who have access to the internet. Someone even developed a serial killer detector. Whether you're a fan of true crime stories and just want to do some extra reading or you want to use these crime-related information for your research, this article will help you collect, store, and search information from your websites of choice.

In another article, I wrote about collecting information from two different website formats using Beautiful Soup and Python. In this article, I will guide you through loading those information to Elasticsearch and searching through them.

Requirements

Python

I'm using Python 3.6.8 but you can use other versions. Some of the syntax could be different especially for Python 2 versions.

Beautiful Soup

Once you're done installing Python, you can get Beautiful Soup by entering "pip install beautifulsoup4" in your terminal. For more information, visit Crummy.

Elasticsearch

First, you need to install Elasticsearch. You can download Elasticsearch and find installation instructions from the Elastic website.

Second, you need to install the Elasticsearch client for Python so that we can interact with Elasticsearch through our Python code. You can get the Elasticsearch client for Python by entering "pip install elasticsearch" in your terminal. If you want to explore this API further, you can refer to the Elasticsearch API documentation for Python.

Running Elasticsearch

Once you have Elasticsearch, you will have a folder named "elasticsearch" including the version number. Inside that folder is a "bin" folder that contains executable files.

To run Elasticsearch, you can..

Option 1: Double-click the "elasticsearch" file in the "bin" folder and click the "Run in Terminal" button.

Option 2: Navigate to that directory using your terminal and enter "./elasticsearch".

Option 1
Option 1
Option 2
Option 2

The steps above should start the server. Keep that terminal alive and enter http://127.0.0.1:9200/ to your browser. If there are no issues, you will get something similar to the screenshot below.

Loading Data Into Elasticsearch

In the previous article, we scraped two different websites using Beautiful Soup which gave us two Python scripts - one for Bizarrepedia and another for Criminal Minds. In the future, we might want to add more websites and scripts so instead of writing the code to insert data into Elasticsearch and pasting them to all of the scripts, we will create a separate Python script specifically for Elasticsearch transactions.

For this article, I will create a file named "elastic.py" in the same directory where all the web scraper files are located.

Elastic.py (Version 1)

from elasticsearch import Elasticsearch
es = Elasticsearch()


def es_insert(category, source, subject, story):
    doc = {
        "source": source,
        "subject": subject,
        "story": story,
    }
    res = es.index(index=category, doc_type="story", body=doc)
    print(res["result"])

In the first two lines, we import the Elasticsearch module and create an instance of it so we can use its methods.

In line 5 to 12, we define a method named es_insert(). We will use this method to insert records to Elasticsearch. It takes the following parameters - category, source, subject, and story. These parameters are combined into one JSON document named "doc" and then passed to Elasticsearch along with the index name and document type "doc_type".

Parameters
Value
category
The type of information. In this case, "True Crime".
source
The source of the data (e.g. Bizarrepedia, Criminal Minds, etc.)
subject
The subject of the story or the criminal's name.
story
The whole story.

Loading Records From Bizarrepedia

We added new lines into our original code from the previous article.

In line 3, we import the es_insert() method that we defined earlier from the Python script named "elastic".

In line 34, we run the es_insert() method and pass the category, source, subject, and story parameters.

Bizarrepedia.py

import requests
from bs4 import BeautifulSoup
from elastic import es_insert


# Retrieve all pages
url = "https://www.bizarrepedia.com/crime/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
pages = int(soup.find("a", {"data-num-pages": True})["data-num-pages"])

for page in range(1, pages + 1):
    if page == 1: # First page has no page number
        url = "https://www.bizarrepedia.com/crime"
    else:
        url = "https://www.bizarrepedia.com/crime/page/" + str(page)

    # Retrieve each story
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")
    stories = soup.find_all("a", {"class": "bx"})

    for story in stories:
        response = requests.get(story["href"])
        soup = BeautifulSoup(response.text, "lxml")
        subject = soup.find("h1", {"class": "entry"}).text
        main_story = soup.find("div", {"class": "typography"})
        blocks = main_story.find_all("p")
        full_story = ""

        for block in blocks:
            full_story = full_story + block.text + "\n\n"
        print(subject + "\n\n" + full_story)
        es_insert("truecrime", "bizarrepedia", subject, full_story)

When we run this code, we will see each story followed with a line that says "created" confirming that the record has been created. Scraping all the stories and loading them into Elasticsearch may take a few seconds.

The result after running the code.
The result after running the code.

Once the script is done running, you can refresh the browser to see how many records are there. This can be found under "hits" -> "total" in the JSON tab.

If you remember, when we scraped Bizarrepedia's website in the previous article, we knew it has 108 records based on the button's label. The screenshot below shows that 108 records have been created.

This is a good confirmation that we were able to scrape all the articles and they were all successfully loaded into Elasticsearch.

108 records created.
108 records created.

Loading Records From Criminal Minds

We took one additional field from the Criminal Minds website named "quote". This field is not in the Bizarrepedia website. If we want to load this field to Elasticsearch, we will have to change the elastic.py's es_insert() method.

There are several ways to do this. We can add a parameter that sets the "quote" field to a blank default value so that the es_insert() method would still work for both websites regardless if they have that field or not but that would look like something you would do in an SQL database, not a NoSQL one like Elastisearch.

Instead, we will add a **kwargs parameter. Adding this parameter will also make our method flexible if we decide to load more fields. This will allow us to pass both the field name and the field value (i.e. quote = "Que Sera Sera.") We can also pass more than one key and value pair. You can read more about **kwargs from Python Tips.

**kwargs allows you to pass keyworded variable length of arguments to a function. You should use **kwargs if you want to handle named arguments in a function. Here is an example to get you going with it:

— Python Tips

In line 5, we add a **kwargs parameter named "extras" and in line 10, we simply include it to the JSON document. This change will not impact the scripts using it. For instance, if we run the code for Bizarrepedia again without making any adjustments, it will still work as it did previously.

Elastic.py (Version 2)

from elasticsearch import Elasticsearch
es = Elasticsearch()


def es_insert(category, source, subject, story, **extras):
    doc = {
        "source": source,
        "subject": subject,
        "story": story,
        **extras,
    }
    res = es.index(index=category, doc_type="story", body=doc)
    print(res["result"])

We can now use the es_insert() method to load the data scraped from the Criminal Minds website including the "quote" field.

Notice line 28 in the code below. We pass the first few arguments to es_insert() the same way we did in Bizarrepedia's code except this time, we passed the "quote" field name and field value (quote=quote) in the end.

Criminal_minds.py

import requests
from bs4 import BeautifulSoup
from elastic import es_insert


# Retrieve all stories
url = "https://criminalminds.fandom.com/wiki/Real_Criminals/Serial_Killers"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
stories = soup.find_all("div", {"class": "lightbox-caption"})

for story in stories:
    # Retrieve each story
    url = "https://criminalminds.fandom.com" + story.find("a")["href"]
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")
    main_story = soup.find("div", {"id":"mw-content-text"})
    subject = story.find("a")["title"]
    if main_story.find("table") is not None:
        quote = " ".join(main_story.find("table").text.split())
    blocks = main_story.find_all("p")
    full_story = ""

    for block in blocks:
        full_story = full_story + block.text + "\n"

    print(quote + "\n" + subject + "\n\n" + full_story)
    es_insert("truecrime", "criminalminds", subject, full_story, quote=quote)

Searching Data From Elasticsearch

To search data from Elasticsearch, we will create the new method es_search() from line 16 to 34. We want to be able to search for any one of the following:

  • Source
  • Subject
  • Phrases in the stories

In the future, we may want to be able to search using multiple filter combinations (e.g. source and subject, source and phrases, etc.) so we will use a "**filters" parameter instead.

Elastic.py (Final Version)

import json
from elasticsearch import Elasticsearch
es = Elasticsearch()


def es_insert(category, source, subject, story, **extras):
    doc = {
        "source": source,
        "subject": subject,
        "story": story,
        **extras,
    }
    res = es.index(index=category, doc_type="story", body=doc)
    print(res["result"])

def es_search(**filters):
    result = dict()
    result_set = list()
    search_terms = list()
    for key, value in filters.items():
        search_terms.append({"match": {key: value}})

    print("Search terms: ", search_terms)
    size = es.count(index="truecrime").get("count")
    res = es.search(index="truecrime", size=size, body=json.dumps({"query": {"bool": {"must": search_terms}}}))
    for hit in res["hits"]["hits"]:
        result = {"total": res["hits"]["total"], \
                        "source": hit["_source"]["source"], \
                        "subject": hit["_source"]["subject"], \
                        "story": hit["_source"]["story"]}
        if "quote" in hit["_source"]:
            result.update({"quote": hit["_source"]["quote"]})
        result_set.append(result)
    return result_set

Boolean Query

There are several ways to query Elasticsearch. In this article, we will use the Boolean Query and its occurrence type "must". This will allow us to build different combinations of queries (i.e. subject = "gacy" and source = "criminalminds") instead of matching only one field (i.e. subject = "gacy").

In line 19 to 23, we create a list of dictionaries named "search_terms". This will allow us to create two "match" objects that we can use for our boolean query. The sample usage of the es_search() method in the code below shows how this part of code works.

In this example, we pass two filters - subject and source.

es_search(subject="gacy", source="criminalminds")
A list of two search terms (dictionaries) for subject and source.
A list of two search terms (dictionaries) for subject and source.

In line 24, we count the number of records in the "truecrime" index and pass its value to the variable "size". We do this because Elasticsearch returns only 10 records by default. There are more efficient ways to do this but for now, since we have only a few records, we'll stick to this method.

In line 25, we pass the index name, size, and the body (query) to Elasticsearch's search() method. We'll use json.dumps() to make sure we are passing a valid JSON document.

Invalid and valid JSON formats.
Invalid and valid JSON formats.

In line 26 to 34, we loop through the "hits" or the results we got from our query. We take the total number of results, source, subject, story, and quote (if any). We create a dictionary out of each result and pass it to the variable "result". Then we add each result to a list named "result_set". Once our code is done looping through all the results, it will return the list of results.

Below are two sample scripts that you can run to try out the es_search() method.

Truecrime_search.py (Sample 1)

from elastic import es_search


result = es_search(subject="gacy", source="criminalminds")
for ndx, val in enumerate(result):
    print("\n----------\n")
    print("Story", ndx + 1, "of", val.get("total"))
    print("Subject:", val.get("subject"))
    print(val.get("story"))
Single result.
Single result.

Truecrime_search.py (Sample 2)

from elastic import es_search


result = es_search(story="arrested")
for ndx, val in enumerate(result):
    print("\n----------\n")
    print("Story", ndx + 1, "of", val.get("total"))
    print("Subject:", val.get("subject"))
Multiple results excluding the stories.
Multiple results excluding the stories.

Finally

Now we can search through the information we collected. We can search for specific sources, subjects, quotes, and phrases. In the next article, we'll be extracting more specific information like when the subjects have been arrested, what are the names of their victims, etc. We will then update the existing records in Elasticsearch to add those information.

Questions & Answers

    © 2019 Joann Mistica

    Comments

      0 of 8192 characters used
      Post Comment

      No comments yet.

      working

      This website uses cookies

      As a user in the EEA, your approval is needed on a few things. To provide a better website experience, owlcation.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

      For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: https://owlcation.com/privacy-policy#gdpr

      Show Details
      Necessary
      HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
      LoginThis is necessary to sign in to the HubPages Service.
      Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
      AkismetThis is used to detect comment spam. (Privacy Policy)
      HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
      HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
      Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
      CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
      Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)
      Features
      Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
      Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
      Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
      Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
      Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
      VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
      PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
      Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
      MavenThis supports the Maven widget and search functionality. (Privacy Policy)
      Marketing
      Google AdSenseThis is an ad network. (Privacy Policy)
      Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
      Index ExchangeThis is an ad network. (Privacy Policy)
      SovrnThis is an ad network. (Privacy Policy)
      Facebook AdsThis is an ad network. (Privacy Policy)
      Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
      AppNexusThis is an ad network. (Privacy Policy)
      OpenxThis is an ad network. (Privacy Policy)
      Rubicon ProjectThis is an ad network. (Privacy Policy)
      TripleLiftThis is an ad network. (Privacy Policy)
      Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
      Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
      Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
      Statistics
      Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
      ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
      Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)
      ClickscoThis is a data management platform studying reader behavior (Privacy Policy)