A professional engineer and writer helping people find creative ways to solve everyday problems.
This article will provide you a detailed walk-through on how you can use Python to read large text files. A fully functional, ready to execute code snippet is included in this walk-through to get you up to speed in 10 minutes after reading this article.
Let us first familiarize you with the advanced data-structures available in Python that we will use to store and process data from files, just in case you are new to Python programming.
Advanced Data-Structures in Python
Python has two advanced and powerful data structures that make it superior in functionality to C/C++ and make it an ideal language for numerical data intensive applications competing with industry dominant Matlab®.
Numpy arrays or ndarrays are arrays that can be scaled up to 'n' dimensions. They are best used as 2-dimensional array structures to represent matrices. The Numpy module itself contains powerful function libraries for a variety of numerical and algebraic operations.
Data-frames build upon the 2-dimensional ndarrays to add extra functionality. The 2-dimensional ndarray now has a separate column for array index and all column headers are now individually addressable. More importantly, each column can now hold a different data type (int, float or string).
Text File to be Read by Python Code
Let's move forward to the tutorial and acquaint you with the demonstration text file.
It is a 14 rows x 20 column data table saved as a txt file. It contains data in all three data formats: int, float and string. File name is: BusData.
Next, view the code snippet given below, to read this file and we will explain this code line by line in the following section.
Python Code to Read Data From Text File
Initialization: Import Numpy and Pandas
Line 4: Import the numpy package in the project.
Line 5: Import the pandas package in the project.
Line 7: Start a function definition Read(). It is always a good practice to break your code in functions.
Line 9: Define global variables.
Read More From Owlcation
In Python only global variables will appear in the variable explorer and can be referenced outside functions. For demonstration, here I have defined all four as global, otherwise only BusDataReshaped variable should have been declared global.
Open and Read the Target File
Line 11: The open() function points to the directory location of file BusData.txt. Definition is assigned to random variable X.
Line 12: read() function reads the entire file as a string and assigns it to variable BusData. Fig 2 shows that BusData is now a string with 1792 characters.
Split the File Character-wise
Line 14: split() function in Python, splits the string into a list at the points where their is space. The data is now converted into a list of 280 elements and assigned to variable BusDataList. Reference Fig 2.
Convert to Numpy Array
Line 15: The list is converted into a numpy array by the numpy.array() function. Fig 2 shows that BusDataArray is now an array of datatype string and has 280 elements.
The problem is that it still does not look like our original data table in the file. It needs to be reshaped.
Line 16: The numpy.reshape() function from the numpy package reshapes the array into our desired dimensions of 14 x 20. Fig 2 shows that BusDataReshaped variable is now an ndarray and has dimensions 14 x 20.
As we can see, the data type of all the values is still string, but remember, the original file had integers and floats in the data as well. To make sure that all the data is treated according to its correct data type we need to convert it into a Pandas Data-frame.
Converting a Numpy Array to Data-frame
Line 20: line 20 finally does the job of converting an array of strings into a pandas dataframe.
Pandas.dataframe() function takes the reshaped numpy array, and the names of all 20 column headers as inputs. Fig 3 shows the formed dataframe and Fig 2 verifies this in the variable explorer.
Referencing Values of a Pandas Dataframe
Line 22: Values of this dataframe can be very conveniently accessed by the dataframe.columnheader.[index] syntax.
A check of variable types will show that all the three data types of string, integer and floats of each column are automatically preserved by the dataframe.
This content is accurate and true to the best of the author’s knowledge and is not meant to substitute for formal and individualized advice from a qualified professional.
© 2022 StormsHalted