Kevin is data science and data engineer. He works for a large consulting company in Montreal, Canada. He has over 20 years experience.
Python and Excel are both powerful tools for data exploration and analysis. They're both powerful, and even more so together. There are different libraries that have been created over the past several years to integrate Excel and Python or vice versa. This article will describe them, provide details to acquire and install them and finally brief instructions to help you get started using them. The libraries are listed below.
Excel/Python Integration Options
Openpyxl is an open source library that supports OOXML standard. OOXML standards for open office extensible markup language. Openpyxl can be used with any version of Excel that supports this standard; meaning Excel 2010 (2007) to the present (currently Excel 2016). I haven’t tried or tested Openpyxl with Office 365. However, alternative spreadsheet application like Office Libre Calc or Open Office Calc which support the OOXML standard can also use the library to work with xlsx files.
Openpyxl support most Excel functionality or APIs, including read and write to files, charting, working with pivot tables, parsing formulas, using filters and sorts, creating tables, styling to name a few of the most used. In terms of data wrangling, the library works with datasets both large and small, however, you will see a performance degradation on very large datasets. To work with very large datasets, you will need to use the openpyxl.worksheet._read_only.ReadOnlyWorksheet API.
openpyxl.worksheet._read_only.ReadOnlyWorksheet is read only
Depending on memory availability of your computer, you can use this function to load large datasets into memory or into Anaconda or Jupyter notebook for data analysis or data wrangling. You cannot interface with Excel directly or interactively.
To write back your very large dataset, you use the openpyxl.worksheet._write_only.WriteOnlyWorksheet API to dump the data back into Excel.
Openpyxl can be installed into any Python support editor or IDE, like Anaconda or IPython, Jupyter or any other that you currently use. Openpyxl cannot be used directly inside of Excel.
Note: for this examples I am using Jupyter from the Anaconda suite which can be downloaded and installed from this address: https://www.anaconda.com/distribution/ or you can install just the Jupyter editor from : https://jupyter.org/
To install from command line (command or powershell on Windows, or Terminal on OSX):
Pip install openpyxl
To use to create an Excel workbook and worksheet:
- In the code above, we start by importing the Workbook object from the openpyxl library
- Next we define a workbook object
- Then we create an Excel file to store our data
- From the open excel Workbook, we get a handle on the active Worksheet (ws1)
- Afterwards, add some content using a “for” loop
- And finally save the file.
The two following screenshots show the execution of the tut_openpyxl.py file and save.
Read Data From Excel
The next example will demonstrate opening and reading data from an Excel file
- This is a basic example to read from an Excel file
- Import the load_workbook class from the openpyxl library
- Get a handle on the open workbook
- Get the active worksheet or a named worksheet using workbook[‘some_worksheet’]
- Finally, loop through the values on the sheet
The pyxll package is a commercial offering that can be added or integrated into Excel. A bit like VBA. The pyxll package cannot be installed like other standard Python packages since pyxll is an Excel add-in. Pyxll supports Excel versions from 97-2003 up to the present.
Installation instruction are located here: https://www.pyxll.com/docs/userguide/installation.html
The pyxll website contains several examples on using pyxll in Excel. They make use of decorators and functions to interact with a worksheet, menu and other objects in a workbook.
Another library is xlrd and its companion xlwt below. Xlrd is used to read data from an Excel Workbook. Xlrd was designed to work with older versions of Excel with the “xls” extension.
Installation of the xlrd library is done with pip as:
To open a workbook to read in the data from a worksheet follow these simple steps as in the code snippet below. The excelFilePath parameter is the path to the Excel file. The path value should be listed in double-quotes.
This brief example only covers the basic principle of opening a workbook and reading the data. The complete documentation can be found here: https://xlrd.readthedocs.io/en/latest/index.html
Of course, xlrd, as the name suggests, can only read in data from an Excel workbook. The library doesn’t provide the APIs to write to an Excel file. Luckily, xlrd has a partner called xlwt which is the next library to discuss.
The xlwt is designed to work with Excel files versions 95 through to 2003, which was the binary format prior to the OOXML (Open Office XML) format that was introduced with Excel 2007. The xlwt library works in candem with the xlrd library dscussed above.
The installation process is simple and straightforward. As with most other Python libraries, you can install using the pip utility as follows:
The following code snippet, adapted from the Read the Docs site on xlwt, provides the basic instructions on writing data to an Excel Worksheet, adding styling and using a formula. The syntax is easy to follow.
The write function, write(r, c, label='', style=<xlwt.Style.XFStyle object>) , provides the main capabilities to write content to an Excel spreadsheet, with formulas being an alternate method. The “r” and “c” parameters are for the row and column respectfully.
Complete documentation on using this Python package is located here: https://xlwt.readthedocs.io/en/latest/ . As I mentioned in the opening paragraph, xlwt and xlrd for that matter, are for xls Excel formats (95-2003). For Excel OOXML, you should use other libraries discussed in this article.
The xlutils Python is a continuation of xlrd and xlwt. The package provides more extensive setr of APIs for working with xls based Excel files. Documentation on the package is found here: https://pypi.org/project/xlutils/. To use the package you also need to install the xlrd and xlwt packages.
The xlutils package is installed using pip:
Pandas is a very powerful Python library used for data analysis, manipulation and exploration. It is one of the pillars of data engineering and data science. Once of the main tools or API in Pandas is the DataFrame, which is an in-memory table of data. Pandas can output the contents of the DataFrame to Excel using either openpyxl or xlsxwriter for OOXML files and xlwt (above) for xls file formats as its writing engine. You need to install these packages to work with Pandas. You don’t have to import them into your Python script to use them.
To install pandas, execute this command from the command line interface window, or terminal if you are using OSX:
Here is a screenshot of the script, the VS Code execution and Excel file that is created as a result.
The xlsxwriter package supports the OOXML format Excel, which means 2007 onwards. It is a full feature package including formatting, cell manipulation, formulas, pivot tables, charts, filters, data validation and drop-down list, memory optimization and images to name of the extensive features.
As mentioned previously, it is integrated with Pandas as well which makes it a wicked combination.
The complete documentation is located at their site here: https://xlsxwriter.readthedocs.io/
The following script starts by importing the xlsxwriter package from PYPI repository using pip. Next, define and create a workbook and Excel file. Then we define a worksheet object, xlWks, and add it to the Workbook.
For the sake of the example, I define a dictionary object, but can be anything like a list, a Pandas dataframe, data imported from some external source. I add the data to the Worksheet using an interation and add a simple SUM formula before saving and closing the file.
The following screenshot is the result in Excel.
This final Python package is not specifically for Excel. Rather, it is a Python wrapper for the Windows API which provides access to COM (Common Object Model). COM is a common interface to all Windows based applications, Microsoft Office including Excel.
This is a simple example of using COM to automate the creation of an Excel file, adding a worksheet and some data as well as adding a formula and saving the file.
There you have it: eight different Python packages for interfacing with Excel.
This article is accurate and true to the best of the author’s knowledge. Content is for informational or entertainment purposes only and does not substitute for personal counsel or professional advice in business, financial, legal, or technical matters.
© 2020 Kevin Languedoc
Kevin Languedoc (author) from Canada on March 31, 2020:
There are different ways to approach this. You can use an ORM library like https://www.sqlalchemy.org/ or a full stack framework like Python or Flask.
platinumOwl4 on March 30, 2020:
This is a great article packed with fantastic information. However, what I need is a library that will help me create a database for books and music.
Thanks in advance platimumowl4