Much of the public and private information are stored in the form of pdf documents. Data is the new oil, and every other organization wants to take leverage of the data by exploration, extraction and cleaning. In many cases, as discussed in this post, later on, useful information is embedded in pdf pages in the forms of tables. This is specifically true for the financial results of the listed companies, insurance summary, government data, etc. among others. If the pdf files are similar in format, we can automate the process of exploration, extraction, and cleaning the useful information from pdf into a more general format like CSV. This output CSV can then be used to load a database and can be used for further analysis.
In a previous post here, I explained the methods wherein one could convert pdf files to texts and then extract information applying regular expressions.
In this post, I present before you a case study wherein we search, extract, and export tabular data from a pdf file into a CSV format using python.
Case Study- extract tabular information from pdf
An Investment analytics firm “RAVE Analytics” analyses quarterly results of various companies and advises their clients based on its analysis. Mohan works as a Data Analyst for RAVE Analytics, and his work is to get the quarterly results for various companies from their respective websites and store it in a given format in a database for future use. He starts with the quarterly results of “Reliance Industries Limited.” The quarterly results of RIL are on their website but in pdf format. He needs the final data in the following form.
|CName |CCode|Year |Quarter |Revenue |EBITDA |NetProfit |EPS|
How will Mohan proceed? The sample input pdf file can be found here. For your reference, a snapshot is attached below.
Steps - extract tabular information from pdf
The above diagram depicts the high-level steps involved in the pdf extraction process. As you can see, there are three significant steps involved in the process. The square in the picture represents the step name, and the ellipse next to it represents the tool used in that process.
- Find page number where consolidated financial results appear. As you can see in the sample RIL quarterly results pdf, it has xx number of pages, and the quarterly results appear on page number yy. So to get to the actual data, we need to find out on which page number requisite information lies. PYPDF2 is used to convert pdf to text, and REGEX is used to find the page number where “Consolidated Financial Result” appears in pdf.
- Extract table from pdf as a data frame– For many people, this step seems to be a difficult task. While in python, various libraries for this task are available who try to perform decently, I have found that CAMELOT gives very accurate results with a little effort. Camelot is a Python library that makes it easy for anyone to extract tables from PDF files. You can read more about the library and the API it offers here.
- Clean data and export results – While Camelot gives a decent output data frame extracted from pdf, it has to be cleaned further using python tools such as pandas. After the data frame is cleaned, we have to reformat in the desired format and export it.
At a more granular level, as given in the jupyter notebook in the next section, the following tasks ( function) have been performed by Mohan in this process.
- Find page number of desired table in pdf document
- Extract table using page number from pdf document
- Clean extracted dataframe
- Reformat dataframe
- Extract desired information from each extracted and reformatted dataframe
- If pdf is successfully processed, move to processed folder
- If pdf encountered an error, move to error folder
- Export final result to csv and move to output folder
Jupyter Notebook – extract tabular information from pdf
This section presents the well documented Jupyter Notebook in python. I will recommed you to try this case study by yourself and let me know if you have any difficulties.
In this case study, we have studied one example of getting tabular data from a pdf file and export it in the desired format for further use. The output file is given here for your reference, and an image is also shown below.
While most of the cases will follow this process, the challenging part is to get your table out from pdf as a data frame. Some tables are demarcated by horizontal and vertical lines, while others are separated by space. I would recommend you to study the detailed documentation of CAMELOT library and try different kinds of pdf on your own.