Statistics For Data Science Course
6
1709

Case Study : Extract Information From PDF Files

Introduction

Many a time, data analysts are presented with a problem involving pdf files. PDF stands for “portable document format”. It was introduced to ease the sharing of documents between computers and across operating system platforms when you need to save files that cannot be modified but still need to be easily shared and printed. Much of the public and private information are stored in the form of pdf documents. In this article, I will present before you a simple case study, where the goal is to extract the text information using python and then export the data in a CSV format.

Case Study – Information Extraction From PDF

An Insurance Company wants to digitize their legacy data. Most of the data are captured in the form of text which is available as a portable document format(pdf). The company wants to extract information and then export the required information in the form of CSV so that it can be loaded into the database easily.
For the purpose of this tutorial, we assume that we have only one pdf which is stored here for your reference. It is assumed that the PDFs are text-based and not image scanned PDFs. For simplicity, we are going to extract the following information from pdf.
  • Master Policy No.
  • Certificate No.
  • GST NO.

Steps- Extract Information from pdf

STEPS-Extract Information from pdf files

The above diagram depicts the steps involved in the extraction of information from a pdf file. As you can see, there are three steps involved in this process.

  1. Convert PDF to text – In this step, we need to convert the PDF file to a text string. This can be achieved via python packages like pdfminer.six or PyPDF2. In the code example for this tutorial, I have used PyPDF2. Please note that this process extracts texts only from text-based PDFs. 
  2. Extract Information from Text – Once the text data is available, the next step is to find the required fields. This can be found with the help of regular expressions. 
  3. Reformat and Extract– After the required information is extracted and stored. This step ensures that the information is stored in such a way so as to make the data interpretable and easy to extract. This can be done with the help of pandas library of python. 
Well documented jupyter notbook for each step is given for your reference in the next section.

Jupyter Notebook – Extract Information from pdf

This section presents the well documented Jupyter Notebook in python. I will recommed you to try this case study by yourself and let me know if you have any difficulties.

Conclusion

The case study taken in this example was the simplest case. But even in the most complex cases, you would be following these steps. This analysis can be extended to automate the pdf extraction process for millions of documents of similar format. After you have done this, you can also think of the ways to include scanned pdf documents in such cases.

Show Comments

Leave a Reply