Python tools: How to process PDF form data

Public account: You and the cabin by: Peter Editor: Peter

Hello, I’m Peter

In many cases, we need to deal with PDF files. Especially when we encounter PDF form data need to be extracted, it is really a headache.

Because PDF files can not be copied directly like Word, even if copied and pasted again may appear formatting disorder or even garbled problems. How to extract table data from a PDF file? This article provides two solutions:

camelot
tabula

Artifact 1: Camelot

One of the first offers is a tool for extracting table information from text PDF: Camelot, which converts most tables directly to Pandas’ Dataframe.

For more detailed information, please refer to the project address: github.com/camelot-dev…

Installation of camelot

Camelot can be installed in a number of ways. If there is an error, there is usually a solution on the Internet:

1. Install through Conda

conda install -c conda-forge camelot-py
Copy the code

2. Install using PIP

pip install "camelot-py[base]"
Copy the code

3. Install on GitHub

First copy the project locally:

git clone https://www.github.com/camelot-dev/camelot
Copy the code

Then enter the file to install:

cd camelot

pip install ".[base]"
Copy the code

Use case

Here’s an example of how to use Camelot. Suppose we now have a one-page PDF file called test.pdf:

1. Read the file first

tables = camelot.read_pdf("test.pdf")
tables
Copy the code

Export data to CSV format (Mode 1)

tables.export('test.csv'.# export file name
              f='csv'.# export format
              compress=True # File compression
             )
Copy the code

View information about tables:

Export Mode 2:

tables[0].to_csv("test1.csv")
Copy the code

Convert data to DataFrame:

Artifact 2: Tabula

Tabula is more powerful than Camelot and can extract multiple tables at the same time. Please refer to github.com/chezou/tabu…

The installation

Tabula installation is very simple:

pip install tabula-py  Install python extensions
Copy the code

Verify that the library is installed successfully after installation:

Reading PDF files

Tabula is a library to read PDF files:

df1 = tabula.read_pdf("test.pdf",pages="all")
Copy the code

Then we find that the only element in the list is the dataframe:

Output to a CSV file

Output the read data into a CSV file:

# Method 1: Output to CSV format indirectly
df2.to_csv("test2.csv")

# Method 2: Directly output to CSV format
tabula.convert_into("test.pdf"."test3.csv",output_format="csv",pages='all')
Copy the code

The PDF file read above is relatively simple, only one page, and it happens to be a very standard tabular form of data. Here is a more complex example:

The PDF file is 3 pages long
Table data format varies from page to page

Here is the first page, and the first column can be viewed as an index:

On the second page there are two tables with lots of blank lines in between:

Data comparison criteria in Page 3:

These 3 pages are in the same PDF file, these 3 pages are in the same PDF file

Read the first table

tab1 = tabula.read_pdf("data.pdf",stream=True)
len(tab1)
Copy the code

As you can see from the red tip above, only the first page is read by default when pages is not specified, so the length of the list is 1.

Convert original index to new column (partial data) after dataframe

Read all data in PDF

Use pages to read all data:

tab2 = tabula.read_pdf("data.pdf",pages="all")   Get all data all
len(tab2)
Copy the code

Pages =”all” :

We get data for four tables with a list length of 4
After the first table is converted to dataframe data, the original row index does not exist. This is different from the above (without the pages parameter)

Gets data for the specified page

tab3 = tabula.read_pdf("data.pdf", 
                       pages=3.# represents the data on page 3
                       stream=True)
tab3[0]
Copy the code

Get data from two tables simultaneously:

tab4 = tabula.read_pdf("data.pdf", 
                       pages="1, 3".# 2 table data simultaneously
                       stream=True)
len(tab4)  # of length 2
Copy the code

Reads data at the specified location (area)

With the area parameter:

Delete unwanted information

Delete unwanted field information from the table we read

Output files in different formats

You can output the obtained data to files in different formats. Take JSON format as an example:

 tabula.convert_into("data.pdf".# the source file
                    "test4.json".# output file name
                    output_format="json")  # File format
Copy the code

We can see that