The beginning of the story

Go to finance department to check the salary slip of last month today! Found a goddess face unhappy! It’s like the sky is falling! I finish salary to ask: goddess, you zha not happy, not be about to send salary immediately! The goddess said: the boss just sent me a task, let me make an Excel sheet for the invoices of this month! Give it to him by the end of the day! So many invoices, I can’t categorize until tomorrow! You don’t have to work today! I said: then I will give you ten minutes to finish it, after work you can buy me a big dinner, after all, this kind of close the distance of the opportunity is not many! Of course she must have a look of disbelief! Then I’ll let my technology conquer her!

The text start

Here we take four invoices as an example and place the picture of the invoice in the PIC folder.

Open a receipt at random

These are the bill that I look for on the net certainly won’t take the bill of the company to do tutorial! Then I guess tomorrow I’ll be packing my bags with my accounting sister! That still don’t hate me! Ha, ha, ha

Extraction target: amount, name, taxpayer identification number, drawer.

Finally, save the four contents of each invoice into Excel:

You need to use the library

The required libraries are as follows:

from PIL import Image as PI
import pyocr
import pyocr.builders
from cnocr import CnOcr
Copy the code

The installation command is as follows:

pip install pyocr
pip install cnocr
Copy the code

Installation is very simple!

The invoice contains Chinese content, we need to identify the Chinese in the picture, then CNOCR is a good choice.

Note: In addition to installing the above library, you need to install an additional EXE file, otherwise the following error will occur

Exe files to install:

1. ImageMagick

2. tesseract-OCR

The installation process of these two software is no longer described, you can search for tutorials to install.

03. Extract content

Here, take one of the pictures as an example to explain how to extract the target content: amount, name, taxpayer identification number and drawer.

Read the picture: PIC /pic1.jpg

tool = pyocr.get_available_tools()[0]
img_url = "pic/pic1.jpg"
with open(img_url, 'rb') as f:
    a = f.read()
new_img = PI.open(io.BytesIO(a))
Copy the code

Extracting amount

Need to intercept to the position of the invoice amount

Image_text1 = new_img.crop((left, top, right, Image_text1.show ()Copy the code

The left, top, right, and bottom values have been modified several times. We can locate according to the content of their invoices.

The numbers in the picture are then extracted

Again, continue extracting: name

Extract the name

       left = 155
top = 450
right = 450
bottom = 470
image_obj2 = new_img.crop((left, top, right, bottom))
image_obj2.show()
Copy the code

The name here is Chinese, we can no longer like withdrawal amount (number) operation. It is necessary to use CNOCR to remove Chinese from the picture.

image_obj2.save("tmp.jpg")
ocr = CnOcr()
res = ocr.ocr("tmp.jpg")
print("".join(res[0]))
Copy the code

Extract taxpayer identification number

Image_text3 = new_img.crop((left, top, right, Image_text3.show ()Copy the code

txt3 = tool.image_to_string(image_text3)
print(txt3)
Copy the code

The taxpayer identification numbers in the picture are extracted and the results are as follows:

Draw the drawer

left = 528
top = 550
right = 670
bottom = 600
image_obj4 = new_img.crop((left, top, right, bottom))
image_obj4.show()
Copy the code

image_obj4.save("tmp.jpg")
ocr = CnOcr()
res = ocr.ocr("tmp.jpg")
print("".join(res[0]))
Copy the code

As there are Chinese characters, we use CNOCR to extract Chinese characters from the picture, just like extracting names.

Ok, so we will extract the four target contents in the invoice, and then identify all the invoices under the folder PIC and save the contents to Excel.

04. Batch identify invoices and save them in Excel

Before reading the picture, wrap the above four operations into functions that are easy to call from each invoice object.

Read all the pictures in the folder.

filePath = 'pic'
pic_name = []
for i,j,name in os.walk(filePath):
    pic_name = name
for i in pic_name:
    print(i)
Copy the code

Begin the identification and write the results to Excel.

for i in pic_name: img_url = filePath+"/"+i with open(img_url, 'rb') as f: A = f.read() new_img = pi.open (io.bytesio (a)) ## write CSV outws.cell(row=count, column=1, value=text2(new_img)) outws.cell(row=count, column=2, value=text3(new_img)) outws.cell(row=count, column=3, value=text1(new_img)) outws.cell(row=count, column=4, Value =text4(new_img)) count = count + 1 outwb.save(" invoice pool-xls ") #Copy the code

Finally saved as: Invoice summary – Li Yunchen.xls, the result is as follows:

6. Summary

This paper is basically successful to achieve the goal requirements, from the effect is still very good! Complete source code can be combined by the text (has been all shared in the text), interested readers can try their own!

Be sure to try ****! Be sure to try ****! Do try it!

Finally, the examples in this paper can be applied in other ways, for example

  • Batch calculation of invoice sum summary

  • Batch classification according to invoice type

Then today promised the goddess request is finished! After finishing, he and the goddess went home to cook for me!

This happiness comes too suddenly ah! On the need to remember to click on the source code blue font: click here to get or add Q group: 754370353 self can know you lazy, I have put a folder

Without further ado, brothers, I’m going to eat!