Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”

Author’s other platforms:

| CSDN:blog.csdn.net/qq_4115394…

| the nuggets: juejin. Cn/user / 651387…

| zhihu: www.zhihu.com/people/1024…

| GitHub:github.com/JiangXia-10…

| public no. : 1024 notes

This article is about 2,589 words and takes 8 minutes to read

1 introduction

The PDF format document is more convenient for us to use, because it will not be abnormal due to the editor and other reasons. However, sometimes we need to modify the document, at this time we need to parse the PDF into Word format. There are many format conversion websites and software on the Internet, but most of them can only be used for free for a few times. If we use them again, we will have to upgrade VIP. Then, if we write a PDF conversion program by ourselves, is it very convenient and niuability?

This article will show you how to use Python to write a PDF conversion tool for Word.

2 the body

Here I’m using Win10, python version 3.7:

The dependency package used is PDfMiner3K, which can be installed with the following command:

pip install pdfminer3k
Copy the code

The specific code is as follows. The functions of each line of code are written in the notes, so we will not repeat them one by one:

# author: The 2020-10-31 # description: Import sys import importlib importlib.reload(sys) from pdfminer.pdfparser import PDFParser,PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import PDFPageAggregator from pdfminer.layout import * from Pdfminer. Pdfinterp import PDFTextExtractionNotAllowed # first define a function of the PDF document # parse PDF file, file contains a variety of objects def parse (pdf_path) : Fp = open(pdf_path, Parser = PDFParser(FP) # Create a PDF document doc = PDFDocument() # Connect parser with document object Parser.set_document (doc) doc.set_parser(parser) doc.initialize() # if not doc.is_extractable: raise PDFTextExtractionNotAllowed else: RSRCMGR = PDFResourceManager() # laparams = laparams () device = PDFPageAggregator(RSRCMGR, laparams=laparams) # create a PDF interpreter object interpreter = PDFPageInterpreter(RSRCMGR, device) # to count pages, Num_page, num_image, num_curve, num_figure, num_TextBoxHorizontal = 0, 0, 0, 0 # doc.get_pages() gets a list of pages for page in doc.get_pages(): # add 1 num_page += 1 interpreter. Process_page (page) # Accept the LTPage object layout = device.get_result() for x in layout: If isinstance(x,LTImage): num_image += 1 If isinstance(x,LTFigure): if isinstance(x, LTTextBoxHorizontal): With open(r'test.doc', 'a',encoding=' utF-8 ') as f: Results = x.goet_text () f.write('\n') # print(' \n',' %s\n'%num_page,' %s\n'%num_page,' %s\n'%num_page,' %s\n'%num_page,' % s \ n '% num_image, curve number: % s \ n' % num_curve, 'level text box: % s \ n' # % num_TextBoxHorizontal) to perform the main function if __name__ = = '__main__' : Pdf_path = r 'c :\Users\Jiang\Desktop\test. PDF 'Copy the code

The content of the PDF document is as follows:

The result of executing the above code is as follows:

The doc document is parsed as follows:

The content of the doc document is the same as that of the PDF document:

3 summary

Python uses PDfMiner3k to convert PDF to TXT and doc to PDF. This is an example of how python converts PDF to TXT using PDFMiner3k.

In addition, all the actual combat article source will be synchronized to Github, there is a need to welcome the use of download.

Today’s recommendation

Getting started with Python (4) : Using sets

Getting started with Python (I) : String formatting

Getting started with Python (3) : Using tuples

Getting Started with Python 6: Calling custom functions

Getting Started with Python (5) : Dict usage