My friend needed to split a PDF file, and found this pypDF2 could complete these operations after checking online, so I studied this library and made some records. First, pypdf2 is python3, and there was a corresponding pypdf library in the previous version 2.

You can install directly using PIP:

pip install pypdf2
Copy the code

Official document: pythonhosted.org/PyPDF2/

There are mainly these categories:


PdfFileReader.

This class mainly provides reading operations on PDF files. Its construction method is as follows:

PdfFileReader(stream, strict=True, warndest=None, overwriteWarnings=True)
Copy the code

The first argument can be passed in a file stream, or a file path. The last three parameters are used to set how warnings are handled, using the default values.

Now that you have the example, you can do something with the PDF. The main operations are as follows:

  • Decrypt (password) : This method is used to decrypt PDF files if they are encrypted.

  • GetDocumentInfo () : Retrieves some information about a PDF file. The return value is a DocumentInformation type. Output directly will yield information similar to the following:

{‘/ModDate’: “D:20150310202949-07’00′”, ‘/Title’: ”, ‘/Creator’: ‘LaTeX with hyperref package’, ‘/CreationDate’: “D:20150310202949-07’00′”, ‘/PTEX.Fullbanner’: ‘This is pdfTeX, Version 3.14159265-2.6-1.40.15 (TeX Live 2014/MacPorts 2014_6) kpathsea Version 6.2.0’, ‘/Producer’: ‘pdfTeX – 1.40.15’, ‘/ Keywords’ :’, ‘/ Trapped’ : ‘/ False’, ‘/ Author’ : ‘, ‘/ Subject’ : ‘}

  • GetNumPages () : This is the number of pages in the PDF file.

  • GetPage (pageNumber) : the PageObject corresponding to the pageNumber pageNumber in the PDF file is returned as a PageObject instance. After you get the PageObject instance, you can add it, insert it, and so on.

  • GetPageNumber (Page) : As opposed to the above method, you can pass in an instance of PageObject and get the page number in the PDF file.

  • GetOutlines (Node =None, Outlines =None) : Retrieves Outlines of documents that appear in documents.

  • IsEncrypted: records whether the PDF isEncrypted. If the file itself is encrypted, it returns true even after using the decrypt method.

  • NumPages: The total number of pages in a PDF, equivalent to accessing the read-only property of getNumPages().


PdfFileWriter.

This class supports writing to PDF files, usually using PdfFileReader to read some PDF data, and then using this class to perform some operations.

No parameters are required to create an instance of this class.

The main methods are as follows:

  • Addattinfringement (fname, fdata) : Adding documentation to PDF.

  • AddBlankPage (width=None, height=None) : Adds a blank page to the end of the PDF, using the size of the last page of the PDF in the current Weiter if no size is specified.

  • AddPage: Adds a page to a PDF, usually from the Reader above.

  • AppendPagesFromReader (reader, after_page_append = None) : Copies the data from reader into the current Writer instance, and, if after_page_append is specified, finally returns the function and passes the data from Writer into it.

  • Encrypt (user_pwd, owner_pwd = None, use_128bit = True) : Userpwd allows users to open PDF files with limited permissions, which may be limited if the password is used, but I can’t find the content of setting permissions in the document. Ownerpwd allows unlimited use. The third parameter is whether to use 128-bit encryption.

  • GetNumPages () : Get the number of PDF pages.

  • GetPage (pageNumber) : getPage(pageNumber) : get the corresponding Page, is a PageObject, you can use the above addPage method to addPage.

  • InsertPage (Page, index=0) : Adds the page to the PDF. Index specifies where the page was inserted.

  • Write (stream) : writes the content of the Writer to a file.


PdfFileMerger.

This class is used to merge PDF files. The constructor of this class takes one parameter: PdfFileMerger(strict=True). Note that this parameter is described later: PdfFileMerger(strict=True)

Common methods:

  • AddBookmark (title, pagenum, parent=None) : Add a bookmark to the PDF. Title is the title of the bookmark and pagenum is the page that the bookmark points to.

  • Append (fileobj, bookmark=None, pages=None, import_bookmarks=True) : Pages can use (start, stop[, step]) or a Page Range to add a specified Range of pages to fileobj.

  • Merge (position, FileOBj, Bookmark =None, Pages =None, import_bookmarks=True) : Similar to the Append method, but you can specify the position to add using the position argument.

  • Write (Fileobj) : Writes data to a file.

To use this, create a PdfFileMerger instance, then use Append or Merge to add the PDF files you want to merge in turn, and save using write.

def merge_pdf():
    Create an instance to merge files
    pdf_merger = PdfFileMerger()

    # Add a week1_1.pdf file first
    pdf_merger.append('Week1_1.pdf')
    Then add the ex1.pdf file at the end of page 0
    pdf_merger.merge(0, 'ex1.pdf')
    # bookmark
    pdf_merger.addBookmark('This is a bookmark'1),Write it to a file
    pdf_merger.write('merge_pdf.pdf')
Copy the code

Let’s look at this parameter in PdfFileMerger(strict=True) :

The official explanation for this parameter:

Strict (bool) — Determines whether user should be warned of all problems and also causes some correctable problems to be fatal. Defaults to True.

Determine if the user should be warned of all problems and if any can be corrected.

At first, it seems that this parameter is used to warn the user of some errors. The default can be used directly, but when I try to merge the PDF with Chinese, I get the following error:

Traceback (most recent call last):
  File "I: \ python3.5 \ lib \ site - packages \ PyPDF2 \ generic py." ", line 484, in readFromStream
    return NameObject(name.decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 10: invalid continuation byte During handling of the above exception, another exception occurred: PyPDF2.utils.PdfReadError: Illegal character in Name ObjectCopy the code

There was an error using UTF decoding in the source package, I tried to modify the source code to use GBK, but there were other errors as well. When strict in the constructor is set to False, the console prints the following error:

PdfReadWarning: Illegal character in Name Object [generic.py:489]
Copy the code

But the two files were successfully merged, and I looked at the merged files sometimes good or bad, the same code runs many times, sometimes can normally deal with Chinese, but sometimes Chinese garbled.

In addition to the methods listed, there are other methods, such as bookmarking, adding links, etc., which can be found in the official documentation.


Merge, split and encrypt PDF.

Encrypt, decrypt, merge, split by page, split by copy

Note: If you run Chinese files, garbled characters may appear in the results. However, if you run them several times, Chinese characters may be displayed normally. It’s not clear exactly why, but that’s how metaphysical…

Code portal

# @Time : 2018/3/26 23:48
# @Author : Leafage
# @File : handlePDF.py
# @Software: PyCharm
# @describe: Merge, split and encrypt PDF files.
from PyPDF2 import PdfFileReader, PdfFileMerger, PdfFileWriter


def get_reader(filename, password):
    try:
        old_file = open(filename, 'rb')
    except IOError as err:
        print('File open failed! ' + str(err))
        return None

    Create a read instance
    pdf_reader = PdfFileReader(old_file, strict=False)

    # decrypt operation
    if pdf_reader.isEncrypted:
        if password is None:
            print('%s file is encrypted, password required! ' % filename)
            return None
        else:
            ifpdf_reader.decrypt(password) ! = 1:print('%s password is incorrect! ' % filename)
                return None
    if old_file in locals():
        old_file.close()
    return pdf_reader


def encrypt_pdf(filename, new_password, old_password=None, encrypted_filename=None):
    """Encrypt the file corresponding to filename and generate a new file :param filename: file path :param new_password: password used for file encryption :param old_password: If the old file is encrypted, the password is :param encrypted_filename: specifies the encrypted filename. Filename_encrypted is used when saving. :return: """
    Create a Reader instance
    pdf_reader = get_reader(filename, old_password)

    if pdf_reader is None:
        return

    Create a write instance
    pdf_writer = PdfFileWriter()
    Write data from the previous Reader to Writer
    pdf_writer.appendPagesFromReader(pdf_reader)

    # re-encrypt with the new password
    pdf_writer.encrypt(new_password)

    if encrypted_filename is None:
        # use old file name + encrypted as new file name
        encrypted_filename = "".join(filename.split('. ')] [: - 1) +'_' + 'encrypted' + '.pdf'

    pdf_writer.write(open(encrypted_filename, 'wb'))


def decrypt_pdf(filename, password, decrypted_filename=None):
    """Decrypt the encrypted file retrograde and generate a password-free PDF file :param filename: previously encrypted PDF file :param password: corresponding password: param decrypted_filename: Decrypted file name :return:"""

    Create a Reader and a Writer
    pdf_reader = get_reader(filename, password)
    if pdf_reader is None:
        return
    if not pdf_reader.isEncrypted:
        print('File is not encrypted, no action required! ')
        return
    pdf_writer = PdfFileWriter()

    pdf_writer.appendPagesFromReader(pdf_reader)

    if decrypted_filename is None:
        decrypted_filename = "".join(filename.split('. ')] [: - 1) +'_' + 'decrypted' + '.pdf'

    Write a new file
    pdf_writer.write(open(decrypted_filename, 'wb'))


def split_by_pages(filename, pages, password=None):
    ""Param filename: specifies the name of the file to be split. Param pages: specifies the number of pages of each file to be split. Param password: Decrypts the file if it is encrypted.""
    # get Reader
    pdf_reader = get_reader(filename, password)
    if pdf_reader is None:
        return
    Get the total number of pages
    pages_nums = pdf_reader.numPages

    if pages <= 1:
        print('Each document must be larger than 1 page! ')
        return

    Get the number of pages per PDF file after shard
    pdf_num = pages_nums // pages + 1 if pages_nums % pages else int(pages_nums / pages)

    print('PDF files are divided into % D copies with % D pages each! ' % (pdf_num, pages))

    Generate PDF files in turn
    for cur_pdf_num in range(1, pdf_num + 1):
        Create a new write instance
        pdf_writer = PdfFileWriter()
        Generate the corresponding file name
        split_pdf_name = "".join(filename)[:-1] + '_' + str(cur_pdf_num) + '.pdf'
        # calculate the current start position
        start = pages * (cur_pdf_num - 1)
        # Calculate the end position, return the last page if it was the last one, otherwise use the number of pages per page * the number of files already divided
        end = pages * cur_pdf_num ifcur_pdf_num ! = pdf_numelse pages_nums
        # print(str(start) + ',' + str(end))
        # read the corresponding pages in sequence
        for i in range(start, end):
            pdf_writer.addPage(pdf_reader.getPage(i))
        Write file
        pdf_writer.write(open(split_pdf_name, 'wb'))


def split_by_num(filename, nums, password=None):
    """Divide PDF file into nums: param filename: filename: param nums: number of shares to be divided into :param password: if decryption is required, enter the password: return:"""
    pdf_reader = get_reader(filename, password)
    if not pdf_reader:
        return

    if nums < 2:
        print('Copies must not be less than 2! ')
        return

    Get the total number of pages in the PDF
    pages = pdf_reader.numPages

    if pages < nums:
        print('The number of copies should not be greater than the total number of pages in PDF! ')
        return

    # Calculate how many pages each should have
    each_pdf = pages // nums

    print('PDF has % D pages, divided into % D copies, each has % D pages! ' % (pages, nums, each_pdf))

    for num in range(1, nums + 1):
        pdf_writer = PdfFileWriter()
        Generate the corresponding file name
        split_pdf_name = "".join(filename)[:-1] + '_' + str(num) + '.pdf'
        # calculate the current start position
        start = each_pdf * (num - 1)
        # Calculate the end position, return the last page if it was the last one, otherwise use the number of pages per page * the number of files already divided
        end = each_pdf * num ifnum ! = numselse pages
        print(str(start) + ', ' + str(end))
        for i in range(start, end):
            pdf_writer.addPage(pdf_reader.getPage(i))
        pdf_writer.write(open(split_pdf_name, 'wb'))


def merger_pdf(filenames, merged_name, passwords=None):
    """Pass in a list of files and merge them together: Param filenames: list of files: Param passwords: list of corresponding passwords: return:"""
    # count how many files there are
    filenums = len(filenames)
    Note that the False argument is required
    pdf_merger = PdfFileMerger(False)

    for i in range(filenums):
        # get password
        if passwords is None:
            password = None
        else:
            password = passwords[i]
        pdf_reader = get_reader(filenames[i], password)
        if not pdf_reader:
            return
        # appEnd is added to the end by default
        pdf_merger.append(pdf_reader)

    pdf_merger.write(open(merged_name, 'wb'))


def insert_pdf(pdf1, pdf2, insert_num, merged_name, password1=None, password2=None):
    ""Insert param pdf1: pdf1 file name :param pdf2: pdf2 file name :param insert_num: Number of pages to be added :param merged_name: indicates the merged file name: param password1: pdF1 Password :param password2: pdf2 password :return:"""
    pdf1_reader = get_reader(pdf1, password1)
    pdf2_reader = get_reader(pdf2, password2)

    # If one fails to open, return
    if not pdf1_reader or not pdf2_reader:
        return
    Get the total number of pages pdf1
    pdf1_pages = pdf1_reader.numPages
    if insert_num < 0 or insert_num > pdf1_pages:
        print('Insert position is abnormal, the number of pages to insert is: %d, pdf1 file has: %d pages! ' % (insert_num, pdf1_pages))
        return
    The False parameter is required
    m_pdf = PdfFileMerger(False)
    m_pdf.append(pdf1)
    m_pdf.merge(insert_num, pdf2)
    m_pdf.write(open(merged_name, 'wb'))


if __name__ == '__main__':
    # encrypt_pdf('ex1.pdf', 'leafage')
    # decrypt_pdf('ex1123_encrypted.pdf', 'leafage')
    # split_by_pages('ex1.pdf', 5)
    split_by_num('ex2.pdf', 3)
    # merger_pdf(['ex1.pdf', 'ex2.pdf'], 'merger.pdf')
    # insert_pdf('ex1.pdf', 'ex2.pdf', 10, 'pdf12.pdf')

Copy the code