PyPDF2 is a pure-python library to work with PDF files. We can use the PyPDF2 module to work with the existing PDF files. We can’t create a new PDF file using this module.
Table of Contents
PyPDF2 Features
Some of the exciting features of PyPDF2 module are:
- PDF Files metadata such as number of pages, author, creator, created and last updated time.
- Extracting Content of PDF file page by page.
- Merge multiple PDF files.
- Rotate PDF file pages by an angle.
- Scaling of PDF pages.
- Extracting images from PDF pages and saving as image using the Pillow library.
Installing PyPDF2 Module
We can use PIP to install PyPDF2 module.
$ pip install PyPDF2
PyPDF2 Examples
Let’s look at some examples to work with PDF files using the PyPDF2 module.
1. Extracting PDF Metadata
We can get the number of pages in the PDF file. We can also get the information about the PDF author, creator app, and creation dates.
import PyPDF2
with open('Python_Tutorial.pdf', 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
print(f'Number of Pages in PDF File is {pdf_reader.getNumPages()}')
print(f'PDF Metadata is {pdf_reader.documentInfo}')
print(f'PDF File Author is {pdf_reader.documentInfo["/Author"]}')
print(f'PDF File Creator is {pdf_reader.documentInfo["/Creator"]}')
Sample Output:
Number of Pages in PDF File is 2
PDF Metadata is {'/Author': 'Microsoft Office User', '/Creator': 'Microsoft Word', '/CreationDate': "D:20191009091859+00'00'", '/ModDate': "D:20191009091859+00'00'"}
PDF File Author is Microsoft Office User
PDF File Creator is Microsoft Word
- The PDF file should be opened in the binary mode. That’w why the file opening mode is passed as ‘rb’.
- The PdfFileReader class is used to read the PDF file.
- The documentInfo is a dictionary that contains the metadata of the PDF file.
- We can get the number of pages in the PDF file using the getNumPages() function. An alternative way is to use the
numPages
attribute.
2. Extracting Text of PDF Pages
import PyPDF2
with open('Python_Tutorial.pdf', 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
# printing first page contents
pdf_page = pdf_reader.getPage(0)
print(pdf_page.extractText())
# reading all the pages content one by one
for page_num in range(pdf_reader.numPages):
pdf_page = pdf_reader.getPage(page_num)
print(pdf_page.extractText())
- The PdfFileReader getPage(int) method returns the
PyPDF2.pdf.PageObject
instance. - We can call the extractText() method on the page object to get the text content of the page.
- The extractText() will not return any binary data such as images.
3. Rotate PDF File Pages
The PyPDF2 allows many types of manipulations that can be done page-by-page. We can rotate a page clockwise or counter-clockwise by an angle.
import PyPDF2
with open('Python_Tutorial.pdf', 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
pdf_writer = PyPDF2.PdfFileWriter()
for page_num in range(pdf_reader.numPages):
pdf_page = pdf_reader.getPage(page_num)
pdf_page.rotateClockwise(90) # rotateCounterClockwise()
pdf_writer.addPage(pdf_page)
with open('Python_Tutorial_rotated.pdf', 'wb') as pdf_file_rotated:
pdf_writer.write(pdf_file_rotated)
- The PdfFileWriter is used to write the PDF file from the source PDF.
- We are using rotateClockwise(90) method to rotate the page clockwise by 90-degrees.
- We are adding the rotated pages to the PdfFileWriter instance.
- Finally, the write() method of the PdfFileWriter is used to produce the rotated PDF file.
4. Merge PDF Files
import PyPDF2
pdf_merger = PyPDF2.PdfFileMerger()
pdf_files_list = ['Python_Tutorial.pdf', 'Python_Tutorial_rotated.pdf']
for pdf_file_name in pdf_files_list:
with open(pdf_file_name, 'rb') as pdf_file:
pdf_merger.append(pdf_file)
with open('Python_Tutorial_merged.pdf', 'wb') as pdf_file_merged:
pdf_merger.write(pdf_file_merged)
The above code looks good to merge the PDF files. But, it produced an empty PDF file. The reason is that the source PDF files got closed before the actual write happened to create the merged PDF file.
It’s a bug in the latest version of PyPDF2. You can read about it this GitHub issue.
There is an alternative approach to use the contextlib
module to keep the source files open until the write operation is done.
import contextlib
import PyPDF2
pdf_files_list = ['Python_Tutorial.pdf', 'Python_Tutorial_rotated.pdf']
with contextlib.ExitStack() as stack:
pdf_merger = PyPDF2.PdfFileMerger()
files = [stack.enter_context(open(pdf, 'rb')) for pdf in pdf_files_list]
for f in files:
pdf_merger.append(f)
with open('Python_Tutorial_merged_contextlib.pdf', 'wb') as f:
pdf_merger.write(f)
You can read more about it at this StackOverflow Question.
5. Split PDF Files into Single Pages Files
import PyPDF2
with open('Python_Tutorial.pdf', 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
for i in range(pdf_reader.numPages):
pdf_writer = PyPDF2.PdfFileWriter()
pdf_writer.addPage(pdf_reader.getPage(i))
output_file_name = f'Python_Tutorial_{i}.pdf'
with open(output_file_name, 'wb') as output_file:
pdf_writer.write(output_file)
The Python_Tutorial.pdf has 2 pages. The output files are named as Python_Tutorial_0.pdf and Python_Tutorial_1.pdf.
6. Extracting Images from PDF Files
We can use PyPDF2 along with Pillow (Python Imaging Library) to extract images from the PDF pages and save them as image files.
First of all, you will have to install the Pillow module using the following command.
$ pip install Pillow
Here is the simple program to extract images from the first page of the PDF file. We can easily extend it further to extract all the images from the PDF file.
import PyPDF2
from PIL import Image
with open('Python_Tutorial.pdf', 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
# extracting images from the 1st page
page0 = pdf_reader.getPage(0)
if '/XObject' in page0['/Resources']:
xObject = page0['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj].getData()
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
else:
mode = "P"
if '/Filter' in xObject[obj]:
if xObject[obj]['/Filter'] == '/FlateDecode':
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
elif xObject[obj]['/Filter'] == '/DCTDecode':
img = open(obj[1:] + ".jpg", "wb")
img.write(data)
img.close()
elif xObject[obj]['/Filter'] == '/JPXDecode':
img = open(obj[1:] + ".jp2", "wb")
img.write(data)
img.close()
elif xObject[obj]['/Filter'] == '/CCITTFaxDecode':
img = open(obj[1:] + ".tiff", "wb")
img.write(data)
img.close()
else:
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
else:
print("No image found.")
My sample PDF file has a PNG image on the first page and the program saved it with an “image20.png” filename.
Really useful stuff. This saved me some money on signing up for some pdf manipulations software 😀
I want to compare two PDF files and highlight the changes. I can search for a word and highlight it in PDF. Can anyone help with this?