Easy way to convert PDF to text in Python

Filed Under: Python
Easy Way To Convert PDF To Text In Python

Hello, readers! In this article, we will be focusing on Conversion of PDF data into a text format, in detail.

So, let us begin!! 馃檪

Also read: Python fpdf module


Introduction – PDF to text conversion

What can be the solution when the data within the pdf is needed for processing? Is it feasible to have every line of the pdf being copied and saved?

This is when the need of converting the pdf files into text files comes into picture.

Also, there are loads of mobile applications that offer us PDF to text conversions. But, with Python, we can have it integrated into our main solution as a piece or section of automation. This again justifies that Python can be used to automate a lot of processes within the main real-life solution to a problem.

In the context of this topic, we will be focusing the conversion of PDF to text files in Python.


Implementing the conversion of PDF to text format in Python

1. At first, we would need to have a pdf file for the conversion. Either we can create PDF files using Python fpdf module or fetch one from the system.

In this example, we will be making use of an existing PDF file.

2. We would now be installing PyPDF2 module that enables us to have a easier conversion from .pdf to .txt files.

pip install PyPDF2

Collecting PyPDF2
  Downloading PyPDF2-1.26.0.tar.gz (77 kB)
     |鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 77 kB 1.9 MB/s
Installing collected packages: PyPDF2
    Running setup.py install for PyPDF2 .. done
Successfully installed PyPDF2-1.26.0

The PyPDF2 module is offered by Python that consists of in-built functions to convert PDF files into text format.

3. Now is the important step wherein we use the PyPDF2 module and write scripts to perform the conversion.

Example:

import PyPDF2
obj=open('op.pdf','rb')
 
pdfR=PyPDF2.PdfFileReader(obj)
 
cnt=pdfR.numPages
 
pageobj=pdfR.getPage(cnt+1)
 
data=pageobj.extractText()
 
opt=open(r"C:\Users\SMulani\data.txt","a")
cnt.writelines(data)

op.pdf file:

PDF
PDF

Output:

text file
text file

Explanation:

In the above code, at first, we import the PyPDF2 module. Post which, we create an object from the function PdfFileReader() that will enable us to point to the pdf file.

Then we make use of numPages attribute to store the number of pages in the PDF. We make use of getPage() function to select all the pages of the PDF to be converted to text file.

At last, the extractText() function enables us with the creation of a text object to transfer the data into it.

Finally, we open the empty text file and use writelines() function to export the data from pdf to text file easily.


Conclusion

BY this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.

For more such posts related to Python programming, Stay tuned with us!

Till then, Happy learning!! 馃檪

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content