Hello, readers! In this article, we will be focusing on Conversion of PDF data into a text format, in detail.
So, let us begin!! 🙂
Also read: Python fpdf module
Introduction – PDF to text conversion
What can be the solution when the data within the pdf is needed for processing? Is it feasible to have every line of the pdf being copied and saved?
This is when the need of converting the pdf files into text files comes into picture.
Also, there are loads of mobile applications that offer us PDF to text conversions. But, with Python, we can have it integrated into our main solution as a piece or section of automation. This again justifies that Python can be used to automate a lot of processes within the main real-life solution to a problem.
In the context of this topic, we will be focusing the conversion of PDF to text files in Python.
Implementing the conversion of PDF to text format in Python
1. At first, we would need to have a pdf file for the conversion. Either we can create PDF files using Python fpdf module or fetch one from the system.
In this example, we will be making use of an existing PDF file.
2. We would now be installing PyPDF2 module that enables us to have a easier conversion from .pdf to .txt files.
pip install PyPDF2 Collecting PyPDF2 Downloading PyPDF2-1.26.0.tar.gz (77 kB) |████████████████████████████████| 77 kB 1.9 MB/s Installing collected packages: PyPDF2 Running setup.py install for PyPDF2 .. done Successfully installed PyPDF2-1.26.0
The PyPDF2 module is offered by Python that consists of in-built functions to convert PDF files into text format.
3. Now is the important step wherein we use the PyPDF2 module and write scripts to perform the conversion.
import PyPDF2 obj=open('op.pdf','rb') pdfR=PyPDF2.PdfFileReader(obj) cnt=pdfR.numPages pageobj=pdfR.getPage(cnt+1) data=pageobj.extractText() opt=open(r"C:\Users\SMulani\data.txt","a") cnt.writelines(data)
In the above code, at first, we import the PyPDF2 module. Post which, we create an object from the function PdfFileReader() that will enable us to point to the pdf file.
Then we make use of numPages attribute to store the number of pages in the PDF. We make use of getPage() function to select all the pages of the PDF to be converted to text file.
At last, the extractText() function enables us with the creation of a text object to transfer the data into it.
Finally, we open the empty text file and use writelines() function to export the data from pdf to text file easily.
BY this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.
For more such posts related to Python programming, Stay tuned with us!
Till then, Happy learning!! 🙂