Optical Character Recognition and Translation in Python

Filed Under: Python Advanced
In Python

Hello, readers. Today, let me talk to you about Optical Character Recognition and Translation in Python and the amazing things that we can use it for.

What is Optical Character Recognition?

Optical Character Recognition, commonly abbreviated to OCR entails the mechanical and electrical translation into computer text of scanned images of printed, typewritten text.

Digitizing typed texts is a popular technique such that they can be searched electronically, processed more compactly, shown online, and used in computer processes such as machine translation, text to voice, and text mining.

OCR (Optical Character Recognition) technology has been implemented across the broad continuum of sectors in recent years, revolutionizing the method of document management.

OCR has made it possible for scanned documents to become more than just image archives, converting them into completely searchable documents with computer-recognized text content.

With the aid of OCR, after entering them into electronic files, individuals no longer need to retype essential records manually.

Instead, OCR collects and immediately enters related information.

In less time, the effect is precise, effective processing of information.

There are many study areas for optical character recognition, but the most important areas are as follows:

  • Banking Activities
  • Persons who are blind and visually impaired
  • Department of Law Office
  • Industry Retail
  • Other sectors, including departments in education, banking, and administration.

How to Recognize Text From Images Using Python?

Today, we’ll take a picture of a non-English newspaper from the internet, and perform optical character recognition. This will convert the image to alphanumeric characters.

Then, we’ll translate it to English to essentially, read, the newspaper.

You can read newspapers and books from around the world, without knowing the language!

1. Download the newspaper images

First, let’s download the newspaper clipping. For this we’ll use the wget command. You can simply download the file and save it in the same folder as your code.

!wget 'http://www.rhitisports.com/india/wp-content/uploads/2014/06/National-Duniya.jpg' > 'newspaper.jpg'

We can show our downloaded image with:

import PIL
from PIL import ImageDraw
im = PIL.Image.open("National-Duniya.jpg")
National Duniya Optical Character Recognition
National Duniya

2. Install EasyOCR for Optical Character Recognition

This is the Python library that we’re going to use. It has support for over 70 languages!

In the backend, it uses PyTorch and deep transfer learning techniques from vgg16_bn and others.

If you’re installing on Google Colab, like me, then you’ll need to do:

!pip install easyocr --no-deps

Google Colab is advised, because it requires GPU and can be quite taxing on a personal PC.

3. Display List of Support Language

This is a list I made of all shortcuts for languages:

lang = ['abq','ady','af','ang','ar','as','ava','az','be',
        'pt','ro','ru','rs_cyrillic','rs_latin','sck','sk','sl', 'sq','sv','sw','ta','tab','th','tl','tr','ug','uk','ur','uz','vi']

4. Create the Language Model

The next step is to setup the language model. In our case, we know that the language of the newspaper is Hindi. We’ll use the Hindi-English (hi-en) language model.

Have a look at how you can set that up.

reader = easyocr.Reader(['hi','en'])

This will take a few seconds.

5. Create Bounding Boxes

Then we let the model read the result and create bounding boxes:

result = reader.readtext('National-Duniya.jpg')
bounds = reader.readtext('National-Duniya.jpg')

We can also draw these bounding boxes on the image itself:

def draw_boxes(image, bounds, color='red', width=2):
    draw = ImageDraw.Draw(image)
    for bound in bounds:
        p0, p1, p2, p3 = bound[0]
        draw.line([*p0, *p1, *p2, *p3, *p0], fill=color, width=width)
    return image
draw_boxes(im, bounds)
Bounding Boxes Hindi To English Optical Character Recognition
Bounding Boxes – Hindi To English

As you can see, all the text is correctly read.


The above line gives us an output as 75. Which means there are 75 lines of text.

We can combine it into one big piece of text using join and list comprehension:

res = " ".join([line[1] for line in result])

The result is quite accurate:

National Duniya New Delhi, १३ June २०१४ सौ खिलाड़ियों की सूची में ११्वें स्थान पर रै भारतीय क्रिकेट कप्तान रैंक धोनी फोर्ब्स की सबसे अमीर खिलाड़ियों की सूची में ( एजेंसी भारतीय क्रिकेट शामिल है न्यूयॉर्क कीकुत रोबाल्डो कमाई नवदर ब पिछत रकसात मं मेवेदर ने पिछले एक साल में टीम के कप्तान महेंद्र सिंह धोनी आठ करोडडॉतर रती और दस कराड़ ५० नाखडॅवर की दस करोड़ ५० लाखडालर की फोर्ब्स की सबसे अमीर सौ वर दूसर स्थाब पर ऐैं कगाई की रै |इरासे वरतीब खिलाडियों कोीं सूची में अकेले कमाईं की है|इससे वह तीोन साल में साल गें दूसरी बर दुनिया के भारतीय हैं बार दुनिया के सबसे अमीर दूसरी मैड्रिड के सनसे अगैर खिताड़ी रो गर | सूचो में अमेरिकी मुक्केबाज खिलाड़ी हो गरIरियल फ्लायड मेवेदर शीर्ष पर हैं जबकि स्टार फुटबॉलर क्रिस्टियानो रोनाल्डो समेत १५ फुटबॉलर शीर्ष इसमें गोल्फर टाइगर वुड्स और तेककारेद् धोनी की कुल कमाई टेनिस स्टार रोजर फेडरर तथा १०० में हैं | रोनाल्डो की कुल कमाई डॉलर ओरँविज्ञापनों से कमाई रफेल नडाल भी हैं FITSOUL आठ करोड़ डालर रही और वह करोड़ ८० लाख डॉलर रै |वर सूची धोनी को कुल कमाई तीन करोड़ दूसरे स्थान पर हैं में ११वें स्थान परऐ डालर और विज्ञापनों सेकमाईदे अमेरिकी बास्केटबाल खिलाड़ी करोड़ ६० लाखडालर है|वह सूची फोर्ब्सने बताय कि धोनी भारत धोनीने २०१३ के आखिरमें रीबाक के साथ उनके करार सेयह लेबोन जेन्स तीसरे और अजेंटीना में २२चें स्थान पर हैं |फोर्ब्स ने कहा के सर्वश्रेष्ठ कप्तानों में सेएक हैं के फुटबॉलर लियोनेल मैसी चौथे दस लाख डालर अधिक था|उनकी बल्ल क लिए प्रायाजन करार स्पार्टन स्पोर्ट्स और एमिटी स्थान पर हैं | वुड्स छठे स्थान पर हैं कि वेतन और विज्ञापनों को कमाई कमाई में जून २०१३ से जून २०१४ वह आईसीसी के तीनों खिताब यूनिवर्सिटी सेकिया जो करीब ४० फेडरर सतवें और नडाल नौवें के दम पर धोनी की जून २०१४ में जीतने चाले पहले भारतीय तक वेतन, बोनस इनामी राशि, कप्तान हैं अपीयरेंस फीस, विज्ञापन से कमाई स्थान पर हैं अय४० लाखडालररहा लाख डालर का थाIइसस पहल

6. Translate Output Text to English

If I don’t know the language that was recognized by EasyOCR, I would need some help with getting it translated. So let’s take the help of one of the best language translation APIs out there – Google Translate. You can install the same using Python pip:

!pip install googletrans

Now we can translate our above text:

from googletrans import Translator

translator = Translator()

which gives us:

National Duniya New Delhi, 13 June 2014 Ranked 11th in the list of hundred players, Indian cricket captain rank Dhoni in Forbes list of richest players (agency Indian cricket includes New York Kikut Robaldo earning Navdar and Mayweather in backward raksat last year In the team captain Mahendra Singh Dhoni has earned eight crores of millions and ten crores of 50 crores and ten crores of 50 million dollars, for the richest hundred of Forbes, he is the only player in the list of erratic players. In the second year in the world, Indians are the second time in the world, the second richest in the world, after the second richest Madrid, the American boxer player is in the list. Real Floyd Mayweather is on top, while 15 footballer including star footballer Cristiano Ronaldo is the top golfer Tiger Woods and Tekkared Dhoni's total earnings are in tennis star Roger Federer and 100. Ronaldo's total earnings from dollars and advertisements are Rafael Nadal also FITSOUL was 80 million dollars and he got $ 60 million. Ranked 11th in the US with $ 11 million and American Basketball player in advertisements worth Rs 70 lakh. That list was reported by Forbes as Dhoni India Dhoni ranked third in LeBon Jones and 22nd in Argentina by his tie with Reebak at the end of 2013. One of the best captains of the season is that footballer Lionel Massey was fourth in a million dollars. The Spartan Sports and Amity ranks for his batting. Woods is in sixth position in earning salary and advertisements from June 2013 to June, 2016, he won all three ICC titles at University Sekia, which was the first Indian to win Dhoni's win in June 2014 on the basis of 60 Federer Satv and Nadal Ninth. The amount is the captain, the appearance fees, earning from advertising is in place, it was 70 lakh dollars and it was worth millions of dollars.

Thus, I have read a newspaper in a language that I am less comfortable with.

Apart from newspapers, it has a wide variety of uses:

  • ordering food in a Chinese/Japanese restaurant
  • traveling to the Middle East
  • reading street signs, etc.

Ending Note

If you liked reading this article and want to read more, continue to follow the site! We have a lot of interesting articles upcoming in the near future. To stay updated on all the articles, don’t forget to join us along on Twitter and sign up for the newsletter for some interesting reads!

Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors