Text extraction from images using machine learning

Emproto Technologies

Content Writer

The task of extraction of data from images in a way is a technique of teaching artificial intelligence how to read. The main step of this assignment is to teach the algorithm to see the text which we generally call the text recognition step and the next is to process it and put it in a tabular form to use it further for analysis.

‍

Extraction of text from an image is considered a straightforward process but organizing the data in tabular is the difficult part.

‍

Emproto was mandated by an Executive agency of the European Commission to build a platform that would extract essential information like ingredients and nutrition from food wrappers. The objective of this exercise was to analyze the health quotient of packaged food products available in the European Union.

‍

How did we accomplish the complete task?

‍

Process Overview

Field workers from the agency would go to supermarkets and capture images from food wrappers available. The images would then be uploaded to the platform. The platform had to process the images and extract the data. The extracted data needs to be organized in a tabular format. This data would be checked and approved by an admin.

‍

Text Extraction

We begin with the most public text recognition technique known as OCR- Optical Character Recognition. Optical Character Recognition enables you to convert different types of documents and images captured into the editable and searchable type of data. We used Vision API (OCR). Google Cloud’s Vision API offers powerful pre-trained machine learning models through REST and RPC APIs. Assign labels to images and quickly classify them into millions of predefined categories. Detect objects and faces, read printed and handwritten text, and build valuable metadata into your image catalog.

‍

Text Organization

As mentioned before, we also need to convert the text into tabular form for further analysis. We used a function, getBoundingBoxes() which gives a Rect object corresponding to the position of the text, then we got the border co-ordinates of the text block of each, after that, we did some math and matrix operations to make it into a table format.

‍

One of the major challenges that we encountered while building the solution was to choose the right tool. We considered two such tools: Tesseract OCR vs Vision API.

‍

We finally moved forward with Vision API for the following amazing reasons:

‍

Tesseract OCR only gives us the extracted text. Vision API has classes that allow access to inner functionalities

‍

In the case of multiple languages, Tesseract requires trained data for each language whereas In Vision API,

the method

_{recognizer = vision.getCloudTextRecognizer(options);}

allows us to recognize both Latin and Non-Latin languages.

Tesseract works well with clear, perfect images. Vision API is better in case of imperfect samples that have creases, folds or glare etc.
We don’t have to train the vision API for better results, it’s already well trained and sitting on a cloud server, it’s keep on training using the image you pass, so it’ll give more accurate images when you try more images.

‍

Example: Here is a sample image of what we’ve accomplished, this image is just a raw image captured by camera and the extracted data you can see the second image

‍

Conclusion

The accuracy of the results obtained provides us a few exciting opportunities going forward. We can completely automate the workflow eliminating the need for manual approval. We can also use it to build a crowdsourced food recommendation engine that can guide us in our dietary habits.

‍

Another Use Case involving Image Processing:

‍

Object Detection

We all know about the image classification problem. Given an image can you find out the class the image belongs to? We can solve any new image classification problem using pre-trained nets.

Object Detection is a general term to describe computer vision tasks to identify objects in digital photographs. The process involves locating the presence of objects with a bounding box and types of classes of the located objects in an image. Thus, the input here is the photograph, an image with one or more objects. The output includes the bounding boxes defined by a point, width, and height and a label for each and every box.