Moment For Technology

React and tesseract.js for image-to-text conversion (OCR)

Posted on Sept. 23, 2022, 3:43 p.m. by Charlie Powell
Category: reading Tag: The front end

Data is the backbone of every software application, because the primary purpose of an application is to solve human problems. In order to solve the problems of human beings, it is necessary to have some information about them.

This information is represented as data, especially through computation. On the Internet, data is mostly collected in the form of text, images and videos. Sometimes, images contain basic text that is processed for a purpose. Most of these images are processed manually because there is no way to process them programmatically.

The inability to extract text from images was a data-processing limitation I experienced firsthand at my last company. We needed to process the scanned gift card, but since we couldn't extract the text from the image, we had to do it manually.

The company has an internal department called "Operations" that manually identifies gift cards and credits them to users' accounts. While we have a website through which users contact us, gift card processing is done manually behind the scenes.

At the time, we were using PHP (Laravel) as the back end and JavaScript (jQuery and Vue) as the front end. Our technology stack is sufficient to work with Tesseract.js as long as management considers the issue to be important.

I would like to fix it, but there is no need to fix it from a corporate or management point of view. After leaving the company, I decided to do some research to try to find possible solutions. Finally, I found OCR.

What is OCR?

OCR stands for "Optical Character Recognition" or "Optical Character Reader". It is used to extract text from images.

The evolution of OCR can be traced back to several inventions, but Optophone, "Gismo", CCD flatbed scanners, Newton MesssagePad, and Tesseract were the major inventions that took character recognition to another useful level.

So why use OCR? Well, optical character recognition solves a lot of problems, one of which prompted me to write this article. I realized that the ability to extract text from images ensured many possibilities, for example.

  • Policing Every organization needs to regulate the activities of its users for some reason. Regulation can be used to protect users' rights and ensure they are protected from threats or fraud. Extracting text from an image enables an organization to process the text information on the image for policing, especially when the image is provided by a number of users. For example, OCR allows for facebook-like policing of the amount of text used in images for advertising. In addition, hiding sensitive content on Twitter is also done through OCR.
  • Searchable search is one of the most common activities, especially on the Internet. Search algorithms are mostly based on operations on text. With optical character recognition, it is possible to identify characters on an image and use them to provide relevant image results to the user. In short, both images and videos are now searchable with OCR.
  • Accessibility Having text on images is always an accessibility challenge, and having a small amount of text on images is a rule of thumb. With OCR, screen readers can access the text on the image, providing some necessary experience for their users.
  • Data processing automation Data processing is mostly automated on a scale. Having text on the image is a limitation on data processing, because the text cannot be processed unless manually. Optical character recognition (OCR) makes it possible to extract text on images programmatically, ensuring automation of data processing, especially as it relates to text processing on images.
  • Everything is going digital and there are still a lot of documents that need to be digitized. Checks, certificates and other physical documents can now be digitized using optical character recognition.

Discovering all of the above uses deepened my interest, so I decided to ask a further question.

"How can I use OCR on the network, especially in the React application?"

This question led me to tesseract.js.

What is tesseract.js?

Tesseract.js is a JavaScript library that compiles the original Tesseract from C to JavaScript WebAssembly, thus making OCR available for use in browsers. The tesseract.js engine was originally written in ASM.js and later ported to WebAssembly, but asM.js still serves as a backup in cases where WebAssembly is not supported.

As stated on the Tesseract.js website, it supports more than 100 languages, automatically detects text orientation and scripts, and has a simple interface to read paragraph, word, and character boundaries.

Tesseract is an optical character recognition engine for various operating systems. It is free software and is distributed under the Apache license. HP developed Tesseract as proprietary software in the 1980s. It was released as open source in 2005, and its development has been sponsored by Google since 2006.

The latest version of Tesseract, version 4, was released in October 2018 and contains a new OCR engine using a neural network system based on short and long term memory (LSTM), which is designed to produce more accurate results.

Understand the Tesseract APIs

To really understand how Tesseract works, we need to break down some of its apis and their components. According to the tesseract.js documentation, there are two ways to use it. Here is the first method and its breakdown.

Tesseract.recognize( image,language, { logger: m = console.log(m) } ) .catch (err = { console.error(err); }) .then(result = { console.log(result); })}Copy the code

The recognize method takes the image as its first argument, the language (which can be multiple) as its second argument, and {logger: m = console.log(me)} as its last argument. Tesseract supports image formats such as JPG, PNG, BMP, and PBM, which can only be provided as elements (IMG, video, or Canvas), file objects (), BLOb objects, image paths or urls, and Base64-encoded images. Read here to learn more about all the image formats Tesseract can handle).

The language is provided as a string, such as eng. The + sign can be used to concatenate several languages, such as eng+ CHI_tra. Language parameters are used to determine the trained language data to be used when processing the image.

Note: _ you can find all available languages and their code _ here _.

{logger: m = console.log(m)} is very useful for getting progress information about the image being processed. The Logger attribute requires a function that will be called multiple times as the Tesseract processes the image. The Logger function should take an object with properties workerId,jobId, Status, and Progress.

{workerId: 'worker-200030', jobId: 'job-734747', status: 'THINKING TEXT', progress: '0.9'}Copy the code

Progress is a number between 0 and 1 that shows the progress of the image recognition process as a percentage.

Tesseract automatically generates this object as an argument to the logger function, but it can also be supplied manually. As the identification process progresses, the properties of the Logger object are updated each time the function is called. Therefore, it can be used to display a transformation progress bar, change parts of the application, or achieve any desired result.

Result in the above code is the result of the image recognition process. Each attribute of result has the attribute bbox as the X /y coordinates of its bounding box.

Here are the attributes of the Result object and their meaning or purpose.

{ text: "I am codingnninja from Nigeria..." hocr: "div class='ocr_page' id= ..." tsv: "1 1 0 0 0 0 0 0 1486 ..." box: null unlv: null osd: null confidence: 90 blocks: [{...}] psm: "SINGLE_BLOCK" oem: "DEFAULT" version: "4.0.0-825 - g887c" paragraphs: [{...}] lines: (5) [] {...},...  words: (47) [{...}, {...}, ...]  symbols: (240) [{...}, {...}, ...] }Copy the code
  • text: All recognized text is a string.
  • lines: An array of text identified for each line.
  • words: An array of each recognized word.
  • symbols: An array of each recognized character.
  • paragraphs: an array of each recognized paragraph. We will discuss "confidence" later in this article.

Tesseract can also be used more necessarily, e.g.

import { createWorker } from 'tesseract.js'; const worker = createWorker({ logger: m = console.log(m) }); (async () = { await worker.load(); await worker.loadLanguage('eng'); await worker.initialize('eng'); const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png'); console.log(text); await worker.terminate(); }) ();Copy the code

This approach is related to the first approach, but implemented differently.

CreateWorker (options) Creates a web worker or idea process that creates a Tesseract worker. This worker helps set up the Tesseract OCR engine. The Load () method loads the Tesseract core script, loadLanguage() loads any language supplied to it as a string, Initialize () ensures that the Tesseract is fully ready for use, and then uses the Recognition method to process the supplied image. The terminate() method stops the worker and cleans everything up.

Note: _ See the Tesseract APIs documentation _ for more information.

Now, we have to build something to really see tesseract.js in effect.

What are we going to build?

We're going to build a gift card password extractor, because extracting passwords from gift cards is the problem that led to this writing adventure in the first place.

We will build a simple application to extract passwords from scanned gift cards. As I set out to build a simple gift card password extractor, I'll walk you through some of the challenges I faced along the way, the solutions I offered, and the conclusions I drew from my experience.

Here's the picture we're going to test because it has some real properties that are possible in the real world.

We're going to extract it from the card. So, let's get started.

Install React and Tesseract

Before installing React and tesseract.js, there is one question to note: Why use React and Tesseract? In fact, Tesseract can be used with Vanilla JavaScript, any JavaScript library or framework such as React, Vue, and Angular.

Using React in this case is a matter of personal preference. Initially, I wanted to use Vue, but I decided to use React because I was more familiar with React than Vue.

Now, let's proceed with the installation.

To install React with create-react-app, you must run the following code.

npx create-react-app image-to-text
cd image-to-text
yarn add Tesseract.js
Copy the code

or

npm install tesseract.js
Copy the code

I decided to install tesseract.js with YARN because I couldn't install Tesseract with NPM, but YARN did the job without any pressure. You can use NPM, but from my experience I recommend installing Tesseract with YARN.

Now, let's start our development server by running the following code.

yarn start
Copy the code

or

npm start
Copy the code

After running YARN Start or NPM start, your default browser should open a web page that looks like this.

You can also navigate to localhost:3000 in your browser, provided the page doesn't start automatically.

After installing React and tesseract.js, what's next?

Set up an upload form

In this case, we'll adjust the home page (app.js) we just viewed in the browser to include the form we need.

import { useState, useRef } from 'react';
import Tesseract from 'tesseract.js';
import './App.css';

function App() {
  const [imagePath, setImagePath] = useState("");
  const [text, setText] = useState("");

  const handleChange = (event) = {
    setImage(URL.createObjectURL(event.target.files[0]));
  }

  return (
    div className="App"
      main className="App-main"
        h3Actual image uploaded/h3
        img 
           src={imagePath} className="App-logo" alt="logo"/

          h3Extracted text/h3
        div className="text-box"
          p {text} /p
        /div
        input type="file" onChange={handleChange} /
      /main
    /div
  );
}

export default App
Copy the code

At this point, the part of the code above that needs our attention is the function handleChange.

const handleChange = (event) = {
    setImage(URL.createObjectURL(event.target.files[0]));
  }
Copy the code

In this function, url.createObjecturl takes a selected file via event.target.Files [0] and creates a reference URL that can be used for HTML tags such as IMG, audio, and video. We use setImagePath to add the URL to the state. Now, this URL can now use imagePath.

img src={imagePath} className="App-logo" alt="image"/
Copy the code

We set the SRC property of the image to {imagePath} to preview it in the browser before processing it.

Converts the selected image to text

Since we have grabbed the path of the selected image, we can pass the path of the image to tesseract.js to extract the text from it.


import { useState} from 'react';
import Tesseract from 'tesseract.js';
import './App.css';

function App() {
  const [imagePath, setImagePath] = useState("");
  const [text, setText] = useState("");

  const handleChange = (event) = {
    setImagePath(URL.createObjectURL(event.target.files[0]));
  }

  const handleClick = () = {

    Tesseract.recognize(
      imagePath,'eng',
      { 
        logger: m = console.log(m) 
      }
    )
    .catch (err = {
      console.error(err);
    })
    .then(result = {
      // Get Confidence score
      let confidence = result.confidence

      let text = result.text
      setText(text);

    })
  }

  return (
    div className="App"
      main className="App-main"
        h3Actual imagePath uploaded/h3
        img 
           src={imagePath} className="App-image" alt="logo"/

          h3Extracted text/h3
        div className="text-box"
          p {text} /p
        /div
        input type="file" onChange={handleChange} /
        button onClick={handleClick} style={{height:50}} convert to text/button
      /main
    /div
  );
}

export default App
Copy the code

We added the "handleClick "function to" app.js ", which contains the TESseract.js API to get the path to the selected image. Tesseract.js requires "imagePath", "language", "a setting object".

The button below is added to the form to invoke "handClick", which triggers the image-to-text conversion whenever the button is clicked.

button onClick={handleClick} style={{height:50}} convert to text/button
Copy the code

When the processing is successful, we access "confidence" and "text" from the result. We then add "text "to the state with "setText(text)".

We display the extracted text by adding it to

{text}

.

Obviously, "text" is extracted from images, but what is confidence?

The confidence level shows how accurate the conversion is. Confidence is between 1 and 100. One is the worst, and 100 is the best for accuracy. It can also be used to determine whether an extracted text should be accepted as accurate.

So the question is, what factors affect the accuracy of the confidence score or the overall conversion? It is influenced by three main factors - the quality and nature of the files used, the quality of the scans created from the files, and the processing power of the Rubik's Cube engine.

Now, let's add the following code to "app.css" to give the application a bit of style.

.App {
  text-align: center;
}

.App-image {
  width: 60vmin;
  pointer-events: none;
}

.App-main {
  background-color: #282c34;
  min-height: 100vh;
  display: flex;
  flex-direction: column;
  align-items: center;
  justify-content: center;
  font-size: calc(7px + 2vmin);
  color: white;
}

.text-box {
  background: #fff;
  color: #333;
  border-radius: 5px;
  text-align: center;
}
Copy the code

Here are the results of my first test.

Results in firefox

The above results have a confidence of 64. It's worth noting that the gift card images are dark in color, which definitely affects the results we get.

If you take a closer look at the image above, you'll see that the stitching on the card is almost accurate in the extracted text. It's not accurate because gift cards aren't really clear.

Oh, wait! What does it look like in Chrome?

Results in Chrome

Ah! The results are even worse in Chrome. But why are the results different in Chrome than in Mozilla Firefox? Different browsers treat images and their color profiles differently. This means that an image will look different depending on the browser. By providing Tesseract with pre-rendered image.data, it is likely to produce different results in different browsers, because Tesseract will be provided with different image.data depending on the browser used. As we will see later in this article, preprocessing the image will help achieve consistent results.

We need to be more accurate so that we can be sure that the information we receive or provide is correct. Therefore, we must go further.

Let's give it a try and see if we can get there in the end.

Test accuracy

There are many factors that affect tesseract.js's image-to-text conversion. Most of these factors revolve around the nature of the image we are working with, and the rest depends on how the Tesseract engine handles the transformation.

Internally, Tesseract preprocesses the image before the actual OCR conversion, but it doesn't always give accurate results.

As a solution, we can preprocess the image to achieve accurate conversion. We can binarize, reverse, expand, correct, or rescale the image to preprocess tesseract.js.

Image preprocessing is a large amount of work, or itself is a broad field. Fortunately, p5.js already provides all the image preprocessing techniques we want to use. Instead of reinventing the wheel or using the whole library, just because we wanted to use a small part of it, I copied the ones we needed. All image preprocessing techniques are included in preprocess.js.

What is binarization?

Binarization is the conversion of pixels of an image to black or white. We wanted to binarize previous gift cards to see if accuracy would improve.

Earlier, we extracted some text from the gift card, but the target password wasn't as accurate as we'd like. Therefore, it is necessary to find another way to get accurate results.

Now, we're going to binarize the gift card, which means we're going to convert its pixels to black and white, so we can see if we can achieve better accuracy.

The following function will be used for binarization and is contained in a separate file called preprocess.js.

function preprocessImage(canvas) { const ctx = canvas.getContext('2d'); Const image = ctx.getimagedata (0,0,canvas.width, canvas.height); ThresholdFilter (image data, 0.5); return image; } Export default preprocessImageCopy the code

What does the above code do?

We introduce canvas to store image data, apply some filters, preprocess the image, and then pass it to Tesseract for conversion.

The first preprocessImage function, located in _preprocess.js_, prepares the canvas for use by taking its pixels. This function, thresholdFilter, binaries an image by converting its pixels to black or white.

Let's call preprocessImage to see if the text extracted from the previous gift card can be more accurate.

When we update app.js, it should now look like this code.

import { useState, useRef } from 'react'; import preprocessImage from './preprocess'; import Tesseract from 'tesseract.js'; import './App.css'; function App() { const [image, setImage] = useState(""); const [text, setText] = useState(""); const canvasRef = useRef(null); const imageRef = useRef(null); const handleChange = (event) = { setImage(URL.createObjectURL(event.target.files[0])) } const handleClick = () = { const canvas = canvasRef.current; const ctx = canvas.getContext('2d'); ctx.drawImage(imageRef.current, 0, 0); CTX. PutImageData (preprocessImage (canvas), 0, 0). const dataUrl = canvas.toDataURL("image/jpeg"); Tesseract.recognize( dataUrl,'eng', { logger: m = console.log(m) } ) .catch (err = { console.error(err); }) .then(result = { // Get Confidence score let confidence = result.confidence console.log(confidence) // Get full output let text = result.text setText(text); }) } return ( div className="App" main className="App-main" h3Actual image uploaded/h3 img src={image} className="App-logo" alt="logo" ref={imageRef} / h3Canvas/h3 canvas ref={canvasRef} width={700} height={250}/canvas h3Extracted text/h3 div className="pin-box" p {text} /p /div input type="file" onChange={handleChange} / button onClick={handleClick} style={{height:50}}Convert to text/button /main /div ); } export default AppCopy the code

First, we need to import "preprocessImage" from "preprocess.js "with the following code.

import preprocessImage from './preprocess';
Copy the code

We then add a Canvas tag to the form. We set the canvas and img tags' ref attributes to {canvasRef} and {imageRef}, respectively. Refs is used to access canvas and images from App components. We use "useRef "to get the canvas and image, as shown.

const canvasRef = useRef(null);
const imageRef = useRef(null);
Copy the code

In this part of the code, we merge the image onto the canvas because we can only preprocess the canvas in JavaScript. We then convert it to a data URL with "JPEG" as its image format.

const canvas = canvasRef.current; const ctx = canvas.getContext('2d'); ctx.drawImage(imageRef.current, 0, 0); CTX. PutImageData (preprocessImage (canvas), 0, 0). const dataUrl = canvas.toDataURL("image/jpeg");Copy the code

"DataUrl" is passed to the Tesseract as the image to process.

Now, let's check to see if the extracted text is more accurate.

Test # 2

The image above shows the results in Firefox. It was clear that the dark parts of the image had turned white, but preprocessing the image did not produce more accurate results. The situation is even worse.

The first conversion had only two wrong characters, but this time there are four wrong characters. I even tried to change the threshold level, but to no avail. We didn't get better results, not because the binarization was bad, but because the binarization didn't repair the properties of the image in a way that was appropriate for the Rubik's cube engine.

Let's see what it looks like in Chrome, too.

We got the same result.

After getting poor results through binarization, it is necessary to examine other image preprocessing techniques to see if we can solve this problem. So, we're going to try expansion, inversion, and obfuscation.

We'll get the code for each technique from the p5.js used in this article. We'll add image processing techniques to preprocess.js and use them one by one. Before using them, it is necessary to understand each of the image preprocessing techniques we will use, so we will discuss them first.

What is expansion?

Dilation is the addition of pixels to the boundaries of objects in an image, making them wider, larger, or more open. "Dilation" techniques are used to preprocess our images to increase the brightness of objects on the image. We need a function that uses JavaScript to expand the image, so the code snippet to expand the image is added to preprocess.js.

What is fuzzy?

Blur is the process of smoothing the color of an image by reducing its sharpness. Sometimes, the image has small dots/spots. To get rid of these spots, we can blur the image. The code snippet for the blurred image is included in preprocess.js.

What is inversion?

Inversion is to change the light areas of the image to dark, and the dark areas to light. For example, if a picture has a black background and a white foreground, we can reverse it so that the background is white and the foreground is black. We also added a snippet of code to reverse the image in preprocess.js.

Add dilate, invertColors, and blurARGB to "preprocess.js "and we can now use them to preprocess the image. To use them, we need to update the original "preprocessImage "function in preprocess.js.

preprocessImage(...) Now it looks something like this.

Function preprocessImage(canvas) {const level = 0.4; const radius = 1; const ctx = canvas.getContext('2d'); Const image = ctx.getimagedata (0,0,canvas.width, canvas.height); blurARGB(image.data, canvas, radius); dilate(image.data, canvas); invertColors(image.data); thresholdFilter(image.data, level); return image; }Copy the code

In preprocessImage above, we apply four preprocessing techniques to the image: BlurARGB (), removes the dots on the image, dilate(), increases the brightness of the image, invertColors(), toggles the foreground and background colors of the image, and thresholdFilter() converts the image to a black and white image more suitable for rubik's Cube transformations.

ThresholdFilter () takes image.data and level as parameters. Level is used to set whether the image should be white or black. We determined thresholdFilter levels and blurRGB radii by trial and error because we weren't sure how white, black, or smooth the image should be in order for the Rubik's cube to produce a good result.

Test # 3

Here are the new results after applying the four techniques.

The image above represents the results we got in Chrome and Firefox.

Oh dear! That's great. That's a bad result.

Instead of using all four techniques, why don't we just use two of them at once?

Yes! We can simply use invertColors and thresholdFilter techniques to convert the image to black and white and switch the foreground and background of the image. But ** how do we know what to combine with what technology? ** We know what to combine based on the nature of the image to be preprocessed.

For example, digital images must be converted to black and white, and speckled images must be blurred to remove spots/spots. What really matters is understanding the purpose of each technology.

To use invertColors and thresholdFilter, we need to comment out blurARGB and Dilate in preprocessImage.

function preprocessImage(canvas) { const ctx = canvas.getContext('2d'); Const image = ctx.getimagedata (0,0,canvas.width, canvas.height); // blurARGB(image.data, canvas, 1); // dilate(image.data, canvas); invertColors(image.data); ThresholdFilter (image data, 0.5); return image; }Copy the code
Test # 4

Now, here's the new result.

The results were still worse than those without any pretreatment. After tweaking various techniques for this and other images, I came to the conclusion that different natures of images require different preprocessing techniques.

In short, using tesseract.js without image preprocessing produced the best results for the gift card above. All other experiments using image preprocessing produced less accurate results.

The problem

Initially, I wanted to extract a PIN from any Amazon gift card, but I couldn't because there was no need to match an inconsistent PIN to get a consistent result. While it is possible to process an image to obtain an accurate PIN code, such preprocessing is inconsistent when another image of a different nature is used.

The best results produced

The images below show the best results from the experiment.

Test # 5

The text on the picture is exactly the same as the extracted text. The accuracy of the conversion is 100%. I tried to reproduce the result, but I could only reproduce it using images of similar nature.

Observations and lessons learned

  • Some unpreprocessed images may show different results in different browsers. This was evident in the first test. The results in Firefox are different from those in Chrome. However, pre-processed images help to achieve consistent results in other tests.
  • Black on a white background often leads to manageable results. The image below is an example of an exact result without any preprocessing. I was also able to get the same accuracy by preprocessing the image, but I spent a lot of tweaking that wasn't necessary.

This conversion is 100% accurate.

  • Larger text is often more accurate.

  • Fonts with curved edges tend to confuse the cube. I get the best results when I use Arial.
  • OCR is not currently sufficient for automatic image-to-text conversion, especially when more than 80% accuracy is required. However, it can make the manual processing of the text on the image less stressful by extracting the text for manual correction.
  • OCR is not currently sufficient to deliver useful information to screen readers for accessibility. Providing inaccurate information to screen readers can easily mislead or distract users.
  • OCR is very promising because neural networks make it possible to learn and improve. Deep learning will make OCR a game changer in the near future.
  • Make confident decisions. Trust scores can be used to make decisions, which has a big impact on our app. Confidence scores can be used to decide whether to accept or reject a result. From my experience and experiments, I've realized that any confidence level below 90 isn't really useful. If I just need to extract a few stitches from a text, I want a credibility score between 75 and 100, and anything below 75 will be rejected.

If I'm working with text that doesn't need to extract any parts, I'll definitely accept a confidence score between 90 and 100, but reject anything below that. For example, if I want to digitize a check, a historic money order, or any document that needs to be reproduced accurately, an accuracy of 90 and above will be expected. However, when exact replication is not important, such as obtaining a password from a gift card, scores between 75 and 90 are acceptable. In short, confidence scores help make decisions that affect our application.

conclusion

Optical character recognition (OCR) is a useful technique in view of the data processing limitations posed by text on images and the associated disadvantages. Despite its limitations, OCR is very promising due to its use of neural networks.

Over time, OCR will overcome most of its limitations with the help of deep learning, but until then, the methods highlighted here can be used to handle text extraction from images that at least reduce the difficulties and losses associated with manual processing -- especially from a business perspective.

Now it's your turn to try OCR to extract text from the image. Good luck!

Further reading

Search
About
mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.