ML Kit tutorial for iOS: Recognizing text in images

I came to a point where I thought for iOS: Recognizing Text in Images

By David East

The Nuggets translation Project

Permanent link to this article: github.com/xitu/gold-m…

Translator: portandbridge

Adjuster: Lobster-King, iWeslie

In this ML Kit tutorial, you will learn how to use Google ML Kit for text detection and recognition.

A few years ago, machine learning developers fell into two categories: senior developers in one category and the rest in another. The lower level of machine learning can be difficult because it involves a lot of math and uses obscure terms like Logistic regression, sparsity and neural nets. It doesn’t have to be that hard, though.) You can be a machine learning developer too! At its core, machine learning is not hard. When you apply machine learning, you solve problems by teaching software models to find patterns, rather than hard-coding every situation you can think of into the model. However, it can be daunting at first, and that’s when you can use the tools you already have.

Machine learning & Tool Tooling

As with iOS development, machine learning is all about tooling. You’re not going to build a UITableView yourself, or at least you shouldn’t; You’re going to use a framework, like UIKit.

The same goes for machine learning. Machine learning has a thriving ecosystem of tools. For example, Tensorflow simplifies the process of training and running models. TensorFlow Lite brings model support to iOS and Android devices.

All of these tools require some experience in machine learning. What if you’re not an expert in machine learning, but you want to solve a specific problem? And then you can use the ML Kit.

ML Kit

ML Kit is a mobile SDK that brings Google’s powerful machine learning technology to your App. The ML Kit API has two major parts, which can be used for common usage scenarios and custom models. They are not difficult to use, regardless of the user’s experience.

Existing apis currently support:

Identify words
Face detection
Identifying landmarks
Scanning bar code
Label the image

Each of the above usage scenarios comes with a pre-trained model wrapped in an easy-to-use API. Now it’s time to get your hands dirty!

Preparatory work

In this tutorial, you will write an App called Extractor. Have you ever taken a picture of a sign or poster just to get the text down? It would be great if there was an App that could scratch the text out of an image and convert it into real text. For example, you can extract the address information by taking a picture of an envelope with an address on it. What you’re going to do in this project is just this App! Get ready!

The first thing you need to do is download the project materials for this tutorial. Click the “Download Materials” button at the top or bottom of the tutorial to Download it.

This project uses CocoaPods to manage dependencies.

Configure the ML Kit environment

Each ML Kit API has a different set of CocoaPods dependencies. This is useful because you only need to package the dependencies your App needs. For example, if you’re not going to recognize landmarks, your App doesn’t need to have that model. In Extractor, you use the text recognition API.

To add a text recognition API to your App, you need to add the following lines to your Podfile. But you don’t need to do this initial project, because it’s already in your Podfile, so you can open it yourself.

pod 'Firebase/Core'= >'5.5.0'
pod 'Firebase/MLVision'= >'5.5.0'
pod 'Firebase/MLVisionTextModel'= >'5.5.0'
Copy the code

To install CocoaPods for your project, open your terminal, go to the project folder, and run the following command:

pod install
Copy the code

After installing CocoaPods, open extractor.xcworkspace in Xcode.

Note: You may find a project file named extractor.xcodeProj in the project folder, and a workspace file named extractor.xcworkspace. You need to open the latter in Xcode because the former doesn’t include the CocoaPods dependency libraries needed for compilation.

If you’re not familiar with CocoaPods, our CocoaPods tutorial will give you a primer.

This project contains the following important documents:

Viewcontroller.swift: The only controller for this project.
+UIImage.swift: used to correct the direction of the imageUIImageExtension.

Open a Firebase account

To set up a Firebase account, follow the instructions in the introductory Firebase tutorial on how to create an account. Although the Firebase products involved are different, the process for creating an account and setting it up is exactly the same.

Let you:

Register an account.
Create the project.
Add an iOS app to your project.
Drag GoogleService- info.plist into your project.
Initialize Firebase in an AppDelegate.

This process isn’t hard to follow, but if you do get stuck, the above guide can help you out.

Note: You need to set up Firebase and create your own GoogleService- info.plist files for your final and initial projects.

Compile the App and run it, and you’ll see what it looks like:

It doesn’t do much yet except let you share dead text with the action button in the top right. You have to use ML Kit to make it a really useful App.

Detecting base text

Ready for your first text check! What you can do initially is show the user how the App works.

A good way to show this is to scan a sample image when the App is first launched. Attached to the resources folder is an image called Simcian-Text, which is now the default image displayed by the View controller’s UIImageView, and which you will use for example images.

To start, though, you’ll need a text detector that can detect the text inside the image.

Create text detector

Setting-up a ScaledElementProcessor. Swift file, fill in the following code:

import Firebase

class ScaledElementProcessor {

}
Copy the code

Okay, we’re done! … Just strange. You need to add a Text-Detector property to this class:

let vision = Vision.vision()
var textRecognizer: VisionTextRecognizer!
  
init() {
  textRecognizer = vision.onDeviceTextRecognizer()
}
Copy the code

This textRecognizer is the primary object that you use to detect text in the image. You’re going to use it to recognize the text inside the image that UIImageView displays. Add the following detection method to the previous class:

func process(in imageView: UIImageView, 
  callback: @escaping (_ text: String) -> Void) {
  // 1
  guard let image = imageView.image else { return} / / 2let visionImage = VisionImage(image: image)
  // 3
  textRecognizer.process(visionImage) { result, error in
    // 4
    guard 
      error == nil, 
      letresult = result, ! result.text.isEmptyelse {
        callback("")
        return
    }
    // 5
    callback(result.text)
  }
}
Copy the code

Let’s take a moment to understand the code above:

checkimageViewWhether it actually contains pictures. If not, just return. Ideally, display or write a decent error message yourself.
ML Kit uses a specialVisionImageType. It works well because you can include specific metadata like the direction of the image that ML Kit can use to process the image.
textRecognizerWith aprocessMethod, this method will inputVisionImage, and then returns an array of text results, passed to the closure as a parameter.
The result could benil; In that case, you’d better return an empty string for the callback.
Finally, the callback is triggered to pass the identified text.

Use a text recognizer

Open viewController.swift and add an instance of ScaledElementProcessor as a property after the outlet at the top of the class body code:

let processor = ScaledElementProcessor()
Copy the code

Then add the following code at the bottom of viewDidLoad() to display the detected text in the UITextView:

processor.process(in: imageView) { text in
  self.scannedText = text
}
Copy the code

This little piece of code calls process(in:), passes the main imageView, and then assigns the recognized text to the scannedText property in a callback.

Run the app and you should see the following text at the bottom of the image:

Your
SCanned
text
will
appear
here 
Copy the code

You may have to drag the text view to see the bottom few lines.

Notice that the S and C in Scanned are capitalized. Sometimes, when identifying certain fonts, the capitalization of the text will be wrong. That’s why you want to display text in UITextView; If errors are detected, users can manually edit the text to correct them.

Understanding these classes

Note: you don’t need to copy the code in this section, it’s just to help explain the concepts. In the next section, you need to add code to your App.

VisionText

Do you notice that the textrecognizer.process (in:) callback in ScaledElementProcessor returns an object in the result argument instead of pure text? This is an example of VisionText; It is a class that contains a lot of useful information, such as the identified text. But you need to do more than just get words. Wouldn’t it be cool if we could draw a box around each identified text element?

The results provided by ML Kit have a tree-like structure. You need to reach the leaf element to get the position and size of the frame containing the identified text. If the tree analogy doesn’t make sense to you, don’t worry. The following sections will clarify what happened.

However, if you’re interested in learning more about tree data structures, feel free to check out this tutorial, Swift Tree Data Structures.

VisionTextBlock

To process recognized text, you first use the VisionText object – an object (what I call a tree) that contains multiple text blocks (like branches on a tree). Each branch is a VisionTextBlock object in the Blocks array; Instead, you need to iterate over each branch, as follows:

for block in result.blocks {

}
Copy the code

VisionTextElement

VisionTextBlock is simply an object containing a series of lines of text (words are like leaves on a tree branch), each of which is represented by the VisionTextElement instance. You can see the hierarchy of the identified text in this nested diagram of objects.

When looping through each object, it looks something like this:

for block in result.blocks {
  for line in block.lines {
    for element in line.elements {

    }
  }
}
Copy the code

Each object in this hierarchy contains a frame where the text is located. However, each object has a different level of granularity. A block may contain several rows. Each line may contain more than one element. Each element may contain multiple symbols.

For the purposes of this tutorial, you will be using the element level of granularity. The element usually corresponds to a word. That way, you can draw on top of each word, showing the user where each word is in the image.

The final loop iterates over the elements of each line in the text block. These elements contain frame, which is a simple CGRect. Using this frame, you can draw an outer frame around the text of the image.

Highlight the frame of the text

Frame detection

To draw on an image, you need to create a frame CAShapeLayer with text elements. Open the ScaledElementProcessor. Swift, insert the following struct to the top of the file:

struct ScaledElement {
  let frame: CGRect
  let shapeLayer: CALayer
}
Copy the code

This struct is very handy to use. With structs, it’s much easier to combine frames and cashapelayers with controllers. Now you need a helper method to create a CAShapeLayer from the element’s frame.

Add the following code at the bottom of the ScaledElementProcessor:

private func createShapeLayer(frame: CGRect) -> CAShapeLayer {
  // 1
  let bpath = UIBezierPath(rect: frame)
  let shapeLayer = CAShapeLayer()
  shapeLayer.path = bpath.cgPath
  // 2
  shapeLayer.strokeColor = Constants.lineColor
  shapeLayer.fillColor = Constants.fillColor
  shapeLayer.lineWidth = Constants.lineWidth
  return shapeLayer
}

// MARK: - private
  
// 3
private enum Constants {
  static letLineWidth: CGFloat = 3.0 staticlet lineColor = UIColor.yellow.cgColor
  static let fillColor = UIColor.clear.cgColor
}
Copy the code

This code does the following:

CAShapeLayerThere is no inputCGRectInitializer for. So, you’re going to create a containCGRect 的 UIBezierPathThen put the shape on the layerpathSet it to thisUIBezierPath.
throughConstantsEnumeration type that sets image properties in color and width.
This enumerated type allows the color and width to remain the same.

Now replace process(in:callback:) with the following:

// 1
func process(
  in imageView: UIImageView, 
  callback: @escaping (_ text: String, _ scaledElements: [ScaledElement]) -> Void
  ) {
  guard let image = imageView.image else { return }
  let visionImage = VisionImage(image: image)
    
  textRecognizer.process(visionImage) { result, error in
    guard 
      error == nil, 
      letresult = result, ! result.text.isEmptyelse {
        callback(""[]),return
    }
  
    // 2
    var scaledElements: [ScaledElement] = []
    // 3
    for block in result.blocks {
      for line in block.lines {
        for element in line.elements {
          // 4
          let shapeLayer = self.createShapeLayer(frame: element.frame)
          let scaledElement = 
            ScaledElement(frame: element.frame, shapeLayer: shapeLayer)

          // 5
          scaledElements.append(scaledElement)
        }
      }
    }
      
    callback(result.text, scaledElements)
  }
}
Copy the code

The code has the following changes:

The callback function here now accepts not only recognized text, but also recognized textScaledElementAn array of instances.
scaledElementsThe frame and shape layer is collected and stored.
Exactly the same as the introduction above, this code usesforLoop over to get the frame for each element.
The innermostforLoop over the element’s frame to create a shape layer, and then use the layer to create a new oneScaledElementInstance.
Add the instance you just created toscaledElements.

draw

All this code does is get your pen and paper ready. Now it’s time to start painting. Open viewController.swift and replace the viewDidLoad() call to process(in:) with the following code:

processor.process(in: imageView) { text, elements in
  self.scannedText = text
  elements.forEach() { feature in
    self.frameSublayer.addSublayer(feature.shapeLayer)
  }
}
Copy the code

The ViewController has a frameSublayer property attached to the imageView. This is where you add each element’s shape layer to a sub-layer, and iOS will automatically draw shapes on the image.

Compile your App and run it. Enjoy your work.

Yo… What is this? This is not monet style, but a little bit of Picasso. What could have gone wrong? Well, maybe it’s time to talk about scaling.

Understand the zoom of the image

The default scanned-text.png, which is 654×999 (width × height); However, UIImageView’s “Content Mode” is “Aspect Fit”, which scales the image in the view to 375×369. What the ML Kit takes is the actual size of the image, and it also returns the frame of the element in accordance with the actual size. The frame derived from the actual size is then drawn on the scaled size. The result is confusing.

Note the difference between the scaled size and the actual size in the figure above. As you can see, the frame in the picture is the same size as the actual frame. To get the frame position right, you need to calculate the scale of the image relative to the view.

The formula is pretty simple (👀… Probably) :

Calculate the resolution of the view and image.
Compare the two resolutions to determine the scale.
Compute the height, width, origin X, and origin y by multiplying them by the scaling scale.
Create a new CGRect using the relevant data points.

It doesn’t matter if you get confused! You’ll understand when you see the code.

Calculate the scale

Open the ScaledElementProcessor. Swift, add the following methods:

// 1
private func createScaledFrame(
  featureFrame: CGRect, 
  imageSize: CGSize, viewFrame: CGRect) 
  -> CGRect {
  let viewSize = viewFrame.size
    
  // 2
  let resolutionView = viewSize.width / viewSize.height
  let resolutionImage = imageSize.width / imageSize.height
    
  // 3
  var scale: CGFloat
  if resolutionView > resolutionImage {
    scale = viewSize.height / imageSize.height
  } else {
    scale = viewSize.width / imageSize.width
  }
    
  // 4
  let featureWidthScaled = featureFrame.size.width * scale
  let featureHeightScaled = featureFrame.size.height * scale
    
  // 5
  let imageWidthScaled = imageSize.width * scale
  let imageHeightScaled = imageSize.height * scale
  let imagePointXScaled = (viewSize.width - imageWidthScaled) / 2
  let imagePointYScaled = (viewSize.height - imageHeightScaled) / 2
    
  // 6
  let featurePointXScaled = imagePointXScaled + featureFrame.origin.x * scale
  let featurePointYScaled = imagePointYScaled + featureFrame.origin.y * scale
    
  // 7
  return CGRect(x: featurePointXScaled,
                y: featurePointYScaled,
                width: featureWidthScaled,
                height: featureHeightScaled)
  }
Copy the code

The code does things like:

This method will inputCGRectTo get the original size of the image, display size, andUIImageViewIn the frame.
To calculate the resolution of a view and an image, divide their width by their own height.
Scale according to the larger of the two resolutions. If the view is large, scale according to height; Otherwise, scale according to the width.
This method calculates the width and height. The width and height of the frame are multiplied by the scale to calculate the width and height after scaling.
The origin of the frame must also be scaled. Otherwise, even if the outer frame is the right size, it will be in the wrong place off center.
The new origin is computed by multiplying the scaling ratio by the unscaled origin, plus the scaling values of the X and Y points.
Returns scaled, configured according to the calculated origin and sizeCGRect.

With a properly scaled CGRect you can greatly improve your drawing skills to the level of Sgraffito. Yeah, I’m just gonna teach you a new word, so thank me next time you play Scrabble.

To ScaledElementProcessor. Swift in the process (in: callback:), modify the inner for loop, allowing it to use the following code:

for element in line.elements {
  let frame = self.createScaledFrame(
    featureFrame: element.frame,
    imageSize: image.size, 
    viewFrame: imageView.frame)
  
  let shapeLayer = self.createShapeLayer(frame: frame)
  let scaledElement = ScaledElement(frame: frame, shapeLayer: shapeLayer)
  scaledElements.append(scaledElement)
}
Copy the code

The newly added line will create a scaled frame, and the code will use the outer frame to create the properly positioned shape layer.

Compile your App and run it. Frame should be in the right place. You are such a master framer.

We’ve had enough fun. It’s time to get out and get some real stuff.

Take pictures with a camera

The project already contains the camera and gallery selection code set up in an extension at the bottom of viewController.swift. If you look at it now, you’ll see that the frames are all out of place. This is because the App is still using frames in preloaded images. You remove the old frames and then draw the new frames as you take or select the photo.

Add the following method to the ViewController:

private func removeFrames() {
  guard let sublayers = frameSublayer.sublayers else { return }
  for sublayer in sublayers {
    sublayer.removeFromSuperlayer()
  }
}
Copy the code

This method uses the for loop to remove all sublayers of the Frame sublayer. This will give you a clean canvas for the rest of your photos.

To improve the inspection code, we add the following new methods to the ViewController:

// 1
private func drawFeatures(
  in imageView: UIImageView, 
  completion: (() -> Void)? = nil
  ) {
  // 2
  removeFrames()
  processor.process(in: imageView) { text, elements in
    elements.forEach() { element in
      self.frameSublayer.addSublayer(element.shapeLayer)
    }
    self.scannedText = text
    // 3
    completion?()
  }
}
Copy the code

The code has the following changes:

This method will receiveUIImageViewAnd callbacks, so you know when you’re done.
Frames are automatically removed before the new image is processed.
When all work is done, the completion callback is triggered.

Now replace the call to processor.process(in:callback:) in viewDidLoad() with the following:

drawFeatures(in: imageView)
Copy the code

Scroll down to the position of class extension, find imagePickerController (_ : didFinishPickingMediaWithInfo:). At the bottom of the if paragraph, after imageView.image = pickedImage, add this line of code:

drawFeatures(in: imageView)
Copy the code

When taking or selecting a new photo, this code ensures that the previously drawn frame is removed and replaced with the frame of the new photo.

Compile your App and run it. If you’re running on a real device (not an emulator), take a photo with text. Strange results may occur:

What’s going on here?

This is the image orientation problem, so we’ll talk about image orientation in a second.

Handles the orientation of the image

The App is locked in vertical mode. Redrawing the frame when the device is rotated is cumbersome. For now, it’s easier to set some limits on the user.

With this restriction, users must take portrait photos. UICameraPicker will rotate the portrait by 90 degrees behind the scenes. You won’t see the rotation, because THE UIImageView will rotate you back to the original. But what the text detector gets is the rotated UIImage.

This can lead to confusing results. ML Kit allows you to set the orientation of photos in VisionMetadata objects. Set the orientation correctly and the App will return the correct text, but the frame will still be drawn from the rotated image.

So, you need to manipulate the orientation of the photo so that it always faces up. This project contains an extension called + UIimage.swift. This extension adds a method to the UIImage that changes the orientation of any photo to portrait. Once the orientation of the image is aligned, the entire App runs smoothly.

Open the ViewController. Swift, in imagePickerController (_ : didFinishPickingMediaWithInfo:), Replace imageView.image = pickedImage with the following code:

/ / 1let fixedImage = pickedImage.fixOrientation()
// 2
imageView.image = fixedImage
Copy the code

There are two changes:

Place the image just selectedpickedImageRotate to the up position.
Then, assign the rotated image toimageView.

Compile your App and run it. Take another picture. Everything should be in the right place this time.

Share the text

You don’t have to do anything for the last step. Isn’t it awesome? This App has integrated a UIActivityViewController. Check out shareDidTouch() :

@IBAction func shareDidTouch(_ sender: UIBarButtonItem) {
  letvc = UIActivityViewController( activityItems: [textView.text, imageView.image!] , applicationActivities: []) present(vc, animated:true, completion: nil)
}
Copy the code

There are only two steps here. It’s easy. Create a UIActivityViewController that contains scanned text and images. Then call present() and let the user take care of the rest.

What can we do after that?

A: congratulations! You’re already a machine learning developer! Click the Download Materials button at the top or bottom of this article to get the full Extractor. Note, however, that after downloading the final version of the project file, you will need to add your own GoogleService- info.plist; I said that before. You will also need to change the Bundle ID to the appropriate value depending on your Settings in the Firebase console.

In this tutorial, you did:

Developed a camera APP with text detection function, from which I learned the basic knowledge of ML Kit.
Understand ML Kit’s text recognition API, image scaling and image orientation.

And you don’t need a PhD in machine learning to do it :]

If you want to learn more about Firebase and ML Kit, check out the official documentation.

If you have any comments or questions about this Firebase tutorial, Firebase, ML Kit, or sample App, feel free to join the discussion below!

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.

The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.