The 5 best free data sets for optical character recognition

Optical character recognition, or OCR. OCR is the process by which an electronic device, such as a scanner or camera, examines a printed character on paper, determines its shape by detecting dark and light patterns, and then uses character recognition to translate the shape into computer text.

The role of OCR is to detect the text area in the image and identify the text content. In many occasions, OCR can replace the keyboard to complete the high-speed text input task.

OCR technology has a wide range of applications

OCR technology has been applied in a wide range of scenarios, and the following are a few mature applications:

· Remote identity authentication: combined with OCR and face recognition technology, automatic entry of user ID information is realized and user identity verification is completed. It is applied to finance, insurance, social security, O2O and other industries to effectively control business risks.

· Content review and supervision: automatic recognition of text content in pictures and videos, timely detection of pornographic, violent-related, politically sensitive, malicious advertising and other non-compliant content, avoid business risks, and greatly save the cost of manual review.

· Paper documents and bills are electronic: OCR can realize automatic recognition and input of paper documents, bills and forms, reduce manual input costs and improve input efficiency.

Image from the Internet

OCR in the natural environment has to face and solve a lot of problems, such as complex background, seal interference and overlay, low image contrast, smudging and wear, a wide variety of fonts, printing ink and so on.

For technologies based on deep learning, the quantity of training data affects the technical effect to a large extent, and improving the quantity and quality of training data has become the fundamental way to solve the above problems.

In order to improve the accuracy of OCR recognition and transliteration, not many platforms have developed OCR annotation and transliteration datasets. Below are five common OCR database network resources.

·NIST Database *

The National Academy of Sciences published the handwriting of 3,600 authors, including 800,000 character images.

Web site:

https://catalog.data.gov/dataset/nist-handprinted-forms-and-characters-nist-special-database-19

MNIST database,

A subset of the original NIST data with a training set of 60,000 hand-written numeric examples.

Web site:

https://yann.lecun.com/exdb/mnist/

· Arabic printed text

A dictionary containing 113,284 words and using 10 Arabic fonts.

Web site:

https://diuf.unifr.ch/main/diva/APTI/

Stanford, OCR

Contains a handwritten word data set collected by the MIT Spoken Language Systems Group, published by Stanford.

Web site:

https://ai.stanford.edu/~btaskar/ocr/

Chars74K data,

74K image with numbers in English and Kannada.

Website: https://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/

With more than ten years of data processing experience, Datastore has also accumulated its own unique data advantages in terms of syntax annotation and event annotation. The following is the OCR data developed by Datastore:

Welcome to contact customer service for sample data ~

The 5 best free data sets for optical character recognition

Related Posts

Graviti and UC Berkeley Explor Autonomous Driving Prediction Models, Interaction Prediction Challenge Launched!