Abstract

SVHN is a data set of Street View numbers, Google published a paper “Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural networks” in 2013 Networks offers a solution, which it claims can crack all captchas.

This blog will briefly summarize the paper and use Keras to implement the model and train SVHN datasets.

The method in this paper serves primarily as a baseline of the training SVHN dataset. The authors say his method is accurate more than 96 percent of the time.

Mission requirements

So let’s take a look at what the data set looks like, which is actually Number from Street View.

Past practice

Traditional approaches to solve this problem typically separate out the localization, segmentation, In the past, there were three steps: location, segmentation, and recognition.

The author’s approach

Propose a unified approach that integrates these three steps via the use of a deep convolutional neural network that Operates directly on the image pixels. (The author puts these three steps through a deep convolutional network.)

Author’s Contribution

  • (a) A unified model to localize, segment, and recognize multi-digit numbers from street level photographs
  • (b) A new kind of output layer, providing a conditional probabilistic model of sequences
  • (c) Empirical results that show this model performing best with a deep architecture
  • (d) Reaching human level performance at specific operating thresholds.

Problem description

Numbers in pictures: The numbers in each picture are a sequence of strings:The result of the first picture above is “379”,.

Character length: defined as n, most characters are less than 5 in length. The author assumes that the character length is at most 5.

Implementation method

The author’s method is to train a probability model for the image label. Here is the author’s definition:

  • S: Output sequence, that is, label of training data.
  • X: input image.

The goal here is by maximizingTo learn the model.

X is actually the input image, and if we look at S here, S is: the numeric sequence of the image+ The length of the numeric sequenceA set of. For example, “379” above is the sequence of numbers in the image, and len(“379”) is 3. So S is “3” plus “379”, which is “3379”.

hereIt can be defined as: the probability of a character length multiplied by the probability of each character value. (Each character value is independent).


The variables above are all discrete,There are seven values of: 0,1,2,…. Five is greater than five;There are 10:10 numbers.

Training the model is maximizing on the training setThe author uses a Softmax layer for each parameter.


Model structure

Next, take a look at the model the author published in the paper.

  • The input imageIt’s a 128x128x3 image.
  • Then feature extraction is carried out through a series of CNN layers, and it becomes a vector containing 4096 features.
  • Then, based on these 4096 features, let’s make them,,,,,Each passes through a Softmax layer.
  • For each variable,

The realization of the Keras

See the code: github.com/nladuo/ml-s…

Environment depends on

  • python 3.x
  • TensorFlow 1.11
  • Keras 2.x
  • Pillow
  • h5py

The data download

First of all to ufldl.stanford.edu/housenumber… Download Format 1 data.

wget http://ufldl.stanford.edu/housenumbers/test.tar.gz
wget http://ufldl.stanford.edu/housenumbers/train.tar.gz
Copy the code

After decompressing the data, two more folders are found: Test and train.

tar zvxf test.tar.gz
tar zvxf train.tar.gz
Copy the code

Building a data set

This data set can be read using H5PY.

A network model

The network model is divided into convolution layer + full connection layer, and the code is as follows.

Convolution layer

So these are the three convolution layers. To get the neural network to accept the same distribution of parameters, the Batch Normalization layer is used to Batch normalize the inputs to the ConvNet.

The connection layer

Finally, after passing through Flatten, the convolution layer enters the fully connected layer. Full connection layer Finally, output to the six SoftMax layers, respectively representing: character length, first character, second character, third character, fourth character, fourth character.

Note: there are 11 types of characters from 0 to 10, with 10 indicating that they do not exist.

Training and testing

Then call fit method for training. There are 7 loss and 6 accuracy in total. For loss, there is one for each SoftMax output layer and a total. Accuracy is the accuracy of 6 Softmax layers.

Finally, let’s evaluate it. All the six accuracy items have reached an accuracy rate of more than 85%. If you want to improve, you can use VGG16 and other structures, the Internet says you can improve to 97 percent, but training is estimated to be very slow.

Extended case: Microblog verification code

In addition, the author also uses this method to give the implementation code of sina weibo login verification code recognition, see: captcha-break/weibo.com.

However, I do not know whether the verification code of Sina Weibo has changed. When I used the cloud to type the code, its verification code length is as follows:

reference

  • Github.com/potterhsu/S…
  • Github.com/RyannnG/Cap…