In this section, we will use TensorFlow to realize the deep learning model, which is used to realize the process of verification code recognition. The verification code identified here is graphic verification code. First, we will train a model with marked data, and then use the model to realize the identification of the verification code.

Verification code

This library is not installed by default, so we need to install this library first. In addition, we need to install the Pillow library, using PIp3:

1

pip3 install captcha pillow

Once installed, we can generate a simple graphic captcha with the following code:
12345678from captcha.image import ImageCaptchafrom PIL import Image text = '1234'image = ImageCaptcha()captcha = image.generate(text)captcha_image = Image.open(captcha)captcha_image.show()Copy the code

After running, an image pops up with the following result:

You can see that the text in the figure is exactly what we defined as text content, so we can get a picture and its corresponding real text, so we can use it to generate a batch of training data and test data.

pretreatment

Before the training must be made to the data preprocessing, we first define good now to generate a captcha text content, this is equivalent to have the label, then we could use it to generate the authentication code, you can get the input data x, here we first define our input word, due to relatively large vocabulary lowercase letters and Numbers, Suppose we use captchas that contain both upper and lower case letters and numbers, and a captchas that contain four characters, then the total number of possible combinations is (26 + 26 + 10) ^ 4 = 14776336, which is a bit too many to train, so let’s simplify the training by using only numeric captchas. So the number of combinations is 10 ^ 4 = 10,000, which is a lot less.

So here we first define a thesaurus and its length variable:

123VOCAB = ['0'.'1'.'2'.'3'.'4'.'5'.'6'.'7'.'8'.'9']CAPTCHA_LENGTH = 4VOCAB_LENGTH = len(VOCAB)Copy the code

Here VOCAB is the content of the word list, that is, 0 to 9 these 10 numbers, the number of characters of the verification code CAPTCHA_LENGTH is 4, the length of the word list is VOCAB length, that is, 10.

Next we define a method to generate captcha data, similar to the procedure above, except here we return the data into a Numpy array:

123456789101112131415from PIL import Imagefrom captcha.image import ImageCaptchaimport numpy as np def generate_captcha(captcha_text):    """ get captcha text and np array :param captcha_text: source text :return: captcha image and array """    image = ImageCaptcha()    captcha = image.generate(captcha_text)    captcha_image = Image.open(captcha)    captcha_array = np.array(captcha_image)    return captcha_arrayCopy the code

This method converts the verification code to the RGB value of each pixel. Let’s call this method to try:

12captcha = generate_captcha('1234')print(captcha, captcha.shape)Copy the code

As follows:

123456789[[[239 244 244] [239 244 244] [239 244 244]] (60, 160, 3) 123456789[[[239 244 244] [239 244 244] [239 244 244] [239 244 244]]Copy the code

You can see that its shape is (60,160,3), which actually represents the captcha. The image is 60 in height and 160 in width, which is a captcha of 60×160 pixels. Each pixel has an RGB value, so the last dimension is the RGB value of the pixel.

Next, we need to define label. Since we need to use the deep learning model for training, it is better to use one-hot encoding for our label data here, that is, if the captcha text is 1234, the index position of the word table should be set to 1, and the total length is 40. We use the program to achieve one-HOT encoding and text conversion:

12345678910111213141516171819202122232425262728def text2vec(text):    """ text to one-hot vector :param text: source text :return: np array """    if len(text) > CAPTCHA_LENGTH:        return False    vector = np.zeros(CAPTCHA_LENGTH * VOCAB_LENGTH)    for i, c in enumerate(text):        index = i * VOCAB_LENGTH + VOCAB.index(c)        vector[index] = 1    return vector  def vec2text(vector):    """ vector to captcha text :param vector: np array :return: text """    if not isinstance(vector, np.ndarray):        vector = np.asarray(vector)    vector = np.reshape(vector, [CAPTCHA_LENGTH, -1])    text = ' '    for item in vector:        text += VOCAB[np.argmax(item)]    return textCopy the code

The text2vec () method converts the real text to one-hot encoding, and the Vec2text () method converts the one-hot encoding back to the real text.

For example, by calling these two methods here, we convert 1234 text to one-HOT encoding and then convert it back:

123vector = text2vec('1234')text = vec2text(vector)print(vector, text)Copy the code

The running results are as follows:

1234 [0. 1. 0. 0 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0. 1. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0. 1. 0. 0. 0. 0. 0. 1234]Copy the code

In this way, we can implement text to One-Hot encoding.

Then we can construct a batch of data, x data is the numpy array of the captcha, y data is a popular encoding of the text of the captcha, generate the following content:
1234567891011121314151617181920212223242526272829303132333435363738import randomfrom os.path import join, existsimport pickleimport numpy as npfrom os import makedirs DATA_LENGTH = 10000DATA_PATH = 'data' def get_random_text():    text = ' '    for i in range(CAPTCHA_LENGTH):        text += random.choice(VOCAB)    return text def generate_data():    print('Generating Data... ')    data_x, data_y = [], []     # generate data x and y for i in range(DATA_LENGTH): text = get_random_text() # get captcha array captcha_array = generate_captcha(text) # get vector vector = text2vec(text)  data_x.append(captcha_array) data_y.append(vector) # write data to pickle if not exists(DATA_PATH): makedirs(DATA_PATH) x = np.asarray(data_x, np.float32) y = np.asarray(data_y, np.float32) with open(join(DATA_PATH, 'data.pkl'), 'wb') as f: pickle.dump(x, f) pickle.dump(y, f)Copy the code

Here we define a get_random_text () method that randomly generates captcha text, and then uses this randomly generated text to generate x and Y data, which we then write to the pickle file to complete the preprocessing.

Build the model

Now that we have the data, let’s start building the model. Here we use the train_test_split () method to split the data into three parts: training set, development set, and validation set:

1234567with open('data.pkl'.'rb') as f:    data_x = pickle.load(f)    data_y = pickle.load(f)    returnTaichichuan contains Taichichuan (Taichichuan), taichichuan (Taichichuan), taichichuan (Taichichuan), taichichuan (Taichichuan), taichichuan (Taichichuan), taichichuan (Taichichuan), taichichuan (Taichichuan), taichichuan (Taichichuan). Random_state =40)dev_x, test_x, dev_y, test_Y, = train_test_split(test_x, test_y, test_size=0.5, random_state=40)Copy the code

Next we construct three data set objects from three data sets:

123456789# train and dev datasettrain_dataset = tf.data.Dataset.from_tensor_slices((train_x, train_y)).shuffle(10000)train_dataset = train_dataset.batch(FLAGS.train_batch_size) dev_dataset = tf.data.Dataset.from_tensor_slices((dev_x, dev_y))dev_dataset = dev_dataset.batch(FLAGS.dev_batch_size) test_dataset = tf.data.Dataset.from_tensor_slices((test_x, test_y))test_dataset = test_dataset.batch(FLAGS.test_batch_size)Copy the code

We then initialize an iterator and bind it to the data set:

12345# a reinitializable iteratoriterator = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)train_initializer = iterator.make_initializer(train_dataset)dev_initializer = iterator.make_initializer(dev_dataset)test_initializer = iterator.make_initializer(test_dataset)Copy the code

Here we use three-layer convolution and two-layer fully connected network to construct. Here, in order to simplify writing, we directly use the layer module of TensorFlow:

1234567891011121314151617# input Layerwith tf.variable_scope('inputs'): # x.shape = [-1, 60, 160, 3] x, y_label = iterator.get_next()keep_prob = tf.placeholder(tf.float32, [])y = tf.cast(x, tf.float32)# 3 CNN layersfor _ in range(3): y = tf.layers.conv2d(y, filters=32, kernel_size=3, padding='same', activation=tf.nn.relu) y = tf.layers.max_pooling2d(y, pool_size=2, strides=2, padding='same') # y = tf.layers.dropout(y, rate=keep_prob) # 2 dense layersy = tf.layers.flatten(y)y = tf.layers.dense(y, 1024, activation=tf.nn.relu)y = tf.layers.dropout(y, rate=keep_prob)y = tf.layers.dense(y, VOCAB_LENGTH)Copy the code

Here the convolution kernel is 3, the padding uses SAME mode, and the activation function uses RELu.

After the fully connected network transformation, the shape of y becomes [batch_size, n_classes], and our label is CAPTCHA_LENGTH one-hot vector. We want to use cross entropy here, but when we calculate cross entropy, The sum of the elements in the last dimension of the label parameter vector must be 1, otherwise there will be problems when calculating the gradient. See TensorFlow’s official documentation for details:

https://www.tensorflow.org/api_docs/python/tf/nn/

Softmax_cross_entropy_with_logits:



Note: Although these classes are mutually exclusive, their probabilities are not necessarily.
All that is required is that each line of labels is a valid probability distribution.
If they are not, the gradient calculation will be incorrect.

But the tag parameter is 0 CAPTCHA_LENGTH One Hot vector, so the sum of the elements here is CAPTCHA_LENGTH, so we need to 0 new make sure the elements in the last dimension are 1:0

1
2

y_reshape = tf.reshape(y, [-1, VOCAB_LENGTH])
y_label_reshape = tf.reshape(y_label, [-1, VOCAB_LENGTH])

This ensures that the last dimension is VOCAB_LENGTH, which is a one-hot vector, so the sum of the elements must be 1.

Then the loss and accuracy are easy to calculate:
1234567# losscross_entropy = tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(logits=y_reshape, labels=y_label_reshape))# accuracymax_index_predict = tf.argmax(y_reshape, axis=-1)max_index_label = tf.argmax(y_label_reshape, axis=-1)correct_predict = tf.equal(max_index_predict, max_index_label)accuracy = tf.reduce_mean(tf.cast(correct_predict, tf.float32))Copy the code

Then perform the training:

12345678910111213141516171819# traintrain_op = tf.train.RMSPropOptimizer(FLAGS.learning_rate).minimize(cross_entropy, global_step=global_step)for epoch in range(FLAGS.epoch_num): tf.train.global_step(sess, global_step_tensor=global_step) # train sess.run(train_initializer) for step in range(int(train_steps)): loss, acc, gstep, _ = sess.run([cross_entropy, accuracy, global_step, train_op], feed_dict={keep_prob: FLAGS.keep_prob}) # print log if step % FLAGS.steps_per_print == 0: print('Global Step', gstep, 'Step', step, 'Train Loss', loss, 'Accuracy', acc) if epoch % FLAGS.epochs_per_dev == 0: # dev sess.run(dev_initializer) for step in range(int(dev_steps)): if step % FLAGS.steps_per_print == 0: print('Dev Accuracy', sess.run(accuracy, feed_dict={keep_prob: 1}), 'Step', step)Copy the code

Here we first initialize train_initializer, bind iterator to train data set, and then execute train_op to obtain loss, ACC, GSTEP and other results and output them.

training

Run the training process with results similar to the following:
12345678910... Dev Accuracy 0.9580078 Step 0Dev Accuracy 0.9472656 Step 2Dev Accuracy 0.9501953 Step 4Dev Accuracy 0.9658203 Step 6Global Step 3243 Step 0 Train Loss 1.1920928E-06 Accuracy 1.0Global Step 3245 Step 2 Train Loss 1.5497207E-06 Accuracy 1.0Global Step 3247 Step 4 Train Loss 1.1920928E-06 Accuracy 1.0Global Step 3249 Step 6 Train Loss 1.7881392E-06 Accuracy of 1.0...Copy the code

The accuracy of validation set is more than 95%.

test

We can also save the model every few epochs during the training process:

123# save modelif epoch % FLAGS.epochs_per_save == 0: saver.save(sess, FLAGS.checkpoint_dir, global_step=gstep)Copy the code

Of course, the model with the highest accuracy in the validation set can also be saved.

When verifying, we can refresh the model and verify:

1234567891011# load modelckpt = tf.train.get_checkpoint_state('ckpt')if ckpt: saver.restore(sess, ckpt.model_checkpoint_path) print('Restore from', ckpt.model_checkpoint_path) sess.run(test_initializer) for step in range(int(test_steps)): if step % FLAGS.steps_per_print == 0: print('Test Accuracy', sess.run(accuracy, feed_dict={keep_prob: 1}), 'Step', step)else: print('No Model Found')Copy the code

After verification, the accuracy is almost the same.

Test_x can be replaced with new Inference.

conclusion

The above is the process of verification code recognition using TensorFlow, as shown in the code:

HTTPS:

//github.com/AIDeepLearning/CrackCaptcha

.

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)