On an article we introduced the characters of the authentication code in the beginning of birth in the face of image recognition algorithm, using this algorithm verification code identification characters with poor robustness, have may also add a new authentication code interference or noise, recognition algorithm for overhaul and updates, and update every time need manual analysis. An attacker who uses this approach can hardly gain an advantage in a confrontation.

With the development of AI community CV (Computer Vision), CNN has made significant progress in many vision-related tasks, and at the same time has played the role of “gravedigger” of character CAPTCHA invisibility. In the current CNN research papers, open source projects, and popular science articles, we will focus on how the security of character CAPTCHA is gradually broken down by the CNN model.

A, CNN

Convolutional neural network is a kind of neural network which is widely used in image and video recognition, image classification, natural language processing and other tasks. Biological studies of how the mammalian brain works are now widely believed to have influenced the origins of CNN, most notably Hubel and Wiesel’s vision experiments on cats, for which they won the Nobel Prize in 1981.

As early as 1980, the Japanese scholar K Fukushima designed a Neocognitron model, and most of the modern CNN structures commonly used have been reflected in this model [1]. Later, LePen first introduced Back Projection into the training of Neocognitron-like model, and then proposed the famous Lenet-5 in 1998, and excelled others in MNIST recognition task [4]. But then CNN fell silent for 10 years, until ILSVRC (ImageNet Large Scale Visual Recognition Challenge) in 2012, AlexNet topped the list with an absolute advantage of only 15.3% error rate, and the AI community again set off a wave of research on deep neural networks such as CNN. Only the winner and runner-up of the ILSVRC competition each year, the model structure of RESNET, Googlenet and VGG, which had been written into textbooks, was born.



On the left is the schematic diagram of the internal layers of Neocognitron and the correlation between the layers [1], and on the right is the structural diagram of Lenet-5 [4], from which we can see the similarity, or inheritance and development, between them

The key structures of CNN include a Convolutional layer to extract spatial features and a down-sampling layer to reduce the size of features.

1.1 Convolutional Layer

Convolution is a mathematical operation, defined continuously as



Where f and g are both measurable functions on R, and the convolution operation is the integral of the product of one of the functions flipped over with the other.

Discrete is defined as



Think of it as a weighted sum of the values of two functions (one of which is flipped) at a discrete point. On the two-dimensional discrete function, it becomes the common calculation principle of CNN convolution layer:



Image from StackExchange

Strictly according to the definition of convolution, one of the functions f or g (input matrix and filter, respectively) should be transposed before carrying out the weighted summation, but this step is not included in the actual realization of CNN convolution operation. Therefore, most of the “convolution” operations in the convolutional neural network are implemented by the “correlation” operation, which does not affect anything.

1.2 Pooling Layer

It is used to quickly reduce the size of the feature map, thus reducing the number of parameters in deep NN and reducing the calculation consumption. MAX pooling is the most commonly used operation, which takes the maximum value of each target area as the output of this layer:



Image from Wikipedia

In some CNN structures, the stride > 1 Conv Layer will be used to replace Max pooling, such as Googlenet [3].

CNN Versus Captcha

Last issue we introduced that as early as 2003, researchers used image recognition algorithm to identify character verification codes and got a relatively good recognition effect. And CNN, which was proposed in 1998 and has been applied in the bank check and other image recognition scenes, will naturally be put into the captcha arena. In the confrontation between the two, character CAPTCHAs have been crushed almost from the beginning.

2.1 CAPTCHA vs CNN

In 2004, Kumar Chellapilla et al. tried to use the CNN structure based on [5] to identify several character CAPTCHAs in actual use at that time (including our old friend EZ-Gimpy), and it was an experiment without the help of language model (the character CAPTCHAs have a limited amount of words). 66.2% on MailBlocks, 47.8% on Register, EZ-Gimpy, Ticketmaster, 34.4% on Yahoo Version 2, and 45.7% on Gmail, the industry’s most popular CAPTCHA datasets, respectively 4.89% accuracy [6].

Then, in 2005, Kumar Chellapilla used seven comparative experiments, The ability of humans and machines to perform A single character recognition task under specified conditions (capital letters A-Z and numbers 0-9, font size of 30, Times New Roman, etc.) was compared [7]. For each experiment, different types of increasing distortions or distortions were added to the character images. Forty-four employees with normal vision (or corrected vision), aged 21-58, from the same company, were recruited to complete the experiment online. The machine group adopts the CNN structure based on [5] (after separate training on a separate training set), and uses the recognition accuracy rate as a comparison index.



In only a few of the seven experiments, the performance of humans and machines was roughly equal; In other experiments, human accuracy declined significantly as the interference escalated, while the machine’s accuracy was largely unaffected or less affected. So the authors conclude that “the identification tasks that make up HIP (another name for the CAPTCHA, Human Interaction Proofs) are easier for machines than for humans,” At the same time, it is suggested that the security strength of character verification codes should depend more on the difficulty of character segmentation than on the difficulty of character recognition. In fact, this is also an important principle in the design of most mainstream character CAPTCHAs.

KC compares the fourth group in the experiment. On the left is the character picture example with increasing difficulty of interference, and on the right is the comparison of the recognition accuracy curve between the machine and human participants in different difficulty. It can be seen that the recognition effect of the machine is basically better than that of human participants.

2.2 Ineffective resistance of CAPTCHA

After 2005, KC’s experimental conclusions have been widely recognized by academics and the industry, and more and more CAPTCHAs are making efforts in the direction of making it difficult for characters to be segmentation resistance. Microsoft CAPTCHA, Hotmail, reCAPTCHA, etc.



Microsoft MSN verification code [14]

In this trend, these character verification code recognition experiments are basically carried out in accordance with the following two steps:

  • Segmentation separates individual characters
  • In the segmentation stage of recognition, there are various techniques for recognizing a single character, including color filling, character width and character feature analysis. For the isolated single character, CNN can achieve a 95% recognition rate [7]. So in this line of thought, J. Yan and his team have achieved 61%, 78%, 33%, 36%-89% accuracy on Microsoft CAPTCHA, Megaupload, reCAPTCHA, Yahoo and other data sets from 2008 to 2013, respectively [14-17]. Megauplaod Captcha, one of the contestants who went particularly far in the Segmentation Resistance Mechanism, could see CNN playing more of a “play support” role during this period.

2.3 End of the strongest CAPTCHA

In 2013, the Google team announced a major update to reCAPTCHA: Based on advanced risk analysis, reCAPTCHA can send character images of varying difficulty to both human users and machines. A year later, they emphasized this point again in another blog, citing the experimental results of a Google paper to explain why they did this:

Goodfellow’s team designed an end-to-end model combining three steps of location, segmentation and recognition in order to solve the problem of character recognition on streetview signage using deep convolutional neural network. 96% accuracy on the SVHN(Street View House Numbers) dataset.

Then to further test the model’s ability to generalize on the task of recognizing arbitrary, artificially generated images of disturbing characters, they ran an experiment on reCAPTCHA character CAPTCHAs, using models that are even shallower in depth than they are on SVHN (nine convolutional layers compared to the best experimentally tested 11 on the SVHN dataset). The dataset is made up of the most difficult images of reCAPTCHA (the paper does not specify the selection criteria for the “most difficult” problems, presumably the ones that have been the least accurate in previous experiments, or the ones that have had the highest user error rate in actual production).

Under this premise, the model’s successful transcribing rate (the recognition accuracy of image characters) is 99.8%, which is far better than the performance of human on this task (based on the assumption of data set selection criteria above). So both the reCAPTCHA paper and the reCAPTCHA team’s blog post emphasize that reCAPTCHA has moved away from using the correct character input as the only way to distinguish between man and machine, and instead uses more advanced methods such as analyzing user interaction.

On March 31, 2018, the reCAPTCHA V1, now known as the V2 version that required the user to select a matching image, and the V3 version that did not have an interface, went out of service.

B: Try it yourself

In a world where a variety of deep learning frameworks and open source projects are available, it is extremely easy to train a CNN model and quickly get a decent recognition rate on a character CAPTCHA. The following examples are illustrated with a model structure [11] with only three convolution layers and Python’s CAPTCHA character CAPTCHA generation library 12.

The experiment randomly generated 30,000 images, with 29,726 tags (directories), because some tags generated 2 or more images. The data sets are organized as follows:



The character CAPTCHA style generated by the default CAPTCHA library configuration is as follows:



As you can see, it contains character distortions, overlaps, and dot and line interference that are common in character CAPTCHAs.

Step2: Load the data set (the original warehouse author wrote a generator by himself, here for simplicity, using API such as tensorflow.data.Dataset) :

def make_dataset(path, batch_size, n_label):
    def parse_image(filename):
        image = tf.io.read_file(filename)
        image = tf.image.decode_png(image, channels=3)
        image = tf.image.resize(image, [H, W])
        image = tf.image.per_image_standardization(image)
        return image

    def configure_for_performance(ds):
        ds = ds.shuffle(buffer_size=1000)
        ds = ds.batch(batch_size)
        ds = ds.repeat()
        ds = ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
        return ds

    filenames = glob(path + '/*/*')
    random.shuffle(filenames)
    labels = [[tf.keras.utils.to_categorical(((ord(i)-48)%39), n_label) for i in filename.split("/")[-2]] for filename in filenames]
    filenames_ds = tf.data.Dataset.from_tensor_slices(filenames)
    images_ds = filenames_ds.map(parse_image, num_parallel_calls=tf.data.experimental.AUTOTUNE)
    labels_ds = tf.data.Dataset.from_tensor_slices(labels)
    ds = tf.data.Dataset.zip((images_ds, labels_ds))
    ds = configure_for_performance(ds)

    return ds

Retrieves the label corresponding to the character verification picture from the file path. It is a 4-character string. (ord(I)-48)%39 is used to map the 36 characters of “0-9” and “a-z” to a class index of 0-35, which is then converted to a one-hot category array using to_categorical. We end up with 30,000 x 4 x 36 0 and 1 matrices as the Y of our sample set.

Due to space limitation, the rest of the code can refer to documents such as tensorflow.data.dataset.

Step3: Model structure code. The model only includes three two-dimensional convolution layers, three sub-sampling layers and two fully connected layers, with 3.4m parameters. Compared with the currently commonly used CNN model, which often has hundreds of layers and tens of millions of parameters, this structure is relatively simple and primitive.

input_layer = Input(shape=(H, W, C))
x = layers.Conv2D(32, 3, activation='relu')(input_layer)
x = layers.MaxPooling2D((2, 2))(x)
x = layers.Conv2D(64, 3, activation='relu')(x)
x = layers.MaxPooling2D((2, 2))(x)
x = layers.Conv2D(64, 3, activation='relu')(x)
x = layers.MaxPooling2D((2, 2))(x)

x = layers.Flatten()(x)
x = layers.Dense(1024, activation='relu')(x)

x = layers.Dense(D * N_LABELS, activation='softmax')(x)
x = layers.Reshape((D, N_LABELS))(x)

model = models.Model(inputs=input_layer, outputs=x)

model.compile(optimizer='adam', 
              loss='categorical_crossentropy',
              metrics= ['accuracy'])

Step4: Training model:

ds = make_dataset("captcha_30000", batch_size, N_LABELS)
val_ds = make_dataset("captcha_1000", batch_size, N_LABELS)
history = model.fit(ds,
                    steps_per_epoch=30000//batch_size,
                    epochs=10,
                    validation_data=val_ds,
                    validation_steps=1000//batch_size
                   )ds = make_dataset("captcha_30000", batch_size, N_LABELS)
val_ds = make_dataset("captcha_1000", batch_size, N_LABELS)
history = model.fit(ds,
                    steps_per_epoch=30000//batch_size,
                    epochs=10,
                    validation_data=val_ds,
                    validation_steps=1000//batch_size
                   )

Batch_size for training is taken as 128, 1000 samples separately generated are used to verify the data set, and 10 epochs are trained:



It can be seen that the Validation accuracy is close to 80% when the fifth epoch is around.

Step5: Verify the experimental results

import numpy as np
from matplotlib import pyplot
generator = ImageCaptcha(width=100, height=60)
chars_dict = "0123456789abcdefghijklmnopqrstuvwxyz"
fig, axes = plt.subplots(3, 3, figsize=(10, 10))

for i in range(9):
    label = "".join(random.choices(chars_dict, k=4))

    img = generator.generate_image(label)
    image = tf.keras.preprocessing.image.img_to_array(img)
    image = tf.image.per_image_standardization(image)

    pred = model.predict(tf.expand_dims(image, axis=0))
    pred = np.argmax(pred,axis=-1)
    pred = "".join([chr(i+ord('0')) if i<10 else chr(i+ord('a')-10) for i in pred[0]])
    ax = axes.flat[i]
    ax.imshow(img)
    ax.set_title('pred: {}'.format(pred))
    ax.set_xlabel('true: {}'.format(label))
    ax.set_xticks([])
    ax.set_yticks([])

The training process only took 11 min (8C/32G/GTX1070), and the core code of the whole experiment (model building, data loading, training) was less than 50 lines (although using the number of lines of code to prove a point is a little sophistical).

A very simple CNN structure is used here (there are only three Conv layers and two for each Lenet-5), which quickly achieves about 80% Validation Accuracy on 30K data. Through several tests and the experimental results of the original author of the warehouse (the experimental results of the original warehouse code are for 4-digit character verification codes, reaching 87.6% and 98.8% accuracy respectively in the case of 30241 and 302410 samples), it can be found that the effect of the model is largely dependent on the number of data sets. Within a certain range, the more data available for training, the better the model effect will be. At the same time, it can be found that with the increase of the amount of data, the improvement of accuracy rate is more and more slow. That is to say, in order to improve the accuracy rate by a few percentage points at the later stage of training, the amount of data may need to be nearly doubled.

Although today there are a variety of ways to access captcha images this common Data sets, but as this task is almost only threshold, there are many mature ways to cope with the amount of Data, such as the most commonly used Data augmented (Data Augmentation), is the image Data of the existing random flip, scaling, Tailor and adjust CV operations such as contrast and saturation, so as to achieve the effect of increasing training data sets. Here is a simple change to the original experiment code:

def make_dataset(path, batch_size, n_label, is_training=False): def parse_image(filename): image = tf.io.read_file(filename) image = tf.image.decode_png(image, channels=3) image = tf.image.resize(image, [H, W]) image = tf.image.per_image_standardization(image) if is_training: image = tf.image.resize_with_crop_or_pad(image, H+10, W+10) image = tf.image.random_crop(image, (H, W, 3)) Image = TF.image.random_width (Image, 0.3) Return Image... Model.com from running (optimizer = tf. Keras. Optimizers. Adam (learning_rate = 0.001), loss = 'categorical_crossentropy', metrics= ['accuracy']) ... ds = make_dataset("captcha_10000", batch_size, N_LABELS, True) val_ds = make_dataset("captcha_1000", batch_size, N_LABELS) history = model.fit(ds, steps_per_epoch = 10000//batch_size, epochs = 50, validation_data = val_ds, validation_steps = 1000//batch_size )

As an example, the sample of the original training data set is scaled and trimmed with brightness. Using only 10K images and training 50 epochs (here increase Adam’s learning rate), it is possible to reach close to 80% Valacc around 28 epochs. The selection of image enlargement method should be based on actual requirements. For example, random flipping in this task, whether vertical or horizontal, is not a good choice.



After adding two kinds of simple amplification and increasing the learning rate, the model achieves the same verification accuracy of nearly 80% on only one third of the data in the original data set.

Through image enlargement, the requirement of character verification code recognition task on the number of data sets can be further reduced, and better results can be achieved with limited data. In addition, the Keras. Applications module also provides many mature CNN models and the learning results of these models on ImageNet, which means that the whole experiment process can be further simplified if the transfer learning method is adopted.

In addition to [11], there are many open source projects on GitHub that are more complex in structure and have more powerful functions, but are also packaged more easily to use. Most of them are in such an end-to-end model. Some projects even provide Win32 executable programs with graphical interface, basically out of the box. Reducing the task of character CAPTCHAs to a point where you almost only need to think about how to get a sample. In the current flood of various ** platforms, the acquisition of such sample data is actually not very difficult.

Fourth, the Conclusion

To sum up, traditional character CAPTCHAs have no security at all under the powerful CNN technology, and with the popularization of artificial intelligence technology, the cost and threshold of attack become extremely low, so it can be said that the security system of traditional character CAPTCHAs has been comprehensively disrupted. Coincidentally with this situation, the Internet has become the most important platform for modern social and economic activities, and online assets have increased exponentially. Captcha code as the first pass to protect online assets, has incomparable importance, its security front comprehensive collapse to all walks of life enterprise online assets caused a huge risk, and the amount of loss increased to hundreds of billions of levels year by year.

Under the pressure of strong security, the traditional character verification code into a dead end, a variety of security to improve the deformation and distortion means not only did not save a bit of the situation, but let the user experience reduced to an unbearable situation. Perhaps due to the concept of CAPTCHA verification code characters too deeply rooted in the hearts of the people, may be in the face of strong artificial intelligence design a absolute general verification code is really difficult, face the pain points for many years worldwide is still helpless, until a geek from wuhan, China team creatively put forward a landmark validation of the concept. They saw that under the general trend of the development of artificial intelligence, they must be very creative and really use the biological behavior characteristics of human beings to build a CAPTCHA with artificial intelligence as the core, so as to achieve the security of lasting intelligence. And the CAPTCHA built with this idea is called behavioral verification.

Next, look at the next chapter: “CAPTCHA and AI: The Behavioral Revolution Over the Past Twenty Years”

[1] Fukushima, K. . “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.” In Cybernetics 36.4 (1980) : 193-202.

[2] Wikipedia – Convolutional neural network

[3] Szegedy, C. , et al. “Going deeper with convolutions.” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) IEEE, 2015.

[4] Lecun, Y. , and L. Bottou . “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998) : 2278-2324.

[5] Simard, Patrice Yvon , D. Steinkraus , and J. C. Platt . “Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis.” 7th International Conference on Document Analysis and Recognition (ICDAR 2003), 2-Volume Set, 3-6 August 2003, Edinburgh, Scotland, UK IEEE Computer Society, 2003.

[6] Chellapilla, K. , and P. Y. Simard . “Using Machine Learning to Break Visual Human Interaction Proofs (HIPs).” DBLP DBLP, 2004:265–272.

[7] Larson, K. , et al. “Computers beat Humans at Single Character Recognition in Reading based Human Interaction Proofs (HIPs). ” (2005).

ReCAPTCHA Just Got Easier (But Only If You’re Human) by Vinay Shet, Product Manager, ReCAPTCHA

[9] Google Security Blog, Street View and reCAPTCHA technology just got smarter by Vinay Shet, Product Manager, reCAPTCHA

[10] Goodfellow, I. J. , et al. “Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks.” Computer Science (2013).

[11] GitHub – JackonYang/captcha-tensorflow

[12] GitHub – lepture/captcha

[13] pypi – captcha

[14] J. Yan and A. S. E. Ahmad, “A low-cost attack on A Microsoft CAPTCHA,” in Proceedings of the 15th ACM Conference on Computer and Communications Security, CCS ’08, pp. 543 — 554, USA, October 2008

[15] A. S. El Ahmad, J. Yan, and L. Marshall, “The robustness of a new CAPTCHA,” in Proceedings of The 3rd European Workshop on System Security, Eurosec ’10, Pp. 36-41, April 2010.

[16] A. S. E. Ahmad, J. Yan, and M. Tayara, “The Robustness of Google Captchas,” Computing Science Technical Report CS-TR-1278, Newcastle University, 2011.

[17] H. Gao, W. Wang, J. Qi, X. Wang, X. Liu, and J. Yan, “The robustness of hollow CAPTCHAs,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, CCS 2013, pp. 1075–1085, November 2013.