Project source Address: github.com/nickliqian/… Welcome star and Fork

cnn_captcha

use CNN recognize captcha by tensorflow. This project uses Tensorflow to realize convolutional neural network for character image verification code recognition. The project encapsulates the more general verification, training, verification, identification, API module, greatly reducing the identification character type verification code to spend time and energy.

The project has helped many students to efficiently complete the task of verification code recognition. If you have bugs and good improvements in the process of use, please feel free to put forward issues and PR, and the author will reply as soon as possible, hoping to improve the project together with you.

Also check out nickliqian/ Darknet_captcha if you need to identify click and drag captcha classes, or if you have object detection needs.

schedule

2018.11.12 – First edition readme. md 2018.11.21 – Added some instructions on verification code recognition 2018.11.24 – Optimized rules for verification data set images 2018.11.26 – Added train_model_v2.py, 2018.12.08 – Optimized model recognition speed, supported API pressure test and statistical time-consuming 2018.02.19 – Added a method of accuracy calculation

directory

1 Project Introduction

  • 1.1 About verification Code Identification
  • 1.2 Directory Structure
  • 1.3 depend on
  • 1.4 Model Structure

2 How to Use it

  • 2.1 the data set
  • 2.2 Configuration File
  • 2.3 Validate and split data sets
  • 2.4 Training Model
  • 2.5 Batch Verification
  • 2.6 start the WebServer
  • 2.7 Calling An Interface
  • 2.8 the deployment
  • 2.9 Deploying Multiple Models
  • 2.10 Stress test

3 illustrates

4 the known bugs

1 Project Introduction

1.1 About verification Code Identification

Verification code recognition is mostly a problem that crawler will encounter, and it can also be used as an introductory case of image recognition. Currently, the following methods are commonly used:

Method names Related to the point
tesseract It is only suitable for recognizing images without interference or distortion, which is very troublesome to train
Other open source identification libraries Not general enough, recognition rate unknown
Pay for OCR API The cost is high in the case of high demand
Image processing + machine learning classification algorithm It involves a variety of techniques, is expensive to learn, and is not universal
Convolutional neural network With a certain learning cost, the algorithm is suitable for multiple types of captcha

Here’s a look at using traditional image processing and machine learning algorithms, involving a variety of techniques:

  1. The image processing
  • Pre-treatment (grayscale, binarization)
  • Image segmentation
  • Cropping (removing borders)
  • Image filtering and noise reduction
  • To the background
  • Color separation
  • rotating
  1. Machine learning
  • KNN
  • SVM

Such methods have high requirements for users, and due to the varied types of images, the processing methods are not universal enough, so it often takes a lot of time to adjust the processing steps and relevant algorithms. By using convolutional neural network, the end-to-end recognition of most static character verification codes can be realized only through simple pre-processing, with good effect and high versatility.

Here is a list of common captcha generation libraries:

Reference: Java validation family bucket

1.2 Directory Structure

1.2.1 Basic Configuration

The serial number The file name instructions
1 sample.py The configuration file
2 The sample folder Storing data sets
3 The model folder Storing model files

1.2.2 Training model

The serial number The file name instructions
1 verify_and_split_data.py The validation data set and split data are training and test sets
2 train_model.py Training model
3 train_model_v2.py Training model. The accuracy of training set and verification set is output during training. It is recommended to use this method for training
4 test_batch.py Batch validation
5 gen_image/gen_sample_by_captcha.py A script that generates a captcha
6 gen_image/collect_labels.py Used for statistical captcha labels (often used for Chinese captcha)

1.2.3 web interface

The serial number The file name instructions
1 recognition_object.py Encapsulated identification classes
2 recognize_api.py An interface written by Flask to provide online identification
3 recognize_online.py Examples of using interface identification
4 recognize_local.py Example of testing local images
5 recognize_time_test.py Stress tests identify time and request response time

1.3 depend on

Pip3 install tensorflow==1.7.0 flask==1.0.2 requests==2.19.1 Pillow==4.3.0 matplotlib==2.1.0 easydict==1.8Copy the code

1.4 Model Structure

The serial number The hierarchy
The input input
1 Convolution layer + pooling layer + down-sampling layer + ReLU
2 Convolution layer + pooling layer + down-sampling layer + ReLU
3 Convolution layer + pooling layer + down-sampling layer + ReLU
4 Full connection + downsampling layer + Relu
5 Full connection + Softmax
The output output

2 How to Use it

2.1 the data set

Original data set can be stored in. / sample/origin directory In order to facilitate processing, best picture with 2 e8j_17322d3d4226f0b5c5a71d797d2ba7f7. JPG format (label _ serial number. The suffix)

If you don’t have a training set, you can generate a training set file using the gen_sample_by_captcha.py file. You need to modify the configuration (path, file suffix, character set, etc.) before building.

2.2 Configuration File

Before creating a new project, you need to modify the relevant configuration files

The images folder sample_conf.origin_image_dir ="./sample/origin/"  # original file
sample_conf.train_image_dir = "./sample/train/"   # training set
sample_conf.test_image_dir = "./sample/test/"   # test set
sample_conf.api_image_dir = "./sample/api/"   # API received image storage path
sample_conf.online_image_dir = "./sample/online/"  # Store the image from the captcha URL

# model folder
sample_conf.model_save_dir = "./model/"  The trained model store path

# Image related parameters
sample_conf.image_width = 80  # Image width
sample_conf.image_height = 40  # Image height
sample_conf.max_captcha = 4  # Number of captcha characters
sample_conf.image_suffix = "jpg"  # Image file suffix

# Verification code character parameters
# Captcha identifies the result category
sample_conf.char_set = ['0'.'1'.'2'.'3'.'4'.'5'.'6'.'7'.'8'.'9'.'a'.'b'.'c'.'d'.'e'.'f'.'g'.'h'.'i'.'j'.'k'.'l'.'m'.'n'.'o'.'p'.'q'.'r'.'s'.'t'.'u'.'v'.'w'.'x'.'y'.'z']

# Captcha remote link
sample_conf.remote_url = "https://www.xxxxx.com/getImg"
Copy the code

The purpose of the configuration is mentioned in the script. If your sample is a Chinese captcha, you can use the gen_image/ collect_allelage. py script to count the tags. Json file to store all labels. Set use_labels_json_file = True in the configuration file to enable the labels.json content to be read as the result category.

2.3 Validate and split data sets

This function verifies the size of the original image set and whether the test image can be opened, and splits the training set and test set in a 19:1 ratio. Therefore, you need to create and specify three folders: Origin, train, and test to store related files.

You can also change it to a different directory, but it is best to change it to an absolute path. Once the folder is created, execute the following command:

python3 verify_and_split_data.py
Copy the code

There is usually a prompt like the following

Total image count: 10094 ==== The following 4 images are abnormal ==== [2123rd image] [325.txt] [incorrect file suffix] [3515th image] [_15355300508855503.gif] [image label error] [6413 image] [qwer_15355300721958663.gif] [abcd_15355300466073782.gif] [image cannot open] ========end Test set (5%) and training set (95%) were allocated 10090 images to the training set and test set, among which 4 were abnormal and left in the original directory. Number of test sets: 504 Number of training sets: 9586Copy the code

2.4 Training Model

Once the training and test sets are created, you can start training the model. During training, logs are generated to display the current number of training rounds, accuracy, and Loss. At this time, the accuracy rate is the accuracy rate of the images in the training set, and represents the image recognition situation of the training set. For example:

The 10th training >>> Accuracy is 1.0 >>> Loss 0.0019966468680649996Copy the code

I won’t go into the details of tensorFlow installation. After ensuring that the image parameters and directories are set correctly, run the following command to start training:

python3 train_model.py
Copy the code

You can also call the class to start training or perform a simple identification demonstration

from train_model import TrainModel
from sample import sample_conf

# import configuration
train_image_dir = sample_conf["train_image_dir"]
char_set = sample_conf["char_set"]
model_save_dir = sample_conf["model_save_dir"]

The # verify argument defaults to False, and verify=True verifies all image formats with the specified suffix before training
tm = TrainModel(train_image_dir, char_set, model_save_dir, verify=False)

tm.train_cnn()  # Execute training

tm.recognize_captcha()  # Identify demo

Copy the code

2018.11.26 The train_model_v2.py file is also a script of training model. In the training process, the process of identifying test sets and output accuracy is added, for example:

The 480th training >>> [Training set] Accuracy is 1.0 >>> Loss 0.0017373242881149054 >>> [Verification set] Accuracy is 0.9500000095367432 >>> Loss 0.0017373242881149054 The accuracy of the verification set reaches 99%, and the model is saved successfullyCopy the code

Since the training set often does not contain all sample features, the accuracy of the training set is 100% while the accuracy of the test set is less than 100%. In this case, one solution to improve the accuracy is to increase the negative samples after correct marks.

2.5 Batch Verification

Use the pictures of the test set for verification, output accuracy.

python3 test_batch.py
Copy the code

You can also call a class for validation

from test_batch import TestBatch
from sample import sample_conf

# import configuration
test_image_dir = sample_conf["test_image_dir"]
model_save_dir = sample_conf["model_save_dir"]
char_set = sample_conf["char_set"]
total = 100  # Total number of verified images

tb = TestBatch(test_image_dir, char_set, model_save_dir, total)
tb.test_batch()  # start validation

Copy the code

2.6 start the WebServer

The project has wrapped classes to load models and recognize images, and the recognition service can be used by invoking the interface after starting the Web Server. Start the web server

python3 recognize_api.py
Copy the code

The url of the interface is http://127.0.0.1:6000/b

2.7 Calling An Interface

Invoking the requests interface:

url = "http://127.0.0.1:6000/b"
files = {'image_file': (image_file_name, open('captcha.jpg'.'rb'), 'application')}
r = requests.post(url=url, files=files)
Copy the code

The result is a JSON:

{
    'time': '1542017705.9152594'.'value': 'jsp1',}Copy the code

The file recognize_online.py is an example of online recognition using an interface

2.8 the deployment

When you deploy, modify the last line of your recognize_api.py file to read as follows:

app.run(host='0.0.0.0',port=5000,debug=False)
Copy the code

Then open port access authority, you can access through the external network. In addition, to enable multi-process request processing, you can use a combination of UWSGi + Nginx for deployment. For this section, see Flask Deployment Selection

2.9 Deploying Multiple Models

Deploy multiple recognize_api.py models: combine them in the recognize_api.py file and create a new Recognizer object; And the routing and recognition logic written by referring to the original up_image function.

Q = Recognizer(image_height, image_width, max_captcha, char_set, model_save_dir)
Copy the code

Note that this line is modified:

value = Q.rec_image(img)
Copy the code

2.10 Stress tests and statistics

A simple stress test script is provided to count the time taken to identify and request the API, but you need to pull the graph in Excel yourself. Recognize_time_test. py opens the file recognize_time_test.py and changes the test_file path under main, where an image is re-used to access the recognize_interface. Finally, the data is stored in the test.csv file. Run the following command:

Python3 recognize_time_test.py ---- Outputs the following 2938,5150,13:30:25, Total time: 29ms, Recognition: 15ms, Request: 14ms 2939,5150,13:30:25, total time: 41ms, recognition: 21ms, request: 20ms 2940,5150,13:30:25, Total time: 47ms, identification: 16ms, request: 31msCopy the code

Here, after running 20,000 tests on a model, a set of data called test.csv. After analyzing test.csv using the boxplot, it can be seen that:

  • Average API request time: 27ms
  • Single identification time (average) : 12ms
  • Average request time: 15ms the value is as follows: total request API time = identification time + request time

3 illustrates

  1. There are currently no log files saved for TensorBoard

4 the known bugs

  1. Error launching recognize_api.py file using Pycharm
The 2018-12-01 00:35:15. 106333: W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1273] OP_REQUIRES failed at save_restore_tensor.cc:170  : Invalid argument: Unsuccessful TensorSliceReader constructor: Failed to get matching files on ./model/: Not found: FindFirstFile failedfor:./model: ϵͳ�Ҳ�� ָ����; No such process ...... tensorflow.python.framework.errors_impl.InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on ./model/: Not found: FindFirstFile failedfor:./model: ϵͳ\udcd5Ҳ\udcbb\ udCB5 \ udCBD ָ\udcb6\ udCA8 \ udCB5 \ udCC4 ·\ UDcbe \ UDCB6 \ udCA1 \udca3; No such process [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
Copy the code

The workspace was set by pyCharm by default, causing the Model folder to read relative paths to fail. Solution: Edit the run configuration and set the workspace to the project directory.

  1. FileNotFoundError: [Errno 2] No such file or directory: ‘XXXXXX’ does not exist. Create a folder in the specified directory.

  2. API program in the process of running the memory occupied by the larger results refer to data: link in the iterative cycle, can not contain any tensor calculation expression, or memory overflow. The recognition speed is greatly improved by placing the calculation expression of a tensor after init initialization.

  3. Loading multiple models failed because both Recognizer objects use the default Graph. The solution is to skip the default Graph when creating objects and create new ones so that each Recognizer uses a different Graph and there’s no conflict.

  4. Flask + UWSGi can be used to implement the concurrent running of Flask programs