Mmocr official code: github.com/open-mmlab/…

First of all, thanks to the selflessness of a group of executives, the code is open source, as well as the hard sensetime executives have been maintaining updates. This time write some of their own use records, like me just to the cute new people.

First, the use of MMOCR

  1. Environment configuration

    For a series 30 graphics card that only supports CUDa11, the environment configuration is a bit tricky. For a 3070 graphics card, I will post my own configuration process:

    # mmocr for 3070
    
    conda create -n open-mmlab python=3.7 -y
    conda activate open-mmlab
    
    # install latest pytorch prebuilt with the default prebuilt CUDA version (usually the latest)
    conda install pytorch==1.8. 0 torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
    
    # install the latest mmcv-full
    pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu111/torch18.. 0/index.html
    # install mmdetection
    pip install mmdet
    
    # install mmocr
    git clone https://github.com/open-mmlab/mmocr.git
    cd mmocr
    
    pip install -r requirements.txt
    pip install -v -e .  # or "python setup.py build_ext --inplace"
    export PYTHONPATH=$(pwd):$PYTHONPATH
    Copy the code

    After running the code, there may be a small problem with cocoAPI, and AttributeError: COCO object has no attribute get_cat_ids. There are several solutions

    Method # 1
    git clone https://github.com/open-mmlab/cocoapi.git
    cd cocoapi/pycocotools
    pip install .
    Method # 2
    pip uninstall pycocotools
    pip install mmpycocotools
    Copy the code

    Then we can run an official demo to see if the environment is OK

    python demo/ocr_image_demo.py demo/demo_text_det.jpg demo/output.jpg
    Copy the code
  2. Prepare your own training data

    The official tutorial Datasets Preparation is very detailed, I will add a little detail

    Text Detection data

    I have converted the data into coco format. For the format of COCO data, please refer to the annotation format of Coco data set in Gemfield. If you are a standard academic data set, the official tools code contains scripts for converting various data to and from each other. The data directory Settings are set in the.py file corresponding to configs, as shown in the following example

    # specify the data type that makes the. TXT tag file, the 'IcdarDataset' class use the. Json tag file
    dataset_type = 'TextDetDataset'
    # image directory prefix
    img_prefix = 'tests/data/toy_dataset/imgs'
    # the annotation file
    test_anno_file = 'tests/data/toy_dataset/instances_test.txt'
    Copy the code

    I use the ‘IcdarDataset’ class data, and the official config is basically this kind of data. The most important thing is the annotation file in json format, which can be referred to the official file sample. The dictionary format and the keys are mainly “images”. “Categories” and “annotations” :

    “images”

    The value of “images” is a dictionary list, and each element of the list is a dictionary that contains information about images, examples

    {"file_name": "training/0336.png"."height": 1200."width": 1600."segm_file": "training/0336.xml"."id": 0}
    Copy the code
    • “File_name” specifies the location of the file. Ensure that the file can be read from the img_prefix directory prefix
    • “Segm_file” separate annotation file for each image, optional, actually defined in “Annotations”
    • Id, the image ID, that’s important, and the annotation behind that is going to correspond to each image ID

    “categories”

    The value of “Categories” is a dictionary list, or tag categories. Since we have text categories in OCR, only one category is ok, just copy and paste it mindlessly

    "categories": [{"id": 1."name": "text"}] 
    Copy the code

    “annotations”

    The value of “Annotations” is also a dictionary list, and each element of the list is a dictionary, the ground truth tag that is ultimately read, for example

    {"iscrowd": 0."category_id": 1."bbox": [213.16.370.1163]."area": 168314.0."segmentation": [[485.1179.306.991.252.800.213.608.215.413.274.214.402.16.535.130.471.291.296.460.301.620.365.777.490.931.583.1089]], "image_id": 0."id": 0}
    Copy the code
    • “Iscrowd “, 0 is polygon format segmentation; 1 is RLE format segmentation. Refer to the coco data format above
    • ‘category_id’ target category — it’s all text anyway
    • Gt of the form “bbox” [x,y,w,h], the first two are the coordinates of the upper left corner points, and w and h are the width and height of the box
    • “Area” segmentation area
    • “segmentation” [x1,y1,x2,y2…] Gt of a polygon, every two is a pair of coordinates of a point
    • “Image_id” corresponds to the id of the image
    • “Id” is very important because each image may have multiple targets and this ID must be globally unique. Therefore, the value of [0-total number of segmentation] cannot start from 0 when one image is traversed every time

    According to their own different data, according to the format of the above save. Json file, and then fill in the corresponding directory in configs. When running the code, if you can run but no loss occurs, only weight files will be saved, that is, there is a problem with your data format, or the corresponding directory setting is wrong.

    The Text Recognition data

    This type of data is relatively simple. Each line of annotation only needs to specify the file name and corresponding text label. The premise is that you need to extract each segmentation area into separate image data.

    train_words/1001724.jpg Chiquita
    Copy the code

    The first part is the path of the file. The absolute path and the relative path can be used. The second part is the real label of the text

    Train_prefix specifies the image directory prefix, train_ann_file specifies the annotation file location, and train_ann_file specifies the annotation file location. The same goes for the test setting.

    dataset_type = 'OCRDataset'
    train_prefix = 'data/chinese/'
    train_ann_file = train_prefix + 'labels/train.txt'
    Copy the code

    It should also be noted that word recognition requires a dictionary, defined by dict_file. For Chinese word recognition, the official SAR model already has pre-training weights, so you can download it and fine-tune it yourself. The effect is very good. Everyone should just open mouth….

    dict_file = 'data/chineseocr/labels/dict_printed_chinese_english_digits.txt'
    Copy the code
  3. Training model test results

    In fact, the data is ready, the other are very simple, naturally, the official tutorial also has very detailed information, I will not repeat. Py is a file in tools, so we can execute it directly. Here is a simple example:

    --work-dir specifies the location where the log weight should be saved. --load-from load model before training; -- Resume-from; -- Number of Gpus used; -- GPU-ids Specifies the GPU ID
    
    python ./tools/train.py configs/textrecog/sar/sar_r31_parallel_decoder_chinese.py  --work-dir ./results/sar/  --load-from   checkpoints/sar_chineseocr.pth  --gpus 1 --gpu-ids 4
    Copy the code

    You can just look at the.py file and see what args it needs. Evaluation is set in the last line of config during training. Also, in the first line, there will be _base parameters, the first is to define the optimizer learning rate, etc., the second is the weight save interval, etc.

    evaluation = dict(interval=10, metric='hmean-iou')
    _base_ = [
        '.. /.. /_base_/schedules/schedule_1200e.py'.'.. /.. /_base_/default_runtime.py'
    ]
    Copy the code

If there is a follow-up supplement…