Data is introduced

The project is from Tianchi Competition. This is the project address. Project data comes from The Street View House Numbers Dataset (SVHN) in Google Street View images, and The corresponding Kaggle contest address.

This data comes from the house numbers of the real scene. The training set data includes 3W photos, and the verification set data includes 1W photos. Each photo includes a color image, a corresponding coding category and a specific location. In order to ensure the fairness of the competition, test set A included 4W photos and test set B included 4W photos. The authorities have divided the training set and the verification set for us.



As shown in the figure below, the files shown are respectively

McHar_train.zip: training image McHar_val.zip: verifying image McHar_test_a. zip: testing image McHar_train.json: training image annotated McHar_val.json: training image annotated McHar_val.json: Verify the image label McHar_sample_submit_a.csv: submit format file

The image is labeled in JSON format with the following fields

Field Description
top X in the upper left corner
height The character height
left The top left coordinate is Y
width Character width
label A character encoding

View the data

Before building the data set, let’s do some visualization of the data to get a general idea of the data.

The file path is as follows, save it as a dictionary

data_dir = {
    'train_data': '/content/data/mchar_train/',
    'val_data': '/content/data/mchar_val/',
    'test_data': '/content/data/mchar_test_a/',
    'train_label': '/content/drive/My Drive/Data/Datawhale-DigitsRecognition/mchar_train.json',
    'val_label': '/content/drive/My Drive/Data/Datawhale-DigitsRecognition/mchar_val.json',
    'submit_file': '/content/drive/My Drive/Data/Datawhale-DigitsRecognition/mchar_sample_submit_A.csv'
}
  • View the number of images

    def data_summary():
        train_list = glob(data_dir['train_data']+'*.png')
        test_list = glob(data_dir['test_data']+'*.png')
        val_list = glob(data_dir['val_data']+'*.png')
        print('train image counts: %d'%len(train_list))
        print('val image counts: %d'%len(val_list))
        print('test image counts: %d'%len(test_list))
    
    data_summary()
    train image counts: 30000
    val image counts: 10000
    test image counts: 40000
    
  • View the annotated file information

    def look_train_json(): with open(data_dir['train_label'], 'r', encoding='utf-8') as f: Content = f.read() # loads print(content['000000.png']) look_train_json()
    {'height': [219, 219], 'label': [1, 9], 'left': [246, 323], 'top': [77, 81], 'width': [81, 96]}
    
  • View the resulting file submission format

    def look_submit():
        df = pd.read_csv(data_dir['submit_file'], sep=',')
        print(df.head(5))
    
    look_submit()
        file_name  file_code
    0  000000.png          0
    1  000001.png          0
    2  000002.png          0
    3  000003.png          0
    4  000004.png          0
    
  • View the tag box on the picture

    def plot_samples(): imgs = glob(data_dir['train_data']+'*.png') fig, ax = plt.subplots(figsize=(12, 8), ncols=2, nrows=2) marks = json.loads(open(data_dir['train_label'], 'r').read()) for i in range(4): img_name = os.path.split(imgs[i])[-1] mark = marks[img_name] img = Image.open(imgs[i]) img = np.array(img) bboxes = np.array( [mark['left'], mark['top'], mark['width'], mark['height']] ) ax[i//2, i%2].imshow(img) for j in range(len(mark['label'])): The rect = # define a Rectangle patch. A Rectangle (bboxes [:, j], [: 2] bboxes [:, j] [2], bboxes [:, j] [3], facecolor = 'none', Ax = 'r' edgecolor) [I / / 2, I % 2). The text (bboxes [:, j] [0], bboxes [:, j] [1], mark [' label '] [j]) # draw a rectangle ax [I / / 2, i%2].add_patch(rect) plt.show() plot_samples()

  • View the length and width distribution of the training images

    def img_size_summary():
        sizes = []
    
        for img in glob(data_dir['train_data']+'*.png'):
            img = Image.open(img)
    
            sizes.append(img.size)
    
        sizes = np.array(sizes)
    
        plt.figure(figsize=(10, 8))
        plt.scatter(sizes[:, 0], sizes[:, 1])
        plt.xlabel('Width')
        plt.ylabel('Height')
    
        plt.title('image width-height summary')
        plt.show()
        return np.mean(sizes, axis=0), np.median(sizes, axis=0)
    
    mean, median = img_size_summary()
    print('mean: ', mean)
    print('median: ', median)
    



    It can be seen that the size difference between the training pictures is very large, and basically the width is greater than the height, and the difference between the width is greater than the difference between the height. Then the network input size can be determined by combining the median or average value.

  • View the bounding box size distribution

    def bbox_summary():
        marks = json.loads(open(data_dir['train_label'], 'r').read())
        bboxes = []
    
        for img, mark in marks.items():
            for i in range(len(mark['label'])):
            bboxes.append([mark['left'][i], mark['top'][i], mark['width'][i], mark['height'][i]])
    
        bboxes = np.array(bboxes)
    
        fig, ax = plt.subplots(figsize=(12, 8))
        ax.scatter(bboxes[:, 2], bboxes[:, 3])
        ax.set_title('bbox width-height summary')
        ax.set_xlabel('width')
        ax.set_ylabel('height')
        plt.show()
    
    bbox_summary()



    If the idea of target detection is adopted to realize character recognition, Kmeans clustering can be used to determine the size of anchor on the bounding box.

  • View the number of different character categories

    def label_nums_summary():
        marks = json.load(open(data_dir['train_label'], 'r'))
    
        dicts = {i: 0 for i in range(10)}
        for img, mark in marks.items():
            for lb in mark['label']:
            dicts[lb] += 1
    
        xticks = list(range(10))
        fig, ax = plt.subplots(figsize=(10, 8))
        ax.bar(x=list(dicts.keys()), height=list(dicts.values()))
        ax.set_xticks(xticks)
        plt.show()
        return dicts
    
    print(label_nums_summary())

    As you can see, the overall difference between the categories is not significant, except that the number 1 appears more frequently. There is no extreme imbalance. Weight-crossentropy loss can be considered for later classification.

  • See how many numbers appear in each image

    def label_summary(): marks = json.load(open(data_dir['train_label'], 'r')) dicts = {} for img, mark in marks.items(): if len(mark['label']) not in dicts: dicts[len(mark['label'])] = 0 dicts[len(mark['label'])] += 1 dicts = sorted(dicts.items(), key=lambda x: X [0]) for k, v in dicts: print('%d '%(k, v)) label_summary()
    Number of 1-digit pictures: 4636 Number of 2-digit pictures: 16262 Number of 3-digit pictures: 7813 Number of 4-digit pictures: 1280 Number of 5-digit pictures: 8 Number of 6-digit pictures: 1

    As you can see, only one image contains 6 numbers, which may be outliers and can be dismissed. Almost all the pictures with 1~4 numbers accounted for almost all the training pictures.

Building a data set

Here, we take the Baseline provided by DataWhale, and since each image contains no more than six numbers, character recognition is treated as a classification problem for simplicity.

Here the custom data sets, DigitsDataset inherited from the torch. The utils. Data. The Dataset, data increase use own torchvison. Transforms. There are only regular enhancement operations, such as rotation, random greyscale, random HSV adjustment, etc.

class DigitsDataset(Dataset): """ DigitsDataset Params: data_dir(string): data directory label_path(string): label path aug(bool): wheather do image augmentation, default: True """ def __init__(self, data_dir, label_path, size=(64, 128), aug=True): super(DigitsDataset, self).__init__() self.imgs = glob(data_dir+'*.png') self.aug = aug self.size = size if label_path == None: self.labels = None else: self.labels = json.load(open(label_path, 'r')) self.imgs = [(img, self.labels[os.path.split(img)[-1]]) for img in self.imgs if os.path.split(img)[-1] in self.labels] def __getitem__(self, idx): if self.labels: img, label = self.imgs[idx] else: img = self.imgs[idx] label = None img = Image.open(img) trans0 = [ transforms.ToTensor(), Transforms.Normalize(Mean =[0.485, 0.456, 0.406], Std =[0.229, 0.224], If (img.size[1] / self.size[1]) < ((img.size[1] / self.size[1]))) else self.size[2]) trans1 = [ transforms.Resize(min_size), transforms.CenterCrop(self.size) ] if self.aug: Trans1. The extend ([transforms ColorJitter (0.1, 0.1, 0.1), transforms. RandomGrayscale (0.1), Transforms. RandomAffine (10, translate = (0.05, 0.1), shear=5) ]) trans1.extend(trans0) img = transforms.Compose(trans1)(img) if self.labels: return img, t.tensor(label['label'][:5] + (5 - len(label['label']))*[10]).long() else: return img, self.imgs[idx] def __len__(self): return len(self.imgs)

Take a look at the effect of the data enhancement

fig, ax = plt.subplots(figsize=(6, 12), nrows=4, ncols=2) for i in range(8): Img = img * t.ensor ([0.229, 0.224, 0.225]). Img = img * t.ensor ([0.229, 0.224, 0.225]). Img = img * t.ensor ([0.229, 0.224, 0.225]). 0.456, 0.406]). The view (3, 1, 1) ax [I / / 2, I % 2), imshow (img. Permute (1, 2, 0). Numpy ()) ax/I / / 2, I % 2. Set_xticks ([]) ax [I / / 2, i%2].set_yticks([]) plt.show()

conclusion

This paper mainly introduces the preparation of data and the construction of data set, without using more advanced and complex operations. The purpose is to build a basic data framework, on which other operations can be added more conveniently in the future.

In the next post, I’ll focus on data enhancement, which allows for some more complex operations.

Reference

In this Tutorial, PyTorch will be used to build the Datawhale database. In this Tutorial, the Datawhale will be used to identify the Baseline