What is the experience of training a data set with not enough data?

Abstract:Here is one of the ways to expand a dataset with tags.

preface

Some time ago, I got in touch with the questions raised by several users, and found that many of them gave only a few data sets when using training, and some of them even only had 5 pictures in one category. The ModelArts platform provides a training limit of 5 images per class, but this limit is often far from enough for an industrial-scale application. So contacted the user hope to add some more pictures, add thousands of pictures training. However, users later reported that the workload of annotation was too large. I thought about it, analyzed the scenarios he used, and made a few tactical changes. Here is one of the ways to expand a dataset with tags.

Data set situation

Because the data set belongs to the user data, it cannot be displayed casually. Here, an open source data set can be displayed instead. First of all, this is a classification problem. It is necessary to detect defects on the surface of industrial parts and judge whether they are defective products. The sample pictures are as follows:

These are the surfaces of two solar panels. The left side is normal, and the right side is incomplete and defective. We need to use a model to distinguish these two kinds of pictures, so as to help locate which solar panels have problems. There are 754 normal samples on the left and 358 defective samples on the right. The verification set is the same, with 754 normal samples and 357 defective samples. The total sample is about 2000 pieces, which is a very small sample for the general industrial requirement of 95% accuracy model. First, we directly took this data set and used PyTorch to load the Resnet50 model of ImageNet for training. The overall accuracy of ACC was about 86.06%, and the recall rate of normal class was 97.3%, but that of abnormal class was 62.9%, which could not meet the user’s expectation.

When the user is required to collect more data sets, at least to the ten-thousandth level, the user points out that it is troublesome to collect data through processing and annotation, and asks if there are other ways to save some workload. Data is the soul of deep learning training. How can we do this?

After thinking carefully for a while, I thought of the function of intelligent annotation and manual verification on ModelArts, so I asked the user to try this function first. I’ll take the data set he gave me and figure it out. Check some information, small sample learning few-shot fewshot learning (FSFSL) common methods, basically start from two directions. One is the data itself, and the other is from the model training itself, that is, to do something about the features extracted from the image. The idea here is to start with the data itself.

First of all, we observe the data set, which are all grayscale images of 300*300, and all of them are viewed as the whole picture from the front of the solar panel surface. This is one of those nicely pre-processed images. For this kind of picture, the flip image has little influence on the overall structure of the picture, so the first thing we can do is flip operation to increase the diversity of data. The flip looks like this:

So you go from 1100 data sets to 2200 data sets, which is still not a lot, but direct observation data sets don’t have a very good way to scale. At this time, I thought of using the evaluation function of ModelArts model to evaluate the generalization ability of the model to the data. This calls the Analyse interface under the provided SDK: deep_moxing. Model_analysis.

def validate(val_loader, model, criterion, args): BATCH_TIME = SAVAGEMETER (' 'TIME ', ': 6.3F ') AND LOSSES = SAVAGEMETER ('' LOSS ', ':.4E ') TO1 = SAVAGEMETER ('Acc@1', Progress = Progressmeter (Len (Val_Loader), [Batch_Time, Losses, top1, ':6.2f')] top5], prefix='Test: ') pred_list = [] target_list = [] # switch to evaluate mode model.eval() with torch.no_grad(): end = time.time() for i, (images, target) in enumerate(val_loader): if args.gpu is not None: images = images.cuda(args.gpu, non_blocking=True) target = target.cuda(args.gpu, non_blocking=True) # compute output output = model(images) loss = criterion(output, Target pred_list += output.cpu().numpy()[:, :2].tolist() target_list += target.cpu().numpy().tolist() # measure accuracy and record loss acc1, acc5 = accuracy(output, target, topk=(1, 5), i=i) losses.update(loss.item(), images.size(0)) top1.update(acc1[0], images.size(0)) top5.update(acc5[0], images.size(0)) # measure elapsed time batch_time.update(time.time() - end) end = time.time() if i % args.print_freq == 0: progress.display(i) # TODO: this should also be done with the ProgressMeter print(' * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}' .format(top1=top1, Samples for idx in range(len(name_list)): Samples for idx in range(len(name_list)): name_list[idx] = name_list[idx][0] analyse(task_type='image_classification', save_path='/home/image_labeled/', pred_list=pred_list, label_list=target_list, name_list=name_list) return top1.avg

Most of the previous code is part of the validation code for PyTorch training ImageNet, which requires three Lists, logits, the direct result of the model pred, target, the actual category of the image, and the image storage path name. Then use the above method to call the Analyze interface, which will generate a JSON file in the save_path directory, put it in the ModelArts training output directory, and you can see the analysis results of the model in the evaluation results. I’m going to take the JSON file generated offline and upload it online to see the visualization results. The results of sensitivity analysis are as follows:

What this graph means is how accurate is the test of different eigenvalue ranges. For example, the first item of brightness sensitivity analysis, 0%-20%, can be understood to mean that the accuracy of images in scenes with low brightness is much lower than that of images with class 0 and other brightness conditions. On the whole, it is mainly to detect category 1, which is very sensitive in terms of brightness and sharpness of the image, that is, the model can’t handle these two feature changes of the image very well. Isn’t that the direction in which I want to amplify the data set?

OK, so I tried to amplify the full data set directly, and got a data set of 2210 normal images and 1174 defective images, and threw the same strategy into the PyTorch for training, and got the results:

What’s going on? It’s not quite what I expected…

After re-analyzing the data set, it suddenly occurred to me that this kind of industrial data set often has a problem of uneven samples. Although this problem is close to 2: 1. However, the detection requirements are relatively high for the defective category, so the model should be inclined to learn from the defective category, and the result of seeing the defective category is more sensitive, so in fact, there is still the case of unbalanced samples. Therefore, the latter two enhancement methods only target at 1 category, namely the damaged category with problems. Finally, about 3000 images, 1508 images of normal category and 1432 images of defective category are obtained, so that the samples are relatively balanced. Throw the same strategy into Resnet 50 for training. Accuracy information finally obtained:

As can be seen, in the same verification set of 754 normal samples and 357 defective samples, the overall accuracy of ACC1 is improved by nearly 3%, and the recall of the important indicator of defective category is improved by 8.4%! Well, it’s good. So the method of directly expanding the dataset is effective, and it makes sense to evaluate what amplification methods I can refer to in conjunction with the model. Of course, it is also very important to exclude the problems existing in the original data set, such as the imbalance of samples. By analyzing the specific situation, this amplification method will become simple and practical.

Then based on the results and data sets of this experiment. Some training strategies were changed to help users, and a better network was used to meet the user’s requirements. Of course, this is the result of customized analysis, which will not be explained in detail here, or will be updated in future blogs.

The reference data set is from:

Buerhop-Lutz, C.; Deitsch, S.; Maier, A.; Gallwitz, F.; Berger, S.; Doll, B.; Hauch, J.; Camus, C. & Brabec, C. J. A Benchmark for Visual Identification of Defective Solar Cells in Electroluminescence Imagery. European PV Solar Energy Conference and Exhibition (EU PVSEC), 2018. The DOI: 10.4229/35 theupvsec20182018-5 CV. 3.15

Deitsch, S.; Buerhop-Lutz, C.; Maier, A. K.; Gallwitz, F. & Riess, C. Segmentation of Photovoltaic Module Cells in Electroluminescence Images. CoRR, 2018, Abs / 1806.06530

Deitsch, S.; Christlein, V.; Berger, S.; Buerhop-Lutz, C.; Maier, A.; Gallwitz, F. & Riess, C. Automatic classification of defective photovoltaic module cells in electroluminescence images. Solar Energy, Elsevier BV, 2019, 185, 455-468. The DOI: 10.1016 / j.s olener 2019.02.067

Click on the attention, the first time to understand Huawei cloud fresh technology ~

What is the experience of training a data set with not enough data?

preface

Data set situation

Related Posts

Direct access to Huawei Cloud IoT Intelligent Manufacturing behind the HUAWEI MATE 40 production line

Explore the migration of SSDB, LevelDB, and RocksDB to GaussDB(for Redis)

LiteOS kernel source code analysis: message Queue