I use YOLOv5 for emotion recognition!

Author: Chen Xinda, Datawhale, Shanghai University of Science and Technology

AI technology has been applied to all aspects of our lives, and target detection is one of the most widely used algorithms. The shadow of target detection algorithm can be found in epidemic temperature measuring instrument, inspection robot and even He’s Airdesk. The picture below is airdesk. He locates the mobile phone through target detection algorithm, and then controls the wireless charging coil to move under the mobile phone to charge it automatically. This seemingly simple application is actually a complex theory and iterative AI algorithm. Today, the author will teach you how to quickly grasp the target detection model YOLOv5 and apply it to emotion recognition.

The background,

Today is derived from the content of the published in T – 2019 PAMI on an article [1], before that there have been a lot of researchers by AI algorithms to identify human emotion, but the authors of this article thinks, emotions, and not only related to facial expressions and body movements, and also the current environment is closely linked, such as below expression of the boy should be a surprise:

But add in the surroundings, and the emotions we just thought were different from the real ones:

The main idea of this paper is to identify emotion by combining background image with human information detected by target detection model.

Among them, the author divides emotion into discrete and continuous dimensions. The following will be explained to make it easier to understand, and if you already know it, you can quickly skip it.

Continuous emotional	explain
Valence (V)	Measures how positive or pleasant an emotion is, ranging from negative to positive
Arousal (A)	Measures the rubbing level of the person, ranging from non-active/in calm to agitation/agitation
Dominance (D)	measures the level of control a person feels of the situation, Ranging from submissive/non-control to dominant/in-control

Discrete emotions	explain
Affection	fond feelings; love; tenderness
Anger	intense displeasure or rage; furious; resentful
Annoyance	bothered by something or someone; irritated; impatient; frustrated
Anticipation	state of looking forward; hoping on or getting prepared for possible future events
Aversion	feeling disgust, dislike, repulsion; feeling hate
Confidence	feeling of being certain; conviction that an outcome will be favorable; encouraged; proud
Disapproval	feeling that something is wrong or reprehensible; contempt; hostile
Disconnection	feeling not interested in the main event of the surrounding; indifferent; bored; distracted
Disquietment	nervous; worried; upset; anxious; tense; pressured; alarmed
Doubt/Confusion	difficulty to understand or decide; thinking about different options
Embarrassment	feeling ashamed or guilty
Engagement	paying attention to something; absorbed into something; curious; interested
Esteem	feelings of favourable opinion or judgement; respect; admiration; gratefulness
Excitement	feeling enthusiasm; stimulated; energetic
Fatigue	weariness; tiredness; sleepy
Fear	feeling suspicious or afraid of danger, threat, evil or pain; horror
Happiness	feeling delighted; feeling enjoyment or amusement
Pain	physical suffering
Peace	well being and relaxed; no worry; having positive thoughts or sensations; satisfied
Pleasure	feeling of delight in the senses
Sadness	feeling unhappy, sorrow, disappointed, or discouraged
Sensitivity	feeling of being physically or emotionally wounded; feeling delicate or vulnerable
Suffering	psychological or emotional pain; distressed; anguished
Surprise	sudden discovery of something unexpected
Sympathy	state of sharing others emotions, goals or troubles; supportive; compassionate
Yearning	strong desire to have something; jealous; envious; lust

Second, preparation and model reasoning

2.1 Quick Start

Just complete the five steps below to identify the emotion!

Download the project locally by clone or zip: git clone github.com/chenxindaaa…
Place the decompressed model file in emotic/debug_exp/ Models. (model file download address: links: gas.graviti.com/dataset/dat…
Creating a virtual environment (Optional) :

Conda create -n emotic python=3.7 conda activate emoticCopy the code

Environment configuration

python -m pip install -r requirement.txt
Copy the code

CD Go to the emotic folder, type and run:

python detect.py
Copy the code

After running, the results are saved in the emotic/runs/detect folder.

2.2 Basic Principles

What should I do if I want to recognize another image? Does it support video and camera? How do you modify YOLOv5 code in practice?

For the first two problems, YOLOv5 has already solved them for us, so we just need to modify line 158 in detect.py:

parser.add_argument('--source', type=str, default='./testImages', help='source')  # file/folder, 0 for webcam
Copy the code

Change ‘./testImages’ to the path of the images and videos you want to recognize, or the path of the folder. For calling the camera, just change the ‘./testImages’ to ‘0’ and camera 0 will be called for identification.

Modify YOLOv5:

In detect.py, the most important lines of code are the following:

for *xyxy, conf, cls in reversed(det): c = int(cls) # integer class if c ! = 0: continue pred_cat, pred_cont = inference_emotic(im0, (int(xyxy[0]), int(xyxy[1]), int(xyxy[2]), int(xyxy[3]))) if save_img or opt.save_crop or view_img: # Add bbox to image label = None if opt.hide_labels else (names[c] if opt.hide_conf else f'{names[c]} {conf:.2f}') plot_one_box(xyxy, im0, pred_cat=pred_cat, pred_cont=pred_cont, label=label, color=colors(c, True), line_thickness=opt.line_thickness) if opt.save_crop: save_one_box(xyxy, imc, file=save_dir / 'crops' / names[c] / f'{p.stem}.jpg', BGR=True)Copy the code

Where det is the result identified by YOLOv5, For example tensor([[121.00000, 7.00000, 480.00000, 305.00000, 0.67680, 0.00000], [278.00000, 166.00000, 318.00000, 305.00000, [0.66222, 27.00000]]) identifies two objects.

Xyxy is the coordinate of the object detection frame. For the first object in the above example, xyxy = [121.00000, 7.00000, 480.00000, 305.00000] corresponds to the coordinates (121, 7) and (480, 305). Two points can determine a rectangle, namely the detection frame. Conf is the confidence of this object. The first object has a confidence of 0.67680. CLS is the corresponding category of the object, where 0 corresponds to “people”. Since we only recognize people’s emotions, CLS is not 0, so the process can be skipped. Here I use the reasoning model officially given by YOLOv5, which contains many categories. You can also train a model of only “people” by yourself. For details, please refer to:

After the object coordinates are identified, the emotic model can be input to obtain the corresponding emotion, namely

pred_cat, pred_cont = inference_emotic(im0, (int(xyxy[0]), int(xyxy[1]), int(xyxy[2]), int(xyxy[3])))
Copy the code

Here I have made some changes to the original image visualization and printed the emotic result on the image:

def plot_one_box(x, im, pred_cat, pred_cont, color=(128, 128, 128), label=None, line_thickness=3): # Plots one bounding box on image 'im' using OpenCV assert im.data.contiguous, 'Image not contiguous. Apply np.asousarray (im) to plot_on_box() input Image.' tl = line_thickness or round(0.002) * (im.shape[0] + im.shape[1]) / 2) + 1 # line/font thickness c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3])) cv2.rectangle(im, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA) if label: tf = max(tl - 1, 1) # font thickness t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0] c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3 cv2.rectangle(im, c1, c2, color, -1, cv2.LINE_AA) # filled #cv2.putText(im, label, (c1[0], c1[1] - 2), 0, tl / 3, [225, 255, 255], thickness=tf, lineType=cv2.LINE_AA) for id, text in enumerate(pred_cat): cv2.putText(im, text, (c1[0], c1[1] + id*20), 0, tl / 3, [225, 255, 255], thickness=tf, lineType=cv2.LINE_AA)Copy the code

Running results:

After completing the above steps, we can start our work. It is well known that Trump has won over many voters with his unique charm of speech. Here we take a look at trump’s speech in the eyes of AI:

It can be seen that self-confidence is one of the necessary conditions for convincing people.

3. Model training

3.1 Data preprocessing

First through [tianyu](

We can preprocess the data by using the grid titanium without downloading the data set and save the results locally (the following code is not in the project, you need to create a py file to run, remember to fill in AccessKey) :

from tensorbay import GAS from tensorbay.dataset import Dataset import numpy as np from PIL import Image import cv2 from  tqdm import tqdm import os def cat_to_one_hot(y_cat): cat2ind = {'Affection': 0, 'Anger': 1, 'Annoyance': 2, 'Anticipation': 3, 'Aversion': 4, 'Confidence': 5, 'Disapproval': 6, 'Disconnection': 7, 'Disquietment': 8, 'Doubt/Confusion': 9, 'Embarrassment': 10, 'Engagement': 11, 'Esteem': 12, 'Excitement': 13, 'Fatigue': 14, 'Fear': 15, 'Happiness': 16, 'Pain': 17, 'Peace': 18, 'Pleasure': 19, 'Sadness': 20, 'Sensitivity': 21, 'Suffering': 22, 'Surprise': 23, 'Sympathy': 24, 'Yearning': 25} one_hot_cat = np.zeros(26) for em in y_cat: One_hot_cat [cat2ind[em]] = 1 return one_hot_cat gas = gas (' fill in your AccessKey') dataset = dataset ("Emotic", gas) segments = dataset.keys() save_dir = './data/emotic_pre' if not os.path.exists(save_dir): os.makedirs(save_dir) for seg in ['test', 'val', 'train']: segment = dataset[seg] context_arr, body_arr, cat_arr, cont_arr = [], [], [], [] for data in tqdm(segment): with data.open() as fp: context = np.asarray(Image.open(fp)) if len(context.shape) == 2: context = cv2.cvtColor(context, cv2.COLOR_GRAY2RGB) context_cv = cv2.resize(context, (224, 224)) for label_box2d in data.label.box2d: xmin = label_box2d.xmin ymin = label_box2d.ymin xmax = label_box2d.xmax ymax = label_box2d.ymax body = context[ymin:ymax, xmin:xmax] body_cv = cv2.resize(body, (128, 128)) context_arr.append(context_cv) body_arr.append(body_cv) cont_arr.append(np.array([int(label_box2d.attributes['valence']), int(label_box2d.attributes['arousal']), int(label_box2d.attributes['dominance'])])) cat_arr.append(np.array(cat_to_one_hot(label_box2d.attributes['categories']))) context_arr = np.array(context_arr) body_arr = np.array(body_arr) cat_arr = np.array(cat_arr) cont_arr = np.array(cont_arr) np.save(os.path.join(save_dir, '%s_context_arr.npy' % (seg)), context_arr) np.save(os.path.join(save_dir, '%s_body_arr.npy' % (seg)), body_arr) np.save(os.path.join(save_dir, '%s_cat_arr.npy' % (seg)), cat_arr) np.save(os.path.join(save_dir, '%s_cont_arr.npy' % (seg)), cont_arr)Copy the code

After the program is run, you can see that there is a folder emotic_pre, which contains some NPY files, indicating that the data preprocessing is successful.

3.2 Model training

Open the main.py file. Line 35 begins with the model’s training parameters. Run this file to start training.

4. Detailed explanation of the Emotic model

4.1 Model Structure

The idea of the model is very simple. The upper and lower networks in the flow chart are actually two RESNet 18s. The upper network is responsible for extracting human body features with 128×128 color images as input and 512 1×1 feature images as output. The following network is responsible for extracting image background features. The pre-training model uses the scene classification model Places365. The input is 224×224 color images and the output is 512 1×1 feature images. Then, the two output flatten are splinted into a vector of 1024. After two full-connection layers, a 26-dimensional vector and a 3-dimensional vector are output. The 26-dimensional vector deals with the classification tasks of 26 discrete emotions, while the 3-dimensional vector is the regression tasks of 3 continuous emotions.

import torch import torch.nn as nn class Emotic(nn.Module): ''' Emotic Model''' def __init__(self, num_context_features, num_body_features): super(Emotic,self).__init__() self.num_context_features = num_context_features self.num_body_features = num_body_features self.fc1 = nn.Linear((self.num_context_features + num_body_features), 256) self.bn1 = nn.batchnorm1d (256) self.d1 = nn.dropout (p=0.5) self.fc_cat = nn.linear (256, 26) self.fc_cont = nn.Linear(256, 3) self.relu = nn.ReLU() def forward(self, x_context, x_body): context_features = x_context.view(-1, self.num_context_features) body_features = x_body.view(-1, self.num_body_features) fuse_features = torch.cat((context_features, body_features), 1) fuse_out = self.fc1(fuse_features) fuse_out = self.bn1(fuse_out) fuse_out = self.relu(fuse_out) fuse_out = self.d1(fuse_out) cat_out = self.fc_cat(fuse_out) cont_out = self.fc_cont(fuse_out) return cat_out, cont_outCopy the code

Discrete emotion is more than a classification task, that is, a person may exist a variety of feelings at the same time, the author of 26 threshold processing method is to manually set corresponding 26 kinds of emotion, the output value is greater than the threshold that person has corresponding emotion, threshold is as follows, you can see engagement corresponding threshold is 0, that is to say, each person every identification include this kind of emotion:

>>> import numpy as NP >>> Np. load('./debug_exp/results/val_thresholds. Npy ') array([0.0509765, 0.02937193, 0.03467856, 0.16765128, 0.0307672, 0.13506265, 0.03581731, 0.06581657, 0.03092133, 0.04115443, 0.02678059, 0., 0.04085711, 0.14374524, 0.03058549, 0.02580678, 0.23389584, 0.13780132, 0.07401864, 0.08617007, 0.03372583, 0.03105414, 0.029326, 0.03418647, 0.03770866, 0.03943525], dtype = float32)Copy the code

4.2 Loss function:

Weight_type == ‘static ‘For classification tasks, we provide two loss functions, one is the ordinary mean square error loss function (self.weight_type == ‘mean’) and the other is the weighted square error loss function (self.weight_type == ‘static’). Where, the weighted squared error loss function is as follows, The corresponding weights of the 26 categories are [0.1435, 0.1870, 0.1692, 0.1165, 0.1949, 0.1204, 0.1728, 0.1372, 0.1620, 0.1540, 0.1987, 0.1057, 0.1482, respectively, 0.1192, 0.1590, 0.1929, 0.1158, 0.1907, 0.1345, 0.1307, 0.1665, 0.1698, 0.1797, 0.1657, 0.1520, 0.1537]

class DiscreteLoss(nn.Module): ''' Class to measure loss between categorical emotion predictions and labels.''' def __init__(self, weight_type='mean', device=torch.device('cpu')): super(DiscreteLoss, self).__init__() self.weight_type = weight_type self.device = device if self.weight_type == 'mean': Self. weights = self.weights. To (self.device) elif self.weight_type == 'static': Self. weights = FloatTensor([0.1435, 0.1870, 0.1692, 0.1165, 0.1949, 0.1204, 0.1728, 0.1372, 0.1620, 0.1540, 0.1987, 0.1057, 0.1482, 0.1192, 0.1590, 0.1929, 0.1158, 0.1907, 0.1345, 0.1307, 0.1665, 0.1698, 0.1797, 0.1657, 0.1520, 0.1537]). Unsqueeze (0) self.weights = self.weights. To (self.device) def forward(self, target): if self.weight_type == 'dynamic': self.weights = self.prepare_dynamic_weights(target) self.weights = self.weights.to(self.device) loss = (((pred - target)**2) * self.weights) return loss.sum() def prepare_dynamic_weights(self, target): target_stats = torch.sum(target, Dim =0).float().unsqueeze(dim=0).cpu() weights = torch. Zeros ((1,26)) weights[target_stats! Log (target_stats[target_stats!= 0].data + 1.2) weights[target_stats == 0] = 0.0001 return weightsCopy the code

For regression tasks, the author also provides two loss functions, L2 loss function:

class ContinuousLoss_L2(nn.Module):
  ''' Class to measure loss between continuous emotion dimension predictions and labels. Using l2 loss as base. '''
  def __init__(self, margin=1):
    super(ContinuousLoss_L2, self).__init__()
    self.margin = margin
  
  def forward(self, pred, target):
    labs = torch.abs(pred - target)
    loss = labs ** 2 
    loss[ (labs < self.margin) ] = 0.0
    return loss.sum()


class ContinuousLoss_SL1(nn.Module):
  ''' Class to measure loss between continuous emotion dimension predictions and labels. Using smooth l1 loss as base. '''
  def __init__(self, margin=1):
    super(ContinuousLoss_SL1, self).__init__()
    self.margin = margin
  
  def forward(self, pred, target):
    labs = torch.abs(pred - target)
    loss = 0.5 * (labs ** 2)
    loss[ (labs > self.margin) ] = labs[ (labs > self.margin) ] - 0.5
    return loss.sum()
Copy the code

Data set links: gas.graviti.com/dataset/dat…

[1]Kosti R, Alvarez J M, Recasens A, et al. Context based emotion recognition using emotic dataset[J]. IEEE transactions on pattern analysis and machine intelligence, 2019, 42(11): 2755-2766.

YOLOv5 Project address: github.com/ultralytics…

Emotic project address: github.com/Tandon-A/em…

For more information, please visit our official website (

)