From the previous post, AI-Challenger Baseline Fine-grained User Review Sentiment Analysis (0.70201)

Thank you for your attention. In the second article, I will introduce the process of model training in detail. Some of my personal experience in training will be appropriately integrated into it, and I will share it with you for discussion.

Once we have the model identified, the next step is to train it.

Here is my personal summary of the key aspects of the training model.

Loss Function

For this question, if we do not use the idea of Seq2seq and simply divide it into 20 models for training respectively, in fact, loss function can be set as categorical Crossentropy or binary Crossentropy.

Binary crossentropy is used to categorize multiple tags. Binary crossentropy is used to categorize multiple tags. What’s the difference between multiple categories and multiple labels? There’s only one target category, and there can be multiple target categories.

In this case, I’ve tried both of these loss functions, and since the last one is the one with the highest probability of prediction, there’s very little difference between the two loss functions. Binary Crossentropy is recommended if you want to use seq2seq.

These are two traditional loss functions, which can also be used to optimize MSE. Here is a question, students who are good at mathematics can think about whether f1-score can be directly optimized. Although F-score itself is not derivable, can we design approximate F1-Loss? You can think about that.

Early Stop

If we set the metric values for each batch, we will get f-Score values for each batch. Therefore, we need to calculate the f-Score value of the model after the end of each epoch, which is convenient for us to master the training situation of the model.

Like this

def getClassification(arr):
    arr = list(arr)
    if arr.index(max(arr)) == 0:
        return -2
    elif arr.index(max(arr)) == 1:
        return -1
    elif arr.index(max(arr)) == 2:
        return 0
    else:
        return 1

class Metrics(Callback):
    def on_train_begin(self, logs={}):
        self.val_f1s = []
        self.val_recalls = []
        self.val_precisions = []

    def on_epoch_end(self, epoch, logs={}):
        val_predict = list(map(getClassification, self.model.predict(self.validation_data[0])))
        val_targ = list(map(getClassification, self.validation_data[1]))
        _val_f1 = f1_score(val_targ, val_predict, average="macro")
        _val_recall = recall_score(val_targ, val_predict, average="macro")
        _val_precision = precision_score(val_targ, val_predict, average="macro")
        self.val_f1s.append(_val_f1)
        self.val_recalls.append(_val_recall)
        self.val_precisions.append(_val_precision)
        print(_val_f1, _val_precision, _val_recall)
        print("max f1")
        print(max(self.val_f1s))
        returnCopy the code

Early_stop, as the name suggests, saves time by stopping training in advance when the model is no longer improving on the validation set. Adjust it by setting patience.

Class Weight

Generally speaking, when the data set samples are not balanced, the effect can be improved by setting the weight of positive and negative samples. However, in this problem, after I set class_weight for each of the four categories, I found that the effect was even worse. Here also hope that the students who know the reason can leave a comment, exchange and study together.

EMA (Exponential Smoothing)

Reference: http://zangbo.me/2017/07/01/TensorFlow_6/, exponential moving average (ExponentialMovingAverage) EMA – namely, the young – CSDN blog

EMA is widely used in BN layer of deep learning, RMSprop, Adadelta, Adam and other gradient descent methods.

It adds a shadow copy of the training parameters and maintains the moving average operation of the training parameters in its shadow copy. This operation is called after each training to update the moving average.

I personally feel that the addition of EMA can effectively prevent the parameters from updating too quickly. It has a bagging effect.

Learning Rate

When training the model, we can use the learning rate of dynamic attenuation to avoid the model staying in the local optimum.

My personal experience is as follows:

  1. The model was iterated enough times at the default learning rate (0.001) to retain the model with the highest validation accuracy.
  2. The optimal model of the previous step was loaded, the learning rate was reduced to 0.0001, and the model with the highest verification accuracy was retained.
  3. Load the optimal model of the previous step, remove the regularization strategy, set the learning rate to 0.00001, and train to the optimal.

Max Length (Padding maximum sentence length)

This seemingly unimportant point is actually very important. At first I thought the padding length should be twice as long as the average comment length (for char level, max_length should be 400 or so), but I found that the effect was not as good. After I changed the max_length to 1000, Macro F-Score clearly indicates that, in my opinion, in the multi-classification problem, some comments with long length may belong to those categories with few samples. If the padding is too short, these long comments cannot be correctly classified.

Training code examples

To train 20 models at once, add Python’s GC and Keras’s clear_session.

from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
set_session(tf.Session(config=config))
import random
random.seed = 42
import pandas as pd
from tensorflow import set_random_seed
set_random_seed(42)
from keras.preprocessing import text, sequence
from keras.callbacks import ModelCheckpoint, Callback
from sklearn.metrics import f1_score, recall_score, precision_score
from keras.layers import *
from classifier_bigru import TextClassifier
from gensim.models.keyedvectors import KeyedVectors
import pickle
import gc


def getClassification(arr):
    arr = list(arr)
    if arr.index(max(arr)) == 0:
        return -2
    elif arr.index(max(arr)) == 1:
        return -1
    elif arr.index(max(arr)) == 2:
        return 0
    else:
        return 1


class Metrics(Callback):
    def on_train_begin(self, logs={}):
        self.val_f1s = []
        self.val_recalls = []
        self.val_precisions = []

    def on_epoch_end(self, epoch, logs={}):
        val_predict = list(map(getClassification, self.model.predict(self.validation_data[0])))
        val_targ = list(map(getClassification, self.validation_data[1]))
        _val_f1 = f1_score(val_targ, val_predict, average="macro")
        _val_recall = recall_score(val_targ, val_predict, average="macro")
        _val_precision = precision_score(val_targ, val_predict, average="macro")
        self.val_f1s.append(_val_f1)
        self.val_recalls.append(_val_recall)
        self.val_precisions.append(_val_precision)
        print(_val_f1, _val_precision, _val_recall)
        print("max f1")
        print(max(self.val_f1s))
        return


data = pd.read_csv("preprocess/train_char.csv")
data["content"] = data.apply(lambda x: eval(x[1]), axis=1)

validation = pd.read_csv("preprocess/validation_char.csv")
validation["content"] = validation.apply(lambda x: eval(x[1]), axis=1)

model_dir = "model_bigru_char/"
maxlen = 1200
max_features = 20000
batch_size = 128
epochs = 15
tokenizer = text.Tokenizer(num_words=None)
tokenizer.fit_on_texts(data["content"].values)
with open('tokenizer_char.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

word_index = tokenizer.word_index
w2_model = KeyedVectors.load_word2vec_format("word2vec/chars.vector", binary=True, encoding='utf8',
                                             unicode_errors='ignore')
embeddings_index = {}
embeddings_matrix = np.zeros((len(word_index) + 1, w2_model.vector_size))
word2idx = {"_PAD": 0}
vocab_list = [(k, w2_model.wv[k]) for k, v in w2_model.wv.vocab.items()]

for word, i in word_index.items():
    if word in w2_model:
        embedding_vector = w2_model[word]
    else:
        embedding_vector = None
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector

X_train = data["content"].values
Y_train_ltc = pd.get_dummies(data["location_traffic_convenience"])[[-2, -1, 0, 1]].values
Y_train_ldfbd = pd.get_dummies(data["location_distance_from_business_district"])[[-2, -1, 0, 1]].values
Y_train_letf = pd.get_dummies(data["location_easy_to_find"])[[-2, -1, 0, 1]].values
Y_train_swt = pd.get_dummies(data["service_wait_time"])[[-2, -1, 0, 1]].values
Y_train_swa = pd.get_dummies(data["service_waiters_attitude"])[[-2, -1, 0, 1]].values
Y_train_spc = pd.get_dummies(data["service_parking_convenience"])[[-2, -1, 0, 1]].values
Y_train_ssp = pd.get_dummies(data["service_serving_speed"])[[-2, -1, 0, 1]].values
Y_train_pl = pd.get_dummies(data["price_level"])[[-2, -1, 0, 1]].values
Y_train_pce = pd.get_dummies(data["price_cost_effective"])[[-2, -1, 0, 1]].values
Y_train_pd = pd.get_dummies(data["price_discount"])[[-2, -1, 0, 1]].values
Y_train_ed = pd.get_dummies(data["environment_decoration"])[[-2, -1, 0, 1]].values
Y_train_en = pd.get_dummies(data["environment_noise"])[[-2, -1, 0, 1]].values
Y_train_es = pd.get_dummies(data["environment_space"])[[-2, -1, 0, 1]].values
Y_train_ec = pd.get_dummies(data["environment_cleaness"])[[-2, -1, 0, 1]].values
Y_train_dp = pd.get_dummies(data["dish_portion"])[[-2, -1, 0, 1]].values
Y_train_dt = pd.get_dummies(data["dish_taste"])[[-2, -1, 0, 1]].values
Y_train_dl = pd.get_dummies(data["dish_look"])[[-2, -1, 0, 1]].values
Y_train_dr = pd.get_dummies(data["dish_recommendation"])[[-2, -1, 0, 1]].values
Y_train_ooe = pd.get_dummies(data["others_overall_experience"])[[-2, -1, 0, 1]].values
Y_train_owta = pd.get_dummies(data["others_willing_to_consume_again"])[[-2, -1, 0, 1]].values

X_validation = validation["content"].values
Y_validation_ltc = pd.get_dummies(validation["location_traffic_convenience"])[[-2, -1, 0, 1]].values
Y_validation_ldfbd = pd.get_dummies(validation["location_distance_from_business_district"])[[-2, -1, 0, 1]].values
Y_validation_letf = pd.get_dummies(validation["location_easy_to_find"])[[-2, -1, 0, 1]].values
Y_validation_swt = pd.get_dummies(validation["service_wait_time"])[[-2, -1, 0, 1]].values
Y_validation_swa = pd.get_dummies(validation["service_waiters_attitude"])[[-2, -1, 0, 1]].values
Y_validation_spc = pd.get_dummies(validation["service_parking_convenience"])[[-2, -1, 0, 1]].values
Y_validation_ssp = pd.get_dummies(validation["service_serving_speed"])[[-2, -1, 0, 1]].values
Y_validation_pl = pd.get_dummies(validation["price_level"])[[-2, -1, 0, 1]].values
Y_validation_pce = pd.get_dummies(validation["price_cost_effective"])[[-2, -1, 0, 1]].values
Y_validation_pd = pd.get_dummies(validation["price_discount"])[[-2, -1, 0, 1]].values
Y_validation_ed = pd.get_dummies(validation["environment_decoration"])[[-2, -1, 0, 1]].values
Y_validation_en = pd.get_dummies(validation["environment_noise"])[[-2, -1, 0, 1]].values
Y_validation_es = pd.get_dummies(validation["environment_space"])[[-2, -1, 0, 1]].values
Y_validation_ec = pd.get_dummies(validation["environment_cleaness"])[[-2, -1, 0, 1]].values
Y_validation_dp = pd.get_dummies(validation["dish_portion"])[[-2, -1, 0, 1]].values
Y_validation_dt = pd.get_dummies(validation["dish_taste"])[[-2, -1, 0, 1]].values
Y_validation_dl = pd.get_dummies(validation["dish_look"])[[-2, -1, 0, 1]].values
Y_validation_dr = pd.get_dummies(validation["dish_recommendation"])[[-2, -1, 0, 1]].values
Y_validation_ooe = pd.get_dummies(validation["others_overall_experience"])[[-2, -1, 0, 1]].values
Y_validation_owta = pd.get_dummies(validation["others_willing_to_consume_again"])[[-2, -1, 0, 1]].values

list_tokenized_train = tokenizer.texts_to_sequences(X_train)
input_train = sequence.pad_sequences(list_tokenized_train, maxlen=maxlen)

list_tokenized_validation = tokenizer.texts_to_sequences(X_validation)
input_validation = sequence.pad_sequences(list_tokenized_validation, maxlen=maxlen)

print("model1")
model1 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
file_path = model_dir + "model_ltc_{epoch:02d}.hdf5"
checkpoint = ModelCheckpoint(file_path, verbose=2, save_weights_only=True)
metrics = Metrics()
callbacks_list = [checkpoint, metrics]
history = model1.fit(input_train, Y_train_ltc, batch_size=batch_size, epochs=epochs,
                     validation_data=(input_validation, Y_validation_ltc), callbacks=callbacks_list, verbose=2)
del model1
del history
gc.collect()
K.clear_session()

print("model2")
model2 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
file_path = model_dir + "model_ldfbd_{epoch:02d}.hdf5"
checkpoint = ModelCheckpoint(file_path, verbose=2, save_weights_only=True)
metrics = Metrics()
callbacks_list = [checkpoint, metrics]
history = model2.fit(input_train, Y_train_ldfbd, batch_size=batch_size, epochs=epochs,
                     validation_data=(input_validation, Y_validation_ldfbd), callbacks=callbacks_list, verbose=2)
del model2
del history
gc.collect()
K.clear_session()

print("model3")
model3 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
file_path = model_dir + "model_letf_{epoch:02d}.hdf5"
checkpoint = ModelCheckpoint(file_path, verbose=2, save_weights_only=True)
metrics = Metrics()
callbacks_list = [checkpoint, metrics]
history = model3.fit(input_train, Y_train_letf, batch_size=batch_size, epochs=epochs,
                     validation_data=(input_validation, Y_validation_letf), callbacks=callbacks_list, verbose=2)
del model3
del history
gc.collect()
K.clear_session()
。。。Copy the code

Github address: Pengshuang/AI-comp