Bert Chinese classification learning in deep learning

BERT experiment

Analysis of pre-training results

tfhub_handle_preprocess = "https://hub.tensorflow.google.cn/tensorflow/bert_zh_preprocess/3" bert_preprocess_model = Hub.kerasLayer (tfhub_handle_preprocess) text_test = [' I'm a genius!! '] text_preprocessed = bert_preprocess_model(text_test) print(f'Keys : {list(text_preprocessed.keys())}') print(f'Shape : {text_preprocessed["input_word_ids"].shape}') print(f'Word Ids : {text_preprocessed["input_word_ids"][0, :12]}') print(f'Input Mask : {text_preprocessed["input_mask"][0, :12]}') print(f'Type Ids : {text_preprocessed["input_type_ids"][0, :12]}')

Print the result

Keys : ['input_mask', 'input_type_ids', 'input_word_ids'] Shape : (1, 128) Word Ids : [ 101 2769 4696 3221 702 1921 2798 1557 8013 8013 102 0] Input Mask : [1 1 1 1 1 1 1 1 1 1 1 0] Type Ids : [0 0 0 0 0 0 0 0 0 0

Conjecture 128 in Shape should be the maximum length.

Keys corresponds to three attributes, but Bert should have seven attributes. Why the other four attributes are not here is not very clear at present, but I think Advance Topics in Hub is probably talking about this. But since the task to be done now is text categorization, the following four features are not needed.

  • Input_ids: The ID corresponding to the token entered
  • Input_mask: The mask of the input, where 1 represents normal input and 0 represents padding input
  • Segment_ids: The input 0 represents sentence A or padding, and 1 represents sentence B
  • MASKED_LM_POSITIONS: We mask token positions
  • MASKED_LM_IDS: The corresponding ID of the token we mask
  • Masked_LM_weights: The weight of our mask token, where 1 represents the real mask and 0 represents the padding mask
  • Next_sentence_labels: Sentences A and B are the next and next sentences

Now let’s see what the output looks like. In order to make sure my guess is correct, I dropped another one. What a genius!! Go inside and see the results. If you combine the two results, it looks like this.

Word Ids   : [ 101 2769 4696 3221  702 1921 2798 1557 8013 8013  102    0]
Word Ids   : [ 101 4696 3221  702 1921 2798 1557 8013 8013  102    0    0]
Input Mask : [1 1 1 1 1 1 1 1 1 1 1 0]
Input Mask : [1 1 1 1 1 1 1 1 1 1 0 0]
Type Ids   : [0 0 0 0 0 0 0 0 0 0 0 0]
Type Ids   : [0 0 0 0 0 0 0 0 0 0 0 0]

That’s interesting. Bert has no concept of a phrase at present, it is separated one word at a time. According to the data, Bert’s pre-training is divided into two steps, and this is obviously the first step.

I’m such a genius! Divided into [CLS] What a genius I am! [SEP], so this is the Word IDS can be matched. TypeIds should be equivalent to segment_ids and should be matched, both 0. The Input Mask is also correct. The algorithm is as follows: the mask is 1, which is the “real” Token, and the 0 is the Padding. In the later Attention, the model can not attend to these padding tokens through the Tricky technique. input_mask = [1] * len(input_ids)

Result output analysis

tfhub_handle_encoder = "https://hub.tensorflow.google.cn/tensorflow/bert_zh_L-12_H-768_A-12/4"
bert_model = hub.KerasLayer(tfhub_handle_encoder)

bert_results = bert_model(text_preprocessed)
print(f'Keys       : {list(bert_results.keys())}')
print(f'Pooled Outputs Shape:{bert_results["pooled_output"].shape}')
print(f'Pooled Outputs Values:{bert_results["pooled_output"][0, :12]}')
print(f'Sequence Outputs Shape:{bert_results["sequence_output"].shape}')
print(f'Sequence Outputs Values:{bert_results["sequence_output"][0, :12]}')
  • Sequence_Output: Dimension [batch_size, seq_length, hidden_size], which is the word vector for each token after training.
  • Pooled_output: The dimension is [batch_size, hidden_size], the vector output of the CLS at the first position of each sequence, used for sorting tasks.

Bert Practical Chinese Classification

Referencing the official Bert classification, the official uses the scene of sentiment analysis, which is actually similar to the Chinese multi-classification scene. All you need to do is change Bert’s pre-process and encoder, and then change the last layer of output to multiple categories.

It can be copied basically.

Environment to prepare

pip install -q tensorflow-text
pip install -q tf-models-official

import tensorflow as tf
import os
import shutil

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization  # to create AdamW optimizer

import matplotlib.pyplot as plt

tfhub_handle_preprocess = "https://hub.tensorflow.google.cn/tensorflow/bert_zh_preprocess/3"
tfhub_handle_encoder = "https://hub.tensorflow.google.cn/tensorflow/bert_zh_L-12_H-768_A-12/4"
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)
bert_model = hub.KerasLayer(tfhub_handle_encoder)

Prepare the data set

Since the experiment was done at COLAB, it seems that uploading can only be in the form of uploading files. So compress it into a zip package locally, upload it to Colab and decompress it.

import zipfile
from pathlib import Path
zFile = zipfile.ZipFile("path.zip","r")
for fileM in zFile.namelist(): 
        zFile.extract(fileM, "path")
zFile.close();

The file format requirements are the same as the official example.

File directory category 1 category 1 title 1 category 1 title 2... Category 2 Category 3...

Split data set

AUTOTUNE = tf.data.AUTOTUNE batch_size = 32 seed = 42 raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory( 'train' batch_size = batch_size, validation_split = 0.2, subset = 'training', seed=seed) class_names = raw_train_ds.class_names print(class_names) train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE) val_ds = tf.keras.preprocessing.text_dataset_from_directory( 'train' batch_size = batch_size, validation_split = 0.2, subset = 'validation', seed=seed) val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

Take a look at the data set

for text_batch, label_batch in train_ds.take(1):
  print(text_batch)
  for i in range(3):
    print(f'Review: {text_batch.numpy()[i]}')
    label = label_batch.numpy()[i]
    print(f'Label : {label} ({class_names[label]})')

Define the model

def build_classifier_model(): text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text') preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing') encoder_inputs = preprocessing_layer(text_input) encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder') outputs = encoder(encoder_inputs) net = outputs['pooled_output'] net = Dense(6, activation='softmax', TF.keras.Layers.Dropout(0.5)(net) net = TF.keras.Layers.Dense(6, activation='softmax', name='classifier')(net) return tf.keras.Model(text_input, net) classifier_model = build_classifier_model() epochs = 10 steps_per_epoch = tf.data.experimental.cardinality(train_ds).numpy() num_train_steps = steps_per_epoch * epochs print(num_train_steps) Num_warmup_steps = int(0.1*num_train_steps) init_lr= 3E-5 optimizer = optimization.create_optimizer(init_lr=init_lr, num_train_steps=num_train_steps, num_warmup_steps=num_warmup_steps, optimizer_type='adamw') loss = tf.keras.losses.SparseCategoricalCrossentropy() metrics = tf.metrics.SparseCategoricalAccuracy() classifier_model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

Training model

history = classifier_model.fit(x=train_ds,
                               validation_data=val_ds,
                               epochs=epochs)

Forecast data

I found that although I only used 300 samples, 240 samples of train and 60 samples of val, the accuracy could reach 90%. Bert did.

import numpy as np def print_my_examples(inputs, results): result_for_printing = \ [f'input: {inputs[i]:<30} : class: {class_names[np.argmax(results[i])] }' for i in range(len(inputs))] print(*result_for_printing, Sep ='\n') print() examples = [' MIJI-Sprinklers ', 'MIJI-Sprinklers ',' MIJI-Sprinklers ', 'MIJI-Sprinklers ', 'life element kettle household multi-functional kettle insulation one small office boiling tea ware teapot', 'principal think old man drop fall protection after repeated use of intelligent airbag ma3 jia3 proof clothes artifact gift' and 'seven dimensions sanitary napkin elegant series value combination' day and night, [example_result = classifier_model(examples)] example_result = classifier_model(TF. constant(examples)) print_my_examples(examples,example_result)

The appendix

  • Official example BERT classification code
  • TensorFlow Hub Bert Chinese model processing
  • Detailed explanation of Zhihu Bert
  • Tencent cloud Bert detail
  • BERT source code interpretation