Learn Bert source code in 10 minutes (PyTorch)

Bert applications in production need to be compressed, which requires a good understanding of the Bert structure. The repository interprets Bert source code (pyTorch version) step by step. The warehouse address is

Github.com/DA-southamp…

Code and data introduction

The first thing for the code is the repository

I clone the code directly, put it in the repository and rename it bert_read_step_to_step.

I will use this code to run Bert’s code on text classification step by step, and then record various details including my own implementation at the same time.

Before you run it, you need to do two things.

Prepare the pre-training model

One is the preparation of the pre-training model, I use Google’s Chinese pre-training model: Chinese_L-12_H-768_A -12.zip, the model is too big, I will not upload, if there is no local, just click here to download directly, or directly run the command line

wget https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip
Copy the code

Once the pre-training model is downloaded, unpack it and convert the TF model into the corresponding PyTorch version. The corresponding code is as follows:

export BERT_BASE_DIR=/path/to/bert/chinese_L- 12_H- 768._A- 12

python convert_tf_checkpoint_to_pytorch.py \
  --tf_checkpoint_path $BERT_BASE_DIR/bert_model.ckpt \
  --bert_config_file $BERT_BASE_DIR/bert_config.json \
  --pytorch_dump_path $BERT_BASE_DIR/pytorch_model.bin
Copy the code

After successful transformation, put the model into the corresponding location of the warehouse:

Read_Bert_Code/bert_read_step_to_step/prev_trained_model/
Copy the code

And renamed to:

bert-base-chinese
Copy the code

Prepare text classification training data

The second thing is to prepare the training data. Here I’m going to do a text categorization task using the Tnews data set. This data set comes from here and is divided into training, test and development sets

Read_Bert_Code/bert_read_step_to_step/chineseGLUEdatasets/tnews
Copy the code

It should be noted that since I only want to understand the internal code, accuracy is not in my consideration, so I just take part of the data, including 1K for training data, 1K for testing data and 1K for development data.

I am ready to use Pycharm to import the project and prepare to debug. My debug file is run_classifier

--model_type=bert --model_name_or_path=prev_trained_model/bert-base-chinese --task_name="tnews" --do_train --do_eval --do_lower_case --data_dir=./chineseGLUEdatasets/tnews --max_seq_length=128 --per_gpu_train_batch_size=16 --per_gpu_eval_batch_size=16 --learning_rate=2e-5 --num_train_epochs=4.0 --logging_steps=100 --save_steps=100 --output_dir=./outputs/tnews_output/ --overwrite_output_dir
Copy the code

Then debug run_classifier.py, the details of which I’ll show below

1. Enter the main function

I’m going to break in the main function position, which is right here, and then I’m going to go in and look at the main function

## Break point on main functionif __name__ == "__main__": main()## Main function entryCopy the code

2. Parse command line parameters

Going from here to here is parsing command line arguments, it’s a general operation, basically what model name, model address, whether to test, etc. It’s easy. Just go through it.

3. Judge the situation

From here to here are some general judgments:

Determine whether an output folder exists

Check whether remote debugging is required

There are two parameters to control the single-machine CPU training, single-machine multi-GPU training, or multi-machine distributed GPU training

The code can be seen as follows:

if args.local_rank == - 1 or args.no_cuda:
    device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
    args.n_gpu = torch.cuda.device_count()
else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
    torch.cuda.set_device(args.local_rank)
    device = torch.device("cuda", args.local_rank)
    torch.distributed.init_process_group(backend='nccl')
    args.n_gpu = 1
Copy the code

4. Obtain the Processor of the task

Get the corresponding processor of the task, and the corresponding function is the function that we need to define ourselves to process our own input file. The position is here, and the code is as follows:

processor = processors[args.task_name]()
Copy the code

This result returns a class. We use the following class:

TnewsProcessor(DataProcessor)
Copy the code

The code location is right here,

4.1 TnewsProcessor

Take a closer look at TnewsProcessor, which first inherits from DataProcessor

Click here to open the fold code

# # DataProcessor throughout the project location: processors. Utils. The DataProcessor class DataProcessor (object) : def get_train_examples(self, data_dir): raise NotImplementedError() def get_dev_examples(self, data_dir): raise NotImplementedError() def get_labels(self): raise NotImplementedError() @classmethod def _read_tsv(cls, input_file, quotechar=None): with open(input_file,"r", encoding="utf-8-sig") as f:
            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
            lines = []
            for line in reader:
                lines.append(line)
            return lines

    @classmethod
    def _read_txt(cls, input_file):
        """Reads a tab separated value file."""
        with open(input_file, "r") as f:
            reader = f.readlines()
            lines = []
            for line in reader:
                lines.append(line.strip().split("_! _"))
            return lines
Copy the code

Then it contains five functions, namely, reading the training set, developing the set data, obtaining the returned label, and making the data in the format required by Bert

Let’s look at the TnewsProcessor code format:

Click here to open the fold code

class TnewsProcessor(DataProcessor):

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_txt(os.path.join(data_dir, "toutiao_category_train.txt")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_txt(os.path.join(data_dir, "toutiao_category_dev.txt")), "dev")

    def get_test_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_txt(os.path.join(data_dir, "toutiao_category_test.txt")), "test")

    def get_labels(self):
        """See base class."""
        labels = []
        for i in range(17) :if i == 5 or i == 11:
                continue
            labels.append(str(100 + i))
        return labels

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s-%s" % (set_type, i)
            text_a = line[3]
            if set_type == 'test':
                label = '0'
            else:
                label = line[1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples
Copy the code

One thing to note here is that if we use our own training data, there are two ways, the first is to change the data format to the same data as our test case, and the second is that we change the source code here to read our own data format

5. Load the pre-training model

The code is relatively simple, is to call the pre-training model, not introduced in detail

Click here to open the fold code

config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels, finetuning_task=args.task_name)
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
Copy the code

6. Training model – Also the most important part

The training model, seen from the main function, is two steps, one is to load the required data set, one is to train, the code location is here. The code looks like this:

train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, data_type='train')
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
Copy the code

Let’s look at two functions one by one:

6.1 Loading a Training Set

Let’s take a look at the first function, load_and_cache_examples, which loads training data sets. The code location is here. Looking at the code roughly, there are three core operations.

The first core operation, located here, has the following code:

examples = processor.get_train_examples(args.data_dir)
Copy the code

This code is designed to read the training set using the Processor, very simple.

The resulting example looks something like this (the return form is clearly shown in the processor above) :

guid='train-0'
label='104'
text_a='The stock is not doing well today.'
text_b=None
Copy the code

The second core operation is convert_examples_to_features, which makes it easy to convert data.

The code location is here. The code is as follows:

features = convert_examples_to_features(examples,tokenizer,label_list=label_list,max_length=args.max_seq_length,output_mode=output_ mode,pad_on_left=bool(args.model_type in ['xlnet']),                                                pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0.Copy the code

Let’s go into this function and see what’s going on inside.

processors.glue.glue_convert_examples_to_features
Copy the code

Made a label mapping, ‘100’->0 ‘101’->1…

It then gets a serialized representation of the input text: input_IDS, token_type_IDS; Something like this:

‘input_ids’=[101, 5500, 4873, 704, 4638, 4960, 4788, 2501, 2578, 102]

‘token_type_ids’=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)

The result form is as follows: [1, 1, 1, 1, 1, 1, 1, 1, 1]

Calculate the current length, get the pading length, for example, we are 10, then need to fill up 128, pad needs 118 zeros.

At this point, our input_IDS becomes the list above followed by 128 zeros. Then our attention_mask becomes the form above plus 118 0’s, because the complement is not our second sentence, we don’t have a second sentence at all, so token_type_ids is 128 0’s in total

So after each piece of data, what we need to do is

features.append(InputFeatures(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=label, Input_len =input_len))##10, not128
Copy the code

InputFeatures in this case stores transformed features in a new variable

So after we’ve done the feature transformation of all the raw data, we’ve got a list of features, and then we’ve put the elements into tensor form, and then

The third is to tensor the transformed new data, then use the TensorDataset to construct the final data set and return it,

dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_lens,all_labels)
Copy the code

6.2 Training model -Train function

Let’s look at the second function, which is the train operation.

6.2.1 Common Operations

First of all, it’s all routine.

Random sampling of data: RandomSampler

The DataLoader reads the data

Calculate total training steps (gradient accumulative), warm_UP parameter Settings, optimizer, fp16, etc

Then do the training batch by batch. The core code here is the following to input data and parameters into the model:

outputs = model(**inputs)
Copy the code

We are in for a text categorization demo operation, using the corresponding BertForSequenceClassification Bert in this class.

So let’s jump right into this class and see what’s going on inside the function.

6.2.2 Bert classification model: BertForSequenceClassification

The main codes are as follows:

Click here to open the fold code

##reference: transformers.modeling_bert.BertForSequenceClassification class BertForSequenceClassification(BertPreTrainedModel): def __init__(self, config): ... . self.bert = BertModel(config) self.dropout = nn.Dropout(config.hidden_dropout_prob) self.classifier = nn.Linear(config.hidden_size, self.config.num_labels) def forward(self, input_ids, attention_mask=None, token_type_ids=None,position_ids=None, head_mask=None, labels=None): outputs = self.bert(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids, Pooled_output = outputs pooled_output = outputs pooled_output = outputs1] pooled_output = self.dropout(pooled_output) ... .return outputs  # (loss), logits, (hidden_states), (attentions)
Copy the code

The core of this class has two parts. The first part uses BertModel to obtain Bert’s original output, and then uses CLS’s output to continue the subsequent classification operations. The more important thing is BertModel, so let’s go straight to the inner workings of BertModel. The code is as follows:

And then let’s look at the BertModel and what does the BertModel look like

6.2.1.1 BertModel

The code is as follows:

Click here to open the fold code

## reference: transformers.modeling_bert.BertModel class BertModel(BertPreTrainedModel): def __init__(self, config): self.embeddings = BertEmbeddings(config) self.encoder = BertEncoder(config) self.pooler = BertPooler(config) ... def forward(self, input_ids, attention_mask=None, token_type_ids=None,position_ids=None, head_mask=None): ... Attention_mask = attention_mask. Unsqueeze1).unsqueeze(2)
        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * 10000.0embedding_output = self.embeddings(input_ids, position_ids=position_ids, Encoder_outputs = self. Encoder (embedding_output, embedding_output) extended_attention_mask, head_mask=head_mask) ...return outputs
Copy the code

For BertModel, we can divide it into two parts. The first part works on attention_mask and embedding the input, and the second part enters encoder for encoding. The encoder uses BertEncoder. Let’s just go in and have a look

6.2.1.1.1 BertEncoder

The code is as follows:

Module): def __init__(self, config): super(BertEncoder, self).__init__() self.output_attentions = config.output_attentions self.output_hidden_states = config.output_hidden_states self.layer = nn.ModuleList([BertLayer(config)for _ in range(config.num_hidden_layers)])

    def forward(self, hidden_states, attention_mask=None, head_mask=None):
        all_hidden_states = ()
        all_attentions = ()
        for i, layer_module in enumerate(self.layer):
            if self.output_hidden_states:
                all_hidden_states = all_hidden_states + (hidden_states,)

            layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i])
            hidden_states = layer_outputs[0]

            if self.output_attentions:
                all_attentions = all_attentions + (layer_outputs[1],)

        # Add last layer
        if self.output_hidden_states:
            all_hidden_states = all_hidden_states + (hidden_states,)

        outputs = (hidden_states,)
        if self.output_hidden_states:
            outputs = outputs + (all_hidden_states,)
        if self.output_attentions:
            outputs = outputs + (all_attentions,)
        return outputs  # last-layer hidden state, (all hidden states), (all attentions)
Copy the code

If the output of the BertEncoder is set to True, the results of each layer are output as well as the word vector. If the output is set to 12 layers, the output is 13 layers and the first layer is word-embedding. Each layer of the results is [batchsize seqlength, Hidden_size] (in addition to the first layer, [batchsize, seqlength embedding_size])

Embedding_size is, of course, the same dimensionally as the hidden dimension.

One other thing to note is that we need to see a detail here, which is whether we can do head_mask, the head_mask. I remember there was a paper about which head affects the result, and this seems to work.

The most important of the BertEncoder is the BertLayer

BertLayer

BertLayer is divided into two operations, BertAttention and BertIntermediate. BertSelfAttention is divided into BertSelfAttention and BertSelfOutput. Let’s look at them one by one

BertAttention
BertSelfAttention

def forward(self, hidden_states, attention_mask=None, head_mask=None): Mixed_query_layer = self.query(hidden_states) #16.32.768].16Is the batch_size,32Is the length of each sentence in this batch,768Mixed_key_layer = self.key(hidden_states) mixed_value_layer = self.value(hidden_states) query_layer = Self. transpose_for_scores(mixed_query_layer)#16.12.32.64]:[Batch_size,Num_head,Seq_len, each header dimension] key_layer = self.transpose_for_scores(mixed_key_layer) value_layer = self.transpose_for_scores(mixed_value_layer) # Take the dot product between"query" and "key" to get the raw attention scores.
  attention_scores = torch.matmul(query_layer, key_layer.transpose(- 1.2 -Attention_scores = torch.Size([16.12.32.32])
  attention_scores = attention_scores / math.sqrt(self.attention_head_size)
  if attention_mask is not None:
  # Apply the attention mask is (precomputed forAll layers in BertModel forward() function attention_scores = attention_scores + attention_mask ## The pad part is directly very large negative value, below softmax, directly close0

  # Normalize the attention scores to probabilities.
  attention_probs = nn.Softmax(dim=- 1)(attention_scores) # This is actually dropping out entire tokens to attend to, which might # seem a bit unusual, But is taken from the original Transformer paper. Attention_probs = Self.dropout (attention_probs)## Dimension Torch.16.12.32.32])

  # Mask heads if we want to
  ifhead_mask is not None: Attention_probs = attention_probs * head_mask context_layer = torch. Matmul (attention_probs, value_layer)##16.12.32.64])

  context_layer = context_layer.permute(0.2.1.3). The contiguous () # # dimension torch. The Size ([16.32.12.64])
  new_context_layer_shape = context_layer.size()[:2 -] + (self.all_head_size,)## new_context_layer_shape: torch.Size([self.all_head_size,)16.32.768]) context_layer = context_layer.view(*new_context_layer_shape) ## dimension changed to torch.Size([16.32.768])
  outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)
  return outputs
Copy the code

The dimension returned by BertSelfAttention is torch.Size([16, 32, 768]), which is used as the input of BertSelfOutput

BertSelfOutput

class BertSelfOutput(nn.Module): def __init__(self, config): super(BertSelfOutput, self).__init__() self.dense = nn.Linear(config.hidden_size, Self.LayerNorm = BertLayerNorm(config.hidden_size) ## Make a linear dimension unchanged self.layernorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps) self.dropout = nn.Dropout(config.hidden_dropout_prob) def forward(self, hidden_states, input_tensor): hidden_states = self.dense(hidden_states) hidden_states = self.dropout(hidden_states) hidden_states = self.LayerNorm(hidden_states + input_tensor)return hidden_states
Copy the code

After the above two functions BertSelfAttention and BertSelfOutput return the result of attention, which is followed by the next operation of the BertLayer: BertIntermediate

BertIntermediate

This function is relatively simple, passing through a Linear and passing through a Gelu activation function

The input result dimension is torch.Size([16, 32, 3072])

This result then goes into the BertOutput model

BertOutput

The output dimension is Torch.Size([16, 32, 768])

The output of BertOutput is returned to the BertEncoder class

The BertEncoder result returns the BertModel class as encoder_outputs, dimension Size torch.Size([16, 32, 768])

Outputs = (sequence_output, pooled_output,) + encoder_outputs[1:]

Sequence_output: Torch.Size([16, 32, 768])

Pooled_output: torch.Size([16, 768]) is the output of CLS through a pool layer (linear dimension constant + TANh)

Outputs are returned to BertForSequenceClassification, namely to pooled_output do classification

Top 10 Best Popular Python Libraries of 2020 \

2020 Python Chinese Community Top 10 Articles \

Top 10 Sand Sculptures and fun GitHub apps \

Special recommendation \

Click below to read the article and join the community

Learn Bert source code in 10 minutes (PyTorch)

Code and data introduction

Prepare the pre-training model

Prepare text classification training data

1. Enter the main function

2. Parse command line parameters

3. Judge the situation

4. Obtain the Processor of the task

4.1 TnewsProcessor

5. Load the pre-training model

6. Training model – Also the most important part

6.1 Loading a Training Set

6.2 Training model -Train function

6.2.1 Common Operations

6.2.2 Bert classification model: BertForSequenceClassification

6.2.1.1 BertModel

6.2.1.1.1 BertEncoder

Related Posts

The simplest face detection (free call baidu AI open platform interface)

How to guarantee the consistency between Redis and MySQL?

Qiu Zhao wanted to go to JINGdong, met three times, but got unexpected joy and won the Offer of 20K.