Writing in the front

[email protected]

Recently, I was looking at paddle related, so I decided to go through the source code of Baidu ERNIE. I haven’t seen ERNIE2.0 or ERNIE tiny before, the overall feeling is very similar to BERT, I don’t know what will happen after the update. I will also sort out a summary like the following, those who happen to be studying Paddle or ERNIE can join me to discuss hahaha

@ 2019.05.16 original content

BERT model has been out for a long time, I have read papers and some blogs about it: NLP kill BERT model interpretation [1], but I have not carefully looked at the specific implementation of the source code. Take the time to take a look at it and write it down and discuss it with you.

Note that the source code reading series requires some prior knowledge of NLP, such as the Attention mechanism, the Transformer framework, and python and TensorFlow fundamentals. BERT principles are not the focus of this article.

Attached is a summary of BERT data: a summary of Bert-related papers, articles and code resources [2]

Today we will introduce BERT’s most important model implementation part —–BertModel, the code is located in

Modeling. Py module [3]

In addition to the outside of the code block, there are also comments inside the code block

Please be sure to point out if any interpretation is incorrect

1. Configuration class (BertConfig)

This part of the code mainly defines some default parameters of the BERT model, in addition to some file handling functions.

class BertConfig(object):
  """Configuration classes for BERT models."""

  def __init__(self,
               vocab_size,
               hidden_size=768,
               num_hidden_layers=12,
               num_attention_heads=12,
               intermediate_size=3072,
               hidden_act="gelu",
               hidden_dropout_prob=0.1,
               attention_probs_dropout_prob=0.1,
               max_position_embeddings=512,
               type_vocab_size=16,
               initializer_range=0.02) : self.vocab_size = vocab_size self.hidden_size = hidden_size self.num_hidden_layers = num_hidden_layers self.num_attention_heads = num_attention_heads self.hidden_act = hidden_act self.intermediate_size = intermediate_size self.hidden_dropout_prob = hidden_dropout_prob self.attention_probs_dropout_prob = attention_probs_dropout_prob self.max_position_embeddings = max_position_embeddings self.type_vocab_size = type_vocab_size self.initializer_range = initializer_range @classmethod def from_dict(cls, json_object):"""Constructs a `BertConfig` from a Python dictionary of parameters."""
    config = BertConfig(vocab_size=None)
    for (key, value) in six.iteritems(json_object):
      config.__dict__[key] = value
    return config

  @classmethod
  def from_json_file(cls, json_file):
    """Constructs a `BertConfig` from a json file of parameters."""
    with tf.gfile.GFile(json_file, "r") as reader:
      text = reader.read()
    return cls.from_dict(json.loads(text))

  def to_dict(self):
    """Serializes this instance to a Python dictionary."""
    output = copy.deepcopy(self.__dict__)
    return output

  def to_json_string(self):
    """Serializes this instance to a JSON string."""
    return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
Copy the code

“Parameter Meanings”

Vocab_size: word table size
Hidden_size: indicates the number of neurons at the hidden layer
Num_hidden_layers: Number of hidden layers in Transformer Encoder
*num_attention_heads: * The number of heads for multi-head attention
Intermediate_size: number of “intermediate” hidden layer neurons of encoder (e.g. feed-forward layer)
Hidden_act: Hidden layer activation function
Hidden_dropout_prob: Hidden layer dropout rate
Attention_probs_dropout_prob: Dropout of the attention part
Max_position_embeddings: Maximum position code
Type_vocab_size: dictionary size of token_type_IDS
Initializer_range: Truncated_normal_Initializer Stdev for the initialization method

Segment A and Segment B in the Next Sentence Prediction task, Segment A and Segment B in the next Sentence Prediction task. The bert_config.json file is also available for download, and the default value should be 2. Refer to this Issue[4]

Embedding_lookup = Embedding_lookup

For word_ids, returns the embedding table. Use one-hot or Tf.Gather ()

Def embedding_lookup(input_ids, # word_id: [batch_size, seq_length] vocab_size, embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings", use_one_HOT_embeddings =False): # The default input shape for this function is [batch_size, seq_length, input_num] # If input is2Batch_size, seq_length, batch_size, seq_length1】
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[- 1])

  embedding_table = tf.get_variable(
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))

  flat_input_ids = tf.reshape(input_ids, [- 1] # [batch_size*seq_length*input_num]if use_one_hot_embeddings:
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:	# 按索引取值
    output = tf.gather(embedding_table, flat_input_ids)

  input_shape = get_shape_list(input_ids)

  # output：[batch_size, seq_length, num_inputs]
  # 转成:[batch_size, seq_length, num_inputs*embedding_size]
  output = tf.reshape(output,
  				      input_shape[0:- 1] + [input_shape[- 1] * embedding_size])
  return (output, embedding_table)
Copy the code

“Parameter Meanings”

Input_ids: word id [batch_size, seq_length]
Vocab_size: embedding word table
Embedding_size: embedding dimension
Initializer_range: embedding initialization range
Word_embedding_name: name of the embeddding table
Use_one_hot_embeddings: Whether to use one-hotembedding
Return: [batch_size, seq_length, embedding_size]

embedding_postprocessor

We know that the input of BERT model has three parts:token embedding ，segment embeddingAs well asposition embedding. In the previous section we only got the token embedding. This code completes the information, regularizes it, and then outputs the final embedding. Notice that in the Transformer paperposition embeddingIs a fixed value generated by the sin/cos function. In the code implementation here, it is randomly generated like ordinary Word embedding and can be trained. The reason for the author’s choice here may be that BERT’s training data is much larger than Transformer’s, so the model can learn by itself.

def embedding_postprocessor(input_tensor,				# [batch_size, seq_length, embedding_size]
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16, # generally yes2
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512, # maximum position encoding must be greater than or equal to max_seq_len dropout_prob=0.1):

  input_shape = get_shape_list(input_tensor, expected_rank=3) # embedding_size = input_shape[0]
  seq_length = input_shape[1]
  width = input_shape[2] output = input_tensor # Segment positionif use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.") token_type_table = tf.get_variable( name=token_type_embedding_name, shape=[token_type_vocab_size, width], Initializer = create_Initializer (Initializer_range)) # For token-typeFlat_token_type_ids = tf. 0 (0) 0- 1]) one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size) token_type_embeddings = tf.matmul(one_hot_ids, token_type_table) token_type_embeddings = tf.reshape(token_type_embeddings, [batch_size, seq_length, Width]) output += token_type_Embeddingsifuse_position_embeddings: Assert_op = tf.ASSERT_LESS_equal (seq_length, max_position_embeddings) with tf.control_dependencies([assert_op]): full_position_embeddings = tf.get_variable( name=position_embedding_name, shape=[max_position_embeddings, width], Initializer =create_initializer(initializer_range) [MAX_POSItion_embeddings, width] # But usually the actual input sequence did not reach max_POSItion_embeddings, so to improve the training speed, Embeddings = tf. Slice (full_position_Embeddings, [0.0],
                                     [seq_length, - 1])
      num_dims = len(output.shape.as_list()) # tensor [batch_size, seq_length, width] Our shape is always [seq_length, width] # we can't add position Embedding to Word Embedding # so we need to extend position encoding to [1, seq_length, width] # then it can be added by broadcasting. position_broadcast_shape = []for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      output += position_embeddings

  output = layer_norm_and_dropout(output, dropout_prob)
  return output
Copy the code

4. Construct attention_mask

The purpose of this part of the code is to construct the attention_mask for the attentional domain. Because each sample goes through the padding process, the padding part of the self-attention part cannot attend the other part. Enter the shape as [batch_size, from_seq_length,… The padding is the input_ids and the mask vector of shape [batch_size, to_seq_length].

def create_attention_mask_from_input_mask(from_tensor, to_mask):
  from_shape = get_shape_list(from_tensor, expected_rank=[2.3])
  batch_size = from_shape[0]
  from_seq_length = from_shape[1]

  to_shape = get_shape_list(to_mask, expected_rank=2)
  to_seq_length = to_shape[1]

  to_mask = tf.cast(
      tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)

  broadcast_ones = tf.ones(
      shape=[batch_size, from_seq_length, 1], dtype=tf.float32)

  mask = broadcast_ones * to_mask

  return mask
Copy the code

5. Attention Layer

This part of the code is the implementation of “multi-head attention”, mainly from the paper “Attention is All You Need”. So if you think about key-query-value attention, then from_tensor is query, to_tensor is key and value, and then self-attention when they’re the same. A more detailed introduction of attention can be referred to “Understanding the principle and Model of Attention Mechanism [5]”.

Def attention_layer(from_tensor, # batch_size, from_seq_length, from_width) to_tensor, # batch_size, to_seq_length, To_width attention_mask=None, # [batch_size,from_seq_length, to_seq_length] num_attention_heads=1,		# attention head numbers
                    size_per_head=512# query_act=None, # key_act=None, # key = value_act=None, Attention_probs_dropout_prob =0.0, # Attention layer dropout Initializer_range =0.02Do_return_2d_tensor =False, # does it return2D tensor. [batch_size*from_seq_length,num_attention_heads*size_per_head] # If False, Output shape [batch_size, from_seq_length, num_attention_heads*size_per_head] BATch_size =None, # if input is3D, # then batch is the first dimension, but maybe3The delta of D is reduced to2Batch_size from_seq_length=None, to_seq_length=None Def transpose_for_scores(input_tensor, batch_size, num_attention_heads, seq_length, width) output_tensor = tf.reshape( input_tensor, [batch_size, seq_length, num_attention_heads, width]) output_tensor = tf.transpose(output_tensor, [0.2.1.3])	#[batch_size,  num_attention_heads, seq_length, width]
    return output_tensor

  from_shape = get_shape_list(from_tensor, expected_rank=[2.3])
  to_shape = get_shape_list(to_tensor, expected_rank=[2.3])

  if len(from_shape) ! =len(to_shape):
    raise ValueError(
        "The rank of `from_tensor` must match the rank of `to_tensor`.")

  if len(from_shape) == 3:
    batch_size = from_shape[0]
    from_seq_length = from_shape[1]
    to_seq_length = to_shape[1]
  elif len(from_shape) == 2:
    if (batch_size is None or from_seq_length is None or to_seq_length is None):
      raise ValueError(
          "When passing in rank 2 tensors to attention_layer, the values "
          "for `batch_size`, `from_seq_length`, and `to_seq_length` "
          "must all be specified."# B = Batch size (number of sequences) # F =`from_tensor` sequence length
  #   T = `to_tensor` sequence length
  #   N = `num_attention_heads`
  #   H = `size_per_head`From_tensor and to_tensor2From_tensor_2d = reshape_to_matrix(From_tensor) # 【B*F, Hidden_size = to_tensor_2d = reshape_to_matrix(To_tensor) # 【B*T, hidden_size】 # Put from_tensor into the whole connected layer to get query_layer #`query_layer` = [B*F, N*H]
  query_layer = tf.layers.dense(
      from_tensor_2d,
      num_attention_heads * size_per_head,
      activation=query_act,
      name="query", kernel_initializer= create_Initializer (initializer_range)) # put from_tensor into the full connected layer to get query_layer #`key_layer` = [B*T, N*H]
  key_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=key_act,
      name="key", kernel_initializer=create_initializer(initializer_range)) #`value_layer` = [B*T, N*H]
  value_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=value_act,
      name="value", kernel_initializer= create_Initializer (initializer_range)) # query_layer [B*F, N*H]==>[B, F, N, H]==>[B, N, F, H] query_layer = transpose_for_scores(query_layer, batch_size, Num_attention_heads, from_seq_length, size_per_head) # key_layer [B*T, N*H] ==> [B, T, N, H] ==> [B, N, T, H] key_layer = transpose_for_scores(key_layer, batch_size, Num_attention_heads, to_seq_length, size_per_head`attention_scores` = [B, N, F, T]
  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
  attention_scores = tf.multiply(attention_scores,
                                 1.0 / math.sqrt(float(size_per_head)))

  if attention_mask is not None:
    # `attention_mask` = [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1] # if the element in the attention_mask is1, then the following operation can be obtained:1- 1) *- 10000., the adder is0# if the element in attention_mask is0, then the following operation can be obtained:10) *- 10000., the adder is- 10000.
    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * 10000.0The final attention_score we get is generally not very large, so the above operation for mask is0Attention_scores += adder # Minus infinity after softmax0, is equivalent to mask is0The position of does not count attention_score #`attention_probs`= [B, N, F, T] attention_probs = tf.nn.softmax(attention_scores) But that's what the original Transforme papers do: Attention_probs = dropout(attention_probs, attention_PROBs_dropout_prob) #`value_layer` = [B, T, N, H]
  value_layer = tf.reshape(
      value_layer,
      [batch_size, to_seq_length, num_attention_heads, size_per_head])

  # `value_layer` = [B, N, T, H]
  value_layer = tf.transpose(value_layer, [0.2.1.3])

  # `context_layer` = [B, N, F, H]
  context_layer = tf.matmul(attention_probs, value_layer)

  # `context_layer` = [B, F, N, H]
  context_layer = tf.transpose(context_layer, [0.2.1.3])

  if do_return_2d_tensor:
    # `context_layer` = [B*F, N*H]
    context_layer = tf.reshape(
        context_layer,
        [batch_size * from_seq_length, num_attention_heads * size_per_head])
  else:
    # `context_layer` = [B, F, N*H]
    context_layer = tf.reshape(
        context_layer,
        [batch_size, from_seq_length, num_attention_heads * size_per_head])

  return context_layer
Copy the code

To sum up, the main flow of attention Layer is as follows:

And then you take the input tensor and you take itBatch_size, froM_seq_LENGTH, to_seq_length;
If the input is a 3D tensor, it is converted to a 2D matrix;
From_tensor for query, to_tensor for key and value, through a full connect layer you get query_layer, key_layer, value_layer;
Pass the above tensortranspose_for_scoresConvert to multi-head;
Calculate attention_score and attention_probs (pay attention to the trick of attention_mask) according to the formula in the paper:

The resulting attention_probs is multiplied by value to return either a 2D or 3D tensor

6, the Transformer

The following code is the core code of the famous Transformer, which can be thought of as “Attention is All You Need”. Please refer to [original paper [6]] and [original code [7]].

Def transformer_model(input_tensor, # [batch_size, seq_length, hidden_size] attention_mask=None, # [batch_size, Seq_length seq_length 】 hidden_size =768,
                      num_hidden_layers=12,
                      num_attention_heads=12,
                      intermediate_size=3072, intermediate_act_fn=gelu, # feed-forward layer activation function hidden_dropout_prob=0.1,
                      attention_probs_dropout_prob=0.1,
                      initializer_range=0.02, do_return_all_layers=False): # notice here, because we're going to print hidden_size, we have num_attention_head, # Each head region has more hidden layers of size_per_head # so there is hidden_size = num_attention_head * size_per_headifhidden_size % num_attention_heads ! =0:
    raise ValueError(
        "The hidden size (%d) is not a multiple of the number of attention "
        "heads (%d)" % (hidden_size, num_attention_heads))

  attention_head_size = int(hidden_size / num_attention_heads)
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  input_width = input_shape[2Encoder has a residual operation, so you need the same shapeifinput_width ! = hidden_size: raise ValueError("The width of the input tensor (%d) ! = hidden size (%d)"% (input_width, hidden_size)) # 0 0 0 0 0 0 0 0 % (input_width, hidden_size2D and30 The frequency between D 0 we are 03D tensor2Prev_output = reshape_to_matrix(input_tensor) all_layer_outputs = []for layer_idx in range(num_hidden_layers):
    with tf.variable_scope("layer_%d" % layer_idx):
      layer_input = prev_output

      with tf.variable_scope("attention"):
      # multi-head attention
        attention_heads = []
        with tf.variable_scope("self") : # self-attention attention_head = attention_layer( from_tensor=layer_input, to_tensor=layer_input, attention_mask=attention_mask, num_attention_heads=num_attention_heads, size_per_head=attention_head_size, attention_probs_dropout_prob=attention_probs_dropout_prob, initializer_range=initializer_range, do_return_2d_tensor=True, batch_size=batch_size, from_seq_length=seq_length, to_seq_length=seq_length) attention_heads.append(attention_head)

        attention_output = None
        if len(attention_heads) == 1:
          attention_output = attention_heads[0]
        elseAttention_output = tf.concat(attention_heads, axis=- 1Dropout +residual+norm with tf.variable_scope()"output") : attention_output = tf.layers.dense( attention_output, hidden_size, kernel_initializer=create_initializer(initializer_range)) attention_output = dropout(attention_output, hidden_dropout_prob) attention_output = layer_norm(attention_output + layer_input) # feed-forward with tf.variable_scope("intermediate") : intermediate_output = tf.layers.dense( attention_output, intermediate_size, activation=intermediate_act_fn, Kernel_initializer = create_Initializer (Initializer_range)) # Transform the output of the feed-forward layer back to 'hidden_size' # and then dropout + using linear transformation residual + norm with tf.variable_scope("output") : layer_output = tf.layers.dense( intermediate_output, hidden_size, kernel_initializer=create_initializer(initializer_range)) layer_output = dropout(layer_output, hidden_dropout_prob) layer_output = layer_norm(layer_output + attention_output) prev_output = layer_output all_layer_outputs.append(layer_output)

  if do_return_all_layers:
    final_outputs = []
    for layer_output in all_layer_outputs:
      final_output = reshape_from_matrix(layer_output, input_shape)
      final_outputs.append(final_output)
    return final_outputs
  else:
    final_output = reshape_from_matrix(prev_output, input_shape)
    return final_output
Copy the code

It works best when used with the above and below images, because BERT only has encoder and all decoders have no name

7. Function entry (init)

Constructor of the BertModel class. With the introduction of the previous sections, we can implement the BERT model.

Def __init__(self, config, # BertConfig) is_training, input_ids, # batch_size, seq_length, input_mask=None, # [batch_size, seq_length] token_type_ids=None, # [batch_size, seq_length] use_one_HOT_embeddings =False, # Whether to use one-hot; Otherwise tf.Gather () scope=None): config =copy.deepcopy(config)
    if not is_training:
      config.hidden_dropout_prob = 0.0
      config.attention_probs_dropout_prob = 0.0

    input_shape = get_shape_list(input_ids, expected_rank=2)
    batch_size = input_shape[0]
    seq_length = input_shape[1# # # # # # # # # #1
    if input_mask is None:
      input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

    if token_type_ids is None:
      token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

    with tf.variable_scope(scope, default_name="bert"):
      with tf.variable_scope("embeddings") : # word embedding (self.embedding_output, self.embedding_table) = embedding_lookup( input_ids=input_ids, vocab_size=config.vocab_size, embedding_size=config.hidden_size, initializer_range=config.initializer_range, word_embedding_name="word_embeddings". Use_one_hot_embeddings = USe_one_hot_embeddings) # Add position embedding and Segment embedding # layer norm + Dropout self.embedding_output = embedding_postprocessor( input_tensor=self.embedding_output, use_token_type=True, token_type_ids=token_type_ids, token_type_vocab_size=config.type_vocab_size, token_type_embedding_name="token_type_embeddings",
            use_position_embeddings=True,
            position_embedding_name="position_embeddings",
            initializer_range=config.initializer_range,
            max_position_embeddings=config.max_position_embeddings,
            dropout_prob=config.hidden_dropout_prob)

      with tf.variable_scope("encoder"): # input_ids is the padding word_ids: [25.120.34.0.0# input_mask is a valid word marker: [1.1.1.0.0] attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask) # transformer module stack #`sequence_output` shape = [batch_size, seq_length, hidden_size].
        self.all_encoder_layers = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=config.hidden_size,
            num_hidden_layers=config.num_hidden_layers,
            num_attention_heads=config.num_attention_heads,
            intermediate_size=config.intermediate_size,
            intermediate_act_fn=get_activation(config.hidden_act),
            hidden_dropout_prob=config.hidden_dropout_prob,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
            initializer_range=config.initializer_range,
            do_return_all_layers=True)

	  # `self.sequence_output`Shape is [batch_size, seq_length, hidden_size] self.sequence_output = self.all_encoder_layers[- 1[batch_size, seq_length, hidden_size] # convert to [batch_size, hidden_size] with tf.variable_scope()"pooler"): # Take the tensor at the first moment of the last layer [CLS]. It's important for sorting tasks.0:1, :] gets [batch_size,1First_token_tensor = tf.squeeze(self.sequence_output[:,0:1, :], axis=1) # Then add a full connection layer, The output is still [batch_size, hidden_size] self.pooled_output = tf.layers.dense(first_token_tensor, config.hidden_size, activation=tf.tanh, kernel_initializer=create_initializer(config.initializer_range))Copy the code

Conclusion the ha

With the above in-depth understanding of the source code, we will be more comfortable when using BertModel. Here’s a simple chestnut for the model:

# assuming the input has been split into word_ids. shape=[2.3]
input_ids = tf.constant([[31.51.99], [15.5.0]])
input_mask = tf.constant([[1.1.1], [1.1.0[]) # segment_emebdding1The latter word belongs to the sentence2.# The first word of the second sample belongs to the sentence1The second word belongs to the sentence2And the third element0The original code for padding # looks like this, but it doesn't feel necessary2Token_type_ids = tf.constant([[0.0.1], [0.2.0[]] # vocab_size= vocab_size32000, hidden_size=512,
         num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024BertModel(config=config, is_training=True, input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids) label_embeddings = tf.get_variable(...) The first Token of the last layer is the [CLS] vector representation, Embedding pooled_output = model.get_pooled_output() logits = tf.matmul(pooled_output, label_embeddings)Copy the code

The main process of BERT model construction is as follows:

When the input sequence is added (three), ‘Attention is all you need’
It is simpler to put the embedding into transformer and get output results.
Embedding -> N * [multi-head attention -> Add(Residual) &Norm- > feed-forward -> Add(Residual) &Norm]
Ha, is not very simple ~
There are a few other helper functions in the source code that are not too hard to understand, so I won’t bother here.

The above –

References for this article

[1]

NLP kill BERT model interpretation: blog.csdn.net/Kaiyuan_sjt…

[2]

Bert-related papers, articles and code resources: www.52nlp.cn/bert-paper-…

[3]

Modeling. Py module: github.com/google-rese…

[4]

Refer to this Issue: github.com/google-rese…

[5]

Understanding the mechanics and models of Attention: blog.csdn.net/Kaiyuan_sjt…

[6]

Original paper: arxiv.org/abs/1706.03…

[7]

Original code: github.com/tensorflow/…

“`php

Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) note: WeChat group or qq group to join this site, please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching

Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

BERT Source Code Analysis (PART I)

Writing in the front

[email protected]

@ 2019.05.16 original content

1. Configuration class (BertConfig)

Embedding_lookup = Embedding_lookup

embedding_postprocessor

4. Construct attention_mask

5. Attention Layer

6, the Transformer

7. Function entry (init)

Conclusion the ha

References for this article

“`php

BERT Source Code Analysis (PART I)

Writing in the front

[email protected]

@ 2019.05.16 original content

1. Configuration class (BertConfig)

Embedding_lookup = Embedding_lookup

embedding_postprocessor

4. Construct attention_mask

5. Attention Layer

6, the Transformer

7. Function entry (init)

Conclusion the ha

References for this article

“`php

Related Posts

DataMagic: How to use Spark for trillions of data

100 Basic Python Interview Questions Part 3 (41-60) | Python topic month

String concatenation functions strcat() and strncat()