“This is the 19th day of my participation in the First Challenge 2022, for more details: First Challenge 2022”.

Introduction:

This is a transcript of my previous blog post on CSDN. Huggingface Transformers Part 1 Using Transformers Part 1

preface

This part is the first half of the basic part of Transformer library, mainly including task summary, model summary and data pretreatment three aspects, because I do not know many models, so most of the machine translation, mistakes are inevitable, the content is for reference only.

Using Transformers

Summary of the Tasks

This section describes the most common use cases when using the library. The models available allow for many different configurations and have a great deal of generality across practical use cases. The models available allow for many different configurations and have great commonality in use cases.

These examples make use of auto-Models, classes that instantiate a model based on a given checkpoint, automatically selecting the correct model architecture.

In order for the model to perform well on the task, it must be loaded from the checkpoint corresponding to that task. These checkpoints are often pre-trained for large amounts of data and fine-tuned for specific tasks. This means the following

  • Not all models are fine-tuned for all tasks. If you want to fine-tune a task-specific model, you can use one of the run_$task.py scripts in the sample directory.
  • A fine-tuning model is fine-tuning on a specific data set. This data set may or may not overlap with the use cases and domains we are working on. As mentioned earlier, you can fine-tune the model with sample scripts, or you can create your own training scripts.

To reason about tasks, the library provides several mechanisms:

  • Pipelines: Very easy to use abstraction, requires only two lines of code.
  • Direct use model: Less abstract, but more flexible and powerful with direct access to the participle (PyTorch/TensorFlow) and full reasoning power.

Both methods are shown in the following examples.

Sequence Classification

Sequence classification is the task of classifying a sequence according to a given number of categories. An example of sequence classification is the GLUE dataset, which is based entirely on this task. Py, run_tf_glues. Py, run_TF_text_classification. Py, or run_xnlib.py scripts can be used to fine-tune the model on the GLUE sequence classification task.

Here is an example of sentiment analysis using Pipeline: Identify whether a sequence is positive or negative. It utilizes a fine-tuned model on SST2, which is a GLUE task.

This returns a label (positive or negative) next to the score, as shown below

from transformers import pipeline
nlp = pipeline("sentiment-analysis")
result = nlp("I hate you") [0]
print(f"label: {result['label']}, with score: {round(result['score'].4)}")
result = nlp("I love you") [0]
print(f"label: {result['label']}, with score: {round(result['score'].4)}")
Copy the code

The output is

Label: NEGATIVE, with score: 0.9991 Label: POSITIVE, with score: 0.9999Copy the code

Here is an example of sequence classification using models to determine whether two sequences paraphrase each other. The process is as follows:

  1. Instantiate the participle and model from the checkpoint name. This model is identified as a BERT model, and weights stored in checkpoint are loaded into it.
  2. Build a sequence from these two sentences, using the correct model-specific separators, token type IDS, and attention masks (encode() and __call__() are responsible for this).
  3. This sequence is passed into the model so that it can be classified into one of two available classes :0(not a definition) and 1(a definition).
  4. Calculate the softmax of the result to get the probability of all classes.
  5. Print the results.

The relevant codes are as follows:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase"."is paraphrase"]
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits

paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
print("-"*24)
# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
Copy the code

The output is as follows

not paraphrase: 10%
paraphrase: 90%
------------------------------------------------------------------------
not paraphrase: 94%
paraphrase: 6%
Copy the code

Extractive Question Answering

Extractive question answering is the task of extracting an answer from a given text. An example of a QA dataset is the SQuAD dataset, which is based entirely on this task. If you want to fine-tune a task’s model on the SQuAD dataset, you can use the run_qa.py and run_tF_team.py scripts.

Here is an example of a question answer using pipeline: Extract the answer from the text of a given question. It takes advantage of a model fine-tuned on SQuAD.

from transformers import pipeline
nlp = pipeline("question-answering")

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script.
"""

result = nlp(question="What is extractive question answering?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'].4)}, start: {result['start']}, end: {result['end']}")
print("-"*24)
result = nlp(question="What is a good example of a question answering dataset?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'].4)}, start: {result['start']}, end: {result['end']}")
Copy the code

The output is as follows

Answer: 'The task of recording an Answer from a text given a question', score: 0.6226, start: 34, end: 95 ------------------------------------------------------------------------ Answer: 'SQuAD dataset', score: 0.5053, start: 147, end: 160Copy the code

Here is an example of answering a question using a model and a participle. The process is as follows:

  1. Instantiate a participle and model from the checkpoint name. This model is identified as a BERT model, and weights stored in checkpoint are loaded into it.
  2. Define a passage and some questions.
  3. Iterate over the problem and build a sequence from the text and current problem, using the correct model-specific delimiters, token type ids, and attention masks.
  4. Pass this sequence to the model. This outputs the range of scores for the start and end positions throughout the sequence token(question and text).
  5. Calculate the result of softmax to get the probability on token.
  6. Get tokens from the start and stop values of the identity and convert these tokens to strings.
  7. Print the results.

The relevant codes are as follows:


from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = 🤗 Transformers (formerly known as Pytorch - Transformers and Pytorch - Pretrained - Bert) provides general-purpose Transformers Architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ Languages and Deep Interoperability between TensorFlow 2.0 and PyTorch. """
questions = [
    "How many pretrained models are available in 🤗 Transformers?"."What does 🤗 Transformers provide?"."🤗 Transformers provides interoperability between which frameworks?",]for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]
    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits
    answer_start = torch.argmax(
        answer_start_scores
    )  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    print(f"Question: {question}")
    print(f"Answer: {answer}")
Copy the code

The output is as follows:

Question: How many pretrained models are available in 🤗 Transformers? Answer: Over 32 + Question: What does 🤗 Transformers provide? Answer: general-purpose architectures Question: 🤗 Transformers provides interoperability between which frameworks? Answer: tensorflow 2. 0 and pytorchCopy the code

Language Modeling

Language modeling is the task of fitting a model into a corpus and can be domain-specific. All popular Transformer based models are trained using variants of a language model, such as BERT using mask language modeling and GPT-2 using causal language modeling.

Masked Language Modeling

Masking language modeling masks the tokens in the sequence with mask tokens and prompts the model to fill the mask with appropriate tokens. This allows the model to process both the upper-right context (tokens on the right side of the mask) and the upper-left context (tokens on the left side of the mask). Such training creates a strong foundation for downstream tasks that require a two-way background, such as SQuAD(question answers, see Lewis, Lui, Goyal et al., Part 4.2).

Here is an example of using pipeline to replace masks from sequences.

from transformers import pipeline
from pprint import pprint

nlp = pipeline("fill-mask")
pprint(nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks."))
Copy the code

The output is as follows:

[{' score: 0.17927521467208862, 'sequence' : 'HuggingFace is creating a tool that the community uses to solve ' 'NLP tasks.', 'token': 3944, 'token_str': 'tool'}, {'score': 0.1134946271777153, 'sequence': 'HuggingFace is creating a framework that the community uses to ' 'solve NLP tasks.', 'token': 7208, 'token_str': 'framework'}, {'score': 0.05243523046374321, 'sequence': 'HuggingFace is creating a library that the community uses to ' 'solve NLP tasks.', 'token': 5560, 'token_str': 'library'}, {'score': 0.034935325384140015, 'sequence': 'HuggingFace is creating a database that the community uses to ' 'solve NLP tasks.', 'token': 8503, 'token_str': 'database'}, {'score': 0.028602493926882744, 'sequence': 'HuggingFace is creating a prototype that the community uses to ' 'solve NLP tasks.', 'token': 17715, 'token_str': ' prototype'}]Copy the code

Here is an example of mask language modeling using a model and a word divider. The process is as follows:

  1. Instantiate a participle and model from the checkpoint name. The model is identified as DistilBERT and loaded with weights stored in checkpoint.
  2. Define a sequence with a mask token and place tokenizer.mask_token to mask a word.
  3. Encode the sequence into a list of ids and find the location of the Mask token in that list.
  4. Retrieve the prediction at the index of the Mask token: This tensor is the same size as the vocabulary and the value is the fraction assigned to each token. The model gives higher scores to tokens that it thinks are likely to occur in that context.
  5. Use PyTorch Topk or TensorFlow’s top_k method to retrieve the first five tokens.
  6. Replace the Mask token with a token and print the result

The relevant codes are as follows:

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")

sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
inputs = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(inputs == tokenizer.mask_token_id)[1]

token_logits = model(inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
Copy the code

The output is:

Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint. Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint. Distilled models are smaller than the models they mimic. Using them instead of  the large versions would help decrease our carbon footprint. Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint. Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.Copy the code

The output prints five sequences containing the top five tokens predicted by the model.

Causal Language Modeling

The causal language model is the task of predicting tokens after a series of tokens. In this case, the model only deals with the left-hand context (the token to the left of the mask). Such training is particularly interesting for generating tasks. If you want to fine-tune your model on a causal language modeling task, you can use the run_clm.py script.

Typically, the next token is predicted by sampling the logit of the last hidden state generated by the model from the input sequence.

Here is an example using a tokenizer and model and sampling the next token after the token input sequence using the top_k_top_p_filter() method.

from transformers import AutoModelWithLMHead, AutoTokenizer, top_k_top_p_filtering
import torch
from torch.nn import functional as F

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2")

sequence = f"Hugging Face is based in DUMBO, New York City, and "
input_ids = tokenizer.encode(sequence, return_tensors="pt")

# get logits of last hidden state
next_token_logits = model(input_ids).logits[:, -1To:]# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# sample
probs = F.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated = torch.cat([input_ids, next_token], dim=-1)

resulting_string = tokenizer.decode(generated.tolist()[0])

print(resulting_string)
Copy the code

The output is

Hugging Face is based in DUMBO, New York City, and  
Copy the code

This will print the next tag that (hopefully) matches the original sequence.

In the next section, we’ll show how to leverage this feature to generate multiple user-defined length tokens in Generate ().

Text Generation

In text generation (also known as open text generation), the goal is to create a coherent portion of text as a continuation of a given context. The following example shows how to generate text using GPT-2 in a pipeline. By default, top-K sampling is applied to all models when used in pipeline, as configured in their respective configurations (see GPT-2 configuration).

from transformers import pipeline

text_generator = pipeline("text-generation")

print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))
Copy the code

The output is:

[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]
Copy the code

Here, the model generates a random text with a maximum length of 50 tags from the context “As far As I am concerned, I will”. The default parameter for pretrainedModel.generate () can be overridden directly in pipeline, as shown above in the max_length parameter.

Here is an example of text generation using XLNet and its tokenizer.


from transformers import AutoModelWithLMHead, AutoTokenizer

model = AutoModelWithLMHead.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

# Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family (except for Alexei and Maria) are discovered. The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the remainder of the story. 1883 Western Siberia, a young Grigori Rasputin is asked by his father and a group of men to perform magic. Rasputin has a vision and denounces  one of the men as a horse thief. Although his father initially slaps him for making such an accusation, Rasputin watches as the man is chased outside and beaten. Twenty years later, Rasputin sees a vision of the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous, with people, even a bishop, begging for his blessing. 
        
       
        """
       
      
prompt = "Today the weather is really nice and I am planning on "

inputs = tokenizer.encode(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")
prompt_length = len(tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length:]

print(generated)
Copy the code

The output is:

Today the weather is really nice and I am planning on anning on going out to see a new band in the next few days. I will  see some young guys there when I get back. This will likely be the last time that I have been to the Twilight Zone for a long time. I have been wanting to go to the Twilight Zone for a long time but have not been able to go there. Maybe I will haveCopy the code

Text generation is currently available in PyTorch using GPT-2, OpenAI-Gpt, CTRL, XLNet, Transfo-XL and Reformer, as well as most models in Tensorflow. As you can see from the examples above, XLNet and Transfo-XL often need padding to work well. Gpt-2 is generally a good choice for open text generation because it is trained on millions of web pages with causal language modeling goals.

Named Entity Recognition

Named Entity recognition (NER) is the task of classifying tokens by class, for example, identifying tokens as a person, organization, or location. An example of a named entity recognition dataset is the Conll-2003 dataset, which is based entirely on this task. If you want to fine-tune your model on NER tasks, you can use the run_ner.py script.

Here is an example of named entity recognition using Pipeline, specifically trying to identify tokens as belonging to one of nine classes:

  • O, Outside of a named entity

  • B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity

  • I-mis, Miscellaneous Entity, Miscellaneous Entity

  • B-PER, Beginning of a person’s name right after another person’s name

  • I-per, Person’s name entity

  • B- The Beginning of an organisation right after another organisation

  • An i-org, an organization

  • B-loc, location entity Beginning of a location right after another location

  • I-loc, Location entity, Location

It takes advantage of the model fine-tuned on Control-2003, fine-tuned by DBMDZ’s @Stefan-it. This outputs a list of all the words identified as one of the entities of the nine classes defined above.

from transformers import pipeline

nlp = pipeline("ner")
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very"
           "close to the Manhattan Bridge which is visible from the window."
           
pprint(nlp(sequence))
Copy the code

The output is:

{' word ':' Hu ', 'score: 0.999578595161438,' entity ':' I - ORG, "index" : 1, the 'start' : 0, 'end' : 2} {' word ': '# # gging', 'score: 0.9909763932228088,' entity ':' I - ORG ', 'index: 2,' start ', 2, 'end' : 7} {' word ':' Face ', 'score' : 0.9982224702835083, 'entity' : 'I - ORG', 'index: 3,' start ', 8, 12} {' end ':' word ':' Inc ', 'score' : 0.9994880557060242, 'entity' : 'I - ORG', 'index', 4, 'start' : 13, 16} {' end ':' word ':' New ', 'score' : 0.9994345307350159, 'entity' : 'I - LOC', 'index: 11,' start ', 40, 'end' : 43} {' word ':' York ', 'score' : 0.9993196129798889, 'entity' : 'I - LOC', 'index: 12,' start ', 44, 'end' : 48} {' word ':' City ', 'score' : 0.9993793964385986, 'entity' : 'I - LOC', 'index: 13,' start ': 49, 53} {' end' : 'word' : 'D', 'score' : 0.9862582683563232, 'entity' : 'I - LOC', 'index: 19,' start ', 79, "end" : 80} {' word ':' # # UM ', 'score' : 0.9514269828796387, 'entity' : 'I - LOC', 'index: 20,' start ', 80, "end" : 82} {' word ':' # # BO ', 'score' : 0.933659017086029, 'entity': 'i-loc ', 'index': 21, 'start': 82, 'end': 84} {'word': 'Manhattan', 'score': 0.9761653542518616, 'entity': 'i-loc ', 'index': 28, 'start': 114, 'end': 123} {'word': 'Bridge', 'score': 0.9914628863334656, 'entity': 'i-loc ', 'index': 29, 'start': 124, 'end': 130}Copy the code

Notice that select Face Inc. was identified as an organization, while New York City, DUMBO and Manhattan Bridge were selected as locations.

Here is an example of named entity recognition using a model and a word divider. The process is as follows:

  1. Instantiate the participle and model from the checkpoint name. This model is identified as a BERT model, and weights stored in checkpoint are loaded into it.
  2. Define a list of tags to use for training models.
  3. Define a sequence with known entities, such as “Hugging Face” as organization and “New York City” as location.
  4. Split words into tokens so that they can be mapped to predictions. We use a little hack, first, to fully encode and decode the sequence so that we get a string containing a special token.
  5. Encode the sequence into the ID (automatically add special tags).
  6. Retrieve the prediction by passing the input to the model and getting the first output. This results in each token being distributed over nine possible classes. We use Argmax to retrieve the most likely class for each token.
  7. Each token is packaged with its predictions and printed.

The code is as follows:

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

label_list = [
    "O".# Outside of a named entity
    "B-MISC".# Beginning of a miscellaneous entity right after another miscellaneous entity
    "I-MISC".# Miscellaneous entity
    "B-PER".# Beginning of a person's name right after another person's name
    "I-PER".# Person's name
    "B-ORG".# Beginning of an organisation right after another organisation
    "I-ORG".# Organisation
    "B-LOC".# Beginning of a location right after another location
    "I-LOC"    # Location
]
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge."

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs).logits
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())])
Copy the code

The output is:

[('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')]
Copy the code

This maps the output to a list of each token predicted accordingly. Unlike the Pipeline method, there is a prediction for each token because we did not remove the 0th class, which means that no specific entity was found on that token.

Summary of text

A summary is a summary of a document or article into a shorter text. If you want to fine-tune the summary task model, you can use the run_summary.py script.

An example of a summary dataset is the CNN/Daily Mail dataset, which consists of long news articles and was created for the summary task.

Here is an example of a summary using pipeline. It draws on the Bart model fine-tuned on the CNN/Daily Mail data set.

Because the text summary pipeline relies on the pretrainedModel.generate () method, we can directly override the default parameters of pretrainedModel.generate () in the max_length and min_length pipelines, as shown below.

from transformers import pipeline

summarizer = pipeline("summarization")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York. A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband. Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other. In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage. Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the 2010 marriage license application, according to court documents. Prosecutors said the marriages were part of an immigration scam. On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further. After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York  subway through an emergency exit, said Detective Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say. Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages. Any divorces happened only after such filings were  approved. It was unclear whether any of the men will be prosecuted. The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali. Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force. If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18. """

print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
Copy the code

The output is:

[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002 . At one time, she was married to eight men at once, prosecutors say .'}]
Copy the code

Here is an example of a summary using a model and a word participle. The process is as follows:

  1. Instantiate a toggle and model from the checkpoint name. Abstracts are usually done using an encoder-decoder model, such as Bart or T5.
  2. Define the article that should be summarized.
  3. Added the T5 specific prefix “Summarize:”.
  4. Generate a summary using the pretrainedModel.generate () method.

In this example, we use Google’s T5 model. Even if it was pre-trained only on multi-tasking mixed data sets (including CNN/Daily Mail), it produced very good results.

from transformers import AutoModelWithLMHead, AutoTokenizer

model = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="pt", max_length=512)
outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs.tolist()[0]))
Copy the code

The output is:

<pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them between 1999 and 2002.</s>
Copy the code

Text Translation

Text translation is the task of translating text from one language to another. If you want to fine-tune the model in a translation task, you can use the run_translation.py script.

An example of a translation dataset is the WMT English to German dataset, which takes English sentences as input data and corresponding German sentences as target data.

Here is an example of a translation using pipeline. It utilizes T5 models pre-trained only on multi-tasking mixed data sets (including WMT), however, producing impressive translation results.

from transformers import pipeline

translator = pipeline("translation_en_to_de")

print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40)) * *Copy the code

The output is:

[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
Copy the code

Because text translation pipelines rely on the pretrainedModel.generate () method, we can directly override the default parameter of pretrainedModel.generate () in the pipeline, as shown in max_length above.

Here is an example of a translation using a model and a word segmentation. The process is as follows:

  1. Instantiate the participle and model from the checkpoint name. Abstracts are usually done using an encoder-decoder model, such as Bart or T5.
  2. Define the article that should be summarized.
  3. Add t5-specific prefix “Translate English to German:”
  4. Use the pretrainedModel.generate () method to perform the translation.

The code is as follows:

from transformers import AutoModelWithLMHead, AutoTokenizer

model = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
inputs = tokenizer.encode("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="pt")
outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs.tolist()[0]))
Copy the code

The output is:

<pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.</s>
Copy the code

Summary of the Models

All models in this library are summarized below. Familiarity with the original Transformer model is assumed. We focus on high-level differences between models. Each model in the library can be divided into the following categories:

  • Autoregressive model
  • Self coding model
  • Sequence to sequence model
  • Multimodal model
  • Retrieval based models

Autoregressive models are pre-trained on classic language modeling tasks: guessing the next token after reading all previous tokens. They correspond to the decoder of the original Transformer model and use a mask at the top of the entire sentence to note that the header only sees the front of the text, not the back. Although these models can be fine-tuned and achieve good results on many tasks, the most natural application is text generation. The classic example of this model is GPT.

The pretraining of the autocoding model is by somehow breaking the input marker and attempting to reconstruct the original sentence. They correspond to the encoders of the original Transformer model because they can get the full input without any mask. These models are usually bidirectional representations of the entire sentence. They can be fine-tuned and get good results on many tasks, such as text generation, but their most natural application is sentence classification or token classification. A typical example of this type of model is BERT.

Note that the only difference between the autoregressive model and the autocoding model is the way the model is pretrained. Therefore, the same architecture can be used for both autoregressive and autocoding models. When a given model is used for both types of pre-training, we place it in the category corresponding to the article that first introduced it.

The sequence-to-sequence model uses both the encoder and decoder of the original Transformer for converting tasks or other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks, but their most natural applications are in translating, summarizing, and answering questions. The original Transformer model is an example of this model (for translation only), and T5 is an example that can be fine-tuned for other tasks.

Multimodal models mix text input with other types, such as images, and are more specific to a particular task.

Autoregressive models

As mentioned earlier, these models rely on the decoder portion of the original Transformer and use an attention mask so that at each location, the model only sees the markup that precedes the attention header.

Original GPT (Original GPT)

Improving Language Understanding by Generative Pre-Training, Alec Radford et al.

The first autoregressive model based on Transformer architecture is pre-trained on a book corpus dataset. The library provides versions of models for language modeling and multitasking language modeling/multi-choice classification.

GPT-2

Language Models are Unsupervised Multitask Learners, Alec Radford et al.

A bigger and better version of GPT, in WebText(a web page that links outward over 3KArms on Reddit). The library provides versions of models for language modeling and multitasking language modeling/multiple choice classification.

CTRL

CTRL: A Conditional Transformer Language Model for Controllable Generation, Nitish Shirish Keskar et al.

Same as the GPT model, but with the added idea of controlling code. The text is generated by a prompt (which can be empty) and a control code (or controls) that is then used to influence the text generation: in the style of a Wikipedia article, a book, or a movie review. This library provides a version of the model for language modeling only.

Transformer-XL

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, Zihang Dai et al.

It is the same as the conventional GPT model, but the recursion mechanism of two consecutive segments is introduced (similar to the conventional RNN with two consecutive inputs). In this context, a segment is a sequence of contiguous tokens (for example, 512) that can span multiple documents, and segments are entered in the order of the model.

Basically, the hidden state of the previous section is connected to the current input to calculate the attention score. This allows the model to focus on information in both the previous and current sections. By superimposing multiple layers of attention, receptive fields can be added to multiple previous fragments.

This changes position embedding to position relative embedding (because regular position embedding gives the same result in the current input and the current hidden state of a given position), and requires some tweaking in the way the attention score is calculated.

This library provides a version of the model for language modeling only.

Reformer

Reformer: The Efficient Transformer, Nikita Kitaev et al .

An autoregressive Transformer model with many tricks to reduce memory footprint and computation time. These techniques include:

  • Use axial position encoding (see below for more details). It is a mechanism to avoid having a huge positional encoding matrix (when the sequence length is very large) by breaking it down into smaller matrices.
  • Replace traditional attention with LSH(Locally Sensitive Hash) attention (see below for more details). This is a technique to avoid computing the full product query key in the attention layer.
  • Avoid storing the intermediate results of each layer by either using the reversible transformer layer in the reverse process to obtain the intermediate results (subtracting the remainder from the input of the next layer returns the intermediate results) or recalculating the intermediate results for the results in a given layer (less efficient than storing the intermediate results, but saving memory).
  • Feedforward operations are calculated in blocks rather than in batches.

Using these techniques, the model can input larger sentences than traditional transformer autoregressive models.

Note: This model works well for auto-coding Settings, but there is no checkpoint for such pre-training.

This library provides a version of the model for language modeling only.

XLNet

XLNet: Generalized Autoregressive Pretraining for Language Understanding, Zhilin Yang et al.

Instead of a traditional autoregressive model, XLNet uses a training strategy on top of it. It arranges the tokens in the sentence and then allows the model to predict n+1 tokens using the last n tokens. Since this is all done through a mask, the sentences are actually entered into the model in the correct order, but instead of masking the first n+1 tags, XLNet uses a mask that hides previous tags with some given permutation of sequence length 1.

Autoencoding Models

As mentioned earlier, these models rely on the encoder portion of the original Transformer and do not use masks, so the model can view all the tags in the attention header. For pre-training, the target is the original sentences and the input is a corrupted version of them.

BERT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin et al.

By using a random mask to destroy the input, it is more accurate to say that a given percentage of tokens (typically 15%) are masked during pre-training

  • 80% chance of mask token replacement
  • 10% of the time it’s replaced by a random word
  • There’s a 10% chance it’s not going to move

Note: 80% and 10% are divided on the basis of 15%. That is, 15% of the participants decide to be masked and 80% of the participants are replaced by mask tokens.

The model must predict the original sentence, but with A second goal: the input is two sentences A and B(with A delimiter in the middle). In the corpus, these sentences have a 50% chance of being contiguous, and the remaining 50% of sentences are unrelated. The model must predict whether a sentence is continuous.

The library provides a version of the model for language modeling (traditional or masked), next sentence prediction, token classification, sentence classification, multiple choice classification, and question answering.

ALBERT

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, Zhenzhong Lan et al.

Same as BERT, with a few tweaks:

  • The embedded size E is different from the hidden size H because the embedded is context-independent (an embedded vector represents a tag), whereas the hidden state is context-dependent (a hidden state represents a tag sequence), so H>>E is more logical. Also, the embedding matrix is large because it is V x E (V is the vocabulary size). If E is less than H, it has fewer parameters
  • Layers are split into groups of shared parameters (to save memory).
  • The next sentence prediction is replaced by A sentence ordering prediction: in the input, we have two sentences A and B(consecutively), and we either enter A and then B, or B and then A. The model must predict whether they are swapped.

The library provides a version of the model for the mask language model, token classification, sentence classification, multiple choice classification, and question answering.

RoBERTa

RoBERTa: A Robustly Optimized BERT Pretraining Approach, Yinhan Liu et al.

Same as BERT, but with better pre-training techniques:

  • Dynamic Mask: Mask is different in each epoch, BERT is the same
  • No NSP(next sentence predicted) Loss, instead of putting two sentences together, a set of consecutive text is put together to reach 512 marks (so that the order of sentences can span multiple documents)
  • A bigger batch
  • Use BPE bytes as a unit instead of characters (because of Unicode)

The library provides a version of the model for the mask language model, token classification, sentence classification, multiple choice classification, and question answering.

DistilBERT

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, Victor Sanh et al.

Same as BERT, but smaller. Training is done by distilling the pre-trained BERT model, which means training it to predict the same probabilities as the larger model. The practical goals are:

  • Find the same probability as teacher model
  • Correctly predicted masked token (no target predicted by the next sentence)
  • There is a cosine similarity between student model and teacher model

The library provides a version of the model for the mask language model, token classification, sentence classification, and question answering.

ConvBERT

ConvBERT: Improving BERT with Span-based Dynamic Convolution, Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.

Pre-trained language models like BERT and their variants have recently achieved impressive results in a variety of natural language understanding tasks. However, BERT relies heavily on global self-attention blocks, so memory footprint and computational costs are high. While all attentional headers query the entire input sequence from a global perspective to generate an attentional graph, we observed that some attentional headers only need to learn local dependencies, which implies the existence of computational redundancy. Therefore, we propose a new interval-based dynamic convolution to replace these self-attention heads and directly model local dependencies. The new convolutional head, together with other self-attention heads, forms a new mixed attention block, which is more effective in global and local context learning. We equipped BERT with this mixed attention design and built a ConvBERT model. Experiments show that ConvBERT is significantly superior to BERT and its variants in various downstream tasks, with lower training costs and fewer model parameters. Notably, the ConvBERTbase model achieved a GLUE score of 86.4 points, 0.7 points higher than ELECTRAbase, while using less than a quarter of the training cost.

The library provides a version of the model for the mask language model, token classification, sentence classification, and question answering.

XLM

Cross-lingual Language Model Pretraining, Guillaume Lample and Alexis Conneau

Transformer models trained in several languages. The model has three different training styles, and the library provides checkpoint for all of them.

  • Causal language modeling (CLM) is traditional autoregressive training (so this model can also be described in the previous section). A language is chosen for each training sample, and the model input is a sentence with 256 tags that can span multiple documents using one of the languages.

  • The Mask Language Model (MLM) is like RoBERTa. Selecting a language for each training sample, the model input is a sentence containing 256 tags that can span multiple documents using one of the languages and dynamically mask the tags.

  • The combination of Mask Language Model (MLM) and Translation Language Modeling (TLM). This involves linking a sentence in two different languages and using random masking. To predict one of the masked tokens, the model can use both the context in language 1 and the context given in language 2.

Checkpoints refer to methods that include CLM, MLM, or MLM-TLM in their names for pre-training. In addition to location embedding, the model also has language embedding. When training with MLM/CLM, it gives the model an indication of the language to use, and when training with MLM+TLM, it gives each part an indication of the language to use.

The library provides a version of the model for language modeling, token classification, sentence classification, and question answering.

XLM-RoBERTa

Unsupervised Cross-lingual Representation Learning at Scale, Alexis Conneau et al.

Use RoBERTa’s techniques on XLM methods, but do not use translation languages to model targets. It uses a mask language model only for sentences from one language. However, the model trains more languages (100) and does not use language embedding, so it can detect the input language itself.

The library provides a version of the model for mask language modeling, token classification, sentence classification, multiple choice classification, and question answering.

FlauBERT

FlauBERT: Unsupervised Language Model Pre-training for French, Hang Le et al.

Like RoBERTa, there is no sentence ordering prediction (so just training on MLM targets). The library provides a version of the model for language modeling and sentence classification.

The library provides a version of the model for language modeling and sentence classification.

ELECTRA

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, Kevin Clark et al.

ELECTRA is a Transformer model pre-trained with another (small) masked language model. The input is corrupted by the language model, which takes the input text of a random mask and outputs a text in which ELECTRA must predict which token is original and which token has been replaced. Like GAN training, the small language model trains a few steps (but targets the raw text, rather than fooling the ELECTRA model as traditional GAN Settings do), and then the ELECTRA model trains a few steps.

The library provides a version of the model for mask language model, token classification and sentence classification.

Funnel Transformer

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing, Zihang Dai et al.

Funnel Transformer is a Transformer model that uses pooling, somewhat like the ResNet model: layers are grouped in blocks, and at the beginning of each block (except for the first block), the hidden state is pooled in the sequence dimension. In this way, their lengths are divided by two, which speeds up the calculation of the next hidden state. All pre-trained models have three blocks, meaning that the final sequence of hidden states is a quarter the length of the original sequence.

This is not a problem for tasks like categorization, but for tasks like masked language modeling or tag categorization, we need a hidden state with the same sequence length as the original input. In these cases, the final hidden state is sampled to the length of the input sequence and goes through two additional layers. That’s why there are two versions of each checkpoint. The version with the -base suffix contains only three blocks, while the version without the suffix contains three blocks and an upsample header with an additional layer.

Available pre-training models use the same pre-training objectives as ELECTRA.

The library provides a version of the model for mask language model, token classification, sentence classification, multiple choice classification and question answering.

Longformer

Longformer: The Long-Document Transformer, Iz Beltagy et al.

Transformer model with sparse matrix instead of attention matrix for speed. In general, the local context (for example, what are the two tokens on the left and the right?) Enough to take action on a given token. Some pre-selected input tokens still get global attention, but note that the matrix has fewer parameters, leading to acceleration. See the Local Notes section for more information.

It’s pre-trained, just like RoBERTa.

Note: This model works well for autoregressive model Settings, but there is no checkpoint for such pre-training.

The library provides a version of the model for mask language model, token classification, sentence classification, multiple choice classification and question answering.

Sequence-to-sequence models

As mentioned earlier, these models retain the encoders and decoders of the original Transformer.

BART

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension, Mike Lewis et al.

Sequence to sequence model, with an encoder and a decoder. The encoder enters a corrupted version of the token, and the decoder enters the original token(but with a mask to hide future words, just like a normal Transformer decoder). The combination of the following transformations is applied to the pre-training task of the encoder:

  • Randomly mask off some tokens (e.g. BERT)
  • Randomly Deleting tokens
  • Replace a k token field with a mask token
  • Rearrange sentences
  • Rotate the document so that it starts with a specific token

This library provides a version of this model for conditional generation and sequence classification.

Pegasus

PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.

Pegasus conducted joint training on two self-supervised objective functions: masked language model (MLM) and a new summary-specific pre-training objective called gap sentence generation (GSG).

  • MLM: Encoder input mark is replaced randomly by a Mask token and must be predicted by the encoder (e.g. BERT)
  • GSG: The entire encoder input sentence is replaced with a second mask token and entered into the decoder, but it has a causal mask to hide future words, just like a regular autoregressive Transformer decoder.

Unlike BART, Pegasus’s pre-training task is intentionally similar to summarization: important sentences are hidden and an output sequence is generated from the remaining sentences, similar to summarization.

This library provides a version of this model for conditional generation, which should be used for summarization.

MarianMT

Marian: Fast Neural Machine Translation in C++, Marcin Junczys-Dowmunt et al.

Framework for translating models, using the same model as BART.

This library provides a version of this model for conditional generation.

T5

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel et al.

Use the traditional Transformer model (with minor variations in location embedding for each layer of learning). To be able To manipulate all natural language processing tasks, it converts them into text-to-text problems by using specific prefixes: Summarize:, question:, translate English To German: and so on.

Pre-training includes supervised and self-supervised training. Supervised training is inferred through downstream tasks provided by GLUE and SuperGLUE (translating them into text-to-text tasks, as described above).

Self-supervised training uses broken tokens by randomly deleting 15% of the tokens and replacing them with independent Sentinel tokens (if multiple consecutive tokens are marked for deletion, the entire group will be replaced with a single Sentinel token). The input for the encoder is the corrupted sentence, the input for the decoder is the original sentence, and then the target is the deleted token separated by their Sentinel token.

For example, if we have a sentence “My dog is very cute.” , we decided to remove the tags: “dog”, “is” and “cute”, and the encoder input became “My very”. The target input becomes “dog is cute.”

This library provides a version of this model for conditional generation.

MT5

mT5: A massively multilingual pre-trained text-to-text transformer, Linting Xue et al.

The model architecture is the same as T5. Pre-training objectives for mT5 include self-supervised training for T5, but not supervised training for T5. MT5 is trained in 101 languages.

This library provides a version of this model for conditional generation.

MBart

Multilingual Denoising Pre-training for Neural Machine Translation by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.

The model architecture and pre-training objectives are the same as BART, but BART is trained for 25 languages for supervised and unsupervised machine translation. MBart is the first method to pre-train a complete sequence to a sequence model through full-text multi-language denoising

This library provides a version of this model for conditional generation.

ProphetNet

ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.

ProphetNet introduces a new sequence to sequence pretraining goal called future N-gram prediction. In a future N-gram prediction, the model simultaneously predicts the next N tokens at each time step based on the previous context tokens, rather than just predicting a single next token. Future N-gram predictions explicitly encourage models to plan future tokens and prevent overfitting of strong local correlations. The model architecture is based on the original Transformer, but replaces the “standard” self-attention mechanism in the decoder with a main self-attention mechanism.

The library provides a pre-trained version of the model for conditional generation and a fine-tuned version for summaries.

XLM-ProphetNet

ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.

Xlm-prophetnet has the same model architecture and pretraining goals as ProphetNet, but xLM-ProphetNet is pretrained on XGLUE, a cross-language dataset.

The library provides pre-trained versions of the model for multi-language conditional generation and fine-tuned versions for title generation and problem generation respectively.

Multimodal Models

There is a multimodal model in this library that has not been pre-trained to self-supervise like other models.

MMBT

Supervised Multimodal Bitransformers for Classifying Images and Text, Douwe Kiela et al.

A Transformer model for multimodal Settings, combining text and images for prediction. The Transformer model will map the tokenized text as input, and finally pass a linear layer (the dimensions of the Transformer hidden layer from the number of features) over a pre-trained image resnet.

The different inputs are connected, and a piecewise embedding is added to the positional embedding to let the model know which part of the input vector corresponds to the text and which part corresponds to the image.

Pre-trained models are only suitable for classification.

Retrieval-based Models

Some models use document retrieval during (pre-) training and reasoning to answer open domain questions.

DPR

Dense Passage Retrieval for Open-Domain Question Answering, Vladimir Karpukhin et al.

Intensive Channel Search (DPR) – is a set of state-of-the-art tools and models for open field Q&A research.

DPR includes three models:

  • Problem encoder: Encodes problems as vectors
  • Context encoder: Encodes context as a vector
  • Reader: Extract the answer to the question in the retrieved context and give a correlation score (high score if the inferred span does answer the question).

A PIPELINE for DPR (not yet implemented) uses a retrieval step to find the first K contexts for a given question, and then invokes a reader with the question and retrieved document to get the answer.

RAG

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela

Retrieval Enhanced Generation (RAG) model combines the capabilities of pre-trained intensive retrieval (DPR) and Seq2Seq model. The RAG model retrieves documents, passes them to the SEQ2SEQ model, and then marginalizes them to produce output. The Retriver and SEq2SEQ modules are initialized by a pre-trained model and together fine-tuned to allow retrieval and generation to suit downstream tasks.

Rag-token and Rag-Sequence models can be generated.

More technical Aspects

Full vs sparse attention

Most Transformer models use full attention in the sense that the attention matrix is square. When you have very long text, it can be a big computing bottleneck. Longformer and Reformer are two models that attempt to improve efficiency, using sparse versions of the attention matrix to speed up training.

LSH attention

Reformer uses LSH Attention. In The SoftMAX (QKT) softMAX (QK^T), only the largest elements make useful contributions. So for each query q in q, we can consider only the key K in K that is close to q. The hash function is used to determine whether q and k are close. Note that the mask is modified to mask the current tag (except for the first location), as it will give the same query and key (very similar). Because hashing can be a bit random, in practice several hash functions (determined by the n_rounds parameter) are used and then averaged together.

Local attention

Longformer uses local attention: in general, local context (e.g., what are the two markers on the left and the right?) Enough to take action on a given tag. In addition, by overlaying attention layers with small Windows, the last layer will have a reception field rather than just tags in the window, allowing them to construct a representation of the entire sentence.

Some pre-selected input tokens also get global attention: for those few tokens, note that the matrix has access to all tokens, and the process is symmetric: all other tokens have access to these particular tokens (above their local window tokens). As shown in Figure 2D of this article, the following is an example of an attention mask

Using matrices with fewer parameters allows the model’s inputs to have larger sequence lengths.

Other tricks

The Reformer uses axial position encoding: in traditional Transformer models, position encoding E is a matrix of size L × D, where L is the length of the sequence and D is the dimension of the hidden state. If you have very long text, the matrix can be very large and take up too much space on the GPU. To mitigate this, axial position coding involves splitting the large matrix E into two smaller matrices, E1 and E2, with dimensions l1× D1L_1 \times d_1L1 × D1 and L2 × D2l_2 \times d_2l2×d2, So l1× L2 =ll_1 \times l_2 = ll1×l2= L and d1+d2= DD_1 + D_2 = DD1 +d2=d(plus the product of the lengths, which will end up being much smaller). The embedding of time step J in E is obtained by connecting the embedding of time step J %l1j \% l_1j% L1 in E1 with the embedding of j//l1j//l_1j// L1 in E2.

Preprocessing data

In this tutorial, we will explore how to pre-process data using 🤗 Transformers. The main tool for this is what we call a participle. We can use the Tokenizer class associated with the model we want to use, or we can build one directly using the AutoTokenizer class.

As we saw in the Quick Tour, the segmentation first splits the given text into words (or partial words, punctuation marks, and so on) commonly called tokens. It then converts these tokens to numbers so that it can build a tensor with them and feed them into the model. It will also add any additional input that the model might want to work properly.

Note: If you plan to use a pre-trained model, it is important to use its associated pre-trained participle. It will split the input text into tokens in the same way as the pre-training corpus, and it will use the same corresponding token index (which we commonly call the vocabulary) as it did during pre-training.

To automatically download vocabularies used during pre-training or fine-tune a given model, use the from_pretraining() method.

from transformers import AutoTokenizer+

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
Copy the code

Base use

A PreTrainedTokenizer has many methods, but the only preprocessor method you need to remember is its __call__ : you just need to feed your sentence to your participle object.

Note: __call__ is python’s magic method that calls class instances as functions. The function executed when called is the __call__ method of the class instance.

encoded_input = tokenizer("Hello, I'm a single sentence!")
print(encoded_input)
Copy the code

The output is:

{'input_ids': [101, 8667, 117, 146, 112, 182, 170, 1423, 5650, 106, 102], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Copy the code

This returns a dictionary whose values are a list of strings to integers. Input_id is the index corresponding to each tag in the sentence. We’ll see what attention_mask is for next, and the target for token_type_IDS in the next section.

A word segmentation can decode a list of tag ids in correct sentences.

tokenizer.decode(encoded_input["input_ids"])
Copy the code

The output is:

"[CLS] Hello, I'm a single sentence! [SEP]"
Copy the code

As you can see, the segmentation automatically adds some of the special markup expected by the model. Not all models require special tokens. For example, if we used gpT2-medium instead of Bert-base-case to create a tokenizer, we would see the same sentence as the original sentence. You can disable this behavior by passing add_special_tokens=False (this is only recommended if you manually add those special tokens yourself).

If there are several sentences that need processing, this can be done effectively by sending them to the participle as a list:

batch_sentences = ["Hello I'm a single sentence"."And another sentence"."And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)
Copy the code

The output is:

{'input_ids': [101, 1262, 1330, 5650, 102], [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}Copy the code

We get a dictionary again, this time a list of integer lists.

If the purpose of sending several sentences at a time to the Tokenizer is to build the Batch processing to provide information for the model, you might want to do this:

  • Fill each sentence to the maximum length in the batch.
  • Truncate each sentence to the maximum length acceptable to the model, if applicable.
  • Return the tensor.

All of this can be done using the following options when entering the sentence list into the participle.

batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(batch)
Copy the code

The output is:

{'input_ids': tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
        			[ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
        			[ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]), 
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        				[0, 0, 0, 0, 0, 0, 0, 0, 0],
        				[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        					[1, 1, 1, 1, 1, 0, 0, 0, 0],
        					[1, 1, 1, 1, 1, 1, 1, 1, 0]])}
Copy the code

It returns a dictionary containing string keys and tensor values. Now we can see what attention_mask is all about: it indicates which tags the model should pay attention to, and which should not (because they represent padding in this case).

Note that the command above will throw a warning if your model does not have a maximum length associated with it. You can safely ignore it. You can also pass verbose=False to prevent the segmentation from throwing such warnings.

Preprocessing pairs of sentences

Sometimes you need to provide a pair of sentences for your model. For example, if you want to classify whether two sentences in a pair are similar or question response models, they accept a context and a question. For the BERT model, the input can be expressed as follows:

[CLS] Sequence A [SEP] Sequence B [SEP]

We can provide two sentences as two parameters (not a list, because a list of two sentences will be interpreted as a batch of two single sentences, as we saw earlier), encoding a pair of sentences in the format expected by the model. This again returns a dict string to the list of integers.

encoded_input = tokenizer("How old are you?"."I'm 6 years old")
print(encoded_input)
Copy the code

The output is:

{'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Copy the code

This shows us the use of token_type_IDS: they tell the model which part of the input corresponds to the first sentence and which part corresponds to the second sentence. Note that token_TYPE_IDS is not required or handled by all models. By default, the participle will return only the input expected by its association model. You can force the return (or not return) of any of these special parameters using return_input_IDS or return_token_type_IDS.

If we decode the token ID we obtained, we will see that the special token has been added appropriately.

tokenizer.decode(encoded_input["input_ids"])
Copy the code

The output is:

"[CLS] How old are you? [SEP] I'm 6 years old [SEP]"
Copy the code

If you have a list of sequence pairs to work with, you should provide them to the participle as two lists: the first sentence list and the second sentence list.


batch_sentences = ["Hello I'm a single sentence"."And another sentence"."And the very very last one"]
batch_of_second_sentences = ["I'm a sentence that goes with the first sentence"."And I should be encoded with the second sentence"."And I go with the very last one"]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
print(encoded_inputs)
Copy the code

The output is:

{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102] [101, 1262, 1103, 1248, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102] 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}Copy the code

As we can see, it returns a dictionary where each value is a list of lists of integers. To double-check the input into the model, we can decode each list in input_IDS one by one.

for ids in encoded_inputs["input_ids"] :print(tokenizer.decode(ids))
Copy the code

The output is:

[CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
[CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
[CLS] And the very very last one [SEP] And I go with the very last one [SEP]
Copy the code

Similarly, you can automatically populate the input to the maximum sentence length in the batch, truncate it to the maximum length acceptable to the model, and return the tensor directly using the following method

batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")
print(batch)
Copy the code

The output is:

{'input_ids': tensor([[ 101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102], [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102, 0, 0, 0, 0] [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1141, 102, 0, 0, 0]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}Copy the code

Everything you always wanted to know about padding and truncation

We have seen commands that work for most cases (padding the batch to the maximum sentence length and truncating it to the maximum length acceptable to the pattern). However, the API supports more policies if needed. The three parameters you need to know for this are padding, TRUNCation, and max_length.

  • Padding control padding. It can be a Boolean or a string and should be:

    • The True or ‘longest’ argument will cause each sentence to be populated with the longest sequence in the Batch (not populated if only one sequence is provided).
    • The ‘max_length’ argument will cause each sentence to be populated to the length specified by the max_length argument, or to the maximum length acceptable to the model if max_length is not provided (max_length=None) (for example, BERT for 512 tokens). If you supply only one sequence, the fill will still be applied to it.
    • False or ‘do_not_pad’ does not populate the sequence. As we saw earlier, this is the default behavior.
  • Truncation Control truncation. It can be a Boolean or a string and should be:

    • True or ‘only_first’ truncates to the maximum length specified by the max_length argument, or the maximum length accepted by the model if max_length is not provided (max_length=None). If a pair of sequences (or a batch of sequences) is provided, this only truncates the first sentence of the pair of sequences.
    • ‘only_second’ truncation is the maximum length specified by the max_length argument, or the maximum length accepted by the model if max_length is not provided (max_length=None). If a pair of sequences (or a batch of sequences) is provided, this only truncates the second sentence of the pair of sequences.
    • ‘longest_first’ truncates to the maximum length specified for the max_length argument, or the maximum length accepted by the model if max_length is not provided (max_length=None). This truncates the tokens one by one, removing one token from the longest sequence in the pair until the appropriate length is reached.
    • False or ‘do_NOT_TRUNCate ‘do not truncate the sequence. As we saw earlier, this is the default behavior.
  • Max_length controls the length of the fill/truncate. It can be an integer or None, in which case it will default to the maximum length acceptable to the model. Truncation/padding to max_length will be disabled if the model does not have a specific maximum input length.

The following table summarizes the suggested ways to set up the fill and truncation. If you use a pair of input sequences in the following example, you can replace truncation=True with a STRATEGY in [‘ only_first ‘,’ only_second ‘,’ longest_first ‘], Truncation =’only_second’ or TRUNCation =’ longest_first’ to control how two sequences in a sequence pair are truncated.

Truncation Padding Instruction
no truncation no padding tokenizer(batch_sentences)
padding to max sequence in batch tokenizer(batch_sentences, padding=True) or
tokenizer(batch_sentences, padding='longest')
padding to max model input length tokenizer(batch_sentences, padding='max_length')
padding to specific length tokenizer(batch_sentences, padding='max_length', max_length=42)
truncation to max model input length no padding tokenizer(batch_sentences, truncation=True) or
tokenizer(batch_sentences, truncation=STRATEGY)
padding to max sequence in batch tokenizer(batch_sentences, padding=True, truncation=True) or
tokenizer(batch_sentences, padding=True, truncation=STRATEGY)
padding to max model input length tokenizer(batch_sentences, padding='max_length', truncation=True) or
tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)
padding to specific length Not possible
truncation to specific length no padding tokenizer(batch_sentences, truncation=True, max_length=42) or
tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)
padding to max sequence in batch tokenizer(batch_sentences, padding=True, truncation=True, max_length=42) or
tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)
padding to max model input length Not possible
padding to specific length tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42) or
tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)

To verify these instructions experimentally, I use a pair.

Start by defining the batch form of the pair, noting that the i-th element of batch_sentences and the i-th element of batch_of_second_sentences are a pair.

batch_sentences = ["Hello I'm a single sentence"."And another sentence"."And the very very last one"]
batch_of_second_sentences = ["I'm a sentence that goes with the first sentence"."And I should be encoded with the second sentence"."And I go with the very last one"]
Copy the code

No truncation

First, no truncation, There are four cases: NO padding, PADDING to Max sequence in Batch, PADDING to Max Model input length, and PADDING to specific length.

no padding

batch = tokenizer(batch_sentences, batch_of_second_sentences)
for ids in batch['input_ids'] :# Print the same code, omitted later
    print("= = = ="*36)
    print(len(tokenizer.convert_ids_to_tokens(ids)))
    print(tokenizer.convert_ids_to_tokens(ids))
    print(a)Copy the code

The output is:

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 21 ['[CLS]', 'Hello', 'I', "'", 'm', 'a', 'single', 'sentence', '[SEP]', 'I', "'", 'm', 'a', 'sentence', 'that', 'goes', 'with', 'the', 'first', 'sentence', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 15 ['[CLS]', 'And', 'another', 'sentence', '[SEP]', 'And', 'I', 'should', 'be', 'encoded', 'with', 'the', 'second', 'sentence', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 17 ['[CLS]', 'And', 'the', 'very', 'very', 'last', 'one', '[SEP]', 'And', 'I', 'go', 'with', 'the', 'very', 'last', 'one', '[SEP]']Copy the code

Here is the output of sentence pair after tokenizer, and three special tokens are added on the basis of the token obtained by word segmentation.

It can be seen that in the case of no padding and NO truncation, no padding or truncation occurs (truncation occurs only when the sentence length reaches the maximum input of the model).

Since the printout code is the same, the subsequent printout code is omitted.

padding to max sequence in batch

batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True)
# is equivalent to
# batch = tokenizer(batch_sentences, batch_of_second_sentences, padding='longest')
Copy the code

The output is:

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 21 ['[CLS]', 'Hello', 'I', "'", 'm', 'a', 'single', 'sentence', '[SEP]', 'I', "'", 'm', 'a', 'sentence', 'that', 'goes', 'with', 'the', 'first', 'sentence', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 21 ['[CLS]', 'And', 'another', 'sentence', '[SEP]', 'And', 'I', 'should', 'be', 'encoded', 'with', 'the', 'second', 'sentence', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 21 ['[CLS]', 'And', 'the', 'very', 'very', 'last', 'one', '[SEP]', 'And', 'I', 'go', 'with', 'the', 'very', 'last', 'one', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']Copy the code

In this case, the text is already padded to the longest sentence.

**padding to max model input length **

batch = tokenizer(batch_sentences, batch_of_second_sentences, padding='max_length')
Copy the code

In this case, the text is automatically populated to the longest input the model can accept, BERT’s 512. The output (which shows only a few pads because it is filled to 512 lengths) is:

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 512 ['[CLS]', 'Hello', 'I', "'", 'm', 'a', 'single', 'sentence', '[SEP]', 'I', "'", 'm', 'a', 'sentence', 'that', 'goes', 'with', 'the', 'first', 'sentence', '[SEP]', '[PAD]', ] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 512 ['[CLS]', 'And', 'another', 'sentence', '[SEP]', 'And', 'I', 'should', 'be', 'encoded', 'with', 'the', 'second', 'sentence', '[SEP]', '[PAD]',] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 512 ['[CLS]', 'And', 'the', 'very', 'very', 'last', 'one', '[SEP]', 'And', 'I', 'go', 'with', 'the', 'very', 'last', 'one', '[SEP]', '[PAD]',]Copy the code

padding to specific length

Similarly, we can set max_length ourselves.

batch = tokenizer(batch_sentences, batch_of_second_sentences, padding='max_length', max_length=24)
Copy the code
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 24 ['[CLS]', 'Hello', 'I', "'", 'm', 'a', 'single', 'sentence', '[SEP]', 'I', "'", 'm', 'a', 'sentence', 'that', 'goes', 'with', 'the', 'first', 'sentence', '[SEP]', '[PAD]', '[PAD]', '[PAD]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 24 ['[CLS]', 'And', 'another', 'sentence', '[SEP]', 'And', 'I', 'should', 'be', 'encoded', 'with', 'the', 'second', 'sentence', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 24 ['[CLS]', 'And', 'the', 'very', 'very', 'last', 'one', '[SEP]', 'And', 'I', 'go', 'with', 'the', 'very', 'last', 'one', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']Copy the code

Set max_length=24, and you can see that the text is filled up to 24.

truncation to max model input length

no padding

batch = tokenizer(batch_sentences, batch_of_second_sentences, truncation=True)
Copy the code

The output is:

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 21 ['[CLS]', 'Hello', 'I', "'", 'm', 'a', 'single', 'sentence', '[SEP]', 'I', "'", 'm', 'a', 'sentence', 'that', 'goes', 'with', 'the', 'first', 'sentence', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 15 ['[CLS]', 'And', 'another', 'sentence', '[SEP]', 'And', 'I', 'should', 'be', 'encoded', 'with', 'the', 'second', 'sentence', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 17 ['[CLS]', 'And', 'the', 'very', 'very', 'last', 'one', '[SEP]', 'And', 'I', 'go', 'with', 'the', 'very', 'last', 'one', '[SEP]']Copy the code

Meanwhile, in the case of sentence pair, we tested the following three codes with the same output

batch = tokenizer(batch_sentences, batch_of_second_sentences, truncation='only_first')
batch = tokenizer(batch_sentences, batch_of_second_sentences, truncation='only_second')
batch = tokenizer(batch_sentences, batch_of_second_sentences, truncation='longest_first')
Copy the code

Output as above, no longer displayed.

padding to max sequence in batch

batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True)
Copy the code

The output is:

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 21 ['[CLS]', 'Hello', 'I', "'", 'm', 'a', 'single', 'sentence', '[SEP]', 'I', "'", 'm', 'a', 'sentence', 'that', 'goes', 'with', 'the', 'first', 'sentence', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 21 ['[CLS]', 'And', 'another', 'sentence', '[SEP]', 'And', 'I', 'should', 'be', 'encoded', 'with', 'the', 'second', 'sentence', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 21 ['[CLS]', 'And', 'the', 'very', 'very', 'last', 'one', '[SEP]', 'And', 'I', 'go', 'with', 'the', 'very', 'last', 'one', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']Copy the code

The text here is already padded and aligned and not truncated because it is well below the length limit of the model’s maximum input.

Then test

batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation='only_first')
batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation='only_second')
batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation='longest_first')
Copy the code

The output is exactly the same as above.

padding to max model input length

batch = tokenizer(batch_sentences, batch_of_second_sentences, padding='max_length', truncation=True)
Copy the code

The output is all padded to 512 length.

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 512 ['[CLS]', 'Hello', 'I', "'", 'm', 'a', 'single', 'sentence', '[SEP]', 'I', "'", 'm', 'a', 'sentence', 'that', 'goes', 'with', 'the', 'first', 'sentence', '[SEP]', '[PAD]', ] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 512 ['[CLS]', 'And', 'another', 'sentence', '[SEP]', 'And', 'I', 'should', 'be', 'encoded', 'with', 'the', 'second', 'sentence', '[SEP]', '[PAD]',] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 512 ['[CLS]', 'And', 'the', 'very', 'very', 'last', 'one', '[SEP]', 'And', 'I', 'go', 'with', 'the', 'very', 'last', 'one', '[SEP]', '[PAD]',]Copy the code

Similarly, the following three lines of code were tested and the results were the same

batch = tokenizer(batch_sentences, batch_of_second_sentences, padding='max_length', truncation='only_first')
batch = tokenizer(batch_sentences, batch_of_second_sentences, padding='max_length', truncation='only_second')
batch = tokenizer(batch_sentences, batch_of_second_sentences, padding='max_length', truncation='longest_first')
Copy the code

At the same time, padding to specific length cannot be realized under the setting of TRUNCation to Max model input length.

truncation to specific length

no padding

batch = tokenizer(batch_sentences, batch_of_second_sentences, truncation=True, max_length=20)
Copy the code

Here the maximum length is set to 16 and the output is:

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 16 ['[CLS]', 'Hello', 'I', "'", 'm', 'a', 'single', '[SEP]', 'I', "'", 'm', 'a', 'sentence', 'that', 'goes', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 15 ['[CLS]', 'And', 'another', 'sentence', '[SEP]', 'And', 'I', 'should', 'be', 'encoded', 'with', 'the', 'second', 'sentence', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 16 ['[CLS]', 'And', 'the', 'very', 'very', 'last', 'one', '[SEP]', 'And', 'I', 'go', 'with', 'the', 'very', 'last', '[SEP]']Copy the code

As you can see, the first three paragraphs are truncated, and the other second paragraphs are neither truncated nor padded.

[‘ only_first ‘, ‘only_second’, ‘longest_first ‘]

The first is

batch = tokenizer(batch_sentences, batch_of_second_sentences, truncation='only_first', max_length=16)
Copy the code

The output of

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 16 ['[CLS]', 'Hello', 'I', '[SEP]', 'I', "'", 'm', 'a', 'sentence', 'that', 'goes', 'with', 'the', 'first', 'sentence', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 15 ['[CLS]', 'And', 'another', 'sentence', '[SEP]', 'And', 'I', 'should', 'be', 'encoded', 'with', 'the', 'second', 'sentence', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 16 ['[CLS]', 'And', 'the', 'very', 'very', 'last', '[SEP]', 'And', 'I', 'go', 'with', 'the', 'very', 'last', 'one', '[SEP]']Copy the code

As you can see, with the only_first setting, the first sentence is truncated preferentially.

batch = tokenizer(batch_sentences, batch_of_second_sentences, truncation='only_second', max_length=16)
Copy the code

The output is:

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 16 ['[CLS]', 'Hello', 'I', "'", 'm', 'a', 'single', 'sentence', '[SEP]', 'I', "'", 'm', 'a', 'sentence', 'that', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 15 ['[CLS]', 'And', 'another', 'sentence', '[SEP]', 'And', 'I', 'should', 'be', 'encoded', 'with', 'the', 'second', 'sentence', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 16 ['[CLS]', 'And', 'the', 'very', 'very', 'last', 'one', '[SEP]', 'And', 'I', 'go', 'with', 'the', 'very', 'last', '[SEP]']Copy the code

As you can see, with the only_second setting, the second sentence is truncated preferentially.

in

batch = tokenizer(batch_sentences, batch_of_second_sentences, truncation='longest_first', max_length=16)
Copy the code

The output is:

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 16 ['[CLS]', 'Hello', 'I', "'", 'm', 'a', 'single', '[SEP]', 'I', "'", 'm', 'a', 'sentence', 'that', 'goes', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 15 ['[CLS]', 'And', 'another', 'sentence', '[SEP]', 'And', 'I', 'should', 'be', 'encoded', 'with', 'the', 'second', 'sentence', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 16 ['[CLS]', 'And', 'the', 'very', 'very', 'last', 'one', '[SEP]', 'And', 'I', 'go', 'with', 'the', 'very', 'last', '[SEP]']Copy the code

At this point, the longest sentence will be truncated first (one token after another truncated, maybe the second sentence will be truncated first, after deleting several tokens, the first sentence will be truncated again…) .

Note that the truncation length must ensure that both sentences contain content (at least 1 token is left), otherwise an error will be reported.

batch = tokenizer(batch_sentences, batch_of_second_sentences, truncation='only_first', max_length=15)
batch = tokenizer(batch_sentences, batch_of_second_sentences, truncation='only_first', max_length=14)
Copy the code

For example, the first sentence does not return an error because this truncation also guarantees that one of the tokens in the first sentence is ‘I’, while the second sentence will return an error.

Padding to Max sequence in Batch After the padding=True is set, sentences that do not reach the maximum length in batch will be filled in. The following code:

batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation='only_first', max_length=18)
Copy the code

The output is:

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 18 ['[CLS]', 'Hello', 'I', "'", 'm', '[SEP]', 'I', "'", 'm', 'a', 'sentence', 'that', 'goes', 'with', 'the', 'first', 'sentence', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 18 ['[CLS]', 'And', 'another', 'sentence', '[SEP]', 'And', 'I', 'should', 'be', 'encoded', 'with', 'the', 'second', 'sentence', '[SEP]', '[PAD]', '[PAD]', '[PAD]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 18 ['[CLS]', 'And', 'the', 'very', 'very', 'last', 'one', '[SEP]', 'And', 'I', 'go', 'with', 'the', 'very', 'last', 'one', '[SEP]', '[PAD]']Copy the code

And the truncation also works.

Using [‘ only_first ‘, ‘only_second’, ‘longest_first ‘] also truncates the corresponding section. I don’t show it here anymore.

padding to specific length

This ensures PAD to a specific length (by setting max_length). Notice the code comparison:

batch = tokenizer(batch_sentences, batch_of_second_sentences, padding='max_length', truncation=True, max_length=22)
Copy the code

Output:

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 22 ['[CLS]', 'Hello', 'I', "'", 'm', 'a', 'single', 'sentence', '[SEP]', 'I', "'", 'm', 'a', 'sentence', 'that', 'goes', 'with', 'the', 'first', 'sentence', '[SEP]', '[PAD]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 22 ['[CLS]', 'And', 'another', 'sentence', '[SEP]', 'And', 'I', 'should', 'be', 'encoded', 'with', 'the', 'second', 'sentence', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 22 ['[CLS]', 'And', 'the', 'very', 'very', 'last', 'one', '[SEP]', 'And', 'I', 'go', 'with', 'the', 'very', 'last', 'one', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']Copy the code

And the following code:

batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, max_length=22)
Copy the code

The output is:

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 21 ['[CLS]', 'Hello', 'I', "'", 'm', 'a', 'single', 'sentence', '[SEP]', 'I', "'", 'm', 'a', 'sentence', 'that', 'goes', 'with', 'the', 'first', 'sentence', '[SEP]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 21 ['[CLS]', 'And', 'another', 'sentence', '[SEP]', 'And', 'I', 'should', 'be', 'encoded', 'with', 'the', 'second', 'sentence', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]'] = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 21 ['[CLS]', 'And', 'the', 'very', 'very', 'last', 'one', '[SEP]', 'And', 'I', 'go', 'with', 'the', 'very', 'last', 'one', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']Copy the code

Pre-tokenized inputspre-tokenized inputs

Tokenizer also accepts pre-tokenized input. This is especially useful when you want to calculate tag and extraction predictions in named Entity recognition (NER) or POS tags.

Caveat: Pre-tokenization does not mean that your inputs are word segmentation ready (if so, you do not need to pass them through tokenizer), but rather that they are split into words (this is usually the first step in subword tokenization algorithms such as BPE).

If you want to use pre-marked input, simply set is_SPLit_into_words =True when passing the input to the splitter. For example, we have

encoded_input = tokenizer(["Hello"."I'm"."a"."single"."sentence"], is_split_into_words=True)
print(encoded_input)
Copy the code

The output is:

{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
Copy the code

Note that the participle will still add the id of the special token (if applicable) unless add_SPECIAL_tokens =False is passed.

At the same time, tokenizer does word segmentation as well, but it does not split between words directly after setting is_split_into_words=True. Example code:

batch = tokenizer(["Hello"."I'm"."a"."single"."sentence"], is_split_into_words=True)
ids = batch['input_ids']            # Print the same code, omitted later
print("= = = ="*36)
print(len(tokenizer.convert_ids_to_tokens(ids)))
print(tokenizer.convert_ids_to_tokens(ids))
print(a)Copy the code

The output is:

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================== 9 ['[CLS]', 'Hello', 'I', "'", 'm', 'a', 'single', 'sentence', '[SEP]']Copy the code

As you can see, the word ‘I’m’ is still internally split into three tokens: I, ‘and m.

This is exactly the same as the previous set of sentences or pairs of sentences. You can encode a batch of sentences like this

batch_sentences = [["Hello"."I'm"."a"."single"."sentence"],
                   ["And"."another"."sentence"],
                   ["And"."the"."very"."very"."last"."one"]]
encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
Copy the code

Or a pair of sentences like this

batch_of_second_sentences = [["I'm"."a"."sentence"."that"."goes"."with"."the"."first"."sentence"],
                             ["And"."I"."should"."be"."encoded"."with"."the"."second"."sentence"],
                             ["And"."I"."go"."with"."the"."very"."last"."one"]]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
Copy the code

You can add padding, truncation and just return the tensor as before

batch = tokenizer(batch_sentences,
                  batch_of_second_sentences,
                  is_split_into_words=True,
                  padding=True,
                  truncation=True,
                  return_tensors="pt")
Copy the code