“This is the fifth day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Since Google proposed Bert in 2018, pre-training model has become a pioneer in the FIELD of NLP. Bert, as an epoch-making depth model in the history of NLP, is undoubtedly powerful. Generally speaking, the BERT pre-training model has been able to meet most scenarios in practical tasks. Next, the author will update a set of highly reusable text classification codes based on BERT pre-training model, which will be divided into three articles for detailed interpretation of the whole set of codes, and the data reading part will be explained in detail.

Source code download address: download link extraction code: 2021

The IFLYTEK’ long text classification dataset of The Chinese Language Comprehension Evaluation Benchmark (CLUE) is used as the data set in the code. CLUE paper is highly accepted by the International Conference on Computational Linguistics, COLING2020

The overview

The code reads the data in the data_loader.py file, and then interprets each class and function to clarify its logical relationship. The classes and functions in the file are shown below:

InputExample: a sample object that creates objects for each sample and can override internal methods based on the task. InputFeatures: Feature objects that create objects for each feature and can override internal methods based on the task. IflytekProcessor: processes objects and file data, and returns the InputExample class. This class has no fixed name and can modify the corresponding reading method according to specific tasks. In this example, iflyTek is named iflytekProcessor. Convert_examples_to_features: Convert the InputExample class to the InputFeatures class. InputExample returns the load_and_cache_examples InputFeatures class: The InputExample class generated by convert_examples_to_features is loaded on one side every time the training is not used.

Logic:

InputExample

This class is relatively simple and defines several attributes of the input sample in the initialization method: GUID: the unique ID of the sample. Words: Example sentences. Label: The label of the example. There is no need to modify, and another sentence can be added for text matching, such as self.wordspair = Wordspair, depending on the task.

class InputExample(object) :
    """ A single training/test example for simple sequence classification. Args: guid: Unique id for the example. words: list. The words of the sequence. label: (Optional) string. The label of the example. """

    def __init__(self, guid, words, label=None.) :
        self.guid = guid
        self.words = words
        self.label = label

    def __repr__(self) :
        return str(self.to_json_string())

    def to_dict(self) :
        """Serializes this instance to a Python dictionary."""
        output = copy.deepcopy(self.__dict__)
        return output

    def to_json_string(self) :
        """Serializes this instance to a JSON string."""
        return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
Copy the code

InputFeatures

This class mainly describes the Bert input form, input_ids, attention_mask, token_type_ids plus label_id

If the Bert model is also used, there is no need to modify it. Of course, it needs to be modified according to the input mode of the corresponding pre-training model.

class InputFeatures(object) :
    """A single set of features of data."""

    def __init__(self, input_ids, attention_mask, token_type_ids, label_id) :
        self.input_ids = input_ids
        self.attention_mask = attention_mask
        self.token_type_ids = token_type_ids
        self.label_id = label_id

    def __repr__(self) :
        return str(self.to_json_string())

    def to_dict(self) :
        """Serializes this instance to a Python dictionary."""
        output = copy.deepcopy(self.__dict__)
        return output

    def to_json_string(self) :
        """Serializes this instance to a JSON string."""
        return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
Copy the code

iflytekProcessor

Read the file according to the task

class iflytekProcessor(object) :
    """Processor for the JointBERT data set """

    def __init__(self, args) :
        self.args = args
        self.labels = get_labels(args)

        self.input_text_file = 'data.csv'

    @classmethod
    def _read_file(cls, input_file, quotechar=None) :
        """Reads a tab separated value file."""
        df = pd.read_csv(input_file)
        return df

    def _create_examples(self, datas, set_type) :
        """Creates examples for the training and dev sets."""
        examples = []
        for i, rows  in datas.iterrows():
            try:
                guid = "%s-%s" % (set_type, i)
                # 1. input_text
                words = rows["text"]
                # 2. intent
                label = rows["labels"]
            except :
                print(rows)


            examples.append(InputExample(guid=guid, words=words, label=label))
        return examples

    def get_examples(self, mode) :
        """ Args: mode: train, dev, test """
        data_path = os.path.join(self.args.data_dir, self.args.task, mode)
        logger.info("LOOKING AT {}".format(data_path))
        return self._create_examples(datas=self._read_file(os.path.join(data_path, self.input_text_file)),
                                     set_type=mode)
Copy the code

convert_examples_to_features

Examples are mainly encoded according to Bert’s coding mode and converted into the forms of input_IDS, attention_maskmax_seq_len and token_type_IDS to generate features

def convert_examples_to_features(examples, max_seq_len, tokenizer,
                                 cls_token_segment_id=0,
                                 pad_token_segment_id=0,
                                 sequence_a_segment_id=0,
                                 mask_padding_with_zero=True) :
    # Setting based on the current model type
    cls_token = tokenizer.cls_token
    sep_token = tokenizer.sep_token
    unk_token = tokenizer.unk_token
    pad_token_id = tokenizer.pad_token_id

    features = []
    for (ex_index, example) in enumerate(examples):
        if ex_index % 5000= =0:
            logger.info("Writing example %d of %d" % (ex_index, len(examples)))

        # Tokenize word by word (for NER)
        tokens = []
        for word in example.words:
            word_tokens = tokenizer.tokenize(word)
            if not word_tokens:
                word_tokens = [unk_token]  # For handling the bad-encoded word
            tokens.extend(word_tokens)

        # Account for [CLS] and [SEP]
        special_tokens_count = 2
        if len(tokens) > max_seq_len - special_tokens_count:
            tokens = tokens[:(max_seq_len - special_tokens_count)]

        # Add [SEP] token
        tokens += [sep_token]
        token_type_ids = [sequence_a_segment_id] * len(tokens)

        # Add [CLS] token
        tokens = [cls_token] + tokens
        token_type_ids = [cls_token_segment_id] + token_type_ids

        input_ids = tokenizer.convert_tokens_to_ids(tokens)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)

        # Zero-pad up to the sequence length.
        padding_length = max_seq_len - len(input_ids)
        input_ids = input_ids + ([pad_token_id] * padding_length)
        attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
        token_type_ids = token_type_ids + ([pad_token_segment_id] * padding_length)

        assert len(input_ids) == max_seq_len, "Error with input length {} vs {}".format(len(input_ids), max_seq_len)
        assert len(attention_mask) == max_seq_len, "Error with attention mask length {} vs {}".format(len(attention_mask), max_seq_len)
        assert len(token_type_ids) == max_seq_len, "Error with token type length {} vs {}".format(len(token_type_ids), max_seq_len)

        label_id = int(example.label)

        if ex_index < 5:
            logger.info("*** Example ***")
            logger.info("guid: %s" % example.guid)
            logger.info("tokens: %s" % "".join([str(x) for x in tokens]))
            logger.info("input_ids: %s" % "".join([str(x) for x in input_ids]))
            logger.info("attention_mask: %s" % "".join([str(x) for x in attention_mask]))
            logger.info("token_type_ids: %s" % "".join([str(x) for x in token_type_ids]))
            logger.info("label: %s (id = %d)" % (example.label, label_id))

        features.append(
            InputFeatures(input_ids=input_ids,
                          attention_mask=attention_mask,
                          token_type_ids=token_type_ids,
                          label_id=label_id,
                          ))

    return features
Copy the code

load_and_cache_examples

The load and cache functions, as their name suggests, do both reading and saving.

1. Generate the cache name cached_features_file named cached_ dataset schema _ dataset name _MaxLen based on the parameters

2. Check whether the cache file in the current path exists. If yes, read and save the file directly.

  • First, useprocessorsTo obtainexamples.
  • And then, throughconvert_examples_to_featuresTo obtainfeaturesAnd save the cached data.

3. Transform features data into tensors and construct data sets using TensorDataset

def load_and_cache_examples(args, tokenizer, mode) :
    processor = processors[args.task](args)

    # Load data features from cache or dataset file
    cached_features_file = os.path.join(
        args.data_dir,
        'cached_{}_{}_{}_{}'.format(
            mode,
            args.task,
            list(filter(None, args.model_name_or_path.split("/"))).pop(),
            args.max_seq_len
        )
    )
    print(cached_features_file)

    if os.path.exists(cached_features_file):
        logger.info("Loading features from cached file %s", cached_features_file)
        features = torch.load(cached_features_file)
    else:
        # Load data features from dataset file
        logger.info("Creating features from dataset file at %s", args.data_dir)
        if mode == "train":
            examples = processor.get_examples("train")
        elif mode == "dev":
            examples = processor.get_examples("dev")
        elif mode == "test":
            examples = processor.get_examples("test")
        else:
            raise Exception("For mode, Only train, dev, test is available")

        # Use cross entropy ignore index as padding label id so that only real label ids contribute to the loss later
        features = convert_examples_to_features(examples,
                                                args.max_seq_len,
                                                tokenizer,
                                                )
        logger.info("Saving features into cached file %s", cached_features_file)
        torch.save(features, cached_features_file)

    # Convert to Tensors and build dataset
    all_input_ids = torch.tensor(
        [f.input_ids for f in features],
        dtype=torch.long
    )
    all_attention_mask = torch.tensor(
        [f.attention_mask for f in features],
        dtype=torch.long
    )
    all_token_type_ids = torch.tensor(
        [f.token_type_ids for f in features],
        dtype=torch.long
    )
    all_label_ids = torch.tensor(
        [f.label_id for f in features],
        dtype=torch.long
    )

    dataset = TensorDataset(all_input_ids, all_attention_mask,
                            all_token_type_ids, all_label_ids)
    return dataset
Copy the code

The output shown

Preview: The follow-up introduction to model construction and training is to be continued…. NLP cute new, shallow talent, mistakes or imperfect place, please criticize!!