On the 25th day of the November Gwen Challenge, check out the details of the event: the last Gwen Challenge 2021


This article is about simple text preprocessing, super simple really. Here are four steps:

  1. Load the text into memory as a string.
  2. Split a string into elements (such as words and characters).
  3. Create a vocabulary that maps split words to numeric indexes.
  4. Converts text to numeric indexed sequences to facilitate model manipulation.

This is really the simplest text preprocessing, except for Deep Learning with Hands, and the first preprocessing of the fish book. So it might be a good idea for newcomers to take a look.

import collections
import re
from d2l import torch as d2l
Copy the code

Read data set

d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL + 'timemachine.txt'.'090b5e7e70c295757f55df93cb0a180b9691891a')

def read_time_machine() :
    with open(d2l.download('time_machine'), 'r') as f:
        lines = f.readlines()
    return [re.sub('[^A-Za-z]+'.' ', line).strip().lower() for line in lines]

lines = read_time_machine()
print(f'# text lines: {len(lines)}')
print(lines[0])
print(lines[10])
Copy the code

This code loads text from H.G. Ell’s time machine and is a small 3W word corpus.

  • It is displayed during the download process

    Downloading .. \ data \ timemachine TXT from d2l-data.s3-accelerate.amazonaws.com/timemachine… .

  • How many routes to output after downloading, and lines 0 and 10.

    >>
    # text lines: 3221
    the time machine by h g wells
    twinkled and his usually pale face was flushed and animated the
    Copy the code
  • Read_time_machine () reads the sentence from the downloaded file, strips it of all characters except upper and lower case, and stores it in a list. After this step, the list is reduced to lowercase letters and Spaces.

    • for line in linesIs a generator object
      • [for line in Lines] returns the generator object to the list

      • Operating on line for line in lines operates on each element in a list of lines

      • Re.sub (‘[^ a-za-z]+’, ‘, line) replaces elements in line other than letters with Spaces using regular expressions.

        • re.sub(*pattern*, *repl*, *string*, *count=0*, *flags=0*)

          Replace occurrences of pattern in strings with repl and return the result string.

          A repL can be a string or a function; If it is a string, any backslash escape sequences are processed. That is, \n is converted to a newline character, \r to a carriage return character, and so on.

          The optional argument count is the maximum number of times to replace; Count must be a non-negative integer. If this parameter is omitted or set to 0, all matches are replaced.

      • Strip () strips Spaces at both ends of the string

      • .lower() converts uppercase letters to lowercase letters

The word (

It’s breaking down text into lists of words, where dimensions are the basic units of text,

def tokenize(lines, token='word') : 
    if token == 'word':
        return [line.split() for line in lines]
    elif token == 'char':
        return [list(line) for line in lines]
    else:
        print('Error: unknown element type:' + token)
Copy the code

Whether this step breaks the text into words (or Spaces) or individual letters (or Spaces),

  • token = wordIs to break it up into words
  • token = charIs to break it up into letters

Try the tokenize function:

  • word

    tokens = tokenize(lines)
    for i in range(10.13) :print(f"\nline[{i}] :",lines[i])
        print(f"\ntokens[{i}] :",tokens[i])
    Copy the code
    >>
    line[10]: twinkled and his usually pale face was flushed and animated the
    
    tokens[10]: ['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']
    
    line[11]: fire burned brightly and the soft radiance of the incandescent
    
    tokens[11]: ['fire', 'burned', 'brightly', 'and', 'the', 'soft', 'radiance', 'of', 'the', 'incandescent']
    
    line[12]: lights in the lilies of silver caught the bubbles that flashed and
    
    tokens[12]: ['lights', 'in', 'the', 'lilies', 'of', 'silver', 'caught', 'the', 'bubbles', 'that', 'flashed', 'and']
    Copy the code
  • char

    tokens = tokenize(lines, 'char')
    for i in range(10.13) :print(f"\nline[{i}] :",lines[i])
        print(f"\ntokens[{i}] :",tokens[i])
    Copy the code
    >>
    line[10]: twinkled and his usually pale face was flushed and animated the
    
    tokens[10]: ['t', 'w', 'i', 'n', 'k', 'l', 'e', 'd', ' ', 'a', 'n', 'd', ' ', 'h', 'i', 's', ' ', 'u', 's', 'u', 'a', 'l', 'l', 'y', ' ', 'p', 'a', 'l', 'e', ' ', 'f', 'a', 'c', 'e', ' ', 'w', 'a', 's', ' ', 'f', 'l', 'u', 's', 'h', 'e', 'd', ' ', 'a', 'n', 'd', ' ', 'a', 'n', 'i', 'm', 'a', 't', 'e', 'd', ' ', 't', 'h', 'e']
    
    line[11]: fire burned brightly and the soft radiance of the incandescent
    
    tokens[11]: ['f', 'i', 'r', 'e', ' ', 'b', 'u', 'r', 'n', 'e', 'd', ' ', 'b', 'r', 'i', 'g', 'h', 't', 'l', 'y', ' ', 'a', 'n', 'd', ' ', 't', 'h', 'e', ' ', 's', 'o', 'f', 't', ' ', 'r', 'a', 'd', 'i', 'a', 'n', 'c', 'e', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'i', 'n', 'c', 'a', 'n', 'd', 'e', 's', 'c', 'e', 'n', 't']
    
    line[12]: lights in the lilies of silver caught the bubbles that flashed and
    
    tokens[12]: ['l', 'i', 'g', 'h', 't', 's', ' ', 'i', 'n', ' ', 't', 'h', 'e', ' ', 'l', 'i', 'l', 'i', 'e', 's', ' ', 'o', 'f', ' ', 's', 'i', 'l', 'v', 'e', 'r', ' ', 'c', 'a', 'u', 'g', 'h', 't', ' ', 't', 'h', 'e', ' ', 'b', 'u', 'b', 'b', 'l', 'e', 's', ' ', 't', 'h', 'a', 't', ' ', 'f', 'l', 'a', 's', 'h', 'e', 'd', ' ', 'a', 'n', 'd']
    Copy the code

The vocabulary

After the above processing, we get a string of words, and the input required by the model is a number, so we need to number each word so that the model can use the number.

Build a vocabulary dictionary to map strings of words to a numeric index starting at 0.

def count_corpus(tokens) : 
    if len(tokens) == 0 or isinstance(tokens[0].list) :Flattening the tokenlist into a list populated with tokenlists
        tokens = [token for line in tokens for token in line]
    return collections.Counter(tokens)
Copy the code

This function counts the frequency of words.

By combining all the documents in the training set and counting their unique lexeme, the statistical result is called corpus. Each unique word is then assigned a numerical index based on its frequency of occurrence.

  • if len(tokens) == 0 or isinstance(tokens[0], list)Return true if tokens are an empty list or a 2-d list.
  • tokens = [token for line in tokens for token in line]Used for list flattening to convert the original two dimensions into one dimension.
  • And then return tocollections.Counter.
    • class collections.Counter([*iterable-or-mapping*])

      A Counter is a dict subclass

      It is a collection where elements are stored like dictionary keys and their counts are stored as values. The count can be any integer value, including 0 and negative numbers. The Counter class is a bit like bags or multisets in other languages.

      c = collections.Counter('gallahad')   
      print(c)
      c = collections.Counter({'red': 4.'blue': 2})
      print(c)
      c = collections.Counter(['eggs'.'ham'])
      print(c)
      Copy the code
      >>
      Counter({'a': 3, 'l': 2, 'g': 1, 'h': 1, 'd': 1})
      Counter({'red': 4, 'blue': 2})
      Counter({'eggs': 1, 'ham': 1})
      Copy the code

      If the referenced key has no record, a 0 is returned instead of a KeyError

      c = collections.Counter('gallahad')   
      print(c['z'])
      Copy the code
      > > 0Copy the code

Processing complexity can be reduced by removing less frequent words when processing expectations.

Any lexical elements that do not exist or have been deleted from the corpus will be mapped to a specific unknown lexical element “<unk>”.

We can also add a list of reserved primitives, such as padding primitives (” <pad> “); Sequence start word (” <bos> “); Sequence end word (” <eos> “).

Now write a class that implements the text vocabulary functionality:

class Vocab:
    def __init__(self, tokens=None, min_freq=0, reserved_tokens=None) :
        if tokens is None:
            tokens = []
        if reserved_tokens is None:
            reserved_tokens = [] 
        Sort by frequency of occurrence
        counter = count_corpus(tokens)
        self.token_freqs = sorted(counter.items(), key=lambda x: x[1],
                                  reverse=True)
        The index of the unknown word is 0
        self.unk, uniq_tokens = 0['<unk>'] + reserved_tokens
        uniq_tokens += [token for token, freq in self.token_freqs
                        if freq >= min_freq and token not in uniq_tokens]
        self.idx_to_token, self.token_to_idx = [], dict(a)for token in uniq_tokens:
            self.idx_to_token.append(token)
            self.token_to_idx[token] = len(self.idx_to_token) - 1

    def __len__(self) :
        return len(self.idx_to_token)

    def __getitem__(self, tokens) :
        if not isinstance(tokens, (list.tuple)) :return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]

    def to_tokens(self, indices) :
        if not isinstance(indices, (list.tuple)) :return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]
Copy the code
  • Parameters:

    • Token: Is your text sorted by word or char
    • Min_freq: Set a threshold to ignore words if they are too low in frequency
    • Reserved_tokens: Tokens that start and end sentences
  • If tokens are not accepted or reserved_tokens is null.

  • Use counter to receive the corpus with good word frequency

  • Self.token_freqs gets the dictionary in descending order of frequency

    • sorted(iterable, key=None, reverse=False)

      Parameter Description:

      • Iterable — iterable.
      • Key — A comparison element that takes only one argument. The argument to the function is taken from the iterable, and an element in the iterable is specified for sorting.
      • Reverse — Collation, reverse = True descending, reverse = False ascending (default).
    • Sort using counter’s.items(), which is a tuple of key-value pairs. The key is set to X [1], and the tuple of each key-value pair is treated as X, which is sorted by the following value. Reverse = True sorts them in descending order.

  • The last part is to write self.idx_to_token and self.token_to_idx, which are used to obtain the word frequency or word.


  1. This is a pytorch version of a reading note for Hands-on Deep Learning. More articles in this series can be found here: Juejin.

  2. Github address: DeepLearningNotes/d2l(github.com)

Still in update …………