This is the 10th day of my participation in Gwen Challenge

pretreatment

The data in The Chinese text may be messy, such as all kinds of garbled codes, symbols, etc., which need to be removed. I think the basic standard is to ensure that the format and content of the text can be read like a normal article.

participles

The first step in the processing of Chinese text data is usually word segmentation, which is called tokenization in English text segmentation. The basic idea is to divide text into words or characters one by one. Since Chinese does not naturally have Spaces as delimiters like English, Chinese word segmentation is a big research field. At present, Chinese text segmentation can be performed by using commonly used jieba and other mature word segmentation devices.

Why participles?

Text is some unstructured data, we need to first transform these data into structured data, which can be transformed into mathematical problems. This idea is also the idea of machine learning, and word segmentation is the first step of transformation. Simple participles of sentences are as follows:

Table tennis is sold outCopy the code

Another word is the smallest unit of complete meaning. The granularity of words is too small to express the complete meaning, while the granularity of sentences is too large to carry a lot of information, so it cannot be modeled accurately.

Delete low frequency words, stop words?

First of all, most of the low-frequency words are rarely used, such as people’s names, wrong words, “de” and other nonsense words.

Second, removing low-frequency words makes the dictionary not too big. If one-hot vectors are used, the dimensions are equal to the dictionary size. The larger the dictionary, the larger the one-hot vector is, which makes the calculation slower and increases storage. Moreover, the larger the dimension of one-hot vector is, the more parameters are needed in the model, which is easy to cause over fitting

How are participles represented in a computer

In the process of the actual text processing, we need to put the word into a machine learning or deep learning using numerical characteristics, in order to convenient for computer operation, such as word segmentation can be “India” 1, “American” is 2, and so on, the Numbers only here is said categories, does not have the size of a numerical comparison function. Therefore, a one-hot vector is further used to represent the country or gender, such as:

India ->[0,0,0,1] USA ->[0,0,1,0]Copy the code

Why the one-hot vector?

When we represent a country, why do we use one-hot vector to represent features, such as:

"India" - > "American",0,0,1 [0] - > [0,0,1,0] "China" - >,1,0,0 [0] "UK" - >,0,0,0 [1]Copy the code

Instead of directly representing features with numbers, such as:

India ->1 USA ->2 China ->3 UK ->4Copy the code

Some students may think that when each country is represented by a number directly, it can be easily solved with a number and save the storage and calculation of the 4-bit vector. Isn’t it a cost-effective design? Of course not, or there wouldn’t be a one-hot vector!

If we were really using numbers, we would do the following calculation based on the previous example:

Us (2) + India (1) = China (3)Copy the code

Such a feature calculation is completely unreasonable, how can the sum of the United States and India be China? If we calculate with vectors:

USA [0,0,1,0]+ India [0,0,0,1] = [0,0,1,1]Copy the code

[0,0,1,1,] is more meaningful, indicating that the results include both the United States and India.

In general, operations in machine learning or deep learning cannot be characterized by scalars, because the results obtained by numbers are meaningless during the operation of features. The right thing to do is to use vectors for feature representation.

Does it have to be one-hot? No! And word embedding

If we have ten thousand different words, we use one-hot to indicate that each word needs ten thousand dimensional vector. Such large dimensional vector requires expensive computing resources, so we only recommend using it for a few words. In the case of many words, it is no longer used, then we need to conduct word embbeding. Let’s map these higher-dimensional vectors to lower-dimensional vectors. Specific practices are as follows:

One-hot vector of words (1*10000) * P (10000*10) = V (1*10)Copy the code

We multiply the one-hot high-dimensional vector of a word by the P-matrix, where P is the number of words multiplied by the self-defined 10-dimensional matrix. P needs to be learned from the training data. Each line represents the word vector of a word. To put it simply, the previous 10,000 dimensional vector is compressed into a 10 dimensional vector, so that it can be easily calculated.

conclusion

The processing process of Chinese text is usually as follows:

Preprocessing --> word segmentation --> stop word --> vector segmentationCopy the code