1. NLP

Natural Language Processing (NLP) refers to Natural Language Processing. Its purpose is to make computers understand human speech.

The following is the result that THE AI deduced after I trained 20,000 douban film reviews and randomly input a new film review to the neural network.

    "Good, good acting.".0.91799414= = = > high praise"It would be weird if it were better.".0.19483969= = = > bad review"One star for subtitles.".0.0028086603= = = > bad review"Good acting, good acting, bad acting.".0.17192301= = = > bad review"Good acting, good acting, good acting, good acting, bad acting." 0.8373259= = = > high praiseCopy the code

By the end of this article, you will have acquired these skills.

2. Read data

First of all, we need to find the data set to be trained. Here I have a CSV file, which contains 50000 film and television reviews obtained from Douban.

His format looks something like this:

The name of the score comments classification
The movie name One to five Comment on the content 1 positive, 0 negative

Here are some of the numbers:

The code looks like this:

# import packages
import csv
import jieba

Read the CSV file
csv_reader = csv.reader(open("datasets/douban_comments.csv"))

# Store sentences and tags
sentences = []
labels = []

# Loop read each line for processing
i = 1 
for row in csv_reader:
    
    # Use a stutter participle to break words with Spaces in comments
    comments = jieba.cut(row[2]) 
    comment = "".join(comments)
    sentences.append(comment)
    # Deposit hashtag, 1 positive, 0 negative
    labels.append(int(row[3]))

    i = i + 1

    if i > 20000: break # take the first 20,000 tests, take all comments

# select training data and separate test data
training_size = 16000
# 0 to 16000 are training data
training_sentences = sentences[0:training_size]
training_labels = labels[0:training_size]
After # 16000 is the test data
testing_sentences = sentences[training_size:]
testing_labels = labels[training_size:]
Copy the code

There are several things going on here:

  1. The file is read line by line, selecting the comment and label fields.
  2. Comment content is stored after word segmentation.
  3. The data were divided into training and test groups.

2.1 Chinese word segmentation

Focus on participles.

Participles are unique to Chinese and do not exist in English.

Here is an English sentence.

This is an English sentence.

How many words does this sentence have?

There are five, because there is a space between each word, which the computer can easily recognize and process.

This is an English sentence .
1 2 3 4 5 6

Here is a Chinese sentence.

Welcome to my Nuggets blog.

How many words does this sentence have?

I’m afraid you have to read it a few times, and then combine it with your life experience, and then you can sort it out.

The focus of today’s study is not word segmentation, so we will skip over it and use a third party’s stuttering word segmentation.

Installation method

Code is compatible with Python 2/3

  • Fully automatic installation:easy_install jiebaorpip install jieba / pip3 install jieba
  • Semi-automatic installation: Download it firstpypi.python.org/pypi/jieba/Decompress and runpython setup.py install
  • Manual installation: Download code files to place jieba directory in the current directory or site-packages directory
  • throughimport jiebaTo refer to

After introduction, call jieba.cut(” Welcome to my Nuggets blog.” ) can be participle l.

import jieba
words = jieba.cut("Welcome to my Nuggets blog.") 
sentence = "".join(words)
print(sentence) # Welcome to my Nuggets blog.
Copy the code

Why do we have participles? Because words are the smallest unit of language, understanding words is the only way to understand language, to know what is said.

For Chinese, the word segmentation method is different for the same word in different contexts.

Pay attention to “Peking University” below:

import jieba
sentence = "".join(jieba.cut("Welcome to Peking University Cafeteria.")) 
print(sentence) Welcome to Peking University Cafeteria
sentence2 = "".join(jieba.cut("Welcome to Beijing University Student Volunteer Center")) 
print(sentence2) Welcome to Beijing University Student Volunteer Center
Copy the code

Therefore, the difficulty of Chinese natural language processing lies in word segmentation.

At this point, our product is:

Sentences = [' I like you ',' I don't like him '... Labels = [0,1,......]Copy the code

3. Text serialization

Text, in fact, the computer can’t read it directly, it only knows zeros and ones.

You can see these words and pictures because they have been transformed many times.

Take the letter A, which we call 65, 0100 0001 in binary.

binary The decimal system Abbreviations/characters explain
0100, 0001, 65 A Capital A
0100, 0010, 66 B Capital B
0100, 0011, 67 C Capital C
0100, 0100, 68 D Capital D
0100, 0101, 69 E Capital E

When you see A, B, C, it’s actually 0100 0001, 0100 0010, 0100 0011 on the computer, and it likes numbers.

This is why when you compare letter sizes you find that A<B is essentially 65<66.

Well, our prepared text also needs to be converted to numbers to make it easier to calculate.

3.1 fit_on_texts classification

Tokenizer is a Tokenizer used to classify and serialize text.

The word segmentation here is different from the Chinese word segmentation we said above, because the programming language is invented by foreigners, they do not have to deliberately divide words, he named the word segmentation, is to classify words.

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = ['I like you'.'I don't like him']
# define a participle
tokenizer = Tokenizer()
# participles handle text,
tokenizer.fit_on_texts(sentences)
print(tokenizer.word_index) # {' I ': 1, 'like ': 2,' you ': 3, 'no ': 4,' he ': 5}
Copy the code

So what you do is you look for a number of words in the text.

See the output result: 5 different words are selected from 2 sentences, numbered 1~5.

3.2 TEXts_TO_SEQUENCES text change sequences

All the words in the text are numbered, so the text can be represented by numbers.

# Convert text to numeric sequence
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences) # [[1, 2, 3], [1, 4, 2, 5]]
Copy the code

So the computer began to smile.

3.3 PAD_SEQUENCES fill sequences

It’s given numbers, but it’s not standard, it’s short and long, it’s an assembly line, it only eats standardized data.

Pad_sequences will process the sequence to a uniform length, default to the longest sequence, and complement 0 if there is not enough.

from tensorflow.keras.preprocessing.sequence import pad_sequences

Padding ='post', padding='pre'
padded = pad_sequences(sequences, padding='post')
print(padded) # [[1 2 3] [1 4 2 5]] -> [[1 2 3 0] [1 4 2 5]]
Copy the code

In this way, the length is the same, the computer showed a happy smile.

If it’s too short, it can be replenished, but what if it’s too long?

Too long can be cut.

# truncating='post' to the back, truncating='pre' to the front
padded = pad_sequences(sequences, maxlen = 3,truncating='pre')
print(padded) # [[1, 2, 3], [1, 4, 2, 5]] -> [[1 2 3] [4 2 5]]
Copy the code

At this point, our product is:

Sentences = [[1 2 3 0] [1 4 2 5]] Labels = [0,1,...]Copy the code

4. Build a model

The so-called model is assembly line equipment. Let’s take a look at what an assembly line looks like.

Finished, the function of the assembly line is to come in the fixed format of raw materials, after layer by layer of processing, and finally out of the fixed format of the finished product.

The model is also like this, defining layers of “equipment”, configuring the “indicators” in the process, waiting for online production.

# Build the model and define the layers
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length= max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')])Loss = optimizer= metrics=[" accuracy "]
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Copy the code

4.1 Sequential sequence

You can think of it as the whole assembly line, which contains all kinds of devices (layers).

4.2 Embedding layer

Embedded layer, literally we can feel the momentum of this layer.

Embedding, which means inserting a lot of dimensions. A word is represented in multiple dimensions.

Now dimensions.

Two dimensions look like this (length, width) :

Three dimensions look like this (length, width, height) :

Can you imagine what 100 dimensions would look like? Unless you’re a physicist, it’s hard to describe space in terms of more than three dimensions. But the numbers are pretty good.

Gender, position, age, height, skin color, that’s five dimensions, is it possible to find 1,000 dimensions?

For a word, it can be embedded in many dimensions, and with the value of the dimensions, we can understand the degree of words and calculate the relationship between words.

If we assign dimensions R, B, and G to color:

color R G B
red 255 0 0
green 0 255 0
blue 0 0 255
yellow 255 255 0
white 255 255 255
black 0 0 0

Witness the miracle below, know colorology all know, red and green mixed together what color?

Come, read after me: red + green = yellow.

[255,0,0]+[0,255,0] = [255,255,0]

In this way, the brightness of the color, the relationship between the colors, the computer can be calculated.

As long as the markings are reasonable, the computer can actually calculate: King + female = queen, wonderful =- terrible, happy > smile.

So you say, does the computer understand the meaning of words, only it’s not perceptual like you, it’s all calculation.

The embedding layer marks words with proper dimensions.

Embedding(VOCab_size, embedding_DIM, input_length)

  • Vocab_size: dictionary size. How many kinds of words are there?
  • Embedding_dim: The output size of this layer. How many dimensions are used to represent a word.
  • Input_length: dimension of the input data. How many words are in a sentence, usually max_length (the maximum length of the training set).

4.3 GlobalAveragePooling1D Global mean pooling is one-dimensional

The main thing is dimension reduction. In the end, we only need one dimension, which is good or bad, but now there are too many dimensions, which need to be reduced.

4.4 Dense

The Dense(64, activation=’relu’) was reduced to Dense(1, activation=’sigmoid’), and the final output was a result, just like the previous production line input of flour, water, meat, vegetables and other raw materials, and the final output was steamed buns.

4.5 Activation Activation function

Activation is an activation function whose primary role is to provide nonlinear modeling capabilities for networks.

A linear problem is one that can be solved with a single line. Try it out at TensorFlow.

Using linear thinking, the neural network can quickly distinguish between the two samples.

But if you have a sample like this, you can’t just draw a straight line.

If the relu activation function is used, it can be easily distinguished.

That’s what the activation function does.

Commonly used have the following several, the following has its functions and graphics.

We used Relu and Sigmoid.

  • Relu: 1) Linear Unit, the most commonly used activation function.
  • Sigmoid: also called Logistic function, it can map a real number to the interval of (0,1).

Dense(1, activation=’sigmoid’) For the last Dense, we adopted SIGmoID because 0 in our dataset was negative and 1 was positive, and we expected the output value of the model to be between 0 and 1.

4. Training model

Training model is equivalent to starting the pipeline machine, passing in training data and validation data, and calling fit method to train.

model.fit(training_padded, training_labels, epochs=num_epochs,
    validation_data=(testing_padded, testing_labels), verbose=2)
Save the training set result
model.save_weights('checkpoint/checkpoint')
Copy the code

After startup, the log print looks like this:

Epoch 1/10 500/500 - 61s - loss: 0.6088 - accuracy: 0.6648 - val_loss: 0.5582 - val_accuracy: 0.7275 
Epoch 2/10 500/500 - 60s - loss: 0.4156 - accuracy: 0.8130 - val_loss: 0.5656 - val_accuracy: 0.7222 
Epoch 3/10 500/500 - 60s - loss: 0.2820 - accuracy: 0.8823 - val_loss: 0.6518 - val_accuracy: 0.7057
Copy the code

Finally, call save_weights to save the results.

5. Verify the results

sentences = [
    "Good, good acting."."It would be weird if it were better."."One star for subtitles."."Good acting, good acting, bad acting."."Good acting, good acting, good acting, good acting, bad acting."
]

# word segmentation
v_len = len(sentences)
for i in range(v_len):
    sentences[i] = "".join(jieba.cut(sentences[i]) )

# serialization
sequences = tokenizer.texts_to_sequences(sentences)
Fill for standard length
padded = pad_sequences(sequences, maxlen= max_length, padding='post', truncating='post')
# prediction
predicts = model.predict(np.array(padded))
Print the result
for i in range(len(sentences)):
    print(sentences[i],   predicts[i][0].'= = = > high praise if predicts[i][0] > 0.5 else '= = = > bad review')

Copy the code

The final result is printed:

Very good, very good acting0.93863165It would be strange if the reviews were good0.32386222===> Negative one star to subtitles0.0030411482Good acting, good acting, very bad0.21595979Good acting, good acting, good acting, good acting, bad acting0.71479297= = = > high praiseCopy the code

The full code has been uploaded to Github at github.com/hlwgy/douba…

The reading object of this article is junior staff. In order to facilitate understanding, some details are deliberately omitted, and the knowledge points are relatively shallow. It aims to introduce the process and principle, and is only used for introduction.