LSTM, short term memory, is a temporal recursive neural network (RNN). LSTM is suitable for processing and predicting important events with very long intervals and delays in time series.

Generally speaking, LSTM is very suitable for predicting time-related data, and it is widely used in text processing (it can be understood as predicting which word is most likely to appear at t+1 when a certain word appears at t point). Technically, uh, I have no idea.

But that doesn’t stop us from doing some fun things with LSTM, like you don’t know how a rice cooker works, but you always cook. Next we use the powerful “rice cooker” Keras to cook some delicious meals.

Write code automatically

It’s a little clickbait, but it’s still an artificial retarded. Let’s say you’re writing code and you write something like this:

def afunc(i):
    i = i + 1
    returCopy the code

The next character would naturally be “n”. A sequence of characters is converted to a sequence of numbers as input, and the next character of the sequence of characters as output. For example, if the sequence is 3, return can be converted to:

r,e,t -> u
e,t,u -> r
t,u,r -> nCopy the code

These individual characters are then mapped to numbers:

r -> 1
e -> 2
t -> 3
u -> 4
n -> 5Copy the code

Into this form:

X1 = [1, 2, 3], y1 = 4 x2 = [4] 2, y2 = 1 x3 =,4,1 [3], y3 = 5Copy the code

That way, we can train with Keras.

I used python Requests with empty lines and comments removed as a training sample.

You can refer to the script for details

Text Generation With LSTM Recurrent Neural Networks in Python with Keras

Results after training 40 EPOchs:

Epoch 40/40 217004/217004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 270 - s - loss: 0.9860Copy the code

Effect:

As you can see, some of the output does look a bit like code, but there is a lot of overlap because of the overfitting.

“GRU” is a variant of LSTM, which is said to be simpler than LSTM in structure, fewer parameters and better than LSTM in over-fitting:

model = Sequential()
model.add(GRU(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(GRU(256))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')Copy the code

Effects of training 20 Epochs:

Epoch 20/20 217004/217004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 224 - s - loss: 0.8398Copy the code

You can see that the repeatability aspect is solved better than LSTM. If you have time to retrain more and see if you can get better results.

Poetry robot

In fact, the English letters in the above example are replaced by Chinese characters.

The format of the text is as follows, with the colon separating the title and the poem, one poem in a line, a thousand poems in total.

Mountain Pavilion late autumn: Mountain Pavilion full of autumn, Rock you cool wind. Thin blue still smoke, chrysanthemum still dew. Old stone clothes new moss, new nest seal old tree. Experience boundless, close round light twilight.Copy the code

Data processing:

with open('poetry.txt') as f:
    raw_text = f.read()
lines = raw_text.split("\n")[:-1]
poem_text = [i.split(':')[1] for i in lines]
char_list = [re.findall('[\x80-\xff]{3}|[\w\W]', s) for s in poem_text]
all_words = []
for i in char_list:
    all_words.extend(i)
word_dataframe = pd.DataFrame(pd.Series(all_words).value_counts())
word_dataframe['id']=list(range(1,len(word_dataframe)+1))
word_index_dict = word_dataframe['id'].to_dict()
index_dict = {}
for k in word_index_dict:
    index_dict.update({word_index_dict[k]:k})
seq_len = 2
dataX = []
dataY = []
for i in range(0, len(all_words) - seq_len, 1):
    seq_in = all_words[i : i + seq_len]
    seq_out = all_words[i + seq_len]
    dataX.append([word_index_dict[x] for x in seq_in])
    dataY.append(word_index_dict[seq_out])
X = np.array(dataX)
y = np_utils.to_categorical(np.array(dataY))Copy the code

I set the sequence length as 2, that is, predicting the next word with the previous two words, a total of 217,687 pieces of data.

The volume of Chinese characters is several levels higher than the English alphabet, and there are 4,620 of them. We use word embedding to process and map the one-hot encoded words into low-dimensional vector expression to reduce the feature dimension.

model = Sequential()
model.add(Embedding(len(word_dataframe)+1, 512))
model.add(GRU(512))
model.add(Dense(y.shape[1]))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')Copy the code

Define a function that generates poetry:

def gen_poem(seed_text): rows = 4 cols = 6 chars = re.findall('[\x80-\xff]{3}|[\w\W]', seed_text) if len(chars) ! = seq_len: return "" arr = [word_index_dict[k] for k in chars] for i in range(seq_len, rows * cols): if (i+1) % cols == 0: if (i+1) / cols == 2 or (i+1) / cols == 4: arr.append(2) else: arr.append(1) else: proba = model.predict(np.array(arr[-seq_len:]), verbose=0) predicted = np.argsort(proba[1])[-5:] index = random.randint(0,len(predicted)-1) new_char = predicted[index] while new_char == 1 or new_char == 2: index = random.randint(0,len(predicted)-1) new_char = predicted[index] arr.append(new_char) poem = [index_dict[i] for i in arr] return "".join(poem)Copy the code

In order not to generate the same sentence every time, I set it to randomly pick from the five most likely outcomes.

Training 25 epochs to see the results:

Not bad! At least the statement is more smooth!

The poem describes that on a moonlit night came the sounds of apes, and the apes did not pity the empty city in front of them. It found a new bronze mirror of dress, but it looked around, and could not find the owner of the mirror. It expresses the poet’s feelings of loneliness and emptiness. (Forgive my forced interpretation…)

Here are some of the resulting verses:

1 spring flowers bloom, where not to see flowers. In front of yumen landscape, not infinite love.

2. February out of life, what is who bosom friend. Li Rongbei music bell, heaven and earth gas tuning.

3. I don’t see each other in the world. Don’t have a sign, clouds will spring flowers.

4 chrysanthemum still fear the wind, stone salt here. Li Rong dark pass three, the world toward the spot bamboo.

5. Leisurely shadow song and, cloud floating huang Yunji. Where the wind lies, a cup of wine ripe gold.

Effects of training 75 Epochs:

No, a cup of wine makes you swim. At this time the moon hanging dance, why not go home.

The moon hangs in February, why I do not know the horizon. Tonight seems to grow, golden jade spring.

Small lotus rest zheng clothes, days regardless of non. If light tonight, when more fear of cold.

4. Chrysanthemum cup color shake, golden cloud cover pole. Wind flower sunset moon, clouds melancholy gentleman.

Leisurely know heaven and earth, when the glorious clouds. You are not poor, so there is no heaven and earth.

Seven words:

1. Clear spring setting sorrow night long thinking, where do you want to treat me. King road far in the smoke, heaven and earth do not meet without.

2 hang dance Phoenix day in February, what year light such as yan can recommend. When the wind is silent, cloud sorrow is not so born.

3 small lotus rest qi compound, golden house of incense water. This pillow in what to do, the incense wind blowing flowers open.

4 remnants of chrysanthemum cup color still look, cloud sorrow night moon hanging dance red. What do you need today, yunzhong embroidered skirt yan Fang.

5 leisurely body has its own time, a hundred years silent to life. You can’t see it, you can walk alone at night.

Text sentiment analysis

The above are the nature of playing games, then use LSTM to do a more practical text sentiment analysis.

Text sentiment analysis usually involves giving a piece of text and giving a machine to determine whether the emotion expressed in the text is positive or negative.

Such as:

“I love this product so much that I would recommend it to my friends.” It’s positive,

“Garbage, not recommended, requested return.” It’s negative.

Text sentiment analysis is especially useful in online opinion analysis.

For the corpus I used, I searched the product reviews of JINGdong, and the URL was similar to the following example:

https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv3006&productId=2695276&score=1&sort Type=5&page=0&pageSize=10&isShadowSku=0Copy the code

Parameter productId is the ID of the commodity. Score =1 is a one-star review (negative) and score=5 is a five-star review (positive).

After grabbing, organize into CSV files, such as:

0," Very copycat! I can't read it at all, even I can't read it on two different computers -- winXP and Win10!" 0," Price cut right after purchase, must review." . 1," Very good removable hard disk, reasonable price, as good as other brands, very satisfactory." 1. "Must have in the office, very good"Copy the code

0 is negative; 1 is heads. The number of bad comments and good comments should be balanced.

In this example, unlike the previous verse, we need word segmentation, otherwise the input sequence will be too long.

Data preprocessing:

import pandas as pd import numpy as np import jieba from keras.preprocessing import sequence comments = pd.read_csv('jd_comments.csv', encoding='utf-8') comments['words'] = comments['content'].apply(lambda x: list(jieba.cut(x))) all_words = [] for w in comments['words']: all_words.extend(w) word_dict = pd.DataFrame(pd.Series(all_words).value_counts()) word_dict['id'] = list(range(1, len(word_dict)+1)) comments['w2v'] = comments['words'].apply(lambda x: Comments ['w2v'] = list(sequence. pad_SEQUENCES (comments['w2v'], maxlen=50))Copy the code

Training and test data:

x_train = np.array(list(comments['w2v']))[::2]
y_train = np.array(list(comments['score']))[::2]
x_test = np.array(list(comments['w2v']))[1::2]
y_test = np.array(list(comments['score']))[1::2]Copy the code

Model:

model = Sequential() model.add(Embedding(len(word_dict)+1, 256)) model.add(LSTM(256)) model.add(Dropout(0.5)) model.add(Dense(1)) model.add(Activation('sigmoid') model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])Copy the code

The output is only 1-dimensional, so the activation function uses sigmoID and the loss function uses binary_crossentropy.

Training:

I only used 14,255 pieces of data, so the training was very fast and the accuracy was 96.38%.

Take a look at the accuracy of the test data set:

Model. The evaluate (x = x_test, y = y_test, verbose = 0) [0.37807498600423312, 0.87680651045320612]Copy the code

The accuracy rate was 87.68%, which was generally good.

Define a function to see the effect:

def new_data(new_comment):
    words = list(jieba.cut(new_comment))
    w2v = [word_dict['id'][x] for x in words]
    xn = sequence.pad_sequences([w2v], maxlen=50)
    return xnCopy the code

The accuracy rate is good:

“They all say easy to use, I ha ha.” It’s amazing how you can get it right.