Keras is a high-level neural network API written in Python that can be run as a back end with TensorFlow, CNTK, or Theano. The focus of Keras development is to support rapid experimentation. Being able to translate your ideas into experimental results with minimal delay is the key to doing good research.
This article takes the project on Kaggle :IMDB film review sentiment Analysis as an example to learn how to build a neural network with Keras to deal with practical problems. Reading this article requires a basic understanding of neural networks.


The article is divided into two parts:

  • Some basic concepts in Keras. Api usage. I will give some simple usage examples, or give links to relevant knowledge.
  • IMDB film review emotion analysis actual combat. I’m going to use all of the stuff I talked about in Part 1.

ModelDense fully connected layer

keras.layers.core.Dense(units, activation=None, use_bias=True, k


ernel_initializer=’glorot_uniform’, bias_initializer=’zeros’, ke


rnel_regularizer=None, bias_regularizer=None, activity_regulariz


er=None, kernel_constraint=None, bias_constraint=None)
# as first layer in a sequential model:
# as first layer in a sequential model:

model = Sequential()model.add(Dense(32, input_shape=(16,)))

# now the model will take as input arrays of shape (*, 16)
# and output arrays of shape (*, 32)
# after the first layer, you don’t need to specify
# the size of the input anymore:

model.add(Dense(32))Embedded layer Embedding

keras.layers.embeddings.Embedding(input_dim, output_dim, embeddi


ngs_initializer=’uniform’, embeddings_regularizer=None, activity


_regularizer=None, embeddings_constraint=None, mask_zero=False,


input_length=None)
Check out this link if you are interested
https://machinelearningmastery.c … eep-learning-keras/


Word to vector. The purpose of this layer is to get the text represented by a vector of words.

  • Input_dim: Size of the word list. The total number of different words.
  • Output_dim: The number of dimensions you want the word to be converted into.
  • Input_length: indicates the number of words in each sentence
For example: We input an M*50 matrix with 200 different words, and we want to convert each word into a 32-dimensional vector. It returns a tensor of (M,50,32).


There are 50 words in a sentence, and each word is a 32-dimensional vector. There are M sentences in total. So is e.s hape = (M, 50, 32)

e = Embedding(200, 32, input_length=50)

LSTM layer.

LSTM is a special case of recurrent neural network.
Deeplearning.net/tutorial/ls…


To put it simply, the neural networks we have mentioned before, including CNN, are one-way, without considering the sequence relationship, but the meaning of a word is relevant to its context, for example, “I use a Xiaomi phone and eat millet porridge “, two millet definitely do not mean the same thing. When doing semantic analysis, you need to consider the context. The recurrent neural network RNN does just that. Or “The movie was of high quality, but I didn’t like it “. There are both positive and negative comments in this sentence, and the LSTM will recognize the” but “which is what we want to focus on.

keras.layers.recurrent.LSTM(units, activation=’tanh’, recurrent_


activation=’hard_sigmoid’, use_bias=True, kernel_initializer=’gl


orot_uniform’, recurrent_initializer=’orthogonal’, bias_initiali


zer=’zeros’, unit_forget_bias=True, kernel_regularizer=None, rec


urrent_regularizer=None, bias_regularizer=None, activity_regular


izer=None, kernel_constraint=None, recurrent_constraint=None, bi


As_constraint = None, dropout = 0.0, recurrent_dropout = 0.0)

Pooling layer

  • Keras. The layers. Pooling. GlobalMaxPooling1D () # global maximum pool for the time signal
    Stackoverflow.com/questions/4…

    • Input: 3D tensor shaped like (samples, steps, features)
    • Output: 2D tensor shaped like (samples, features)
  • keras.layers.pooling.MaxPooling1D(pool_size=2, strides=None, pad

    ding=’valid’)
  • keras.layers.pooling.MaxPooling2D(pool_size=(2, 2), strides=None

    , padding=’valid’, data_format=None)
  • keras.layers.pooling.MaxPooling3D(pool_size=(2, 2, 2), strides=N

    one, padding=’valid’, data_format=None)
  • .

Data preprocessingText preprocessing

  • keras.preprocessing.text.text_to_word_sequence(text,

    filters=base_filter(), lower=True, split=” “)
  • keras.preprocessing.text.one_hot(text, n,

    filters=base_filter(), lower=True, split=” “)
  • keras.preprocessing.text.Tokenizer(num_words=None, filters=base_

    filter(),

    lower=True, split=” “)

    Tokenizer is a tool for vectorizing text, or converting text into sequences (i.e. subscripts of words in dictionaries)

    List, counting from 1).
    • Num_words: None or an integer, the maximum number of words to process. If set to an integer, the word splitter is limited to processing num_words, the most common words in the dataset
    • No matter what num_words are, the dictionaries in FIT_on_texts are the same, and all words have corresponding index. Except for Texts_to_sequences, the results are different.
    • The sentence is represented by index of the most common (num_words -1) words.
    • Note that X_t varies with num_words. Take only the most num_words-1 sentences in the dictionary. If there are particularly unusual words in a sentence, they are filtered out. For example, if the sentence =”x y z”. Y,z is not top num_words-1, the vector form of the last sentence is [x_index_in_dic].


t1=

“i love that girl”

t2=

‘i hate u’

texts=[t1,t2]tokenizer = Tokenizer(num_words=

None

)tokenizer.fit_on_texts(texts)

Get an index for each word in the dictionary.
print

( tokenizer.word_counts)

#OrderedDict([(
‘i’
.
2
), (
‘love’
.
1
), (
‘that’
.
1
), (
‘girl’
.
1
), (
‘hate’
.
1
), (
‘u’
.
1
)])
print

( tokenizer.word_index)

# {
‘i’
:
1
.
‘love’
:
2
.
‘that’
:
3
.
‘girl’
:
4
.
‘hate’
:
5
.
‘u’
:
6
}
print

( tokenizer.word_docs)

# {
‘i’
:
2
.
‘love’
:
1
.
‘that’
:
1
.
‘girl’
:
1
.
‘u’
:
1
.
‘hate’
:
1
})
print

( tokenizer.index_docs)

# {
1
:
2
.
2
:
1
.
3
:
1
.
4
:
1
.
6
:
1
.
5
:
1
}

tokennized_texts = tokenizer.texts_to_sequences(texts)

print

(tokennized_texts)

#
[1, 2, 3, 4], [1, 5, 6]
Each word is represented by its index

X_t = pad_sequences(tokennized_texts, maxlen=

None

)

# can be converted into
2
D Array is a matrix. The number of words in each text is maxlen. Non-existent words
0
Said.
print

(X_t)

#
[1 2 3 4][0 1 5 6]]

Sequence preprocessing

  • keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None , dtype=’int32′, padding=’pre’, truncating=’pre’, Value =0.) returns a tensor of order 2
  • keras.preprocessing.sequence.skipgrams(sequence, vocabulary_size

    ,

    window_size=4, negative_samples=1., shuffle=True,

    categorical=False, sampling_table=None)
  • keras.preprocessing.sequence.make_sampling_table(size, sampling_

    factor=1e-5)

Keras actual combat :IMDB film review emotion analysis

Introduction to data set

  • LabeledTrainData. TSV/imdb_master. CSV data sets reviews have been marked for film is positive/negative evaluation
  • The testData.tsv test set needs to predict whether comments will be positive/negative
The main steps

  • Data is read
  • Data cleaning mainly includes removing stop words, HTML tags and punctuation marks
  • Model building
    • Embedding layer: Complete word to vector conversion
    • LSTM
    • Pooling layer: important feature extraction is completed
    • Full connection layer: classification


The data load

import

pandas

as

pd

import

matplotlib.pyplot

as

plt

import

numpy

as

npdf_train = pd.read_csv(

“./dataset/word2vec-nlp-tutorial/labeledTrainData.tsv”

, header=

0

, delimiter=

“\t”

, quoting=

3

)df_train1=pd.read_csv(

“./dataset/imdb-review-dataset/imdb_master.csv”

,encoding=

“latin-1”

)df_train1=df_train1.drop([

“type”

.

‘file’

],axis=

1

)df_train1.rename(columns={

‘label’

:

‘sentiment’

.

‘Unnamed: 0’

:

‘id’

.

‘review’

:

‘review’

}, inplace=

True

)df_train1 = df_train1[df_train1.sentiment !=

‘unsup’

]df_train1[

‘sentiment’

] = df_train1[

‘sentiment’

].map({

‘pos’

:

1

.

‘neg’

:

0

})new_train=pd.concat([df_train,df_train1])Data cleaning

Process HTML data with BS4

  • Filter out words
  • Remove stop words
import

re

from

bs4

import

BeautifulSoup

from

nltk.corpus

import

stopwordsdef review_to_words( raw_review ): review_text = BeautifulSoup(raw_review,

‘lxml’

).get_text() letters_only = re.sub(

“[^a-zA-Z]”

.

“”

, review_text) words = letters_only.lower().split() stops = set(stopwords.words(

“english”

)) meaningful_words = [w

for

w

in

words

if
not

w

in

stops]

return

(

“”

.join( meaningful_words )) new_train[

‘review’

]=new_train[

‘review’

].apply(review_to_words)df_test[

“review”

]=df_test[

“review”

].apply(review_to_words)Keras builds the network

Text is converted to a matrix


– Tokenizer applies to list(sentence) to get the dictionary. Replace the word Index in the dictionary to get the number matrix


– PAD_SEQUENCES complement 0. Ensure that each row of the matrix has the same number. That is, each sentence has the same number of words.

list_classes = [

“sentiment”

]y = new_train[list_classes].valuesprint(y.shape)list_sentences_train = new_train[

“review”

]list_sentences_test = df_test[

“review”

]max_features = 6000tokenizer = Tokenizer(num_words=max_features)tokenizer.fit_on_texts(list(list_sentences_train))list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)print(len(tokenizer.word_index))totalNumWords = [len(one_comment) for one_comment in list_tokenized_train]print(max(totalNumWords),sum(totalNumWords) / len(totalNumWords))maxlen = 400X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

Model building

  • Amount of words to

inp = Input(shape=(maxlen, ))print(inp.shape)

# (? , 400) # 400 words per sentence

embed_size = 128

# Each word is converted into a 128-dimensional vector

x = Embedding(max_features, embed_size)(inp)print(x.shape)

# (? , 400, 128)
  • LSTM has 60 neurons
  • GlobalMaxPool1D is equivalent to extracting the most important neuronal output
  • DropOut Dismisses part of the output and introduces regularization to prevent overfitting
  • Dense fully connected layer
  • Model compilation specifies loss functions, optimizers, and model performance metrics

x = LSTM(

60

.

return

_sequences=True,name=

‘lstm_layer’

)(x)print(x.shape)x = GlobalMaxPool1D()(x)print(x.shape)x = Dropout(0.1)(x)print(x.shape)x = Dense(50, Activation =”relu”)(x)print(x.shape)x = Dropout(0.1)(x)print(x.shape)x = Dense(1, activation=”sigmoid”)(x)print(x.shape)model = Model(inputs=inp, outputs=x)model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

  • Model training

batch_size = 32epochs = 2print(X_t.shape,y.shape)model.fit(X_t,y, batch_size=batch_size, epochs=epochs, Validation_split = 0.2)

  • Using model prediction
prediction

= model.predict(X_te)

y_pred

= (prediction >

0.5

) Original addresswww.cnblogs.com/sdu20112013…

More Java learning materials can be found at itheimaGZ