Word embedding provides dense representations of words and their relative meanings, which are an improvement on the sparse representations used in simple package model representations, and can be learned from text data and reused across projects. They can also be learned as part of a neural network that fits text data.

Word Embedding

Word embedding is a class of methods that use dense vector representations to represent words and documents.

Word is embedded coding scheme to improve the traditional word bag model, the traditional method using large and sparse vector to represent each word or rating within the vector of each word to represent the entire vocabulary, these representations are sparse, because each term is huge, the given word or document consisting largely of zero vector said.

In contrast, in embedding, words are represented by dense vectors, where vectors represent the projection of words into a continuous vector space.

The positions of words in the vector space are learned from the text and are based on the words that surround the words as they are used.

The learned position of a word in a vector space is called its Embedding.

Two popular examples of learning words from text embedding methods include:

  • Word2Vec.
  • GloVe.

In addition to these well-designed approaches, word embedding learning can be used as part of a deep learning model. This may be a slower approach, but it allows you to customize the model for a particular data set.

Keras Embedding Layer

Keras provides an embedding layer that applies to neural networks of text data.

It requires input data to be integer coded, so each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API provided by Keras.

The embedding layer is initialized with random weights and all words in the learning training data set are embedded.

It is a flexible layer that can be used in a number of ways, such as:

  • It can be used alone to learn a word embed, which can be saved later and used in another model.
  • It can be used as part of a deep learning model in which embeddings are learned along with the model itself.
  • It can be used to load pre-trained word embedding models, which is a kind of transfer learning.

The embedding layer is defined as the first hidden layer of the network. It must specify three arguments:

  • Input_dim: This is the number of possible values for terms in the text data. For example, if your data is integer encoded values between 0 and 9, then the vocabulary size is 10 words;
  • Output_dim: This is the size of the vector space of the embedded words. It defines the size of the output vector of this layer for each word. For example, it could be 32 or 100 or even larger, which can be considered a hyperparameter for a specific problem;
  • Input_length: This is the length of the input sequence, as you define for any input layer of the Keras model, that is, the number of terms in an input. For example, if all of your input documents consist of 1000 words, then input_length is 1000.

For example, below we define an embedding layer with a vocabulary of 200 (such as integer coded words from 0 to 199, including 0 to 199), a 32-dimensional vector space in which words will be embedded, and input documents of 50 words per word.

e = Embedding(input_dim=200, output_dim=32, input_length=50)

The embedded layer has its own weights, which will be included if you save the model to a file.

The output of the embedding layer is a two-dimensional vector, one for each word embedded in the sequence of input text (input document).

If you want to directly connect the Dense layer behind the Embedding layer, you must first use the Flatten layer to tile the 2D output matrix of the Embedding layer into a one-dimensional vector.

Now, let’s see how we can use the embedding layer in practice.

Learn examples of Embedding

In this section, we’ll look at how to fit neural networks on text classification problems while learning about word embedding.

We are going to define a little problem where we have 10 text documents, each with a student submitting a comment on the work. Each text document is classified as a positive “1” or a negative “0”. It’s a simple sentiment analysis problem.

First, we’ll define the document and its category tags.

Docs = ['Well done ', 'Good work', 'Great effort', 'nice work', 'Excellent!', 'Weak', 'Poor effort!', 'not good', 'poor work', '] # define class labels = [1,1,1,1,1, 1,0,0,0,0]Copy the code

Next, let’s integer code each file. This means that putting the input, the embedding layer will have a sequence of integers. We can try other more complex Bag of Word models such as counting or TF-IDF.

Keras provides the one_hot() function to create the hash of each word as a valid integer encoding. We use an estimated vocabulary size of 50, which greatly reduces the probability of hash function collisions.

Vocab_size = 50 encoded_docs = [one_hot(d, vocab_size) for d in docs] print(encoded_docs)Copy the code
42, [[6, 16], [24], [2, 17], [42, 24], [18], [17], [22, 17], [27, 42], [22, 24], [16, 46, 49, 34]]Copy the code

Later sequences have different lengths, but Keras prefers input vectorization and all inputs having the same length. We will populate all input sequences of length 4, again we can do this using the built-in Keras function (in this case pad_sequences()),

# pad documents to a Max length of 4 sequences(max_length = 4 PADDED_docs = PAD_docs) maxlen=max_length, padding='post') print(padded_docs)Copy the code
[[6 16 0 0] [42 24 0 0] [2 17 0 0] [18 0 0 0] [17 0 0 0] [22 17 0 0] [27 42 0 0] [22 24 0 0] [49 46 16 34]]Copy the code

We are now ready to define our embedding layer as part of our neural network model.

The embedded vocabulary is 50, the input length is 4, and we will select an 8-dimensional embedding space.

The model is a simple binary classification model. Importantly, the output of the embedded layer will be four vectors per 8 dimensions, one for each word. We tile it onto a vector of 32 elements to pass to the dense output layer.

# define the model definition model = Sequential() model.add(Embedding(vocab_size, 8,) input_length=max_length)) model.add(Flatten()) model.add(Dense(1, Activation ='sigmoid')) # compile the model.compile(optimizer=' Adam ', loss='binary_crossentropy', Metrics =['acc']) # Summarize the model print(model.summary())Copy the code
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 4, 8)              400
_________________________________________________________________
flatten_1 (Flatten)          (None, 32)                0
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33
=================================================================
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________Copy the code

Finally, we can fit and evaluate the classification model.

# Fit the model. Fit (padded_docs, labels, epochs=50, verbose=0) accuracy = model.evaluate(padded_docs, labels, verbose=0) print('Accuracy: %f' % (accuracy*100))Copy the code
Accuracy: 100.000000Copy the code

Here is the complete code, where we have rewritten the model definition using a functional API, but the structure is exactly the same.

from keras.layers import Dense, Flatten, Input
from keras.layers.embeddings import Embedding
from keras.models import Model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import one_hot
# define documents
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']
# define class labels
labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)
# define the model
input = Input(shape=(4, ))
x = Embedding(vocab_size, 8, input_length=max_length)(input)
x = Flatten()(x)
x = Dense(1, activation='sigmoid')(x)
model = Model(inputs=input, outputs=x)
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy * 100))Copy the code

After that, we can save the weights learned in the embedded layer to a file for later use in other models.

You can often also use this model to classify other documents with similar terms that you see in the test data set.

Next, let’s look at loading pre-trained word embedding into Keras.

Example using pre-trained GloVE embedding

The Keras embedding layer can also use embedded words learned elsewhere.

In the field of natural language processing, it is common to learn, save and share word embeds.

For example, the researchers behind the GloVe approach provided a set of pre-trained word emplacements published under a public domain license. See:

The smallest package is 822Mb and is called “Glove.6B.zip”. It trained a data set of 1 billion words (words) with a vocabulary of 400,000 words with several different embedded vector sizes, including 50,100,200 and 300 sizes.

You can download this set of embeds as weights for pre-training embeds of words in the training data set in the Keras embedding layer.

This example is inspired by an example from the Keras project: pretrained_word_embeddings.py.

After downloading and unpacking, you’ll see several files, one of which is “Glob.6B.100D.txt”, which contains a 100-dimensional version of the embed.

If you peek inside the file, you’ll see a token (word) followed by the weight of each line (100 numbers). For example, here is the first line of an embedded ASCII text file, showing the embedding of “the”.

The -0.038194-0.24487 0.72812-0.39961 0.083172 0.043953-0.39141 0.3344-0.57545 0.087459 0.28787-0.06731 0.30906 -0.26384-0.13231-0.20757 0.33395-0.33848-0.31743-0.48336 0.1464-0.37304 0.34577 0.052041 0.44946-0.46971 0.02628 -0.54155-0.15518-0.14107-0.039722 0.28277 0.14393 0.23464-0.31021 0.086173 0.20397 0.52624 0.17164-0.082378 -0.71787-0.41531 0.20335-0.12763 0.41367 0.55187 0.57908-0.33477-0.36559-0.54857-0.062892 0.26584 0.30205 0.99775 -0.80481-3.0243 0.01254-0.36942 2.2167 0.72201-0.24978 0.92136 0.034514 0.46745 1.1079-0.19358-0.074575 0.23353 -0.052062-0.22044 0.057162-0.15806-0.30798-0.41625 0.37972 0.15006-0.53212-0.2055-1.2526 0.071624 0.70565 0.49744 -0.42063 0.26148-1.538-0.30223-0.073438-0.28312 0.37104-0.25217 0.016215-0.017099-0.38984 0.87424-0.72569 -0.5958-0.52028-0.1459 0.8278 0.27062Copy the code

As described in the previous section, the first step is to define the examples, encode them as integers, and then populate the sequences with the same length.

In this case, we need to be able to map words to integers and integers to words.

Keras provides a Tokenizer class that ADAPTS to training data, converts text to sequences consistently by calling the Texts_to_SEQUENCES () method of the Tokenizer class, and has access to the integer dictionary mapping of words in the word_index attribute.

# define documents docs = ['Well done!', 'Good work', 'Great effort', 'nice work', 'Excellent!', 'Weak', 'Poor effort!', 'not good', 'poor work', # prepare tokenizer t = tokenizer () # prepare tokenizer t = tokenizer () # prepare tokenizer t = tokenizer () t.fit_on_texts(docs) vocab_size = len(t.word_index) + 1 # integer encode the documents encoded_docs = t.texts_to_sequences(docs) print(encoded_docs) # pad documents to a max length of 4 words max_length = 4 padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post') print(padded_docs)Copy the code

Next, we need to load the entire Glove word embed file into memory as a dictionary of words to embed the array.

# load the whole embedding into memory
embeddings_index = dict()
f = open('glove.6B.100d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))Copy the code

It is slow. It might be better to filter the embedding of specific words in the training data.

Next, we need to create an embedded matrix for each word in the training dataset. We can do this by enumerating all the unique words in tokenizer.word_index and finding the embedding weight vector from the loaded GloVe insert.

The result is a weight matrix for only the words that will be seen during the training.

# create a weight matrix for words in training docs
embedding_matrix = zeros((vocab_size, 100))
for word, i in t.word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vectorCopy the code

Now we can define our model and evaluate it as before.

The key difference is that the embedding layer can be seeded with a GloVe embedding weight. We chose the 100-dimensional version, so we had to set it to 100 using output_dim to define the embedding layer. Finally, we do not want to update the learnword weights in this model, so we will set the trainable property of the model to False.

e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)Copy the code

A complete working example is listed below.

from numpy import asarray from numpy import zeros from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense from keras.layers import Flatten from keras.layers import Embedding # define documents docs = ['Well done!', 'Good work', 'Great effort', 'nice work', 'Excellent!', 'Weak', 'Poor effort!', 'not good', 'poor work', # prepare tokenizer t = tokenizer () # prepare tokenizer t = tokenizer () # prepare tokenizer t = tokenizer () t.fit_on_texts(docs) vocab_size = len(t.word_index) + 1 # integer encode the documents encoded_docs = t.texts_to_sequences(docs) print(encoded_docs) # pad documents to a max length of 4 words max_length = 4 padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post') print(padded_docs) # load the whole embedding into memory embeddings_index = dict() f = open('.. /glove_data/glove.6B/glove.6B.100d.txt') for line in f: values = line.split() word = values[0] coefs = asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() print('Loaded %s word vectors.' % len(embeddings_index)) # create a weight matrix for words in training docs embedding_matrix = zeros((vocab_size, 100)) for word, i in t.word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector # define model model = Sequential() e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False) model.add(e) model.add(Flatten()) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc']) # summarize the model print(model.summary()) # fit the model model.fit(padded_docs, labels, epochs=50, verbose=0) # evaluate the model loss, accuracy = model.evaluate(padded_docs, labels, verbose=0) print('Accuracy: %f' % (accuracy*100))Copy the code

It may take longer to run this example, but it shows that it can adapt to this simple problem.

[[6, 2], [3, 1], [7, 4], [8, 1], [9], [10], [5, 4], [11, 3], [5, 1], [12, 13, 2, 14] [[6 2 0 0] [3 10 0] [7 4 0 0] [8 10 0] [9 0 0 0] [5 4 0 0] [11 3 0 0] [5 10 0] [12 13 2 14] Loaded 400000 word vectors. _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 4, 100) 1500 _________________________________________________________________ flatten_1 (Flatten) (None, 400) 0 _________________________________________________________________ dense_1 (Dense) (None, 1) 401 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 1901 Trainable params: 401 Non - trainable params: 1500 _________________________________________________________________ Accuracy: 100.000000Copy the code

In practice, it is best to try to learn word embedding using pre-trained embedding, because it is fixed, and try to learn on top of pre-trained embedding, similar to using pre-trained VGG or Res-net migration specific problems in computer vision.

But it depends on what works best for your particular problem.

IMDB data set Embedding instance

from keras.models import Sequential,Model from keras.layers import Flatten, Dense, Embedding, Input_layer = Input(shape=(maxlen,)) x = Embedding(input_dim=10000,output_dim=8)(input_layer) Embedding = Model(input_layer,x) x = Flatten()(x) x = Dense(1,activation='sigmoid')(x) Model = Model(input_layer,x) model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['acc']) model.summary() history = modhistory = modhistory = mod> history = The model fit (x_train y_train, epochs = 10, batch_size = 32, validation_split = 0.2) _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_4 (InputLayer) (None, 20) 0 _________________________________________________________________ embedding_5 (Embedding) (None, 20, 8) 80000 _________________________________________________________________ flatten_5 (Flatten) (None, 160) 0 _________________________________________________________________ dense_5 (Dense) (None, 1) 161 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 80161 Trainable params: 80161 Non - trainable params: 0 _________________________________________________________________ Train on 20000 samples, Validate on 5000 samples of Epoch 1/10 20000/20000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 105 us/step - loss: 0.6772 acc: 0.6006 - val_loss: 0.6448 - val_acc: 0.6704 Epoch 2/10 20000/20000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 93 us/step - loss: 0.5830 acc: 0.7188 - val_loss: 0.5629 - val_acc: 0.7046 Epoch 3/10 20000/20000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 95 us/step - loss: 0.5152 acc: 0.7464-val_loss: 0.5362-val_acc: 0.7208 Epoch 4/10 20000/20000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 93 us/step - loss: 0.4879 acc: 0.7607 - val_loss: 0.5299 - val_acc: 0.7292 Epoch 5/10 20000/20000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 97 us/step - loss: 0.4731 acc: 0.7694 - val_loss: 0.5290 - val_acc: 0.7334 Epoch 6/10 20000/20000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 98 us/step - loss: 0.4633 acc: 0.7773 - val_loss: 0.5317 - val_acc: 0.7344 Epoch 7/10 20000/20000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 96 us/step - loss: 0.4548 acc: 0.7819-val_loss: 0.5333 - val_acc: 0.7318 Epoch 8/10 20000/20000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 93 us/step - loss: 0.4471 acc: 0.7870 - val_loss: 0.5377 - val_acc: 0.7288 Epoch 9/10 20000/20000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 95 us/step - loss: 0.4399 acc: 0.7924-VAL_loss: 0.5422-val_acc: 0.7278 Epoch 10/10 20000/20000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 90 us/step - loss: 0.4328 acc: 0.7957 - val_loss: 0.5458 - val_acc: 0.7290Copy the code

Let’s look at shape for input

x_train[1].shape
x_train[1]
x_train[:1].shape
x_train[:1]
(20,)
array([ 23,   4,   2,  15,  16,   4,   2,   5,  28,   6,  52, 154, 462,
        33,  89,  78, 285,  16, 145,  95], dtype=int32)
(1, 20)
array([[ 65,  16,  38,   2,  88,  12,  16, 283,   5,  16,   2, 113, 103,
         32,  15,  16,   2,  19, 178,  32]], dtype=int32)Copy the code

Now if we look at the output of embedding,

X_train [:1]) (1, 20, 8) array([[-0.17401133, -0.08743777, 0.15631911, -0.06831486, -0.09105065, 0.06253908, -0.0798945, 0.07671431], [0.06872956, -0.00586612, 0.07713806, -0.00182899] [-0.01912907, -0.01732869, 0.00391375, -0.02338142, 0.02787969] [0.20604047, 0.10910885, 0.06304865, -0.14038748, 0.12123005, 0.06124007, [-0.19636872, -0.0027669, 0.01087157, -0.02332311, -0.04321857, -0.09228673, -0.03061322, [0.18718374, 0.10347525, -0.06668846, 0.25818944, 0.07522523, 0.07082067, 0.05170904, 0.22902426], [-0.27160701, -0.29296583, 0.1055108, 0.15896739, -0.24833643, -0.17791845, -0.27316946, -0.241273], [-0.02175452, -0.0839383, 0.04338101, 0.01062139, -0.11473208, -0.18394938, -0.05141308, -0.10405254], [0.18718374, 0.10347525, -0.06668846, 0.25818944, 0.07522523, 0.07082067, 0.05170904, 0.22902426, [-0.01912907, -0.01732869, 0.00391375, [-0.14751843, 0.05572686, 0.20332271, -0.01759946] [0.00282665, -0.17532936, -0.09342033, 0.04514923, -0.04684081] [0.00225757, -0.12751001, -0.12703758, 0.17167819, -0.03712473, 0.04252302], [0.00225757, -0.12751001, -0.12703758, 0.17167819, -0.03712473, 0.04252302, [0.02198115, 0.03989581, 0.13165356, 0.06523556, 0.14900513, 0.01858517, -0.01644249] [0.18718374, 0.10347525, -0.06668846, 0.25818944, 0.07522523, 0.07082067, 0.05170904, 0.22902426], [-0.01912907, -0.01732869, 0.00391375, -0.02338142, 0.02787969, -0.02744135, 0.0074541, 0.01806928], [-0.01993229, -0.04436176, 0.07624088, 0.04268746, -0.00883252, 0.00789542, -0.03039453, 0.05851226], [-0.12873659, -0.00083202, -0.03246918, -0.23910245, -0.24635716, -0.10966355, 0.02079294, -0.3829115], [0.00225757, -0.12751001, -0.12703758, [float32]], dtype=float32)Copy the code

As can be seen, embedding (1, 20) into a vector (1, 20, 8) as the input sample (sentence with the longest length of 20 words, in which each word is represented as an int number). That is to say, each word is embedded as an 8-dimensional vector. The parameters of the embedding layer can be learned by the neural network, and the data can be easily transformed into a format that can be further processed by CNN or RNN after the embedding layer.

reference

  • How to Use Word Embedding Layers for Deep Learning with Keras – Machine Learning Mastery