The website address

This note divides reviews into positive or negative categories. This is a binary or dichotomous problem, an important and widely used machine learning problem. This note uses the TensorFlow Hub and Keras for the basic application of transfer learning.

We will use an IMDB dataset containing 50,000 Movie review texts from the Internet Movie Database. The comments were first divided into two groups, with 25,000 for training and another 25,000 for testing. The training and test groups were balanced, meaning they contained equal numbers of positive and negative evaluations.

This notebook uses TF.keras (a high-level API for building and training models in TensorFlow) and tensorFlow_hub (a library for loading training models from TFHub in one line of code). For a more advanced text classification tutorial using TF.keras, see the MLCC Text Classification Guide.

Terminals: conda install tensorflow-hub and conda install tensorflow-datasets

import numpy as np import tensorflow as tf ! pip install tensorflow-hub ! Import tensorflow_hub as hub import tensorflow_datasets as TFDS print("Version: ", tf.__version__) print("Eager mode: ", tf.executing_eagerly()) print("Hub version: ", hub.__version__) print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")Copy the code

1. Download the IMDB data set

train_data,validation_data,test_data = tfds.load(
name = 'imdb_reviews',
split= ('train[:60%]', 'train[60%:]' ,'test'),
as_supervised = True)
Copy the code

2. Explore the data

Let’s take a moment to understand the format of the data. Each sample is a sentence representing a movie review and a corresponding label. The sentence is not preprocessed in any way. The tag is an integer value (0 or 1), where 0 represents a negative comment and 1 represents a positive comment.

Let’s print the first ten samples.

train_examples_batch,train_labels_batch = next(iter(train_data.batch(10)))
train_examples_batch
Copy the code

Print the first ten labels

train_labels_batch
Copy the code

3. Build a model

Neural networks are built from stacked layers, which require architectural decisions from three main perspectives:

  • How to represent text?
  • How many layers are there in the model?
  • How many are in each layerHidden units

In this example, the input data consists of sentences. Predictions are labeled as 0 or 1.

One way to represent text is to convert sentences into embedding vectors. Using a pre-trained text embed vector as the first layer will have three advantages:

  • Don’t worry about text preprocessing
  • Can benefit from transfer learning
  • The inserts have fixed length and are easier to handle

In this example, you use pre-trained text from the TensorFlow Hub to embed a vector model called Google/NLM – EN-DIM50/2.

There are many other pre-trained text embedding vectors from TFHub that can be used in this tutorial:

  • Google/NNLM-EN-DIM128/2 – Based on the same data as Google/NNLM-EN-DIM50/2 and trained using the same NNLM architecture, but with a larger embedding vector dimension. Embedding vectors with larger dimensions can improve your task, but it may take longer to train your model.
  • Google/NLM – EN-DIM128-with – Normalization /2 – Is the same as Google/NLM – EN-DIM128/2 but has additional text normalization, such as removing punctuation. It helps if the text in your task contains additional characters or punctuation.
  • Google /universal-sentence-encoder/4 – A larger model that can generate 512 dimensional embedding vectors, trained with a deep average network (DAN) encoder.

There’s more! Find more text embedding vector models on TFHub.

Let’s start by creating a Keras layer that uses the Tensorflow Hub model to embed statements and try it out in a few input samples. Note that the shape of the embeddings output is :(num_examples, embedding_dimension) regardless of the length of the input text.

embedding = "https://hub.tensorflow.google.cn/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])
Copy the code

Build a complete model

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()
Copy the code

Layers are stacked sequentially to build classifiers:

  1. The first layer is the TensorFlow Hub layer. This layer maps the sentence to its embedding vector using the pre-trained SaveModel. The pre-trained text embedding vector model you use (google/nnlm-en-dim50/2) you can split sentences into word cases, embed each word case, and then combine embedding vectors. The generated dimensions are:(num_examples, embedding_dimension). For this NNLM model,embedding_dimensionIs 50.
  2. The fixed-length output vector passes through a fully connected layer with 16 hidden layer elements (Dense) for pipeline transmission.
  3. The last layer is closely connected to a single output node. useSigmoidAn activation function whose function value is a floating point number between 0 and 1, representing the probability or confidence level.

3.1 Loss functions and optimizers

A model requires a loss function and an optimizer to train. Since this is a binary classification problem and the model outputs logit (a single cell layer with linear activation), we will use the binary_Crossentropy loss function.

This is not the only option for loss functions; for example, you can also choose mean_squared_error. In general, however, binary_Crossentropy is better suited for dealing with probability problems, measuring the “distance” between probability distributions, or in our use case, the difference between the real distribution and the predicted value.

Later, when you explore regression problems (for example, predicting house prices), you’ll see how to use another loss function called the mean square error.

Now, configure the model to use the optimizer and loss function:

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])
Copy the code

4. Training model

A mini-batch of 512 samples was used to train the model for 10 cycles, i.e., 10 iterations of all samples in x_train and y_train tensors. During training, monitoring model losses and accuracy over 10,000 samples of the validation set:

history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=validation_data.batch(512),
                    verbose=1)
Copy the code

5. Evaluation model

results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))
Copy the code

6. Read further

  • For a more general way to handle string input and a more detailed analysis of accuracy and lost progress during training, see the Text Classification Tutorial using preprocessed text.
  • Try more text-related tutorials using training models from TFHub.