In our previous answer on Zhihu, researchers at the University of Chicago were able to use AI to write fake reviews for restaurants and hotels on Yelp.

Today we will share how to build AI water army with Keras, write realistic restaurant reviews, and get promoted to commander of water Division. After reading this tutorial, you’ll learn how to generate a 5-star Yelp restaurant review.

Here are some examples of comments generated by the AI (unedited) :

I had steak and mussels and Parmesan chicken, all very good, and we’ll be back.

The food, service and atmosphere are great. I would recommend it to all my friends.

Great atmosphere, great food, great service. It’s worth a try!

I had the steak, mussels with a side of chicken parmesan. All were very good. We will be back.

The food, service, atmosphere, and service are excellent. I would recommend it to all my friends

Good atmosphere, amazing food and great service.Service is also pretty good. Give them a try!

Here’s how:

  • Obtain and prepare training data.

  • Build character-level language models.

  • Tips for training models.

  • Generate random comments.

It’s easy to train a model on a GPU in just a few days. Fortunately, there are a lot of pre-training model weights now, so you can skip right to the comments section at the end.

To prepare data

We can get Yelp dataset dataset in JSON format for free on Yelp website.

After downloading the dataset and extracting the data, in the dataset folder you will find two files you need:

Review.json

Business.json

A word of caution: both files are large, especially review.json (3.7g).

Each line of the review.json file is a comment in JSON string format. There are no opening and closing parentheses [] in the two folders, so the contents of the JSON file as a whole are not a valid JSON string. In addition, it can be difficult to get the entire review.json file into memory. So we first script them into a CSV file line by line.

python json_converter.py ./dataset/review.json
python json_converter.py ./dataset/business.json
Copy the code

After this, you will find two files in the dataset folder, both valid CSV files that can be opened using the Pandas library. Here’s what we’re going to do: Extract the 5-star review text only from businesses that have a “restaurant” label in their category.

# Read thow two CSV files to pandas dataframes
df_business=pd.read_csv('.. /dataset/business.csv')
df_review=pd.read_csv('.. /dataset/review.csv')
# Filter 'Restaurants' businesses
restaurants = df_business[df_business['categories'].str.contains('Restaurants')]
# Filter 5-stars reviews
five_star=df_review[df_review['stars'] = = 5)# merge the reviews with restaurants by key 'business_id'
# This keep only 5-star restaurants reviews
combo=pd.merge(restaurants_clean, five_star, on='business_id')
# review texts column
rnn_fivestar_reviews_only=combo[['text']]
Copy the code

Next, we remove newline characters and duplicate comments from comments.

# remove new line characters
rnn_fivestar_reviews_only=rnn_fivestar_reviews_only.replace({r'\n+': ' '}, regex=True)
# remove dupliated reviews
final=rnn_fivestar_reviews_only.drop_duplicates()
Copy the code

To show the model where the comment begins and ends, we need to add special tags to the comment text. The final prepared comment will have one line that does what you expect, like this:

Hummus is delicious and fresh! And falafel is awesome. Definitely worth going back! The owner is very nice and the staff is very kind.

“Hummus is amazing and fresh! Loved the falafels. I will definitely be back. Great owner, friendly staff”

The model we have built is a character-level language model, meaning the smallest distinguishable symbol is a character. You can also try the lexical model, where input is tagged with words. There are advantages and disadvantages to the character-level language model.

Advantages:

So you don’t have to worry about unknown words. Be able to learn larger vocabulary.

Disadvantages:

This results in a long sequence that is not as good at capturing remote dependencies (the effect of earlier parts of the statement on later parts) as the lexical model.

Moreover, character-level model consumes more computing resources during training.

The model is similar to the LSTM_text_generation.py demo code, except that we stack several more cyclic neural network units so that the hidden layer can store more information between the input and output layers. This will result in more realistic Yelp reviews.

Before showing the code for the model, let’s take a closer look at how stacked RNNS work. You might see it in standard neural networks (the Dense layer in Keras).

The first layer calculates the activation value a[1] with input X, and it stacks the second layer to calculate the next activation value A [2].

The notation A [I] represents the activation configuration of layer I, where represents the time step T.

Let’s see how we can calculate an activation.

To compute a[2]<3>, you need two inputs, A [2]<2> and a[1]<3>.

G is the activation function, wa[2] and ba[2] are layer 2 parameters.

We can see that to stack the RNN, the previous RNN needs to return all the time step ATO to the following RNN. Keras has a default RNN layer, such as LSTM, that returns only the last time step activation value, A. To return activation values for all time-steps, we set the return_SEQUENCES parameter to true.

Here’s how to model on Keras. Each input sample is a 60-character one-hot representation, with 95 possible characters in total.

Each output is a column of 95 predicted probabilities per character.

import keras
from keras import layers

model = keras.models.Sequential()
model.add(layers.LSTM(1024, input_shape=(60, 95),return_sequences=True))
model.add(layers.LSTM(1024, input_shape=(60, 95)))
model.add(layers.Dense(95, activation='softmax'))
Copy the code

Training model

The idea behind training models is simple: we train models in pairs of inputs/outputs. Each input is 60 characters, and the corresponding output is the character immediately following it.

In the data preparation step, we created a clean list of 5-star review text. There were 1214,016 comments in total. To make the training easier, we only train comments with 250 characters or less, and we end up with 418,955 lines of comments.

Then we shuffled the order of the reviews so that we didn’t train the model with 100 reviews of the same restaurant in a row. We read all the comments as a long text string, and then create a Python directory (such as a hash table) to map each character to an index between 0 and 94 (95 special characters in total).

# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)
Copy the code

There are 7,266,2807 characters in the text library, which is difficult to process as a whole. So we split it up into blocks of text with 90K characters each.

For each split text block, we generate each pair of inputs and outputs for that part. Converts a pointer to a block of text from beginning to end, one character at a time if the step is set to 1.

def getDataFromChunk(txtChunk, maxlen=60, step=1):
   sentences = []
   next_chars = []
   for i in range(0, len(txtChunk) - maxlen, step):
       sentences.append(txtChunk[i : i + maxlen])
       next_chars.append(txtChunk[i + maxlen])
   print('nb sequences:', len(sentences))
   print('Vectorization... ')
   X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
   y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
   for i, sentence in enumerate(sentences):
       for t, char in enumerate(sentence):
           X[i, t, char_indices[char]] = 1
           y[i, char_indices[next_chars[i]]] = 1
return [X, y]
Copy the code

If a text block is trained on GPU (GTX1070), each epoch takes 219 seconds, so it takes about 2 days to train the entire text library. 72662807/90000 x 219/60/60/24 = 2.0 days

Keras’s two callback functions, ModelCheckpoint and ReduceLROnPlateau, are handy to use. ModelCheckpoint helps us keep the weight of each optimization. When the loss measure stops decreasing, the ReduceLROnPlateau callback function will automatically reduce the learning rate. Its main benefit is that we no longer need to manually adjust the learning rate, the main disadvantage is that its learning rate will always decline and decline.

# this saves the weights everytime they improve so you can let it train. Also learning rate decay
filepath="Feb-22-all-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.5, patience=1, MIN_LR =0.00001) Callbacks_list = [checkpoint, reduce_LR]Copy the code

The code for training the model for 20 Epochs is as follows:

for iteration in range(1, 20):
   print('Iteration', iteration)
   with open(".. /dataset/short_reviews_shuffle.txt") as f:
       for chunk in iter(lambda: f.read(90000), ""):
           X, y = getDataFromChunk(chunk)
           model.fit(X, y, batch_size=128, epochs=1, callbacks=callbacks_list)
Copy the code

It takes about a month to complete the training. But for us, two hours of training produces great results.

Generate a 5-star review

With weights for pre-trained models or models you’ve trained yourself, we can generate interesting Yelp reviews. Here’s the idea: We seed the model with the first 60 characters, and then let the model predict the characters that follow.

The “index sampling” process generates some randomness based on a given prediction, adding some variation to the final result.

If temperature is small, it always selects the index with the highest probability of prediction.

Def sample (preds, temperature = 1.0) :' '' Generate some randomness with the given preds which is a list of numbers, if the temperature is very small, it will always pick the index with highest pred value '' '
   preds = np.asarray(preds).astype('float64')
   preds = np.log(preds) / temperature
   exp_preds = np.exp(preds)
   preds = exp_preds / np.sum(exp_preds)
   probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
Copy the code

To generate 300 characters, the code looks like this:

# We generate 300 characters
for i in range(300):
   sampled = np.zeros((1, maxlen, len(chars)))
   # Turn each char to char index.
   for t, char in enumerate(generated_text):
       sampled[0, t, char_indices[char]] = 1.
   # Predict next char probabilities
   preds = model.predict(sampled, verbose=0)[0]
   # Add some randomness by sampling given probabilities.
   next_index = sample(preds, temperature)
   # Turn char index to char.
   next_char = chars[next_index]
   # Append char to generated text string
   generated_text += next_char
   # Pop the first char in generated text string.
   generated_text = generated_text[1:]
   # Print the new generated char.
   sys.stdout.write(next_char)
   sys.stdout.flush()
print(generated_text)
Copy the code

conclusion

In this article, we learned how to build and train a character-level text generation model using Keras. The source code of this project and the pre-training model used are available on GitHub.

emmmmmm… With that in mind, you can also check out my other text processing tutorials and articles:

This comment is toxic! — The general routine of text classification

Text classification using convolutional neural network based on TensorFlow