Natural language processing is one of the core problems in artificial intelligence research. MetaMind, a deep learning company that Salesforce has announced it will acquire, has published an article on its website that explores LSTM and the word bag model for natural language processing. There are some interactive illustrations in the article that interested readers can browse to the original web page. Alexander Rosenberg Johansen, Research scientist at MetaMind. A paper on the study will be posted to arXiv soon, according to the report.


There is no doubt about the rise of machine learning, deep learning and artificial intelligence more broadly, and it is already having a huge impact on the field of computer science. As you’ve probably heard, deep learning has surpassed humans in many tasks, from image recognition to Go.


The deep learning community currently sees natural language processing (NLP) as the next frontier of research and application.


One advantage of deep learning is that its progress tends to be very generic. For example, techniques that make deep learning effective in one area can often migrate to another without much modification. More specifically, the methods developed for building large-scale, computationally expensive deep learning models for image and speech recognition can also be applied to natural language processing. One example is the recent state-of-the-art translation system, which outperforms all previous systems but requires far more computer power. Such demanding systems are able to spot very complex patterns that occasionally appear in real-world data, but this has allowed many people to use such large-scale models for a wide variety of tasks. Which brings up another question:


Are all tasks complex enough to be handled by this model?


Let’s look at the internals of a two-layered MLP trained on bag-of-words embeddings for sentiment analysis:



The inside of a simple deep learning system called bag-of-words, which categorizes sentences as positive or negative. This diagram is a T-SNE from the last hidden layer of a 2-layer MLP on a word bag. Each data point corresponds to a sentence, and a different color corresponds to the prediction and real goal of the deep learning system. Solid lines represent the different semantic content of a sentence. You can follow them with an interactive chart.


The solid wire box in the figure above provides some important insights. Real-world data is far more difficult. Some sentences can be easily classified, but others contain complex semantic structures. In the case of sentences that can be easily classified, a high-volume system may not be necessary. Perhaps a much simpler model could do the same job. This blog post explores whether this is the case and will show that we often get things done using simple models.


Deep learning of text


Most deep learning methods require floating point numbers as input, and if you’ve never worked with text, you might be wondering:


How do I use a piece of text for deep learning?


For text, the core problem is how to represent arbitrarily large amounts of information given the length of the material. One popular method is tokenize text into words, sub-words, and even characters. Each word can then be converted to a floating point vector by well-researched methods such as Word2vec or Glove. This method can improve the meaningful representation of words through the implied relations before different words.



You can find interesting relationships between words by taking a word, converting it to a high-dimensional embed (say, 300 dimensions), and then using PCA or T-SNE (popular dimensionality reduction tools, in this case to 2 dimensions). For example, in the figure above you can see that the distance between uncle and aunt is about the same as the distance between man and woman (Mikolov et al., 2013).


Using the tokenization and word2vec methods, we can convert a piece of text into a sequence of floating point representations of words.


Now, what’s the use of a sequence of one-word representations?


Bag-of-words


Now let’s talk about the word bag, which is probably the simplest machine learning algorithm out there!



Take some word representations (the gray box at the bottom) and add ‘sum’ or ‘average’ to get a common representation (the blue box). This common representation contains some information about each word. In this paper, this shared representation is used to predict whether a sentence is positive or negative (red box).


Simply take the mean of words on each feature dimension. It turns out that simply averaging word embedding (although this completely ignores sentence order) is sufficient to achieve good results in many simple real-world cases, and also provides a powerful benchmark when combined with deep neural networks (explained later).


In addition, the cost of averaging is very low, and the dimension of sentences can be reduced to a fixed size vector.


Recurrent neural network


Some sentences require a high degree of accuracy or depend on sentence structure. Using word bags to solve these problems may not cut it. Still, you might consider using the amazing Recurrent Neural Networks.



At each time step (left to right), an input (such as a word) is fed into the RNN (gray box) and the previous internal memory is consolidated (blue box). The RNN then performs some calculations to get a new internal memory (the blue box) that represents all the previously seen units (for example, all the previous words). The RNN should now contain sentence-level information to better predict whether a sentence is positive or negative (red box).


Each word embedding is sequentially fed into a recurrent neural network, which can then store previously seen information and combine it with new words. When driven by a well-known memory unit such as long short-term memory (LSTM) or gated loop unit (GRU), an RNN is able to remember what happens in a sentence with many words! (Because of the success of LSTM, an RNN with an LSTM memory cell is often called an LSTM.) The largest model in this class stacks the structure eight times.




Both represent recurrent neural networks with LSTM units. They also use trade-offs, such as skipping connections between LSTM layers and an approach called attention. Also note that the green LSTM points in the opposite direction. When combined with a normal LSTM, this is called a bidirectional LSTM because it can retrieve information in both directions of the data sequence. More information please see Stephen Merity blog articles (that is, the heart of the machine, the depth profiling, step by step a | Google breakthrough the neural network architecture is behind the machine translation?” (Source: Wu et al., 2016).


However, LSTM is much more computatively expensive than a simple bag of words model, and requires experienced deep learning engineers to implement and support it using high-performance computing hardware.


Example: Sentiment analysis


Sentiment Analysis is a document categorization task that quantifies the polarity of subjective articles. Given a sentence, the model assessed whether it was emotionally positive, negative or neutral.


Want to spot angry customers on Twitter before it gets serious? Well, sentiment analysis might be just what you want!


An excellent data set for this purpose (which we’ll use next) is the Stanford Sentiment TreeBank (SST) :

https://nlp.stanford.edu/sentiment/treebank.html

We have exposed a PyTorch data loader:

https://github.com/pytorch/text

STT can not only categorize sentences (positive and negative), but also provide grammatical subphrases for each sentence. However, in our system, we do not use any tree information.


The original SST was made up of five categories: very positive, positive, neutral, negative, and very negative. We think the binary classification task is simpler, in which positive and very positive are combined, negative and very negative are combined, and there is no neutrality.

We provide a brief and technical description of our model architecture. The point is not exactly how it was built, but that the computationally cheap model achieved 82% validation accuracy, a 64-size batch task took 10 milliseconds, The computationally expensive LSTM architecture achieves 88% accuracy but takes 87ms to complete the same task (the best models are around 88-90% accurate).



The green box below represents word embedding, initialized with GloVe, followed by the average of the word (word bag) and the 2-layer MLP with Dropout.



The turquoise box below represents word embedding, initialized with GloVe. No gradient is tracked throughout the word embedding. We used a bidirectional RNN with LSTM cells, similar to a word bag, which we used to hide the state to extract the mean and maximum, followed by a 2-layer MLP with dropout.


Skim Reader with low computational cost


In some tasks, the algorithm can show near-human accuracy, but your server budget will have to be very high to achieve that. As you know, it’s not always necessary to use an LSTM with real-world data, and a low-cost word bag (BoW) might be fine.


Of course, an agnostic word bag (BoW) misclassifies a large number of negative words. Switching entirely to a bad BoW would reduce our overall performance and make it sound less convincing. So the question becomes:


Can we learn to distinguish between “easy” and “difficult” sentences?


And to save time, can we do this with a low-cost model?


To explore the internal


A popular way to explore deep learning models is to understand how each sentence is represented in a hidden layer. However, because the hidden layer is often high dimensional, we can use algorithms like T-SNE to reduce it to two dimensions, allowing us to chart it for human observation.



The two images above are screenshots of interactive illustrations from the original text. In the original interaction diagram, you can move the cursor, zoom, and hover over data points to view information about those data points. In the image, you can see the last hidden layer in the word bag (BoW). When you hover over any data point, you can see the sentence that represents that data point. The color of a sentence depends on its label.


Predictions TAB page: Comparison of the model’s systematic Predictions with the actual tags. The center of the data point represents its prediction (blue is positive, red is negative), and the lines around it represent the actual label. It allows us to know when the system is right and when it is wrong.


Probabilities TAB: We plot the Probabilities of the categories predicted in the output layer. This represents information about the model’s predictions. In addition, when hovering over a data point, you will also see the probability of a given data point, the color of which represents the prediction of the model. Note that since the task is binary, the probability starts at 0.5, with a minimum confidence of 50/50 in this case.


The T-SNE diagram is vulnerable to a lot of overinterpretation, but this might give you an idea of some trends.


T – SNE interpretation


  • Sentences become clusters, which make up different semantic types.

  • Some clusters have simple forms with high confidence and accuracy.

  • Other clusters are more diffuse, with lower accuracy and confidence.

  • Sentences with positive and negative components are difficult.


Now let’s look at a similar diagram on LSTM:



The two images above are screenshots of interactive illustrations from the original text. In the original interaction diagram, you can move the cursor, zoom, and hover over data points to view information about those data points. Set up similar to the interaction diagram of the word bag, come and explore the interior of the LSTM!


We can assume that many of these observations are valid for LSTM as well. However, LSTM only has a relatively small number of samples and a relatively low confidence level, and when both positive and negative components appear in the sentence, it is less challenging for LSTM than for the word bag.


It appears that word bags can cluster sentences and use their probabilities to identify whether it is possible to give a correct prediction of the sentences in that cluster. One reasonable assumption can be made about these observations:


Answers with higher confidence are more correct.


To investigate this hypothesis, we can look at probability thresholds.


Probability threshold


People train word bags and LSTM to provide probabilities for each class to measure certainty. What does that mean? If the bag returns a 1, it is confident in its prediction. We usually use the classes with the highest probability provided by our model for prediction. In the case of this binary classification (positive or negative), the probability must be greater than 0.5 (otherwise we would predict the opposite class). But the low probability of a predicted class may indicate that the model is suspicious. For example, if a model predicts a positive probability of 0.51 and a negative probability of 0.49, it is less credible to say that the conclusion is positive. When we use “threshold,” we mean comparing the predicted probability to a value and evaluating whether to use it or not. For example, we could decide to use sentences with all probabilities greater than 0.7. Or we can look at the effect of the 0.5-0.55 range on forecast confidence, which is precisely what is investigated in the figure below.



In this threshold plot, the height of the column corresponds to the accuracy of the data points within the two thresholds; Lines represent similar accuracy when all data points exceed a given threshold. In the data quantity plot, the height of the column corresponds to the amount of data reciding within both thresholds, and the line is the amount of data accumulated in each threshold bin.


From each bag of words you might find that increasing the probability threshold also improves performance. When the LSTM overfits the training set and provides only high confidence answers, it seems normal that this is not evident in the LSTM diagram.


BoW was used on easy samples and raw LSTM was used on difficult samples


Thus, the simple use of output probability can show us when a sentence is easy and when guidance from a more powerful system (such as a powerful LSTM) is needed.


We create a “probability strategy” using probability thresholds to set thresholds for the probability of a bag of words system and use LSTM on all data points that do not reach the threshold. This gives us so much data for the word bag (sentences above the threshold) and a set of data points where we choose BoW (above the threshold) or LSTM (below the threshold) that we can use to discover an accuracy and computational cost. We then get a ratio between BoW and LSTM ranging from 0.0 (LSTM only) to 1.0 (BoW only) and can calculate accuracy and calculation time.


Baseline


To build the baseline, we need to consider the ratio between the two models. For example, word bag (BoW) using 0.1 data is equivalent to 0.9 times the accuracy of LSTM and 0.1 times the accuracy of BoW. The aim is to obtain a baseline without guided strategy, so that the choice of using BoW or LSTM in a sentence is randomly assigned. However, there are costs to using strategies. We must first process all the sentences through the BoW model to determine whether we should use BoW or LSTM. In the case that no sentence reaches a probability threshold, we can run additional models without much reason. To illustrate this, we calculate the strategy cost versus ratio in the following way.


Where C represents the cost and P represents the proportion of data used by BoW.



The figure above is the result on the validation set, which compares the accuracy and speed of different combinations of BoW, LSTM (red line), and probabilistic threshold strategies (blue line), with the leftmost data point corresponding to LSTM alone, the rightmost data point using BoW only, and the middle data point corresponding to a combination of the two. The blue line represents the combination of CBOW and LSTM with no guiding strategy, and the red line describes what proportion of the BoW probability is used as a guiding strategy for which system. Note that the biggest time savings are over 90%, because only BoW is used. Interestingly, we found that the BoW threshold was significantly better than the situation without guided Strategy.


We then measured the mean of the Curve, which we call Speed Under the Curve, as shown in the table below.



The above is the result of discretely choosing BoW or LSTM in the validation set. Each model is evaluated ten times with different seeds. The results in this table are the mean of suCs. Probability strategy is also compared with Ratio.


Learn when to skip and when to read


Knowing when to convert between two different models is not enough, because we need to build a more general system and learn to convert between all the different models. Such a system would help us deal with more complex behaviors.


Can we learn in supervised learning when reading is better than skipping?


LSTM “reads” us from left to right, storing a memory at each step, while “skip” uses the BoW model. In probabilistic operations from the bag of words model, we make decisions based on invariants, meaning that the more powerful LSTM works better when the bag of words system is in doubt. But is this always the case?




The “confusion matrix” when the word bag and LSTM are true or false about a sentence. Similar to the confused T-SNE diagram from the previous bag of words and LSTM.


In fact, it turned out that this was true in only 12 percent of the sentences, and in 6 percent of the sentences, both the bag and the LSTM were wrong. In this case, there’s no reason to run LSTM anymore and just use word bags to save time.


Learn to skip and configure


We should not always use LSTM when BoW is questioned. Can we make the bag model understand when LSTM also makes mistakes and we want to preserve precious computational resources?


Let’s look at the T-SNE diagram again, but now add the confusion matrix diagram between BoW and LSTM. We want to find relationships between different elements of the confusion matrix, especially when BoW is wrong.



From the comparison chart, we find that when BoW is right and under suspicion, we can easily call it out. However, while LSTM may be right or wrong, there is no clear relationship between BoW and LSTM.


Can we learn about this relationship?


In addition, probabilistic strategies are highly restrictive because they rely on binary decisions and require probabilities. Instead, we propose a trainable decision network based on neural network. If we look at the confusion matrix, we can use that information to generate labels for the supervised decision network. Therefore, we can use LSTM when LSTM is correct and BoW is wrong.


To generate the data set, we need a sentence set that contains the word bag and LSTM’s real, potential predictions. However, in the process of training LSTM, it often achieves more than 99% of the training accuracy, and obviously has the phenomenon of over-fitting to the training set. To avoid this, we split the training set into model training set (80% of the training data) and decision training set (the remaining 20% of the training data), where the decision training set is not previously seen by the model. We then fine-tune the model with the remaining 20 percent of the data and expect the decision network to generalize to this new, unfamiliar but highly relevant data set and make the system a little better.



Both word bags and LSTM were initially trained on the “Model train” (80% of the training data), then these models were used to generate labels of the decision network, and then the training of the complete data set was carried out. The validation set has been used during this time.


To build our decision network, we went to the last hidden layer of our low-cost bag of words system (the same layer used to generate t-SNE diagrams). We superimpose a two-layer MLP on top of the word bag training on the model training set. We found that if we did not follow this approach, the decision network would not be able to understand the BoW model trends and would not be able to generalize well.



The long strip at the bottom represents the layers of the word bag system and does not contain dropout. A double-layer MLP is added at the top, and a class is used for whether to select word bag or superior LSTM.


The categories selected by the decision network on the validation set (based on models trained on the model training set) are then applied to models trained on the complete training set but very related. Why apply it to a model trained by a complete training set? Because the models on the model training set are usually poor, the accuracy will be low. The decision network utilizes early stopping training based on maximizing SUCs on validation sets.


How does the decision network perform?


Let’s start by looking at the prediction of the decision network.



The data points are the same as the t-SNE diagram of the word bag model. Green dots represent sentences using word bag predictions, and yellow dots represent LSTM.


Note: how close this is to the probability cutoff of the bag of words. Let’s see if the T-SNE of the last hidden layer of the decision network can actually gather some information about when the LSTM is right or wrong.


How does the network enforce our decisions


Let’s start with the prediction of the decision network.


Data points are based on the statement representation of the final hidden state of the decision network, derived from the validation statement. The colors are the same as in the previous comparison.


It seems that the decision network can pick up clusters from the hidden state of the word bag. However, it does not seem to understand when LSTM might be wrong (clustering yellow and red separately).



The purple curve represents the newly introduced decision network on the validation set, noting how the decision network achieves a solution that is close to, but slightly different from, the probability threshold. In terms of time curve and data precision, the advantage of decision network is not obvious.



Bow and LSTM performance in test sets and validation sets. SUC based on the average of the accuracy and speed graph. Each model was calculated ten times using different seeds. The results in the table are averages for suCs. The standard deviation is based on differences with ratios.


From the prediction graphs, data volume, accuracy, and SUC scores, we can infer that the decision network is very good at knowing when BoW is right and when BoW is not. Moreover, it allows us to build a more general system that mines the hidden states of deep learning models. However, it also shows how difficult it is for a decision network to understand the behavior of systems it cannot access, such as the more complex LSTM.


discuss


We now understand the true power of LSTM, which can achieve near-human levels of text, and to do so, training does not need to approach real-world data volumes. We can train a bag-of-words model to understand simple sentences, which can save a lot of computing resources, and the performance loss of the whole system is minimal (depending on the size of the bag-of-words threshold).


This method is related to averaging, which is usually performed when similar models with high confidence will be used. However, as long as we have a word bag with adjustable confidence and don’t need to run LSTM, we can weigh the importance of calculation time and accuracy and adjust the parameters accordingly. We believe this approach will be very helpful for deep learning developers looking to save computing resources without sacrificing performance.