By Ruchen Zhang Zhibo Zhang, Scallop Algorithm Team

background

Scallop, a mobile English learning platform with more than 80 million users, has been exploring how to use data to provide more accurate personalized education. A more rapid and scientific assessment of users’ vocabulary level can not only effectively improve users’ learning efficiency, but also help us develop more personalized learning content for each user.

 

By applying TensorFlow, the deep knowledge tracking system can predict the probability of users’ correct answer to each word in the word list in real time (as shown in Figure 1). This paper will introduce how scallops realize the depth tracking model and apply it to the vocabulary level assessment of English learners.

Figure 1: Real-time prediction of correct answer rate

Model is introduced

Based on a large number of previous online vocabulary test records, our total number of sequences has accumulated to the tens of millions level, which provides a solid foundation for using deep learning models. In terms of models, we selected Deep Knowledge Tracing (DKT) model [1] published by Piech Chris et al., Stanford University in NIPS 2015, which was verified on Khan Academy Data. It has better effect than traditional BKT model. As can be seen from Table 1, compared with Khan Academy Data, scallop vocabulary test Data has a larger number of subjects, a larger number of users involved and a longer sequence length. These differences are also the biggest challenges we face in the process of model tuning.

Table 1: Khan Math and Shanbay Vocab data comparison

 

The Baseline model is structured as a monolayer LSTM, as shown in Figure 2. The input Xt is the embedding of the user’s current action (word and correct or not). You can use one-hot encodings or compressed representations. The output YT represents the probability that the model predicts the user’s correct answer to each word in the thesaurus.

Figure 2: DKT model structure

Model to improve

The baseline, realized according to the ideas of the original paper, can reproduce the results of the paper well on Khan Academy Data. For actual application scenarios, we use TensorFlow to achieve the corresponding model, and make improvements in the following aspects to try to improve the performance of the model.

Data preprocessing

Through observation, the following problems exist in the original data:

  • The proportion of abnormal user data is too high

  • Some user test sequences are too short and provide insufficient information

  • There are a few very low frequency words

After data cleaning, the accuracy of the model is improved by about 1.3%.

Introducing external features

The input of DKT original model is only the current question and whether the user answers correctly or not. In fact, some other information related to the user’s answering process can also be input into the model as features. Some of the typical characteristics are listed below:

  • Time – The amount of Time the user took to answer when they first encountered the word

  • Attempt Count – How many times the user encounters the word

  • First Action – Whether the user’s First action is to answer directly or ask for help

  • Word level – Prior Word level

These features can be used in a variety of ways. They can be input through autoencoder encoding, input after feature vector and input embeddings splicing, and can be directly spliced with hidden state output of LSTM for prediction. The use of these features further improved the model accuracy by about 2.1%. We also conducted a comparative experiment on the impact of different features and found that Time and Attempt Count were the two most important feature dimensions, while the impact of other features was limited.

Figure 3: DKT model with external features

Long sequence dependence

The traditional LSTM model uses a gated function, which effectively alleviates the problem of gradient disappearance, but it is still unavoidable in the face of extremely long sequences. In addition, due to the use of tanh function, the problem of gradient disappearance between layers still exists in multi-layer LSTM. Therefore, at the present stage, most multilayer LSTM uses 2~3 layers, no more than 4 layers. In order to solve the problem of long-term dependence of super-long sequences in data, we choose the “Recurrent Neural Network (IndRNN) model” published in CVPR 2018 by Shuai Li et al. [2]. IndRNN decouples neurons within layers and makes them independent of each other. Meanwhile, ReLU activation function is used to effectively solve the problems of gradient disappearance and explosion within and between layers, which greatly increases the number of model layers and the length of sequences that can be learned. As shown in Figure 4, for Adding Problem (a typical Problem for evaluating RNN model), when the sequence length reaches 1000, LSTM cannot reduce the mean square error, while IndRNN can still rapidly converge to a very small error.

Figure 4: Comparison of convergence of various RNNS to long sequences for Adding Problem

The introduction of IndRNN effectively solves the long-term dependence problem of super-long sequences in data, and further improves the accuracy of the model by 1.2%.

Hyperparameter tuning

On the basis of the good performance of the manual parameter tuning model, we hope to further optimize the model through automatic parameter tuning. Some parameters that can be adjusted are:

  • RNN structure types – LSTM, GRU, IndRNN

  • RNN Layer number and connection mode

  • Learning rate and Decay steps

  • Input and RNN dimensions

  • Dropout size

 

Among automatic reference algorithms, Grid Search, Random Search and Bayesian Optimization [3] are relatively mainstream. The problem with grid search is that it is prone to dimensional disaster, while random search cannot use prior knowledge to better select the next set of hyperparameters, and only Bayesian optimization is “likely” to be a better algorithm than the modeler’s ability to tune parameters. Because it can adjust hyperparameters efficiently with prior knowledge. The bayesian optimization method is very powerful when the objective function is unknown and the computational complexity is high. The basic idea of this algorithm is to estimate the posterior distribution of the objective function based on the sampling data using Bayesian theorem, and then select the hyperparameter combination of the next sample according to the distribution.

Figure 5: Bayesian optimization of one-dimensional black box functions

In FIG. 5, the red line represents the real black-box function distribution, and the green area represents the confidence interval at each position calculated according to the sampled points. All you have to do at this point is select the next sampling point, with a large mean called exploitation and a large variance called exploration. A point with a larger mean will have a better chance of getting a better solution, while a point with a larger variance will have a better chance of getting global optimization. Therefore, deciding the ratio of exploitation to exploration depends on the use scenario. This function is called the Acquisition function, which is used to weigh exploitation and exploration. The commonly used acquisition function is Upper Condence Bound, Expected Improvement, Entropy Search, etc. With acquisition function, the hyperparameter at the maximum value can be obtained with it as the next hyperparameter value recommended by the Bayesian optimization algorithm. This result is based on the joint probability distribution between the hyperparameters and is balanced by Explorations and exploration.

After using Bayesian optimization, the accuracy of the model was further improved by 1.7%.

Deployment model

We use TensorFlow Serving as the solution for deploying the online model. Before launching, we used some model compression techniques to reduce the model size, and found the optimal Batching config parameter according to TensorFlow Serving Batching Guide [4].

The model of compression

There are many ways of model compression, such as parameter sharing and pruning, parameter quantization, low rank decomposition, etc. From the perspective of simplicity, we refer to the projection layer idea in LSTMP [5], decompose the embedding matrix of the final output layer, and add a projection layer. The reason for this is that the final output vocabulary of the model has a large dimension, so most of the parameters of the model are concentrated in the output layer. After decomposition, the size of the model is reduced to half of the original, but the accuracy of the model is not lost.

 

In addition, the hidden state of the DKT model is different for each user, so based on long-term learning needs, we need to save this vector for each user as user embedding. However, if the vector dimension is large, the storage pressure will be very large in the face of a large number of potential users, so we try to reduce the vector dimension. The original scheme was to use LSTMP, but the experiment found that directly reducing this dimension had very little damage to the accuracy of the model. Reducing the dimension to one-fifth of the baseline model had little negative impact on accuracy, which was also better than we expected.

TensorFlow Serving Batching tuning

According to the official Performance Tuning Guide, for online prediction system, num_batCH_THREADS is set as the number of CPU cores, max_batCH_size is set as a large value, Batch_timeout_micros is set to 0. Then adjust batch_timeout_micros from 1 to 10millisecond to find the optimal configuration. It is found through the test that under the same computing resources, using the Batching Config after tuning, the concurrency is 2-2.5 times higher than that when not using.

Summary and Outlook

This paper introduces the application of TensorFlow in Computer-Aided Language Learning (COMPUTER-Aided Language Learning), taking the vocabulary level assessment scenario as an example. By reproducing, improving and optimizing the results of a series of papers, the DKT model was successfully put online, providing a more scientific vocabulary test scheme for tens of millions of users.

In the future, we will continue to explore how to apply the DKT model more deeply into the word learning scene of scallop words. At the same time, word questions will be extended to more general exercises, and knowledge tracking will be carried out in a wider range of fields and from more perspectives, so as to help users learn English more efficiently. Using AI to empower education is the constant pursuit of scallops.

reference

[1] Piech, C. et al. Deep Knowledge Tracing in Neural Information Processing Systems 505 — 513 (2015).

[2] Li, S., Li, W., Cook, C., Zhu, C. & Gao, Y. Independently Recurrent Neural Network(IndRNN): Building A Longer and Deeper RNN. (2018).

[3] Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. & Freitas, N. de. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE 104, 148 — 175(2016).

[4] TensorFlow Serving Batching Guide. (2018). Available at:

https://github.com/tensorflow/serving/tree/master/tensorflow_serving/batching.

[5] Sak, H., Senior, A. & Beaufays, F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. in Fifteenth annual conference of the international speech communication association (2014).