My understanding is that time series model

This course focuses on time series models, which are characterized by an indefinite series of data that may change over time. That is, each piece of data has a timestamp that is important to the meaning of the whole data. Just think, if the order of a paragraph of text is upset, then this paragraph of text or the original text? To understand the sequence model, the dependency before and after should be considered when constructing the model.

Core content: RNN + GRU + LSTM + Word embedding(mapping from one-hot encoding to meaningful vector space) + attention-based model Depending on the nature of the sequence, bi-direction and uni-direction are also available (one-way and bi-directional).

Where RNN is the basic model, from the input of a time series to get an output (or possibly a sequence). The problem of RNN is that gradient transmission is very difficult when the sequence is very long, resulting in difficult convergence of training, and the subsequent output cannot be learned from the previous input. The parameters of time series in RNN model are shared.

LSTM (Long-term and short-term memory model) and GRU (Gated Recurrent Unit) solve this problem by introducing a series of gates and a memory cell. LSTM imports three gates: INPUT /forget/output, grU imports two gates: Update /reset. A memory is added to the CPU, which is controlled by gate logic, and the memory acts on the CPU to make it work better. It should be noted that LSTM and GRU are modular blocks. You can use them to replace the original blocks in RNN without damaging the original network structure.

In my opinion, word embedding is a great thing. Based on existing embedding, we can also enjoy the convenience brought by large data and training process, which is similar to the transfer learning in the image field. Our own models do not need to be very complex to produce amazing results, thanks to Word embedding. The author also introduces a very meaningful work: algorithm correction. This work should be important for humans, algorithms themselves have no values, so what they learn from the data is biased, but not necessarily in a good way (probably true), and we need to correct it, like: Women are not good at tech jobs? And so on.

Attention-based model: The author first introduces beam search. Compared with greedy algorithm, the meaning of Beam is to select N different branches at the same time, so that there are many more choices when calculating the probability later, so as to avoid the embarrassment of local optimization of greedy algorithm. It’s like going through a maze, where a greedy algorithm takes the path that looks best, but may end up in a dead end next. Beam is to send n people to explore the path at the same time, as long as one of the n people pass, it is good to choose one of the N people to go the best. Next is the attention model: First of all, I think the attention model is quite complex. It trains the weight of each input for each output, and this weight will ultimately determine who the input depends on. LSTM improves memory, and the attention model is more thorough, which I will not go into here because I have a superficial understanding of it. But it is a crucial technology in machine translation today. PS: BLEU(Bilingual Evaluation Understudy), an indicator to evaluate the effect of machine translation, will be mentioned in many papers

Above is an example of a model. As you can see, the attention layer is sandwicked between two LSTM. The lower LSTM is bidirectional (because the semantics are related to the translation), and the upper LSTM is unidirectional (the output is not bidirectional, so the output can be generated one by one). Theoretically, the attention layer should be a big improvement (probably because the model itself lets the current output know who to pay attention to and is shallow).

Rnn-related technologies are mainly used in speech/language related tasks, such as speech recognition/machine translation and tasks combining the two: real-time simultaneous interpretation. I think simultaneous interpretation of the current performance of the machine is far from the best human performance (experienced sogou’s so-called black technology, can be said to be full of loopholes).

The homework of this class is based on Keras, which I am not familiar with. Thanks to discussion Forum, I successfully completed the homework.

The end. Cheers!