Compile | Troy Chang, love heart, reason_W

Proofreading | reason_W

Next month, Nips will be held across the ocean. What are the highlights of Nips? In his latest work, Bengio proposed a new concept of RNN optimization, fraternal dropout, to optimize RNS by minimizing the prediction differences of the same RNN using different dropout masks and improving the invariance of RNS for different dropout masks. The model achieved impressive results in comparison experiments, and also performed well on image tagging and semi-supervised tasks.

Abstract

RNN, as an important architecture in neural networks, is mainly used for language modeling and sequence prediction. However, the optimization of RNN is quite difficult, much more difficult than feedforward neural network, and many technologies have been proposed to solve this problem. We’ve developed a technique called Fraternal Dropout in our work to achieve this goal. Specifically, we first trained two identical RNNS with different Dropout masks while minimizing the difference in their (pre-softmax) predictions. In this way, our regular term promotes invariance of the representation of the RNN over the Dropout Mask. We show that the upper bound of our regularization term is the linear expected dropout target, and that the linear expected dropout target has been shown to solve the large difference (gap) caused by differences in dropout during training and reasoning stages. We performed sequence modeling tasks on two benchmark datasets (Penn Treebank and Wikitext-2.) to evaluate our model, and the results were quite impressive. We also demonstrate that this approach can provide significant performance improvements for image tagging (Microsoft COCO) and semi-supervised (CIFAR-10) tasks.

1 introduction

Like LSTM network (LSTM; Hochreiter & Schmidhuber(1997)) and the gated cycle unit (GRU; Cyclic neural networks such as Chung et al. (2014) are popular architectures for sequential modeling tasks such as language generation, translation, speech synthesis and machine understanding. However, compared with feedforward networks, these RNNs networks are more difficult to optimize due to the length variability of the input sequence, the repeated application of the same conversion operator at every moment, and the large scale dense embedding matrix determined by vocabulary. It is precisely because of these optimization challenges that RNNs faces compared to feedforward neural networks that batch normalization and its variants (layer normalization, cyclic batch normalization, cyclic normalized propagation), although they do bring significant performance improvements, However, their applications have not been as successful as their counterparts in feedforward neural networks (Laurent et al., 2016). Similarly, naive applications of Dropout (Srivastava et al., 2014) have been shown to be ineffective in RNNs (Zaremba et al., 2014). Therefore, RNNs regularization technique is still an active research area up to now.

To address these challenges, Zaremba et al. (2014) proposed using dropout only for acyclic connections in multilayer RNNS. The Variational dropout(Gal&Ghahramani, 2016) uses the same dropoutmask throughout the sequence during training. DropConnect(Wan et al., 2013) applies dropout operations on weight matrices. Zoneout(Krueger et al. (2016)) randomly selects to use the previous moment hiding state rather than the current moment hiding state in a similar dropout fashion. Similarly, as an alternative to batch normalization, layer normalization normalized the hidden units within each sample to a distribution with zero mean and unit standard deviation. Circular batch standardization applies to batch standardization, but non-shared mini-batch statistics are used for each moment (Cooijmans et al., 2016).

Merity et al. (2017a) and Merity et al. (2017b), on the other hand, proved that active regularization (AR) and temporal active regularization (TAR) are also effective methods for regularization of LSTMs.

In our case, we propose a simple regularization based on dropout, which we call fraternal dropout. This method minimizes the equal-weighted weighted sum of the prediction losses of the two networks (obtained by two different Dropoutmasks on the same LSTM) and adds the L2 difference of the (pre-softmax) prediction results of the two networks as the regular term. We analyze and prove that the regularization goal of the method is equal to minimizing the variance of the predicted results from different I.I.D. dropoutmasks. This method will improve the invariance of prediction results for different Dropoutmasks. We also discuss the correlation between our regularization terms and linear expected dropout(Ma et al.,2016), ii-model (Laine&Aila,2016), and activation regularization (Merity et al., 2017a). In addition, experimental results indicate that our methods greatly increase performance over these related methods, which are further explained in the Specific study in Part 5.

2 FRATERNAL DROPOUT

Dropout is a powerful regularization method in neural networks. It is generally more efficient at densely connected layers because they are more susceptible to overfitting than convolution layers where parameters are shared. For this reason, dropout is an important regularization of the RNN series. However, there is a gap in the use of dropout between the training and reasoning stages, because the reasoning stage assumes that factors are corrected in a linear activation manner, so the expected value of each activation will be different. (Xiaobian Note: In order to better understand this part, you can refer to Dropout with Expectation – linear Regularization – https://arxiv.org/abs/1609.08017, This paper theoretically analyzes the Ensemble model of Dropout and the gap model of expectancy in general. Then the gap is proposed as a regularization method, that is to say, the optimization goal of gap is to be as small as possible. In addition, prediction models with dropout typically vary with different dropout masks. However, the ideal outcome in this case is that the final prediction does not vary with the dropout mask.

Therefore, the idea of fraternal dropout is to train a neural network model to ensure that the predicted results vary as little as possible under different dropout masks. For example, suppose we have an RNN model M(θ), input is X, θ is the parameter of the model, and then let

Denotes the prediction model, that is, given an input sample X at point T, the Dropout mask is

The current input is zt, where zt is a function of input X, and the state of the hidden layer is related to the previous time. In the same way,

Is related to the loss value of the entire input target sample pair (X,Y) at time t.

In fraternal Dropout, we use two identical RNNS to feed forward sample X simultaneously. The two RNNS share the same model parameter θ, but have different dropout masks at each moment of t  和. So at each time point t, there are two losses

.

Thus, the overall loss function of fraternal dropout can be expressed by the following formula,

Where κ is the regularization coefficient and m is

The dimension,

Is the regularization term of fraternal Dropout, expressed as follows:

Monte Carlo sampling was used for estimation

Among themAnd computingValues use the same expression. Therefore, the extra calculation is negligible.

We note that, as shown below, our goal for regularization terms is equivalent to minimizing the variance of the prediction function for different dropout masks (shown in the appendix).

Note 1 Assumption  和They’re all individually and identically distributed, and the dropout masks andIs the prediction function described above.

As a result,

3 Related work

3.1 Work related to linear expected Dropout

Ma et al.(2016) analysis shows that there is an upper limit to the expected error (on the sample) between the expected value of the model under all dropout masks and the expected value using the average mask. Based on this conclusion, they propose to explicitly minimize the difference (we use their regular expression in our notation),

Where s is the Dropout mask. However, for the sake of feasibility, they propose to use the following regulars instead in practice,

In particular, this formula is achieved by using two feedforward inputs in the network (with and without dropout masks respectively) and minimizing the major network loss (with dropout masks) and the previously specified regularization term (but with no backpropagation gradient in an undropout network). The goal of Ma et al.(2016) is to minimize network losses as well as the expected difference between the predicted values from the independent dropout Mask and the predicted values from the expected dropout Mask. At the same time, the upper limit of our regular object is linear expected Dropout, as shown in the following equation (shown in the appendix) :

The results show that minimizing the ELD goal indirectly minimizes our regular term. Finally, as described above, they applied target loss only to networks that were not dropout. In fact, in our study (see section 5), we found that backpropagation of target loss through networks (out-of-dropout) makes optimization models more difficult. However, in this setting, the network gain including performance gain and convergence gain can be obtained simultaneously by backward propagating target loss. We believe that in the case used, the network weights are more likely to reach the target through backpropagation updates, so the convergence speed of our regular term is faster. This is especially true for weighted dropout (Wan et al., 2013), in which case the dropout weights will not be updated during iterations of training.

3.2 II — Work related to model

In order to achieve the goal of improving performance in semi-supervised classification tasks, Laine & Aila(2016) proposed the II — Model. The model proposed by them is similar to the model proposed by us, except that they only apply target loss in one of the networks and use time-dependent weight function (while we use constant K /m), which can be equivalent to the depth feedforward version of our model. Intuition in their example uses unlabeled data to minimize differences in the predictions of two networks using different dropout masks. Furthermore, they also tested their model in supervisory tasks, but could not account for the improvement in using this regular term.

By analyzing our case, we show that minimizing the regular term (also used in the II — model) is equivalent to minimizing the variance in the model prediction results (Note 1). We also prove the relationship between the regular term and linear expected dropout (proposition 1). In Section 5, we examine the impact of targets based on two types of network loss not used in II-Model. We find that applying target loss in both networks will make the network achieve critical faster convergence. Finally, we note that temporal embedding (another model proposed by Laine&Aila (2016) claims to be a better version than ii-model for semi-supervised learning) is quite tricky in natural language processing applications. This is because keeping the average forecast for all times can be very memory consuming (because the number of predictions tends to be very large – tens of thousands). In addition, we prove that it is unnecessary to replace a constant value k/m with a time-dependent weight function in a supervised learning case. Because the labeled data was known, we did not observe the problem mentioned by Laine&Aila (2016) that the network would fall into degradation when it was too large in the early trained epoch. We noticed that it was easier to find an optimized constant value than to adjust a time-dependent function, and this was implemented in our example.

Similar to the method of II-Model, our method is also related to other semi-supervised tasks, mainly Rasmus et al.(2015) and Sajjadi et al.(2016). Since semi-supervised learning is not the focus of this article, we refer to Laine&Aila (2016) for more details.

4 experimental

4.1 Language Model

In the case of language modeling, we tested our model on two benchmark datasets, the Penn Tree-Bank (PTB) dataset (Marcus et al., 1993) and the Wikitext-2 (WT2) dataset (Merity et al., 2016). The preprocessing operations refer to Mikolov et al. (2010) (for the PTB corpus) and Moses Tokenizer (Koehn et al., 2007) (for the WT2 dataset).

For both datasets, we adopted the AWD-LSTM 3-tier architecture described by Merity et al. (2017 a). The number of parameters in the model for PTB is 24 million, while the number of parameters for WT2 is 34 million, because WT2 has a larger vocabulary and we use a larger embedding matrix. Apart from these differences, the architecture is the same.

  • Penn Tree-Bank (PTB) word level tasks

We used confounding metrics to evaluate our model and compared the results we obtained with the best available results. Table 1 shows our results, which show that our method achieves the most advanced performance available on Benchmark.

  • Wikitext-2 word-level tasks

In the case of the Wikitext-2 language modeling task, we outperformed current state-of-the-art technology. Table 2 lists the final results. More details about the experiment can be found in section 5.4.

4.2 Image Labeling

We also apply fraternal Dropout to image labeling tasks. We use the well-known Show and tell model as baseline (Vinyals et al., 2014). It should be emphasized here that in the image annotation task, the image encoder and the sentence decoder architecture are usually learned together. But since we wanted to focus on the benefits of using fraternal Dropout in RNN, we used the frozen pre-training ResNET-101 (He et al., 2015) model as our image encoder. This means that our results are not directly comparable to other state-of-the-art methods, but we provide the results of the original method so that readers can see that our baseline is performing well. Table 3 provides the final results.

We believe that a small κ value works best in this task because the image-tagging encoder is given all the information at the beginning, so the variance of the continuous prediction will be smaller than in the unconditional NLP task. Fraternal Dropout can be advantageous here, mainly because it averages the average gradients of different masks, thus updating weights more frequently.

5 Model Simplification STUDIES

In this section, our goal is to examine existing methods that are closely related to our approach. The Expectation Linear Dropout (Ma et al.,2016), π -Model (Laine & Aila 2016) and Activity Regularization (Merity et al. 2017b), all of our feared studies. These are experiments designed to investigate whether some of the structures proposed in the model are valid or not.) All take a single layer of LSTM, using the same superparameters and model structure.

5.1 Linear Expected DROPOUT (ELD)

The connection between our method and ELD method has been discussed in the second part. Here we conduct experiments to investigate the difference in performance between using ELD regularization and our regularization (FD). Besides ELD, we also studied an improved ELDM of ELD. ELDM is the same use of FD, applying ELD to two identical LSTMS (in the original author’s experiment, we only used dropout on one LSTM). So we get a baseline model without any regularization. Figure 1 shows the curves of the training process for these methods. Compared with other methods, our regularization method performs better in convergence. In terms of generalization performance, we find that FD and ELD are similar, while the benchmark model and ELDM perform worse. Interestingly, if you look at the training and validation curves together, ELDM seems to need further optimization.

5.2 Π – MODEL

Because π -Model is similar to our algorithm (even though it’s designed for semi-supervised learning in a feedforward network), we investigated the differences in performance, both qualitatively and quantitatively, in order to identify the advantages of our algorithm. Firstly, based on the PTB (Penn Treebank Dataset) task, we run single-layer LSTM and three-layer AWD-LSTM to test and compare the two algorithms in language modeling. Figures 1 and 2 show the test results. We find that our MODEL converges significantly faster than the π -model, which we believe is due to the increased adoption of parameter updates based on target gradients as we employ two networks (π -model oppositely) to propagate target losses back.

Although our algorithm was designed specifically to solve problems in RNN, we also compared to π -Model on semi-supervised tasks for a fair comparison. Therefore, we used a CIFAR-10 dataset consisting of 10 classes of 32×32 images. Referring to the usual data splitting method in semi-supervised learning literature, we used 4000 labeled images and 41000 unlabeled images as the training set, and 5000 labeled images as the verification set and 10000 labeled images as the test set. We use the original 56 – layer residual network structure, grid search parameters

Dropout at {0.05, 0.1, 0.15, 0.2} and then leave the remaining superparameters unchanged. We also examined the importance of using untagged data. Table 4 shows the results. We found that the performance of our algorithm is almost identical to that of π -Model. Fraternal dropout performed only slightly better than normal dropout when unlabeled data was not used.

5.3 Activation regularization (AR) and temporal activation regularization (TAR) analysis

The authors of Merity et al.(2017b) investigated the importance of active regularization (AR) and time-domain active regularization (TAR) in LSTM, as shown below,

Table 4: Accuracy of the MODIFIED (semi-supervised task) CIFAR-10 data set based on resNET-56 model. We find that our proposed algorithm performs equally well with model II. Traditional dropout can damage performance when unlabeled data is not used, but fraternal dropout provides slightly better results. This means that our approach is advantageous when data is scarce and additional regex methods have to be used.

Figure 4: Obfuscation study: training (left) and verification (right) of single-layer LSTM (10M parameters) using PTB word-level models. The learning dynamics of the benchmark model, time-domain activation regularization (TAR), predictive model (PR), activation regularization (AR), and Fraternal Dropout (FD, our algorithm) are shown. We find that compared with the control regular term, FD converges faster and has better generalization Chinese performance.

Where, is the output activation of LSTM at time T (thus depends on both the current input and model parameters). Note that the AR and TAR regularization is applied to the output of LSTM, while our regularization is applied to the pre-softmax output of LSTM. However, because our regular term can be decomposed as follows:

And, by encapsulating a term and a dot product term, we experimentally confirm that the promotion in our method is not determined by the regular term alone. There is a similar argument for the TAR target. We run grid searches on Merity et al.(2017b), including the hyperparameters mentioned in Merity et al.(2017b). We use it in the proposed re term. Furthermore, a comparison is made on the regular term (PR) of a regular form to further exclude any improvement from the regular term alone. Based on this grid search, we select the model that is best for all regularization on the validation set, and also report a baseline model that does not use the four regularization mentioned. The learning dynamics are shown in Figure 4. Compared with other methods, our regular term performs better during both convergence and generalization. When any of the regular terms described are applied, the average hidden state activation decreases (figure 3).

LANGUAGE MODELING FAIR COMPARISON

As described in section 4.1, due to the influence of Melis et al. (2017), we want to ensure that fraternal dropout can outperform existing methods not only because of extensive hyperparametric grid searches. Therefore, in our experiments, we retained most of the hyperparameters mentioned in the original document, namely, embedded and hidden state sizes, gradient clipped values, weight attenuation, and values for all dropout layers (dropout on word vectors, output between LSTM layers, and finally output and embedded dropout for LSTM).

Of course, some necessary changes were made:

  • The coefficients for AR and TAR must be changed because fraternal dropout also affects the activation of RNN (as described in section 5.3) – instead of doing a grid search to get the best value, we simply remove the AR and TAR regular terms.

  • Since we needed twice as much memory, the batch size was reduced by half so that the model generated roughly the same amount of memory requirements and could be installed on the same GPU

The final change is to change the nonmonotone interval hyperparameter N of ASGD. We perform a grid search on n∈{5,25,40,50,60} and get very similar results when n is at its maximum (40,50, and 60). As a result, the training time of our model using the ordinary SGD optimizer is longer than that of the original model.

To ensure the effectiveness of our model, we ran 10 learning programs on the PTB dataset using raw hyperparameters of different seeds (without fine-tuning) to calculate confidence intervals. The average optimal confounding degree was 60.64±0.15 and the minimum was 60.33. The confusions were 58.32±0.14 and 58.05, respectively. Our scores (59.8 for validation confounding and 58.0 for testing confounding) were better than the original dropout scores.

Due to limited computational resources, we ran a separate training program for fraternal Dropout on the WT2 dataset. In this experiment, we used the optimal hyperparameter of the PTB data set (κ= 0.1, non-monotonic interval n = 60, and batch size halved).

We confirm that fine-tuning is beneficial to ASGD (Merity et al., 2017a). However, this is a very time consuming practice because different hyperparameters may be used in this additional learning process, so the likelihood of getting better results through extensive grid searches is higher. Therefore, in our experiments, we used the same fine-tunin program as implemented in the official repository (without even using fraternal Dropout). The importance of fine-tuning is listed in Table 5.

We think that running a grid joint search for all hyperparameters might yield better results (changing the dropout rate might be particularly beneficial, since our approach explicitly uses dropout). However, our purpose here is to rule out the possibility of performing better simply because of using better hyperparameters.

6 the conclusion

In this paper, we propose a simple RNNs regularization method called Fraternal Dropout, which reduces the variance of the model’s predictions over different dropout masks by acting as a regular term. Experiments show that our model has a faster convergence rate and achieves the most advanced results in the benchmark language modeling task. We also analyze the relationship between our regular term and linear expected Dropout (Ma et al., 2016). We conducted a number of specific studies, evaluated the models from various perspectives, and carefully compared them to related methods, both qualitatively and quantitatively.

Thesis links to Fraternal Dropout

https://arxiv.org/abs/1711.00066

Resources to recommend

Blockbuster | 128 papers, 21 large areas, deep learning is the most worth watching all the resources on this

Hot style | 6900 great AI learning roadmap on Medium, let you quick learning machine learning

Quora’s Top 10 Machine Learning authors and Facebook’s Top 10 Machine Learning, Data Science groups

AI, Machine learning, deep learning

Machine Learning: The web’s most important AI resources are here (Daniel, research institutes, videos, blogs, books, Quora……)

Must-see: A list of 20 courses from Stanford, MIT, Microsoft, Twitter, and more

Resources | worthy collection of 27 cheat sheet of machine learning