This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together

Welcome to visit your personal blog: jmxgodlz.xyz

preface

This paper introduces neural network parameter tuning techniques: Warmup, decay. Backpropagation mainly completes parameter update: Theta t = t – 1 – alpha theta ∗ gt \ theta_t = \ theta_ {1} t – – \ * g_t theta t = alpha theta t – 1 – alpha ∗ gt, the alpha \ alpha alpha as a vector, gtg_tgt update quantity for gradient, Warmup, decay were ways to tweak alpha \alphaα, and the optimizer determined gradient update was the gTG_TGT calculation. Attenuation mode is shown in the figure below:

warmup and decay

define

Warmup and Decay is a learning rate adjustment strategy during model training.

Warmup is a method of learning rate Warmup mentioned in the ResNet paper. It uses a small learning rate at the beginning of training, trains epoches or steps(e.g., 4 Epoches, 10,000 steps), and then changes to pre-set learning for training.

Similarly, Decay is a learning rate Decay method, which specifies that after training epoches or steps, the learning rate will be reduced to the specified value according to linear or cosine functions. Generally, using Warmup and Decay, the learning rate follows the law of decreasing from small to large.

Why Warmup

References on zhihu here: www.zhihu.com/question/33… The common way in SGD training is to start with a large learning rate and then decay to a small learning rate, while warmUp first increases to the initial learning rate with a small learning rate and then decay to a small learning rate. So why is Warmup effective?

Visually explain

Deep network random initialization varies greatly. If the learning rate is large at the beginning, the deviation brought by initial learning will be difficult to correct in the subsequent learning process.

At the beginning of training, gradient updating is larger, and if the learning rate setting is larger, the updating range is larger. The reason why this type is different from traditional learning mode is that updating shallow network with large amplitude at the beginning will not lead to wrong direction.

Theoretical explanation

The benefits of WarmUp include:

  • Alleviate the phenomenon of over-fitting mini-batch in the initial stage of the model
  • Keep the stability of the model deep

The conclusions of the three papers are given:

  1. When batch size increases, the learning rate can also be multiplied
  2. The limitation of large batch training is the training instability brought by high learning rate
  3. Warmup mainly limits deep weight changes, and freezing deep weight changes can have a similar effect

The relationship between batch and learning rate

Assuming that the model has train to step T and the weight is wTW_TWT, we have K mini-batches, and each mini-batch size is N, which is marked as B1:k\mathcal{B}_{1:k}B1:k. Let’s look at the relationship between learning rate η\etaη training B\mathcal{B} _{1:k}B1:k and learning rate η^ hat{\eta}η^ one time training B\mathcal{B}B.

Assuming we use SGD, then after k training we can get:


w t + k = w t eta 1 n j < k x B j l ( x . w t + j ) w_{t+k}=w_{t}-\eta \frac{1}{n} \sum_{j<k} \sum_{x \in \mathcal{B}_{j}} \nabla l\left(x, w_{t+j}\right)

If we can get this in a single training session:


w ^ t + 1 = w t eta ^ 1 k n j < k x B j l ( x . w t ) \hat{w}_{t+1}=w_{t}-\hat{\eta} \frac{1}{k n} \sum_{j<k} \sum_{x \in \mathcal{B}_{j}} \nabla l\left(x, w_{t}\right)

Where wt+kw_{t+ K}wt+ K and W ^t+1\hat{w}_{t+1} W ^t+1 represents the parameters after the above training for K times and 1 time. Obviously, these two are different. But if we assume that ∇ (l x, wt) material ∇ (x, wt + j) l \ \ nabla l left (x, w_ {t} \ right) \ \ approx \ nabla l left (x, w_ + j {t} \ right) ∇ l (x, wt) material ∇ l (x, wt + j), So the eta ^ = k eta \ hat eta} {\ \ = k eta eta ^ ^ t = k eta can guarantee w + k \ + 1 material wt hat {w} _ {t + 1} \ approx w_} {t + k w + k ^ t + 1 material wt. So, When ∇ (l x, wt) material ∇ (x, wt + j) l \ \ nabla l left (x, w_ {t} \ right) \ \ approx \ nabla l left (x, W_ {T + J}\right)∇ L (x,wt)≈∇ L (x,wt+j) may not be true? [1] Tells us that there are two cases:

  • At the beginning of training, model weights change rapidly
  • Mini-batch size is small and sample variance is large

In the first case, the distribution of the initial parameters of the model depends on the initialization method. The initial data are first modified for the model, so the gradient update is large. If the model learns with a large learning rate at the beginning, it is easy to over-fit the data, which needs to be modified after more rounds of training.

In the second case, in the process of training, if the variance of data distribution in mini-batch is particularly large, it will lead to violent fluctuations in model learning and make the weight learned very unstable, which is most obvious in the early stage of training and relatively relieved in the last stage.

In both cases, Vector is not simple exponentially eta ^ = k eta \ hat eta} {\ \ = k eta eta ^ = k eta, because do not conform to the ∇ l (x, wt) material ∇ (x, wt + j) l \ \ nabla l left (x, W_ {t}, right), approx \ \ left nabla l x, w_ (t + j} {\ right) ∇ l (x, wt) material ∇ l (x, wt + j) hypothesis. At this point, either change the learning rate growth method [warmup] or resolve both cases [data preprocessing to reduce sample variance].

Warmup and stability of model learning

In this part, based on the experimental results of some papers, it is concluded that the model can learn more stably with WarmUP.

The figure above shows that with WarmUp, the model can learn more steadily.

Figure B and C show that with Warmup, the similarity of the last layers of the model increases, avoiding unstable changes to the model.

Learning rate attenuation strategy

Visual code

Warmup is used for the following learning rate attenuation strategies to make the picture more intuitive: The initial learning rate is set to 1, warmup steps is set to 20, and the total steps is 100. In general, warmup can be set to 10% of the total number of steps, following BERT’s rule of thumb.

#! /usr/bin/env python # -*- coding: Utf-8 -*- # author: JMXGODLZZ # datetime: 2022/1/23 7:10 PM # IDE:  PyCharm import keras import tensorflow as tf import matplotlib.pyplot as plt from learningrateSchedules import get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup from learningrateSchedules import get_cosine_with_hard_restarts_schedule_with_warmup from learningrateSchedules import get_polynomial_decay_schedule_with_warmup from learningrateSchedules import get_step_schedule_with_warmup from learningrateSchedules import get_exp_schedule_with_warmup init_lr = 1 warmupsteps = 20 totalsteps = 100 lrs = get_linear_schedule_with_warmup(1, warmupsteps, totalsteps) cos_warm_lrs = get_cosine_schedule_with_warmup(1, warmupsteps, totalsteps) cos_hard_warm_lrs = get_cosine_with_hard_restarts_schedule_with_warmup(1, warmupsteps, totalsteps, 2) poly_warm_lrs = get_polynomial_decay_schedule_with_warmup(1, warmupsteps, totalsteps, 0, 5) step_warm_lrs = get_step_schedule_with_warmup(1, warmupsteps, totalsteps) exp_warm_lrs = get_exp_schedule_with_warmup(1, warmupsteps, totalsteps, Plot (x, LRS, label='linear_warmup', color='k') plt.plot(x, LRS, label='linear_warmup', color='k') plt.plot(x, LRS, label='linear_warmup') cos_warm_lrs, label='cosine_warmup', color='b') plt.plot(x, cos_hard_warm_lrs, label='cosine_cy2_warmup', color='g') plt.plot(x, poly_warm_lrs, label='polynomial_warmup_pw5', color='r') plt.plot(x, step_warm_lrs, label='step_warmup', color='purple') plt.plot(x, exp_warm_lrs, label='exp_warmup', color='orange') plt.xlabel('steps') plt.ylabel('learning rate') plt.legend() plt.show()Copy the code

Exponential decay of learning rate

def get_exp_schedule_with_warmup(learning_rate, num_warmup_steps, num_training_steps, gamma, last_epoch=-1):
    """
    Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after
    a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.

    Args:
        optimizer (:class:`~torch.optim.Optimizer`):
            The optimizer for which to schedule the learning rate.
        num_warmup_steps (:obj:`int`):
            The number of steps for the warmup phase.
        num_training_steps (:obj:`int`):
            The total number of training steps.
        last_epoch (:obj:`int`, `optional`, defaults to -1):
            The index of the last epoch when resuming training.

    Return:
        :obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
    """

    def lr_lambda(current_step: int):
        if current_step < num_warmup_steps:
            return float(current_step) / float(max(1, num_warmup_steps))
        stepmi = (current_step - num_warmup_steps)
        return pow(gamma, stepmi)
    lrs = []
    for current_step in range(num_training_steps):
        cur_lr = lr_lambda(current_step) * learning_rate
        lrs.append(cur_lr)
    return lrs
Copy the code

Cosine decay learning rate

def get_cosine_schedule_with_warmup( learning_rate, num_warmup_steps: int, num_training_steps: int, num_cycles: Float = 0.5, last_epoch: int = -1): float = 0.5, last_epoch: int = -1): """ Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer. Args: optimizer (:class:`~torch.optim.Optimizer`): The optimizer for which to schedule the learning rate. num_warmup_steps (:obj:`int`): The number of steps for the warmup phase. num_training_steps (:obj:`int`): The total number of training steps. Num_cycles (:obj: 'float', 'optional', defaults to 0.5): The number of waves in the cosine schedule (the defaults is to just decrease from the max value to 0 following a half-cosine). last_epoch (:obj:`int`, `optional`, defaults to -1): The index of the last epoch when resuming training. Return: :obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule. """ def lr_lambda(current_step): if current_step < num_warmup_steps: return float(current_step) / float(max(1, num_warmup_steps)) progress = float(current_step - num_warmup_steps) / float(max(1, Num_training_steps - num_warmup_steps)) return Max (0.0, Math.cos (math.pi * float(num_cycles) * 2.0 * progress))) LRS = [] for current_step in range(num_training_steps): cur_lr = lr_lambda(current_step) * learning_rate lrs.append(cur_lr) return lrsCopy the code

Linear decay of learning rate

def get_linear_schedule_with_warmup(learning_rate, num_warmup_steps, num_training_steps, last_epoch=-1): """ Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer. Args: optimizer (:class:`~torch.optim.Optimizer`): The optimizer for which to schedule the learning rate. num_warmup_steps (:obj:`int`): The number of steps for the warmup phase. num_training_steps (:obj:`int`): The total number of training steps. last_epoch (:obj:`int`, `optional`, defaults to -1): The index of the last epoch when resuming training. Return: :obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule. """ def lr_lambda(current_step: int): if current_step < num_warmup_steps: Return float(current_step)/float(Max (1, num_warmup_steps)) return Max (0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps)) ) lrs = [] for current_step in range(num_training_steps): cur_lr = lr_lambda(current_step) * learning_rate lrs.append(cur_lr) return lrsCopy the code

Step decay learning rate

def get_step_schedule_with_warmup(learning_rate, num_warmup_steps, num_training_steps, last_epoch=-1): """ Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer. Args: optimizer (:class:`~torch.optim.Optimizer`): The optimizer for which to schedule the learning rate. num_warmup_steps (:obj:`int`): The number of steps for the warmup phase. num_training_steps (:obj:`int`): The total number of training steps. last_epoch (:obj:`int`, `optional`, defaults to -1): The index of the last epoch when resuming training. Return: :obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule. """ def lr_lambda(current_step: int): if current_step < num_warmup_steps: return float(current_step) / float(max(1, Num_warmup_steps) stepmi = (current_step-num_warmup_steps) // 20 + 1 return pow(0.5, stepmi) lrs = [] for current_step in range(num_training_steps): cur_lr = lr_lambda(current_step) * learning_rate lrs.append(cur_lr) return lrsCopy the code

Polynomial decay learning rate

def get_polynomial_decay_schedule_with_warmup( learning_rate, num_warmup_steps, num_training_steps, lr_end=1e-7, Power = 1.0, last_epoch = 1) : """ Create a schedule with a learning rate that decreases as a polynomial decay from the initial lr set in the optimizer  to end lr defined by `lr_end`, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer. Args: optimizer (:class:`~torch.optim.Optimizer`): The optimizer for which to schedule the learning rate. num_warmup_steps (:obj:`int`): The number of steps for the warmup phase. num_training_steps (:obj:`int`): The total number of training steps. lr_end (:obj:`float`, `optional`, defaults to 1e-7): The end LR. Power (:obj: 'float', 'optional', defaults to 1.0): Power factor. last_epoch (:obj:`int`, `optional`, defaults to -1): The index of the last epoch when resuming training. Note: 'Power' defaults to 1.0 as in the Fairseq implementation, which in turn is based on the original BERT implementation at https://github.com/google-research/bert/blob/f39e881b169b9d53bea03d2d341b31707a6c052b/optimization.py#L37 Return: :obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule. """ lr_init = learning_rate if not (lr_init > lr_end): raise ValueError(f"lr_end ({lr_end}) must be be smaller than initial lr ({lr_init})") def lr_lambda(current_step: int): if current_step < num_warmup_steps: return float(current_step) / float(max(1, num_warmup_steps)) elif current_step > num_training_steps: return lr_end / lr_init # as LambdaLR multiplies by lr_init else: lr_range = lr_init - lr_end decay_steps = num_training_steps - num_warmup_steps pct_remaining = 1 - (current_step - num_warmup_steps) / decay_steps decay = lr_range * pct_remaining ** power + lr_end return decay / lr_init # as LambdaLR multiplies by lr_init lrs = [] for current_step in range(num_training_steps): cur_lr = lr_lambda(current_step) * learning_rate lrs.append(cur_lr) return lrsCopy the code

Cosine cycle attenuates the learning rate

def get_cosine_with_hard_restarts_schedule_with_warmup( learning_rate, num_warmup_steps: int, num_training_steps: int, num_cycles: int = 1, last_epoch: int = -1 ): """ Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, with several hard restarts, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer. Args: optimizer (:class:`~torch.optim.Optimizer`): The optimizer for which to schedule the learning rate. num_warmup_steps (:obj:`int`): The number of steps for the warmup phase. num_training_steps (:obj:`int`): The total number of training steps. num_cycles (:obj:`int`, `optional`, defaults to 1): The number of hard restarts to use. last_epoch (:obj:`int`, `optional`, defaults to -1): The index of the last epoch when resuming training. Return: :obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule. """ def lr_lambda(current_step): if current_step < num_warmup_steps: return float(current_step) / float(max(1, num_warmup_steps)) progress = float(current_step - num_warmup_steps) / float(max(1, Num_training_steps - num_warmup_steps)) if progress >= 1.0: Return return Max (0.0, 0.0 Math.pi * (float(num_cycles) * progress) % 1.0)))) LRS = [] for current_step in range(num_training_steps): cur_lr = lr_lambda(current_step) * learning_rate lrs.append(cur_lr) return lrsCopy the code

Learning rate attenuation implementation

Pytorch learning rate strategy

if args.scheduler == "constant_schedule":
    scheduler = get_constant_schedule(optimizer)

elif args.scheduler == "constant_schedule_with_warmup":
    scheduler = get_constant_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps
    )

elif args.scheduler == "linear_schedule_with_warmup":
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=args.warmup_steps,
        num_training_steps=t_total,
    )

elif args.scheduler == "cosine_schedule_with_warmup":
    scheduler = get_cosine_schedule_with_warmup(
        optimizer,
        num_warmup_steps=args.warmup_steps,
        num_training_steps=t_total,
        num_cycles=args.cosine_schedule_num_cycles,
    )

elif args.scheduler == "cosine_with_hard_restarts_schedule_with_warmup":
    scheduler = get_cosine_with_hard_restarts_schedule_with_warmup(
        optimizer,
        num_warmup_steps=args.warmup_steps,
        num_training_steps=t_total,
        num_cycles=args.cosine_schedule_num_cycles,
    )

elif args.scheduler == "polynomial_decay_schedule_with_warmup":
    scheduler = get_polynomial_decay_schedule_with_warmup(
        optimizer,
        num_warmup_steps=args.warmup_steps,
        num_training_steps=t_total,
        lr_end=args.polynomial_decay_schedule_lr_end,
        power=args.polynomial_decay_schedule_power,
    )

else:
    raise ValueError("{} is not a valid scheduler.".format(args.scheduler))

Copy the code

Keras learning rate strategy

  • Keras provides four decay strategies, namely ExponentialDecay, PiecewiseConstantDecay, PolynomialDecay and InverseTimeDecay render the PolynomialDecay object. As long as you specify the attenuation policy in the Optimizer, this can be done in one line of code, as described in method 1 below.
  • If you want to customize learning rate decay, there is a second, more flexible way to implement a dynamic, custom learning rate decay strategy using Callbacks, which is described in more detail in Method 2.
  • If both methods are used at the same time, the second method is preferred by default and the first method is ignored.

Methods a

Exponential decay

lr_scheduler = tf.keras.optimizers.schedules.ExponentialDecay( initial_learning_rate=1e-2, decay_steps=10000, Decay_rate = 0.96) optimizer = tf. Keras. Optimizers. SGD (learning_rate = lr_scheduler)Copy the code

Segmented attenuation

The learning rate of steps [0~1000] is 1.0, steps [10001 ~ 9000] is 0.5, and other steps is 0.1

Boundaries = tf.Variable(0, trainable=False) boundaries = [1000, 10000] values = [1.0, 0.5, 0.1] learning_rate_fn = tf. Keras. Optimizers. Schedules. PiecewiseConstantDecay (boundaries, values) lr_scheduler = learning_rate_fn(step) optimizer = tf.keras.optimizers.SGD(learning_rate=lr_scheduler)Copy the code

Polynomial attenuation

Decays from 0.1 to 0.001 in 10000 steps, using the open root (power=0.5)

Start_lr = 0.1 0.001 decay_steps end_lr = = 10000 lr_scheduler = tf keras. Optimizers. Schedules. PolynomialDecay ( Start_lr, decay_steps, end_LR, power=0.5) Optimizer = tf.keras.optimizers.SGD(Learning_rate =lr_scheduler)Copy the code

Inverse time attenuation

Initial_lr = 0.1 1.0 decay_rate decay_steps = = 0.5 lr_scheduler = keras. Optimizers. Schedules. InverseTimeDecay ( initial_lr, decay_steps, decay_rate) optimizer = tf.keras.optimizers.SGD(learning_rate=lr_scheduler)Copy the code

Method 2

Custom exponential attenuation

Def step_decay(epoch): init_lr = 0.1 drop=0.5 epochs_drop=10 if epoch<100: return init_lr else: Return init_lr * pow (drop, floor (1 + epoch)/epochs_drop) #... R_callback = LearningRateScheduler(step_decay) lr_callback = LearningRateScheduler(step_decay) Add callbacks model = KerasClassifier(build_fn = create_model,epochs=200,batch_size=5,verbose=1,callbacks=[checkpoint,lr_callback]) model.fit(X,Y)Copy the code

Dynamically modify the learning rate

= ‘val_acc’ ReduceLROnPlateau (monitor, mode = ‘Max’, min_delta = 0.1, factor = 0.2, patience = 5, min_lr = 0.001)

When val_ACC of the training set is smaller than MIN_delta for continuous patience for epochs, the learning rate will be multiplied by factor. Mode can be Max or min, which is flexibly set according to the monitor. Min_lr is the lowest value of learning rate.

# Step 1: ReduceLROnPlateau Define learning dynamic change strategy reduce_lr_callback = ReduceLROnPlateau(monitor=' val_ACC ', Factor =0.2,patience=5, Min_lr =0.001) # Add callbacks model = KerasClassifier(build_fn = create_model,epochs=200,batch_size=5,verbose=1,callbacks=[checkpoint,reduce_lr_callback]) model.fit(X,Y)Copy the code

Keras learning rate code is displayed

def get_lr_metric(optimizer): def lr(y_true, y_pred): return optimizer.lr return lr x = Input((50,)) out = Dense(1, activation='sigmoid')(x) model = Model(x, Out) optimizer = Adam(lr=0.001) lr_metric = get_lr_metric(optimizer) model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['acc', lr_metric]) # reducing the learning rate by half every 2 epochs cbks = [LerningRateScheduler(lambda epoch: 0.001 * 0.5 ** (epoch // 2)), TensorBoard(write_graph=False)] X = NP.random.rand (1000, 50) Y = NP.random.randint (2, size=1000) model.fit(X, Y, epochs=10, callbacks=cbks)Copy the code

Stratified learning rate setting

Sometimes we need to set different learning rates for different layers in the model. For example, when fine-tuning the pre-training model, the pre-training layers should set a smaller learning rate for learning, while other layers should learn at a normal size. Here is the kerAS implementation given by Su Shen, which realizes the purpose of adjusting the learning rate through parameter transformation:

The gradient descent formula is as follows:


Theta. n + 1 = Theta. n Alpha. partial L ( Theta. n ) partial Theta. n \boldsymbol{\theta}_{n+1}=\boldsymbol{\theta}_{n}-\alpha \frac{\partial L(\boldsymbol{\theta}_{n})}{\partial \boldsymbol{\theta}_n}

Consider the transformation θ=λϕ\boldsymbol{\theta}=\lambda \boldsymbol{\phi}θ=λϕ, where λ is a fixed scalar and ϕ is also a parameter. Now to optimize ϕ, the corresponding update formula is:


ϕ n + 1 = ϕ n Alpha. partial L ( Lambda. ϕ n ) partial ϕ n = ϕ n Alpha. partial L ( Theta. n ) partial Theta. n partial Theta. n partial ϕ n = ϕ n Lambda. Alpha. partial L ( Theta. n ) partial Theta. n \begin{aligned}\boldsymbol{\phi}_{n+1}=&\boldsymbol{\phi}_{n}-\alpha \frac{\partial L(\lambda\boldsymbol{\phi}_{n})}{\partial \boldsymbol{\phi}_n}\\ =&\boldsymbol{\phi}_{n}-\alpha \frac{\partial L(\boldsymbol{\theta}_{n})}{\partial \boldsymbol{\theta}_n}\frac{\partial \boldsymbol{\theta}_{n}}{\partial \boldsymbol{\phi}_n}\\ =&\boldsymbol{\phi}_{n}-\lambda\alpha \frac{\partial L(\boldsymbol{\theta}_{n})}{\partial \boldsymbol{\theta}_n}\end{aligned}

Then multiply both sides of the equation by λ using the chain rule:


Lambda. ϕ n + 1 = Lambda. ϕ n Lambda. 2 Alpha. partial L ( Theta. n ) partial Theta. n Theta. n + 1 = Theta. n Lambda. 2 Alpha. partial L ( Theta. n ) partial Theta. n \lambda\boldsymbol{\phi}_{n+1}=\lambda\boldsymbol{\phi}_{n}-\lambda^2\alpha \frac{\partial L(\boldsymbol{\theta}_{n})}{\partial \boldsymbol{\theta}_n}\quad\Rightarrow\quad\boldsymbol{\theta}_{n+1}=\boldsymbol{\theta}_{n}-\lambda^2\alpha \frac{\partial L(\boldsymbol{\theta}_{n})}{\partial \boldsymbol{\theta}_n}

In the SGD optimizer, if the parameter transformation θ=λϕ is applied, the equivalent result is that the learning rate changes from α to λ2α\lambda^2\alphaλ2α.

However, in adaptive learning rate optimizers (such as RMSprop, Adam, etc.), the situation is a little different, because adaptive learning rate uses gradient (as the denominator) to adjust the learning rate, offsetting a λ

In adaptive learning rate optimizers such as RMSprop and Adam, if the parameter θ=λϕ is transformed, the equivalent result is that the learning rate changes from α to λα.

import keras.backend as K class SetLearningRate: Def __init__(self, layer, lamb, is_ada=False): Self. layer = layer self.lamb = lamb # is_ada = is_ada # def __call__(self, inputs): with K.name_scope(self.layer.name): if not self.layer.built: input_shape = K.int_shape(inputs) self.layer.build(input_shape) self.layer.built = True if self.layer._initial_weights is not None: self.layer.set_weights(self.layer._initial_weights) for key in ['kernel', 'bias', 'embeddings', 'depthwise_kernel', 'pointwise_kernel', 'recurrent_kernel', 'gamma', 'beta']: if hasattr(self.layer, key): Weight = getattr(self.layer, key) if self.is_ada: lamb = self.lamb Lamb = self.lamb**0.5 # SGD, Lamb to open square Keith et_value (weight, Kevin Everett sat val (weight)/lamb) # change initialization setattr (self) layer, the key, Return self.layer(inputs) x_in = Input(shape=(None,)) x = x_in # Weights =[word_vecs])(x) # X = SetLearningRate(Embedding(100, 1000, weights=[word_vecs]), 0.1, ~ x = LSTM(100)(x) model = model (x_in, x) Model.compile (Loss ='mse', Optimizer =' Adam ') #Copy the code

reference

Jishuin.proginn.com/p/763bfbd51…

www.zhihu.com/question/33…

Kexue. FM/archives / 64…