When using SGD or other optimization algorithms (Adam, Momentum) to train neural networks, ExponentialMovingAverage (EMA) is usually used. Its significance lies in using the parameters of moving average to improve the robustness of model in test data.

Today we are going to introduce EMA.

What is a MovingAverage?

So let’s say we have a parameterValues under different epochs

Then, the MovingAverage at the end of training is:

Decay represents the decay rate used to control the rate of model updates.

By the equation above, it’s easy to get

whenInfinity

so

namelyAnd the onlyThe relevant

What is a ExponentialMovingAverage?

With that in mind, let’s introduce the formula for EMA.

ShadowVariable is the parameter value obtained after EMA processing, and Variable is the parameter value of the current epoch cycle.

EMA maintains a shadow variable for each variable to be updated. The initial value of the shadow variable is the initial value of the variable.

According to the above formula, decay controlled the updating speed of the model, and the larger the model was, the more stable it was. In practice, this is usually set to a constant very close to 1 (0.999 or 0.9999).

PyTorch code implementation

Let’s look at the code implementation

class EMA(): def __init__(self, decay): self.decay = decay self.shadow = {} def register(self, name, val): self.shadow[name] = val.clone() def get(self, name): return self.shadow[name] def update(self, name, x): Assert name in self. Shadow new_average = (1.0 - self. Decay) * x + self. Decay * self. Shadow [name] self new_average.clone()Copy the code

Usage method, divided into initialization, registration and update three steps.

// register for name, param in model.named_parameters(): if param.requires_grad: ema.register(name, param.data) // update for name, param in model.named_parameters(): if param.requires_grad: ema.update(name, param.data)Copy the code