0 x00 the

Let’s take a look at distributed optimizers in the next few articles. This series is divided into three articles, the cornerstone, the Optimizer for data parallelism in DP/DDP/Horovod, and the PyTorch Distributed optimizer, progressive in depth.

This paper is the cornerstone, through this paper, you can understand the structure of the model, the basic principle of the optimizer, the interaction between the two, how to optimize and update the model, and so on, which lays a foundation for the subsequent step by step analysis.

PyTorch distributed other articles as follows:

PyTorch distributed (1)—— History and Overview

PyTorch how to use GPU

PyTorch distributed (2) —– DataParallel – gradient

PyTorch distributed (3) —– DataParallel – gradient

PyTorch distributed (4)—— Distributed application concepts

—— DistributedDataParallel – what to use

DistributedDataParallel — gradient — gradient — — — — — —

—– DistributedDataParallel – conditional processing groups

PyTorch distributed (8) ——– DistributedDataParallel allel allel allel allel allel allel allel allel allel allel

—– DistributedDataParallel – gradient initialization

PyTorch distributed (10)—— distributed Dataparreducer static schema

—– DistributedDataParallel constructs Reducer and Join operations

—– DistributedDataParallel – gradient forward propagation

—– DistributedDataParallel – gradient back-propagation

PyTorch distributed Autograd (1) —- design

PyTorch Distributed Autograd (2) —- RPC foundation

PyTorch Distributed Autograd (3) —-

PyTorch Distributed Autograd (4) —-

PyTorch Distributed Autograd (5) —-

PyTorch Distributed Autograd (6) —-

For better illustration, the code in this article will be streamlined accordingly.

0x01 Start with the problem

The picture below is from the paper of Kuaishou Bagua, showing the comparison between the native training process and DDP/Horovod. Vanilla above was the native training process, and the U part was the optimizer process. The main function of the conventional optimizer is to optimize and update the current parameters of the model according to the gradient: W.ta -= W.grad * lr.

1.1 the sample

Let’s use an example to see how to train.

class ToyModel(nn.Module) :
    def __init__(self) :
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10.10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10.5)

    def forward(self, x) :
        return self.net2(self.relu(self.net1(x)))

net = ToyModel()
optimizer = optim.SGD(params=net.parameters(), lr = 1)
optimizer.zero_grad()
input = torch.randn(10.10)
outputs = net(input)
outputs.backward(outputs)
optimizer.step()
Copy the code

A rough inverse calculation is shown below.

1.2 points

Since we have had other experience in analyzing engines, we sorted out several problem points to guide our analysis. We analyzed them in the order of: build optimizer according to model parameters -> compute gradient by engine -> optimize parameters -> update model by optimizer. We know that the Autograd engine calculates the gradient, and here’s the problem:

  • Build the optimizer based on the model parameters

    • usingoptimizer = optim.SGD(params=net.parameters(), lr = 1)Construct so that it looks like params is assigned to an internal member variable of the optimizer (we assume it is called parameters).
      1. The model includes two Linear. How do these layers update their parameters?
  • Engine calculated gradient

    • How to ensure that Linear can compute gradients?
      1. How does the calculated gradient correspond to the Linear parameter for the model? Where do these gradients that the engine calculates accumulate?
  • Optimizer optimization parameters:

      1. Step is called to optimize to the internal member variable self.parameters of the optimizer.
  • Optimizer updates model:

      1. How do updates to the self.parameters reflect updates to the model parameters (e.g. Linear)?

The numbers and question marks in the figure below correspond to the above four questions.

      +-------------------------------------------+                    +------------------+
      |ToyModel                                   |                    | Engine           |
      |                                           | forward / backward |                  |
      | Linear(10.10)+--> ReLU +--> Linear(10.5)| +----------------> | Compute gradient |
      |                                           |                    |        +         |
      +-------------------+-----------------------+                    |        |         |
                          |                                            |        |         |
                    1????? | parameters() +------------------+ | | | | gradient | ^ | | | v | |4?????2????? | | +------------------------------------------+ |SGD | | | | | | | | v + | | | ^ +---------------> self.parameters +----------------> | | | | | | | | | +------------------------------------------+ | | | <---------------------------------------------------+ v3 step()

Copy the code

We need to analyze it step by step.

0x01 Model construction

Since the optimizer is a parameter that optimizes and updates the model, we first introduce the model information.

1.1 the Module

If you define a model in PyTorch, you generally need to inherit nn.module.

import torch
import torch.nn as nn
import torch.nn.functional as F

class ToyModel(nn.Module) :
    def __init__(self) :
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10.10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10.5)

    def forward(self, x) :
        return self.net2(self.relu(self.net1(x)))
Copy the code

Module is defined as follows:

class Module:
    r"""Base class for all neural network modules. Your models should also subclass this class. Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes:: import torch.nn as nn import torch.nn.functional as F class Model(nn.Module): def __init__(self): super(Model, self).__init__() self.conv1 = nn.Conv2d(1, 20, 5) self.conv2 = nn.Conv2d(20, 20, 5) def forward(self, x): x = F.relu(self.conv1(x)) return F.relu(self.conv2(x)) Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:`to`, etc. :ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool """

    dump_patches: bool = False
    _version: int = 1
    training: bool
    _is_full_backward_hook: Optional[bool]

    def __init__(self) :
        """ Initializes internal Module state, shared by both nn.Module and ScriptModule. """
        torch._C._log_api_usage_once("python.nn_module")

        self.training = True
        self._parameters = OrderedDict()
        self._buffers = OrderedDict()
        self._non_persistent_buffers_set = set()
        self._backward_hooks = OrderedDict()
        self._is_full_backward_hook = None
        self._forward_hooks = OrderedDict()
        self._forward_pre_hooks = OrderedDict()
        self._state_dict_hooks = OrderedDict()
        self._load_state_dict_pre_hooks = OrderedDict()
        self._modules = OrderedDict()
Copy the code

1.2 Member Variables

There are the following important variables in the Module, which can be roughly divided into the following three categories.

Basic type:

  • _parameters: Weight parameters of type tensor, used for forward and backward propagation, saving the model is to save these parameters. Parameters () can be used to recursively retrieve all the parameters of the model, but it is important to note that the parameters() function returns an iterator.
  • _buffers: Stores variables with non-network parameters that need to be persisted, such as the running_mean of BN.
  • _modulesPyTorch stores variables of type Module, which PyTorch does by recursively iterating through all _modules when going to the parameters of a model.

Calculation related types:

During model calculation, it is completed in the following order:

 _backward_hooks  ----> forward ----> _forward_hooks ----> _backward_hooks
Copy the code

Details are as follows:

  • _forward_pre_hooks: Run before forward without changing forward input parameters.

  • _forward_hooks: Run after forward and do not change the input and output of forward.

  • _backward_hooks: Run after BACKWARD, without changing the input and output of backward.

Save/load related:

The following is related to saving PyTorch, which is used like saving torch. Save (cn.state_dict()…) , using load_state_dict(state_dict) to load.

  • _load_STATE_DICt_PRE_HOOKS: Action you want to do when calling _load_from_state_dict to load the model.
  • _STATE_DICt_hooks: callingstate_dictMethod that you want to perform.

The specific runtime is as follows:

net = {ToyModel} 
 T_destination = {TypeVar} ~T_destination
 dump_patches = {bool} False
 net1 = {Linear} Linear(in_features=10, out_features=10, bias=True)
 net2 = {Linear} Linear(in_features=10, out_features=5, bias=True)
 relu = {ReLU} ReLU()
 training = {bool} True
  _backward_hooks = {OrderedDict: 0} OrderedDict()
  _buffers = {OrderedDict: 0} OrderedDict()
  _forward_hooks = {OrderedDict: 0} OrderedDict()
  _forward_pre_hooks = {OrderedDict: 0} OrderedDict()
  _is_full_backward_hook = {NoneType} None
  _load_state_dict_pre_hooks = {OrderedDict: 0} OrderedDict()
  _modules = {OrderedDict: 3} OrderedDict([('net1', Linear(in_features=10, out_features=10, bias=True)), ('relu', ReLU()), ('net2', Linear(in_features=10, out_features=5, bias=True))])
  _non_persistent_buffers_set = {set: 0} set()
  _parameters = {OrderedDict: 0} OrderedDict()
  _state_dict_hooks = {OrderedDict: 0} OrderedDict()
  _version = {int} 1
Copy the code

1.3 _parameters

The optimizer is optimized _parameters, so we need to take a special look.

1.3.1 build

Let’s first look at the characteristics of the generation: requires_grad=True. Parameter needs to calculate the gradient.

Because tensors do not need to be differentiated by default, the requires_grad attribute defaults to False. If a node requires_grad is set to True, it needs to be differentiated, and all nodes dependent on it for requires_grad are True.

class Parameter(torch.Tensor) :
    r"""A kind of Tensor that is to be considered a module parameter. Parameters are :class:`~torch.Tensor` subclasses, that have a very special property when used with :class:`Module` s - when they're assigned as Module attributes they are  automatically added to the list of its parameters, and will appear e.g. in :meth:`~Module.parameters` iterator. Assigning a Tensor doesn't have such effect. This is because one might want to cache some temporary state, like last hidden state of the RNN, in the model. If there was no such class as :class:`Parameter`, these temporaries would get registered too. Args: data (Tensor): parameter tensor. requires_grad (bool, optional): if the parameter requires gradient. See :ref:`locally-disable-grad-doc` for more details. Default: `True` """
    def __new__(cls, data=None, requires_grad=True) : You need to calculate the gradient
        if data is None:
            data = torch.tensor([])
        return torch.Tensor._make_subclass(cls, data, requires_grad)
Copy the code

1.3.2 classified

If the members of the class are derived from the Parameter class, nn.Module uses the __setattr__ mechanism to attribute them to _parameters. Like Linear’s weight and bias.

def __setattr__(self, name: str, value: Union[Tensor, 'Module']) - >None:
    
    # omit...
    
    params = self.__dict__.get('_parameters')
    if isinstance(value, Parameter):
        remove_from(self.__dict__, self._buffers, self._modules, self._non_persistent_buffers_set)
        self.register_parameter(name, value) # 
        

    def register_parameter(self, name: str, param: Optional[Parameter]) - >None:
        r"""Adds a parameter to the module. The parameter can be accessed as an attribute using given name. Args: name (string): name of the parameter. The parameter can be accessed from this module using the given name param (Parameter): parameter to be added to the module. """
        
        # omit various checks

        if param is None:
            self._parameters[name] = None
        elif not isinstance(param, Parameter):
            raise TypeError("cannot assign '{}' object to parameter '{}' "
                            "(torch.nn.Parameter or None required)"
                            .format(torch.typename(param), name))
        elif param.grad_fn:
            raise ValueError(
                "Cannot assign non-leaf Tensor to parameter '{0}'. Model "
                "parameters must be created explicitly. To express '{0}' "
                "as a function of another Tensor, compute the value in "
                "the forward() method.".format(name))
        else:
            self._parameters[name] = param # added here
        
Copy the code

1.3.3 access

We can’t get the _parameters variable directly, only the parameters method, which returns an Iterator.

Such as:

for param in net.parameters():
    print(type(param), param.size())
Copy the code

Output:

<class 'torch.nn.parameter.Parameter'> torch.Size([10.10])
<class 'torch.nn.parameter.Parameter'> torch.Size([10])
<class 'torch.nn.parameter.Parameter'> torch.Size([5.10])
<class 'torch.nn.parameter.Parameter'> torch.Size([5])
Copy the code

The parameters code is as follows.

def parameters(self, recurse: bool = True) -> Iterator[Parameter]:
    r"""Returns an iterator over module parameters. This is typically passed to an optimizer. Args: recurse (bool): if True, then yields parameters of this module and all submodules. Otherwise, yields only parameters that are direct members of this module. Yields: Parameter: module parameter Example:: >>> for param in model.parameters(): >>> print(type(param), param.size()) 
      
        (20L,) 
       
         (20L, 1L, 5L, 5L) """
       
      
    for name, param in self.named_parameters(recurse=recurse):
        yield param
Copy the code

Take a look at named_parameters, whose core is module._parameters.items(), which returns a traversable array of tuples as a list.

def named_parameters(self, prefix: str = ' ', recurse: bool = True) -> Iterator[Tuple[str, Parameter]]:
    r"""Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself. Args: prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module and all submodules. Otherwise, yields only parameters that are direct members of this module. Yields: (string, Parameter): Tuple containing the name and parameter Example:: >>> for name, param in self.named_parameters(): >>> if name in ['bias']: >>> print(param.size()) """
    gen = self._named_members(
        lambda module: module._parameters.items(),
        prefix=prefix, recurse=recurse)
    for elem in gen:
        yield elem    
Copy the code

Note that we now have two key pieces of knowledge:

  • Parameter constructor requires_grad=True. This setting indicates that the Parameter needs to compute gradients by default.
  • The parameters method returns an Iterator.

The parameters of SGD are now an iterator pointing to toyModel._parameters, indicating that the optimizer is actually optimizing ToyModel’s _parameters directly. So we can remove the question mark corresponding to 4) in the original picture.

      +-------------------------------------------+                    +------------------+
      |ToyModel                                   |                    | Engine           |
      |                                           | forward / backward |                  |
      | Linear(10.10)+--> ReLU +--> Linear(10.5)| +----------------> | Compute gradient |
      |                                           |                    |        +         |
      |         para_iterator = parameters()      |                    |        |         |
      |                   +          ^            |                    |        |         |
      |                   |          |            |                    +------------------+
      +-------------------------------------------+                             |
                          |          |                                          | gradient
                          |          |                                          |
                  1????? | |4 update                                 v
                          |          |                                       2????? | | +----------------------------------------------------------------+ |SGD | | | | | | | | v | | | + | ^ +--------> self.parameters = para_iterator(ToyModel._parameters) ---------> | | | | | | | | | +----------------------------------------------------------------+ | | | <-------------------------------------------------------------------------+ v3 step()

Copy the code

1.4 Linear

Torch. Nn.Linear transforms input data and is generally used to set up the full connection layer.

1.4.1 use

An example of using torch.nn.Linear in PyTorch is shown below.

input = torch.randn(2.3)
linear = nn.Linear(3.4)
out = linear(input)
print(out)

The output is as follows
tensor([[-0.6938.0.0543, -1.4393, -0.3554], [...0.4653, -0.2421, -0.8236, -0.1872]], grad_fn=<AddmmBackward>)
Copy the code

1.4.2 definition

Linear is defined as follows. As you can see, its parameters are mainly Linear

  • The self weight = Parameter ().
  • The self bias = Parameter ().

As can be seen from the above, Parameter is generated with requires_grad=True, indicating that weight and bias need to calculate the gradient.

class Linear(Module) :
    r"""Applies a linear transformation to the incoming data: :math:`y = xA^T + b`

    This module supports :ref:`TensorFloat32<tf32_on_ampere>`.

    Args:
        in_features: size of each input sample
        out_features: size of each output sample
        bias: If set to ``False``, the layer will not learn an additive bias.
            Default: ``True``

    Shape:
        - Input: :math:`(N, *, H_{in})` where :math:`*` means any number of
          additional dimensions and :math:`H_{in} = \text{in\_features}`
        - Output: :math:`(N, *, H_{out})` where all but the last dimension
          are the same shape as the input and :math:`H_{out} = \text{out\_features}`.

    Attributes:
        weight: the learnable weights of the module of shape
            :math:`(\text{out\_features}, \text{in\_features})`. The values are
            initialized from :math:`\mathcal{U}(-\sqrt{k}, \sqrt{k})`, where
            :math:`k = \frac{1}{\text{in\_features}}`
        bias:   the learnable bias of the module of shape :math:`(\text{out\_features})`.
                If :attr:`bias` is ``True``, the values are initialized from
                :math:`\mathcal{U}(-\sqrt{k}, \sqrt{k})` where
                :math:`k = \frac{1}{\text{in\_features}}`

    Examples::

        >>> m = nn.Linear(20, 30)
        >>> input = torch.randn(128, 20)
        >>> output = m(input)
        >>> print(output.size())
        torch.Size([128, 30])
    """
    __constants__ = ['in_features'.'out_features']
    in_features: int
    out_features: int
    weight: Tensor

    def __init__(self, in_features: int, out_features: int, bias: bool = True,
                 device=None, dtype=None) - >None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(Linear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
        if bias:
            self.bias = Parameter(torch.empty(out_features, **factory_kwargs))
        else:
            self.register_parameter('bias'.None)
        self.reset_parameters()

    def reset_parameters(self) - >None:
        init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0
            init.uniform_(self.bias, -bound, bound)

    def forward(self, input: Tensor) -> Tensor:
        return F.linear(input, self.weight, self.bias) 

    def extra_repr(self) - >str:
        return 'in_features={}, out_features={}, bias={}'.format(
            self.in_features, self.out_features, self.bias is not None
        )
Copy the code

1.4.3 explain

From the previous schematic calculation we can see that the backcalculation of Torch. Nn.Linear is AddmmBackward.

struct TORCH_API AddmmBackward : public TraceableFunction {
  using TraceableFunction::TraceableFunction;
  variable_list apply(variable_list&& grads) override;
  std::string name(a) const override { return "AddmmBackward"; }
  
  void release_variables(a) override {
    std::lock_guard<std::mutex> lock(mutex_);
    mat2_.reset_data(a); mat1_.reset_data(a); } std::vector<int64_t> mat1_sizes;
  std::vector<int64_t> mat1_strides;
  SavedVariable mat2_;
  at::Scalar alpha;
  SavedVariable mat1_;
  std::vector<int64_t> mat2_sizes;
  std::vector<int64_t> mat2_strides;
  at::Scalar beta;
};
Copy the code

We find the definition of addmm in the code, and the comment states that this is a matrix multiplication operation.

def addmm(mat: Tensor, mat1: Tensor, mat2: Tensor,
          beta: float = 1., alpha: float = 1.) -> Tensor:
    r""" This function does exact same thing as :func:`torch.addmm` in the forward, except that it supports backward for sparse matrix :attr:`mat1`. :attr:`mat1` need to have `sparse_dim = 2`. Note that the gradients of :attr:`mat1` is a coalesced sparse tensor. Args: mat (Tensor): a dense matrix to be added mat1 (Tensor): a sparse matrix to be multiplied mat2 (Tensor): a dense matrix to be multiplied beta (Number, optional): multiplier for :attr:`mat` (:math:`\beta`) alpha (Number, optional): multiplier for :math:`mat1 @ mat2` (:math:`\alpha`) """
    return torch._sparse_addmm(mat, mat1, mat2, beta=beta, alpha=alpha)
Copy the code

Now we can continue to expand.

  • The weight and bias in Linear are parameters.
    • Parameter constructor requires_grad=True. This setting indicates that Parameter needs to compute gradients by default.
    • So Linear’s weight bias requires the engine to calculate its gradient.
  • The ToyModel_parametersMember variables are retrieved by the parameters method, which returns an Iterator.
    • This iterator is used as a parameter to build the SGD optimizer.
    • The parameters for the SGD optimizer are now an iterator pointing to toyModel._Parameters. This shows that the optimizer is actually optimizing ToyModel’s _parameters directly, which in the case of the full connection layer, corresponds to two arrows pointing to Parameters () issued by Linear.
+--------------------------------------------------+                   +------------------+
| ToyModel                                         |                   | Engine           |
| +-------------------+             +------------+ |forward / backward |                  |
| | Linear(10.10)    +--> ReLU +-->+Linear(10.5)| +-----------------> | Compute gradient | | | | | | | | + | | | weight=Parameter | | weight | | | | | | | +----------+  | | | | | | | | bias=Parameter | | | bias | | +------------------+ | | | | | | | | | +-------------------+ | +--+---------+ |2 | gradient
|                                |     |           |                            |
|                                |     |           |                            v
|                                v     v           |                           ???
|               para_iterator = parameters()       |
|                         +          ^             |
|                         |          |             |
|                         |          |             |
+--------------------------------------------------+
                          |          |
                   1????? | |4 update
                          |          |
                          |          |
      +----------------------------------------------------------------+
      |SGD                |          |                                 |
      |                   |          |                                 |
      |                   v          |                                 |
      |                              +                                 |
^ +--------> self.parameters = para_iterator(ToyModel._parameters) +-------->
|     |                                                                |    |
|     |                                                                |    |
|     +----------------------------------------------------------------+    |
|                                                                           |
<-------------------------------------------------------------------------+ v
                     3 step()
Copy the code

0 x02 Optimizer base class

Optimizer is the base class for all optimizers and has the following main public methods:

  • Add_param_group: Adds a learnable parameter group.
  • Step: Update parameters once.
  • Zero_grad: Zero the gradient in the last iteration before calculating the gradient in the back propagation.
  • State_dict: Returns parameters and states represented as dict structures.
  • Load_state_dict: loads the parameters and states represented by the dict structure.

2.1 the initialization

In the Optimizer initialization function, it does the following:

  • Initialization parameters include learnable parameters (Params) and super parameters (defaults).
  • Save lr, Momentun and other global parameters in self.defaults.
  • Save the current state of the optimizer in self.state.
  • Save all variables to be optimized in self.param_groups.
class Optimizer(object) :

    def __init__(self, params, defaults) : 
        torch._C._log_api_usage_once("python.optimizer")
        self.defaults = defaults Save LR, Momentun and other global parameters

        self._hook_for_profile()

        if isinstance(params, torch.Tensor): # Params must be dictionaries or Tensors
            raise TypeError("params argument given to the optimizer should be "
                            "an iterable of Tensors or dicts, but got " +
                            torch.typename(params))

        self.state = defaultdict(dict) Save the current state of the optimizer
        self.param_groups = [] Each of the parameters to be optimized is a dictionary corresponding to a set of parameters to be optimized and other related parameters

        param_groups = list(params) The variable that needs to be optimized is the argument passed in __init__
        if len(param_groups) == 0:
            raise ValueError("optimizer got an empty parameter list")
        if not isinstance(param_groups[0].dict) :Convert arguments to dictionaries
            param_groups = [{'params': param_groups}] # param_groups is a list in which one of the entries is in the form of a dictionary in which the optimization variables are stored.

        for param_group in param_groups:
            self.add_param_group(param_group) Add all param_groups items to self.param_groups
Copy the code

2.2 Adding variables to be optimized

The code above uses add_param_group, so let’s look at this function.

Add_param_group Adds learnable parameters for different groups. The code is shown below (most of the verification code has been omitted). Param_groups is used to access variables to be optimized in key-value mode, which is especially useful in fine tuning.

def add_param_group(self, param_group) :
    r"""Add a param group to the :class:`Optimizer` s `param_groups`. This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the :class:`Optimizer` as training progresses. Args: param_group (dict): Specifies what Tensors should be optimized along with group specific optimization options. """
    assert isinstance(param_group, dict), "param group must be a dict"

    params = param_group['params'] # Get the variables to be optimized
    if isinstance(params, torch.Tensor):
        param_group['params'] = [params] Build a list of variables to optimize
    elif isinstance(params, set) :raise TypeError('optimizer parameters need to be organized in ordered collections, but '
                        'the ordering of tensors in sets will change between runs. Please use a list instead.')
    else:
        param_group['params'] = list(params)
        
    You must have a tensor and a leaf node

    for name, default in self.defaults.items(): The default parameter is also added to the param_group
        if default is required and name not in param_group:
            raise ValueError("parameter group didn't specify a value of required optimization parameter " +
                             name)
        else:
            param_group.setdefault(name, default) # all groups set the same default parameter (hyperparameter)

    Use set to cancel
    params = param_group['params']
    param_set = set(a)for group in self.param_groups:
        param_set.update(set(group['params']))

    Update its parameter group
    self.param_groups.append(param_group) # Join param_groups
Copy the code

2.3 Examples of variables to be optimized

Let’s print param_groups as follows.

net = nn.Linear(3.3)
nn.init.constant_(net.weight, val=10)
nn.init.constant_(net.bias, val=5)
optimizer = optim.SGD(net.parameters(), lr=0.025)
print(optimizer.param_groups)
Copy the code

The result is as follows, the first 3 x 3 is the weight matrix for NET, and 1 x 3 is the bias matrix.

[{'params': [Parameter containing: # tensor([[10..10..10.],
              [10..10..10.],
              [10..10..10.], requires_grad=True), Parameter containing: # Tensor ([5..5..5.], requires_grad=True)
    ], 
  'lr': 0.025, 
  'momentum': 0, 
  'dampening': 0, 
  'weight_decay': 0, 
  'nesterov': False
  }
]
Copy the code

2.4 Optimizer state

Against 2.4.1 definition

PyTorch’s state_dict is a Python dictionary object.

  • For the model, state_dict will establish a mapping relationship between each layer and the parameters (such as weight and bias) that need to be learned in the training process. Only the layer whose parameters can be trained will be saved in the state_dict of the model, such as convolution layer and linear layer.

  • For the optimizer, state_dict is its state information, which contains two sets of information:

    • State: A dictionary containing the current state of the optimizer (that is, the latest cached variable computed during the variable update process).
      • The key of the dictionary is the cached index.
      • The dictionary value is also a dictionary, key is the cache variable name, and value is the corresponding tensor.
    • Param_groups: A dictionary that contains all param groups.
def state_dict(self) :
    r"""Returns the state of the optimizer as a :class:`dict`. It contains two entries: * state - a dict holding current optimization state. Its content differs between optimizer classes. * param_groups - a dict containing all parameter groups """
    # Save order indices instead of Tensors
    param_mappings = {}
    start_index = 0

    def pack_group(group) :
        nonlocal start_index
        # 'params' takes a different rule
        packed = {k: v for k, v in group.items() ifk ! ='params'}
        param_mappings.update({id(p): i for i, p in enumerate(group['params'], start_index)
                               if id(p) not in param_mappings})
        # save the id of the parameter, not the value of the parameter
        packed['params'] = [param_mappings[id(p)] for p in group['params']]
        start_index += len(packed['params'])
        return packed

    # Pack through self.param_groups
    param_groups = [pack_group(g) for g in self.param_groups]
    
    Replace all Tensor in state with use Order indices
    # Remap state to use order indices as keys
    packed_state = {(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v
                    for k, v in self.state.items()}
    
    return { # return dictionary form
        'state': packed_state, # state
        'param_groups': param_groups, # Parameters to be optimized
    }
Copy the code

Example 1 2.4.2

We added the following print statement in example 1 to look at the internal optimizer variables:

# print model's state_dict
print('Model.state_dict:')
for param_tensor in model.state_dict():
    print(param_tensor, '\t', model.state_dict()[param_tensor].size())

# print optimizer's state_dict
print('Optimizer,s state_dict:')
for var_name in optimizer.state_dict():
    print(var_name, '\t', optimizer.state_dict()[var_name])
Copy the code

The results are as follows:

Model.state_dict:
net1.weight  torch.Size([10.10])
net1.bias 	 torch.Size([10])
net1.weight  torch.Size([10.10])
net2.bias 	 torch.Size([5])

Optimizer,s state_dict:
state 	 {}
param_groups 	 [{'lr': 0.001.'momentum': 0.'dampening': 0.'weight_decay': 0.'nesterov': False.'params': [0.1.2.3]}]
Copy the code

2.4.3 example 2

Example 2 is optimizing a function using SGD.

from math import pi
import torch.optim

x = torch.tensor([pi/2,pi/3],requires_grad=True)
optimizer = torch.optim.SGD([x,],lr=0.2,momentum=0.5)

for step in range(11) :if step:
        optimizer.zero_grad()
        f.backward()
        optimizer.step()

        for var_name in optimizer.state_dict():
            print(var_name, '\t', optimizer.state_dict()[var_name])
    f=-((x.sin()**3).sum* * ())3
Copy the code

The output is shown below, showing the optimization process.

state 	 {0: {'momentum_buffer': tensor([ 1.0704 e-06, -9.1831 e+00])}}
param_groups 	 [{'lr': 0.2.'momentum': 0.5.'dampening': 0.'weight_decay': 0.'nesterov': False.'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([-1.2757 e-06, -4.0070 e+00])}}
param_groups 	 [{'lr': 0.2.'momentum': 0.5.'dampening': 0.'weight_decay': 0.'nesterov': False.'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([-3.4580 e-07, -4.7366 e-01])}}
param_groups 	 [{'lr': 0.2.'momentum': 0.5.'dampening': 0.'weight_decay': 0.'nesterov': False.'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([7.3855 e-07.1.3584 e+00])}}
param_groups 	 [{'lr': 0.2.'momentum': 0.5.'dampening': 0.'weight_decay': 0.'nesterov': False.'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([7.2726 e-07.1.6619 e+00])}}
param_groups 	 [{'lr': 0.2.'momentum': 0.5.'dampening': 0.'weight_decay': 0.'nesterov': False.'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([-3.1580 e-07.8.4152 e-01])}}
param_groups 	 [{'lr': 0.2.'momentum': 0.5.'dampening': 0.'weight_decay': 0.'nesterov': False.'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([2.3738 e-07.5.8072 e-01])}}
param_groups 	 [{'lr': 0.2.'momentum': 0.5.'dampening': 0.'weight_decay': 0.'nesterov': False.'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([5.2412 e-07.8.4104 e-01])}}
param_groups 	 [{'lr': 0.2.'momentum': 0.5.'dampening': 0.'weight_decay': 0.'nesterov': False.'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([-5.1160 e-07.1.9660 e+00])}}
param_groups 	 [{'lr': 0.2.'momentum': 0.5.'dampening': 0.'weight_decay': 0.'nesterov': False.'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([4.9517 e-07.7.2053 e+00])}}
param_groups 	 [{'lr': 0.2.'momentum': 0.5.'dampening': 0.'weight_decay': 0.'nesterov': False.'params': [0]}]
Copy the code

We update it to make sure that the member variable name inside SGD is Param_groups, which is the optimizer’s target and refers to the iterator of toyModel._parameters.

 +-------------------------------------------------+                   +------------------+
 |ToyModel                                         |                   | Engine           |
 | +------------------+             +------------+ |forward / backward |                  |
 | |Linear(10.10)    +--> ReLU +-->+Linear(10.5)| +-----------------> | Compute gradient | | | | | | | | + | | | weight=Parameter| | weight | | | | | | | +-----------+  | bias | | | | | | | bias=Parameter | | +--+---------+ | +------------------+ | | | | | | | | +------------------+ | | |2 | gradient
 |                                v    v           |                            |
 |                         self._parameters        |                            v
 |                                  +              |                           ???
 |                                  |              |
 |                                  |              |
 |                                  v              |
 |              para_iterator = parameters()       |
 |                        +          ^             |
 |                        |          |             |
 |                        |          |             |
 +-------------------------------------------------+
                          |          |
                    1????? | |4 update
                          |          |
      +----------------------------------------------------------------+
      |SGD                |          |                                 |
      |                   |          |                                 |
      |                   v          |                                 |
      |                              +                                 |
^ +-------> self.param_groups = para_iterator(ToyModel._parameters) -------->
|     |                                                                |    |
|     |                                                                |    |
|     +----------------------------------------------------------------+    |
|                                                                           |
<-------------------------------------------------------------------------+ v
                     3 step()

Copy the code

0x03 SGD

Let’s take a closer look at the optimizer with SGD. Stochastic gradient Descent (SGD) is a batch version of stochastic gradient descent. For the training data set, it is divided into N batches, and each batch contains m samples. Each update uses a batch of data rather than the entire training set.

3.1 define

SGD is defined as follows, mainly for checksum setting default values.

class SGD(Optimizer) :
    def __init__(self, params, lr=required, momentum=0, dampening=0,
                 weight_decay=0, nesterov=False) :
        if lr is not required and lr < 0.0:
            raise ValueError("Invalid learning rate: {}".format(lr))
        if momentum < 0.0:
            raise ValueError("Invalid momentum value: {}".format(momentum))
        if weight_decay < 0.0:
            raise ValueError("Invalid weight_decay value: {}".format(weight_decay))

        defaults = dict(lr=lr, momentum=momentum, dampening=dampening,
                        weight_decay=weight_decay, nesterov=nesterov)
        if nesterov and (momentum <= 0 ordampening ! =0) :raise ValueError("Nesterov momentum requires a momentum and zero dampening")
        super(SGD, self).__init__(params, defaults)
        
    def __setstate__(self, state) :
        super(SGD, self).__setstate__(state)
        for group in self.param_groups:
            group.setdefault('nesterov'.False)        
Copy the code

3.2 analytical

As can be seen from the notes, SGD implements the Stochastic Gradient Descent (Optionally with Momentum) algorithm. Nesterov momentum is based On the importance of initialization and momentum in deep learning](http://www.cs.toronto.edu/%7Ehinton/absps/momentum.pdf). In the algorithm.

The following is an example:

Example:
    >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
    >>> optimizer.zero_grad()
    >>> loss_fn(model(input), target).backward()
    >>> optimizer.step()
Copy the code

Implementation of PyTorch SGD with Momentum/Nesterov with Sutskever et al. Different from other frameworks.

For example PyTorch implements a special example of Momentum using the following method:


v t + 1 = mu v t + g t + 1 . p t + 1 = p t lr v t + 1 . \begin{aligned} v_{t+1} & = \mu * v_{t} + g_{t+1}, \\ p_{t+1} & = p_{t} – \text{lr} * v_{t+1}, \end{aligned}

Other frameworks use:


v t + 1 = mu v t + lr g t + 1 . p t + 1 = p t v t + 1 . \begin{aligned} v_{t+1} & = \mu * v_{t} + \text{lr} * g_{t+1}, \\ p_{t+1} & = p_{t} – v_{t+1}. \end{aligned}

Step 3.3

The function of STEP method is to optimize variables with the help of a certain algorithm. This method mainly completes an update of model parameters

    @torch.no_grad()
    def step(self, closure=None) :
        """Performs a single optimization step. Args: closure (callable, optional): A closure that reevaluates the model and returns the loss. """
        # Recalculate loss using closure
        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        Update the variable with the calculated gradient
        # self.param_groups is the list of arguments we pass in
        for group in self.param_groups: Each group is a dict that contains the necessary parameters for each set of parameters
            params_with_grad = []
            d_p_list = []
            momentum_buffer_list = []
            This set of parameters to update the necessary Settings
            weight_decay = group['weight_decay']
            momentum = group['momentum']
            dampening = group['dampening']
            nesterov = group['nesterov']
            lr = group['lr']

            for p in group['params'] :Pass through all the parameters that need to be updated in this group
                if p.grad is not None:
                    params_with_grad.append(p)
                    d_p_list.append(p.grad)

                    state = self.state[p]
                    if 'momentum_buffer' not in state:
                        momentum_buffer_list.append(None)
                    else:
                        momentum_buffer_list.append(state['momentum_buffer'])

            F.sgd(params_with_grad,
                  d_p_list,
                  momentum_buffer_list,
                  weight_decay=weight_decay,
                  momentum=momentum,
                  lr=lr,
                  dampening=dampening,
                  nesterov=nesterov)

            # update momentum_buffers in state
            for p, momentum_buffer in zip(params_with_grad, momentum_buffer_list):
                state = self.state[p]
                state['momentum_buffer'] = momentum_buffer

        return loss
Copy the code

The SGD function is as follows:

def sgd(params: List[Tensor],
        d_p_list: List[Tensor],
        momentum_buffer_list: List[Optional[Tensor]],
        *,
        weight_decay: float,
        momentum: float,
        lr: float,
        dampening: float,
        nesterov: bool) :
    r"""Functional API that performs SGD algorithm computation. See :class:`~torch.optim.SGD` for details. """

    for i, param in enumerate(params):

        d_p = d_p_list[i]
        Regularization and momentum accumulation
        ifweight_decay ! =0:
            d_p = d_p.add(param, alpha=weight_decay)

        ifmomentum ! =0:
            buf = momentum_buffer_list[i]

            if buf is None:
                # Historical updates
                buf = torch.clone(d_p).detach()
                momentum_buffer_list[i] = buf
            else:
                # update self.state via buf
                buf.mul_(momentum).add_(d_p, alpha=1 - dampening)

            if nesterov:
                d_p = d_p.add(buf, alpha=momentum)
            else:
                d_p = buf

        # update the current group learning parameters
        param.add_(d_p, alpha=-lr) # add_ changes the value of the object
Copy the code

3.4 Variable Resolution

Let’s parse the global parameters as follows.

3.4.1 track lr

This is the learning rate, the familiar concept.

3.4.2 dampening

Dampening is applied to the partial derivative and is used to adjust the current gradient weight in momentum SGD.

The corresponding formula is as follows:


v t = v t 1 m o m e n t u m + g t ( 1 d a m p e n i n g ) v_t = v_{t-1} * momentum + g_t * (1 – dampening)

The corresponding code is:

buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
Copy the code

Rule 3.4.3 weight_decay

Weight_decay is the L2 penalty coefficient, with partial derivatives modified with the value of the current learnable parameter P.

The partial derivative of the learnable parameter p to be updated is


g t = g t + ( p w e i g h t _ d e c a y ) g_t = g_t + ( p * weight\_decay)

The corresponding code is:

ifweight_decay ! =0:
	d_p = d_p.add(param, alpha=weight_decay)
Copy the code

3.4.4 nesterov

Whether to enable Nesterov momentum, pytorch sources show that when nesterov is True, momentum and v_t are used once more in addition to the v_t obtained above.


del w J ( w ) + m v t + 1 \bigtriangledown_{w}J(w) + m * v_{t+1}
if (nesterov) {
  d_p = d_p.add(buf, momentum);
} else {
  d_p = buf;
}
Copy the code

3.4.5 Momentum

Momentum: From physics, translated as Momentum or impulse. The function is to combine the last update with the current gradient to optimize the current weight update.

The reason for the introduction is that the improper initialization weight of the training network may lead to the local minimum value in the training process and the global optimal value is not found.

Introducing momentum can solve this problem to some extent. Momentum simulates the inertia of an object in motion and represents the cumulative effect of force on time. Keep the direction of the previous update above a certain level, and adjust the direction of the update according to the current gradient. The greater the momentum, the greater the energy converted to potential energy, the greater the stability, the faster the learning, and the more likely it is to get out of a locally concave region and into a globally concave region.

The original weight update formula is as follows:


w = w L r d w w = w – Lr * dw

Where W is the weight, Lr is the learning rate, and dw is the derivative of w.

The weight update formula after introducing momentum is as follows:


v = m o m e n t u m v L r d w w = w + v v= momentum*v – Lr*dw \\w = w + v

Here momentum is momentum, v is velocity. This formula just means plus the last update of v times momentum. When the direction of the gradient descent -LR * dw is the same as that of the last update V, the last update V can play a positive acceleration effect. When the direction of the current gradient descent -Lr * dw is opposite to the direction of the last update V, the last update V can slow down.

The code is as follows:

ifmomentum ! =0:
    buf = momentum_buffer_list[i]

    if buf is None:
        buf = torch.clone(d_p).detach()
        momentum_buffer_list[i] = buf
    else:
        buf.mul_(momentum).add_(d_p, alpha=1 - dampening)

    if nesterov:
        d_p = d_p.add(buf, alpha=momentum)
    else:
        d_p = buf
Copy the code

0 x04 visualization

4.1 Current Problems

So far, we still have a few unanswered questions, underlined below.

  • Build the optimizer based on the model parameters

      1. usingoptimizer = optim.SGD(params=net.parameters(), lr = 1)Construct so that it looks like params is assigned to an internal member variable of the optimizer (we assume it is called parameters).
    • The model includes two fully linked Linear layers, which update parameters.
    • The weight and bias in Linear are parameters.
      • Parameter constructor requires_grad=True. This setting indicates that Parameter needs to compute gradients by default.
      • So Linear’s weight bias requires the engine to calculate its gradient.
    • The ToyModel_parametersMember variables are retrieved by the parameters method, which returns an Iterator.
      • This iterator is used as a parameter to build the SGD optimizer.
      • The parameters for the SGD optimizer are now an iterator pointing to toyModel._Parameters. This shows that the optimizer is actually optimizing _parameters of the ToyModel directly.
  • Engine calculated gradient

    • How to ensure that Linear can compute gradients?
      • Weight and bias are Parameter types, which need to calculate gradient by default.
    • 2) How do calculated gradients correspond to Linear parameters for the model? Where do the gradients calculated by the engine accumulate??
  • Optimizer optimization parameters:

      1. Step is called to optimize to the internal member variable self.parameters of the optimizer.
    • Self. parameters is an iterator to toyModel._parameters. This shows that the optimizer is actually optimizing _parameters of the ToyModel directly.
  • Optimizer updates model:

      1. Self-parameters updates are actually applied directly to model parameters such as Linear.

If we print the outputs, we can see that its next_functions actually have three, indicating that the previous legend is simplified and we need to make further visualization.

outputs = {Tensor: 10} 
 T = {Tensor: 5} 
 data = {Tensor: 10} 
 device = {device} cpu
 dtype = {dtype} torch.float32
 grad = {NoneType} None
 grad_fn = {AddmmBackward} 
  metadata = {dict: 0} {}
  next_functions = {tuple: 3} 
   0 = {tuple: 2} (<AccumulateGrad object at 0x7f9c3e3bd588>, 0)
   1 = {tuple: 2} (<ReluBackward0 object at 0x7f9c3e5178d0>, 0)
   2 = {tuple: 2} (<TBackward object at 0x7f9c3e517908>, 0)
   __len__ = {int} 3
  requires_grad = {bool} True
 is_cuda = {bool} False
 is_leaf = {bool} False
 is_meta = {bool} False
 is_mkldnn = {bool} False
 is_mlc = {bool} False
 is_quantized = {bool} False
 is_sparse = {bool} False
 is_sparse_csr = {bool} False
 is_vulkan = {bool} False
 is_xpu = {bool} False
 layout = {layout} torch.strided
 name = {NoneType} None
 names = {tuple: 2} (None.None)
 ndim = {int} 2
 output_nr = {int} 0
 requires_grad = {bool} True
Copy the code

4.2 PyTorchViz Visual network

We use PyTorchViz to show the web.

Install the library first:

 pip install torchviz
Copy the code

We then add the code visualization, using the visualization function make_dot() to get the drawing object. After running, a.gv file and a.png file will be generated in the data folder in the same root directory of the code. The.gv file is the script code generated by the Graphviz tool, and.png is the image generated by the.gv file. PNG files are automatically opened by default.

import torch
import torch.nn as nn
import torch.optim as optim

from torchviz import make_dot

class ToyModel(nn.Module) :
    def __init__(self) :
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10.10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10.5)

    def forward(self, x) :
        return self.net2(self.relu(self.net1(x)))

net = ToyModel()
print(net) # Print it while you're at it
optimizer = optim.SGD(params=net.parameters(), lr = 1)
optimizer.zero_grad()
input = torch.randn(10.10)
outputs = net(input)
outputs.backward(outputs)
optimizer.step()

NetVis = make_dot(outputs, params=dict(list(net.named_parameters()) + [('x'.input)]))
NetVis.format = "bmp" # File format
NetVis.directory = "data" # folder where files are generated
NetVis.view() # generate file
Copy the code

The output.

ToyModel(
  (net1): Linear(in_features=10, out_features=10, bias=True)
  (relu): ReLU()
  (net2): Linear(in_features=10, out_features=5, bias=True))Copy the code

The legend is as follows:

We found that the key link of AccumulateGrad was ignored in the previous schematic diagram. We will analyze it next.

0x05 AccumulateGrad

5.1 the principle

Let’s start with an overview of PyTorch’s principles.

Conceptually, Autograd records a computational graph. The nodes in the figure are divided into two types: leaf nodes and non-leaf nodes.

User-created nodes are called leaf nodes, for example:

a=torch.tensor([1.0]) at runtime: A = {Tensor:1} tensor([1.])
 T = {Tensor: 1} tensor([1.])
 data = {Tensor: 1} tensor([1.])
 device = {device} cpu
 dtype = {dtype} torch.float32
 grad = {NoneType} None
 grad_fn = {NoneType} None
 is_cuda = {bool} False
 is_leaf = {bool} True
 requires_grad = {bool} False
Copy the code

However, a cannot be differentiated. When creating a tensor, Pytorch knows that it needs to be automatically differentiated if requires_grad is set to true.

a=torch.tensor([1.0], requires_grad = True) runtime variables: A = {Tensor:1} tensor([1.], requires_grad=True)
 T = {Tensor: 1} tensor([1.], grad_fn=<PermuteBackward>)
 data = {Tensor: 1} tensor([1.])
 device = {device} cpu
 dtype = {dtype} torch.float32
 grad = {NoneType} None
 grad_fn = {NoneType} None
 is_cuda = {bool} False
 is_leaf = {bool} True
 requires_grad = {bool} True
 shape = {Size: 1} 1
Copy the code

PyTorch records the history of each step of the operation on the tensor to produce a conceptual directed acyclic graph whose leaf nodes are the input tensor of the model and whose roots are the output tensor of the model. The user does not need to encode all the execution paths of the graph, because the user runs what the user later wants to differentiate. By tracing the graph from root to leaf, the user can automatically calculate the gradient using the chain derivative rule.

Internally, Autograd represents this graph as a graph of “Function” or “Node” objects (real expressions) that can be evaluated using the Apply method.

When propagating back, the Autograd engine traces the graph from the root node (the forward propagating output node) so that the gradient of all leaf nodes can be calculated using the chain derivative rule. Each forward propagation operator has a corresponding back propagation function, which is used to calculate the gradient of each variable.

In the reverse diagram, the backpropagation calculation function corresponding to the leaf node tensor that needs to be derived is AccumulateGrad. Its gradient is accumulative, and multiple derivatives will accumulate on the derivative of this tensor, such as:

a=torch.tensor([5.0], requires_grad = True)
b = torch.tensor([3.0], requires_grad = True)
c = a + b
Copy the code

The corresponding is:

In our example, Linear instances are explicitly defined by the user, and all are leaf nodes.

5.2 AccumulateGrad

5.2.1 definition

AccumulateGrad is defined as follows:

  • You accumulate the gradient.
  • Call the passed update_grad function to update the gradient.
struct TORCH_API AccumulateGrad : public Node {
  explicit AccumulateGrad(Variable variable_);

  variable_list apply(variable_list&& grads) override;

  static at::Tensor callHooks(
      const Variable& variable,
      at::Tensor new_grad) {
    for (auto& hook : impl::hooks(variable)) {
      new_grad = (*hook)({new_grad})[0];
    }
    return new_grad;
  }

  template <typename T>
  static void accumulateGrad(
      const Variable& variable,
      at::Tensor& variable_grad,
      const at::Tensor& new_grad,
      size_t num_expected_refs,
      const T& update_grad) { // Update the gradient function passed in
    
    if(! variable_grad.defined()) {
      / / ignore
    } else if(! GradMode::is_enabled()) {
      if (variable_grad.is_sparse() && !new_grad.is_sparse()) {
        auto result = new_grad + variable_grad;
        update_grad(std::move(result));
      } else if(! at::inplaceIsVmapCompatible(variable_grad, new_grad)) {
        auto result = variable_grad + new_grad;
        update_grad(std::move(result));
      } else {
        variable_grad += new_grad; // accumulate}}else {
      at::Tensor result;
      if (variable_grad.is_sparse() && !new_grad.is_sparse()) {
        // CPU backend throws an error on sparse + dense, so prefer dense + sparse here.
        result = new_grad + variable_grad; // accumulate
      } else {
        // Assumes operator+ result typically matches strides of first arg,
        // and hopes variable_grad was originally created obeying layout contract.
        result = variable_grad + new_grad; // accumulate
      }
      update_grad(std::move(result));
    }
  }

  Variable variable;
};
Copy the code

5.2.2 the apply

When calling Apply, there are two things to note:

  • The update function passed in is {grad = STD ::move(grad_update); } update gradient.
  • Mutable_grad yields the gradient member of the tensor.
Tensor& mutable_grad(a) const {
  return impl_->mutable_grad(a); }/// Accesses the gradient `Variable` of this `Variable`.
Variable& mutable_grad(a) override {
  return grad_;
}
Copy the code

The specific code is as follows:

auto AccumulateGrad::apply(variable_list&& grads) -> variable_list {
  check_input_variables("AccumulateGrad", grads, 1.0);

  if(! grads[0].defined())
    return {};
  if (variable.grad_fn())
    throw std::logic_error(
        "leaf variable has been moved into the graph interior");
  if(! variable.requires_grad())
    return {};

  at::Tensor new_grad = callHooks(variable, std::move(grads[0]));
  std::lock_guard<std::mutex> lock(mutex_);
  
  at::Tensor& grad = variable.mutable_grad(a);// Get the variable mutable_grad

  accumulateGrad(
      variable,
      grad,
      new_grad,
      1+!post_hooks().empty(a)/* num_expected_refs */,
      [&grad](at::Tensor&& grad_update) { grad = std::move(grad_update); });

  return variable_list(a); }Copy the code

The specific flow chart logic is as follows:

AccumulateGrad                                 Tensor           AutogradMeta
     +                                           +                   +
     |                                           |                   |
     |                                           |                   |
     |                                           |                   |
     v                                           |                   |
   apply(update_grad)                            |                   |
     +                                           |                   |
     |                                           |                   |
     |                                           |                   |
     |                                           |                   |
     v                                           |                   |
accumulateGrad                                   |                   |
     +                                           |                   |
     |                                           |                   |
     | result = variable_grad + new_grad         |                   |
     |                                           |                   |
     v                result                     v                   v
 update_grad +---------------------------->  mutable_grad +--->    grad_

Copy the code

AccumulateGrad is called in the reverse order for the leaf tensors to accumulate gradient and then updated to grad_ of the leaf tensors:

+----------------------------------------------+ +-------------------------+ |Tensor | |TensorImpl | | | | | | | bridge | | | <TensorImpl, UndefinedTensorImpl> impl_ +-----------> | autograd_meta_ +---------+ | | | | | | | | | | +----------------------------------------------+ +-------------------------+ | | | | +-------------------------+ | | AutogradMeta | <-----------------------------------------------------------+ | | | | | | +------------------------------------------------+ | | | AccumulateGrad | | grad_fn_ +--------------------> | | | | | | | | | apply(grads) { | | | | | | grad_accumulator_ | | accumulateGrad(new_grad) { | | | | | | | | result = variable_grad  + new_grad | | | update | | | grad_ <--------------------------------+ update_grad(result) | | | | | | | | } | | | | } | | | | | | | | | +-------------------------+ +------------------------------------------------+Copy the code

Now we know that gradients are accumulated over grad_ of leaf nodes, but how do these gradients update model parameters?

5.3 Combine optimizer

We go back to the STEP function of SGD and select only the key parts. We can see that it obtains the gradient of the parameters in the model and then updates the model parameters.

@torch.no_grad()
def step(self, closure=None) :

    # Recalculate loss using closure

    Update the variable with the calculated gradient
    # self.param_groups is the list of arguments we pass in
    for group in self.param_groups: Each group is a dict that contains the necessary parameters for each set of parameters

        for p in group['params'] :Pass through all the parameters that need to be updated in this group
            if p.grad is not None: Get the gradient of model parameters
                params_with_grad.append(p) # Use gradient for optimization
                d_p_list.append(p.grad)

                # momentum related

        F.sgd(params_with_grad, Add_ (d_p, alpha=-lr
              d_p_list,
              momentum_buffer_list,
              weight_decay=weight_decay,
              momentum=momentum,
              lr=lr,
              dampening=dampening,
              nesterov=nesterov) 

        # update momentum_buffers in state

    return loss
Copy the code

0 x06 summary

We conclude by building the optimizer based on model parameters -> engine calculates gradients -> optimizer optimizes parameters -> optimizer updates the model.

  • Build the optimizer based on the model parameters

      1. usingoptimizer = optim.SGD(params=net.parameters(), lr = 1)Construct so that params is assigned to the internal member variable param_groups of the optimizer.
    • The model includes two Linear. How do these layers update their parameters?
      • The weight and bias in Linear are parameters.
        • Parameter constructor requires_grad=True. This setting indicates that Parameter needs to compute gradients by default.
        • So Linear’s weight bias requires the engine to calculate its gradient.
        • Weight, bias is added to ToyModel_parametersMember variables.
      • The ToyModel_parametersMember variables are retrieved by the parameters method, which returns an Iterator.
        • Use this iterator as a parameter to build the SGD optimizer.
        • The parameters for the SGD optimizer are now an iterator pointing to toyModel._Parameters. This shows that the optimizer is actually optimizing _parameters of the ToyModel directly.
      • So the optimizer simply optimizes and updates Linear’s weight and bias. In fact, the optimizer is just a set of code, which needs to be specified at build time to optimize the parameters of a model, or other variables specified by the user.
  • Engine calculated gradient

    • How to ensure that Linear can compute gradients?
      • Weight and bias are Parameter types, which need to calculate gradient by default.
        1. So calculate the weight, the bias gradient.
    • How does the calculated gradient correspond to the Linear parameter for the model? Where do these gradients that the engine calculates accumulate?
      • In our case, Linear instances are explicitly defined by the user, so they are leaf nodes.
        1. Leaf node accumulates gradient in model parameter tensor through AccumulateGradautograd_meta_.grad_.
  • Optimizer optimization parameters:

      1. Step is called to optimize to the internal member variable self.parameters of the optimizer.
    • Self. parameters is an iterator to toyModel._parameters. This shows that the optimizer is actually optimizing _parameters of the ToyModel directly.
  • Optimizer updates model:

      1. Self-parameters updates are actually applied directly to model parameters such as Linear’s weight and bias.

Specific figure:

+---------------------------------------------------------------------+
| ToyModel                                                            |
|  +---------------------------------+                 +------------+ |                   +------------------+
|  | Linear(10.10)                  +------> ReLU +-->+Linear(10.5)| |                   | Engine           |
|  |                                 |                 |            | |forward / backward |                  |
|  |  weight=Parameter               |                 |    weight  | +-----------------> | Compute gradient |
|  |                                 +---------------+ |    bias    | |                   |        +         |
|  |  +----------------------------+ |               | +--+---------+ |                   |        |         |
|  |  | bias=Parameter             | |               |    |           |                   |        |         |
|  |  |                            | |               |    |           |                   +------------------+
|  |  |                            | |               |    |           |  3 accumulate              |
|  |  |    autograd_meta_.grad_ <----------------------------------------------------+           2| gradient | | | | | | | | | | | | | data | | | | | | v | | | | | v v | | | | | | | self._parameters | | +------------------+ | | +----------------------------+ | + | | | AccumulateGrad | | +---------------------------------+  | | | | | | | | | | | | v |5 update    -----------+ apply()    |
|                                  para_iterator = parameters()  <----------------+       |                  |
|                                            +                        |           |       |                  |
|                                            |                        |           |       +------------------+
|                                            |                        |           |
+---------------------------------------------------------------------+           |
                                           1 |                                    |
                                             |                                    |
              +---------------------------------------------------------------------------+
              | SGD                          |                                    |       |
              |                              |                                    |       |
              |                              v                                    +       |
              |                                                                 4 step()  |
      ^-------------> self.param_groups = para_iterator(ToyModel._parameters) +---------------->
      |       |                                                                           |    |
      |       |                                                                           |    |
      |       +---------------------------------------------------------------------------+    |
      |                                                                                        |
      <--------------------------------------------------------------------------------------+ v

Copy the code

The mobile phone is as follows:

This completes the analysis of the common optimizer, and in the next chapter we examine the optimizer for data parallelism.

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

0 XFF reference

Torch. Optim. Optimizer source code read and flexible use

Optimizer is the optimizer principle

Pytorch Optimizer (Optim) operates with different parameter groups and different learning rate Settings

Pytorch momentum, momentum

Various optimization methods to summarize comparison (SGD/momentum/Nesterov/adagrad/adadelta)

Optimizer algorithm and PyTorch implementation (1) : indelible SGD

The PyTorch optimizer is introduced using Optim.sgd as an example

—- Optimizer (SGD, Adam)

Optimizing neural networks and optimizer selection using Torch. Optim in PyTorch — PyTorch Chinese

Pytorch optimizer: SGD

Pytorch addmm() and addmm_(

Visualization tools under PyTorch

PyTorch’s optimizer

PyTorch source code interpretation torch. Optim: optimization algorithm interface details

Explain the network construction in Pytorch