Participate in the 10th day of The November Gwen Challenge, see the details of the event: 2021 Last Gwen Challenge

import torch
from torch import nn
from d2l import torch as d2l
Copy the code
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
Copy the code

See the manual implementation of Softmax regression article: Hands-on deep Learning 3.6- Manual implementation of Softmax regression – Digging gold (juejin. Cn)

There will be a warning to the user, can ignore directly, if you really want to know what it is we can see torchvision. Transforms. ToTensor explanation. | use transforms ToTensor () a user warned the H W C | image represent

net = nn.Sequential(nn.Flatten(), nn.Linear(784.10))
def init_weights(m) :
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)
net.apply(init_weights);
Copy the code
  • nn.Flatten()PyTorch does not implicitly adjust the shape of the input, so flatten is defined before the linear layer to adjust the shape of the input.nn.Linear(784, 10)Specify the input dimension and the output dimension, and process one image at a time. Given that the image is 28 by 28, the expanded vector is 784.
  • net.apply(init_weights)Apply this function to each layer of NETinit_weights: This function
    • Determine if the obtained layer is nn.Linear, of coursetype(m) == nn.LinearYou can also use the one I mentioned earlierisinstance(m,nn.Linear)
    • If so, initialize the weight of the layer and set the mean value to 0 and the variance to 0.01
loss = nn.CrossEntropyLoss()
trainer = torch.optim.SGD(net.parameters(), lr=0.1)
Copy the code

Loss directly uses the cross entropy loss of NN, and the trainer also directly uses the SGD function of NN. For cross entropy.

num_epochs = 10
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
Copy the code

This training function is the same as the training function written in the previous section. See Hands-on deep learning 3.6- Implementing Softmax Regression manually – Digging gold (juejin. Cn)

Softmax

The Softmax implementation in the previous section uses:


s o f t m a x ( X ) i j = exp ( X i j ) k exp ( X i k ) . \mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(\mathbf{X}_{ij})}{\sum_k \exp(\mathbf{X}_{ik})}.

Softmax function y = ^ j exp ⁡ (oj) ∑ kexp ⁡ (ok) \ hat y_j = \ frac {\ exp (o_j)} {\ sum_k \ exp (o_k)} ^ j y = ∑ kexp (ok) exp (oj), Where y^j\hat y_jy^j is the predicted probability distribution. Ojo_joj is the JJJ element of the unnormalized prediction o\mathbf{o}o. If some of the values in oko_kok are very large, exp⁡(OK)\exp(o_k)exp(OK) may be greater than the maximum number allowed by the data type (i.e., overflow). This will change the denominator or numerator to INF (infinity), and we end up with y^j\hat y_jy^j of 0, INF, or nan (not a number). In these cases, we cannot get a well-defined return value for cross entropy.

One trick to solve this problem is to subtract Max ⁡(OK)\ Max (o_k) Max (OK) from all oko_kok before continuing softmax calculations. You can prove that each constant movement of oko_kok does not change the return value of Softmax. After the subtraction and normalization steps, some ojo_joj may have large negative values. Due to precision limitations, exp⁡(oj)\exp(o_j)exp(oj) will have a value close to zero, that is, underflow. These values may be rounded to zero so that y^j\hat y_jy^j is zero and such that the value of log⁡(y^j)\log(\hat y_j)log(y^j) is -inf. After a few steps of backpropagation, we might find ourselves facing a screen of scary nan results.

Even though we’re going to calculate exponential functions, we’re going to end up taking logarithms of them when we calculate the cross entropy loss. By combining Softmax and cross entropy, numerical stability issues that might have plagued us during back propagation can be avoided. As shown in the following equation, instead of calculating exp⁡(oj)\exp(o_j)exp(oj), we can use ojo_joj directly. Because log⁡(exp⁡(⋅))\log(\exp(\cdot))log(exp(⋅)) is neutralized.


log ( y ^ j ) = log ( exp ( o j ) k exp ( o k ) ) = log ( exp ( o j ) ) log ( k exp ( o k ) ) = o j log ( k exp ( o k ) ) . \begin{aligned} \log{(\hat y_j)} & = \log\left( \frac{\exp(o_j)}{\sum_k \exp(o_k)}\right) \\ & = \log{(\exp(o_j))}-\log{\left( \sum_k \exp(o_k) \right)} \\ & = o_j -\log{\left( \sum_k \exp(o_k) \right)}. \end{aligned}

l ( y . y ^ ) = j = 1 q y j log y ^ j . l(\mathbf{y}, \hat{\mathbf{y}}) = – \sum_{j=1}^q y_j \log \hat{y}_j.