This article is one of a series of notes I wrote while studying Deep Learning with Python (2nd edition, by Francois Chollet). The article covers the notebooks of Jupyter to Markdown, and I will release all of the Jupyter notebooks on GitHub once all of the articles have been completed.

You can be in this website online reading the original text of the book (English) : livebook.manning.com/book/deep-l…

The author of this book also gives a set of Jupyter notebooks: github.com/fchollet/de…


Chapter 2. Before we begin: The Mathematical Building blocks of neural Networks

A first look at neural networks

Programming languages start with “Hello World” and Deep learning starts with MINST.

MNIST is used to train handwritten number recognition, which consists of 28×28 grayscale handwritten images, with labels (values 0 to 9) corresponding to each image.

Import the MNIST data set

# Loading the MNIST dataset in Keras
from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
Copy the code

Take a look at the training set:

print(train_images.shape)
print(train_labels.shape)
train_labels
Copy the code

Output:

(60000, 28, 28)
(60000,)

array([5, 0, 4, ..., 5, 6, 8], dtype=uint8)
Copy the code

Here is the test set:

print(test_images.shape)
print(test_labels.shape)
test_labels
Copy the code

Output:

(10000, 28, 28)
(10000,)

array([7, 2, 1, ..., 4, 5, 6], dtype=uint8)
Copy the code

Network building

Let’s construct a neural network for learning MNIST sets:

from tensorflow.keras import models
from tensorflow.keras import layers

network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28, )))
network.add(layers.Dense(10, activation='softmax'))
Copy the code

Neural networks are made up of layers. One layer is like a distillation filter, which “filters” the incoming data and “refines” the information it needs to pass to the next layer.

Such a series of “layers” are combined to process the data like an assembly line. One layer at a time makes the data being processed, or the “representation of the data,” more and more “useful” to the results we ultimately want.

The network we’ve just built consists of two “Dense layers”, so called because they are Dense connected, or fully connected.

The data goes to the last layer (layer 2), which is a 10-way SoftMax layer. This layer outputs an array of 10 probability values that add up to 1. This output “represents” information that is useful for predicting the number of images. In fact, each probability value in the output represents the probability that the input image belongs to one of 10 numbers (0-9)!

compile

Next, we need to compile the network. This step needs to be given three parameters:

  • Loss function: a function that evaluates how well your network performs
  • Optimizer: How to update (optimize) your network
  • Indicators that need to be monitored during training and testing, such as in this example, we only care about one indicator – the accuracy of the prediction
network.compile(loss="categorical_crossentropy",
                optimizer='rmsprop',
                metrics=['accuracy'])
Copy the code

pretreatment

Graphics processing

We also need to process the graphic data and turn it into something that we know online.

The images in the MNIST dataset are 28×28 and each value is a uint8 of [0, 255]. And our neural network wants 28×28 float32 in [0, 1].

train_images = train_images.reshape((60000.28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000.28 * 28))
test_images = test_images.astype('float32') / 255
Copy the code

Label processing

Tags also need to be handled.

from tensorflow.keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
Copy the code

Training network

network.fit(train_images, train_labels, epochs=5, batch_size=128)
Copy the code

Output:

Train on 60000 samples Epoch 1/5 60000/60000 [==============================] - 3s 49us/sample - loss: 0.2549 accuracy: 0.9254 Epoch 2/5 60000/60000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 38 us/sample - loss: 0.1025 accuracy: 0.9693 Epoch 3/5 60000/60000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 35 us/sample - loss: 0.0676 accuracy: 0.9800 Epoch 4/5 60000/60000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 37 us - 2 s/sample - loss: 0.0491 accuracy: 0.9848 Epoch 5/5 60000/60000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 42 us - 2 s/sample - loss: 0.0369 accuracy: 0.9888 < tensorflow. Python. Keras. Callbacks. History at 0 x13a7892d0 >Copy the code

As you can see, the training is fast, and after a while you have 98%+ accuracy on the training set.

Try it again with the test set:

test_loss, test_acc = network.evaluate(test_images, test_labels, verbose=2)    # verbose=2 to avoid a looooong progress bar that fills the screen with '='. https://github.com/tensorflow/tensorflow/issues/32286
print('test_acc:', test_acc)
Copy the code

Output:

10000/1-0s-loss: 0.0362-accuracy: 0.9789 test_acc: 0.9789Copy the code

Our trained network did not perform as well in the test set as it did in the previous training set, which is the pot of “overfitting”.

Data representation of neural networks

Tensor, an array of arbitrary dimensions (I mean, a programming array). A matrix is a tensor in two dimensions.

We often refer to the dimension of a tensor as the axis.

Know the tensor

Scalar (0 d Tensors)

Scalars, a Scalars is a tensor with zero dimensions (zero axes) that contains a number.

A scalar in NUMPY can be represented by either float32 or float64.

import numpy as np

x = np.array(12)
x
Copy the code

Output:

array(12)
Copy the code
x.ndim    # Axis number (dimension)
Copy the code

Output:

1
Copy the code

Tensors vector (1 d)

Vectors, Vectors are 1-dimensional tensors (with 1 axis) containing a list of scalars.

x = np.array([1.2.3.4.5])
x
Copy the code

Output:

array([1, 2, 3, 4, 5])
Copy the code
x.ndim
Copy the code

Output:

1
Copy the code

We call a vector that has five elements like this a five-dimensional vector. But notice that the 5D vector is not a 5D tensor!

  • 5D vector: has only one axis, and has five dimensions along this axis.
  • The 5D tensor: has 5 axes, and can have any dimension along each axis.

This is a puzzle, because sometimes the dimension is the number of axes, sometimes it’s the number of elements on the axis.

So, we’d better put it another way, in terms of order, and say tensor of order 5.

Tensors matrix (2 d)

Matrices, Matrices are tensors of order 2 (2 axes, that’s what we call rows and columns), containing a column of vectors.

x = np.array([[5.78.2.34.0],
              [6.79.3.35.1],
              [7.80.4.36.2]])
x
Copy the code

Output:

array([[ 5, 78,  2, 34,  0],
       [ 6, 79,  3, 35,  1],
       [ 7, 80,  4, 36,  2]])
Copy the code
x.ndim
Copy the code

Output:

2
Copy the code

Higher-order tensor

You take an array of matrices and you get a tensor of order 3.

And then you have array of order 3 tensors and you have tensor of order 4, and so on, and you have higher order tensors.

x = np.array([[[5.78.2.34.0],
               [6.79.3.35.1],
               [7.80.4.36.2]],
              [[5.78.2.34.0],
               [6.79.3.35.1],
               [7.80.4.36.2]],
              [[5.78.2.34.0],
               [6.79.3.35.1],
               [7.80.4.36.2]]])
x.ndim
Copy the code

Output:

3
Copy the code

In deep learning, we usually use tensors of order 0 to 4.

The three elements of a tensor

  • Order (number of axes) : 3,5,…
  • Shape (dimensions of each axis) :(2, 1, 3), (6, 5, 5, 3, 6)…
  • Data types: Float32, Uint8,…

Let’s look at the tensor data in MNIST:

from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

print(train_images.ndim)
print(train_images.shape)
print(train_images.dtype)
Copy the code

Output:

3
(60000, 28, 28)
uint8
Copy the code

So train_images is a 3rd order tensor of an 8-bit unsigned integer.

Print out the picture inside to see:

digit = train_images[0]

import matplotlib.pyplot as plt

print("image:")
plt.imshow(digit, cmap=plt.cm.binary)
plt.show()
print("label: ", train_labels[0])
Copy the code

Output:

label:  5
Copy the code

The Numpy tensor operation

Tensor slice:

my_slice = train_images[10:100]
print(my_slice.shape)
Copy the code

Output:

(90, 28, 28)
Copy the code

Is equivalent to:

my_slice = train_images[10:100,,,,]print(my_slice.shape)
Copy the code

Output:

(90, 28, 28)
Copy the code

Is equivalent to the

my_slice = train_images[10:100.0:28.0:28]
print(my_slice.shape)
Copy the code

Output:

(90, 28, 28)
Copy the code

Select 14×14 in the lower right corner:

my_slice = train_images[:, 14:, 14:]
plt.imshow(my_slice[0], cmap=plt.cm.binary)
plt.show()
Copy the code

Output:

Select 14×14 at the center:

my_slice = train_images[:, 7: -7.7: -7]
plt.imshow(my_slice[0], cmap=plt.cm.binary)
plt.show()
Copy the code

Output:

Data volume

In deep learning data, the first axis (index=0) is usually called the “sample axis” (or “sample dimension”).

In deep learning, we typically don’t process the whole data set at once, we process it batch by batch.

In MNIST, one of our batches is 128 data:

The first batch of #
batch = train_images[:128]
The second batch of #
batch = train_images[128:256]
# n
n = 12
batch = train_images[128 * n : 128 * (n+1)]
Copy the code

Therefore, when using batch, we also call the first axis “batch axis”.

Common data tensor representation

data Tensor dimensionality The shape of
Vector data 2D (samples,features)
The time series 3D (samples, timesteps, features)
image 4D (samples, height, width, channels) or (samples, channels, height, width)
video 5D (samples, frames, height, width, channels) or (samples, frames, channels, height, width)

“Gears” of neural networks: Tensor operations

In our first neural network example (MNIST), each of our layers actually does something like this on the input data:

output = relu(dot(W, input) + b)
Copy the code

Input is input, W and B are attributes of the layer, and output is output.

There’s relu, dot, add between these things, and we’ll explain that next.

Element-wise operations

And the element-wise operation is that you act on each Element of the tensor individually. For example, let’s implement a simple relu(relu(x) = Max (x, 0)) :

def naive_relu(x) :
    assert len(x.shape) == 2    # x is a 2D Numpy tensor.
    x = x.copy()    # Avoid overwriting the input tensor.
    for i in range(x.shape[0) :for j in range(x.shape[1]):
            x[i, j] = max(x[i, j], 0)
    return x
Copy the code

Addition is also a per-element operation:

def naive_add(x, y) :
    # assert x and y are 2D Numpy tensors and have the same shape.
    assert len(x.shape) == 2
    assert x.shape == y.shape
    
    x = x.copy()    # Avoid overwriting the input tensor.
    for i in range(x.shape[0) :for j in range(x.shape[1]):
            x[i, j] += y[i, j]
    return x
Copy the code

In Numpy, it’s all written down. The specific operations are given to BLAS written in C or Fortran, and the speed is high.

You can check to see if BLAS are installed:

import numpy as np

np.show_config()
Copy the code

Output:

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
Copy the code

Here’s how to use numpy’s element-by-element relu and add:

a = np.array([[1.2.3], [...1.2, -3],
              [3, -1.4]])
b = np.array([[6.7.8], [...2, -3.1], 
              [1.0.4]])

c = a + b    # Element-wise addition
d = np.maximum(c, 0)    # Element-wise relu

print(c)
print(d)
Copy the code

Output:

[[7, 9, 11] [- 3-1-2] [8] 4-1], [[7, 9, 11] [0 0 0] [4 0 8]]Copy the code

Radio (-)

When performing a per-element operation, if the two tensors have different shapes, the smaller one will “broadcast” into the same shape as the larger one, as far as practicable.

Specifically, it is possible to broadcast a pair of pairs with shapes such as (A, B… , n, n+1, … M) and (n, n+1… , m) and the two tensors of m).

Such as:

x = np.random.random((64.3.32.10))    # x is a random tensor with shape (64, 3, 32, 10).
y = np.random.random((32.10))    # y is a random tensor with shape (32, 10).
z = np.maximum(x, y)    # The output z has shape (64, 3, 32, 10) like x.
Copy the code

The operation of broadcast is as follows:

  1. The small tensor increases the axis (broadcast axis) to the same as the large one (ndim)
  2. The elements of the small tensor are repeated on the new axis, adding to the same shape as the large one.

E.g.

x: (32, 10), y: (10,)
Step 1: add an empty first axis to y: Y -> (1, 10)
Step 2: repeat y 32 times alongside this new axis: Y -> (32, 10)
Copy the code

Y[I, :] == Y for I in range(0, 32)

Of course, in the actual implementation, we don’t copy like this, which is a waste of space, we do it directly in the algorithm. For example, let’s implement a simple vector and matrix addition:

def naive_add_matrix_and_vector(m, v) :
    assert len(m.shape) == 2    # m is a 2D Numpy tensor.
    assert len(v.shape) == 1    # v is a Numpy vector.
    assert m.shape[1] == v.shape[0]
    
    m = m.copy()
    for i in range(m.shape[0) :for j in range(m.shape[1]):
            m[i, j] += v[j]
    return m

naive_add_matrix_and_vector(np.array([[1 ,2.3], [4.5.6], [7.8.9]]), 
                            np.array([1, -1.100]))
Copy the code

Output:

array([[  2,   1, 103],
       [  5,   4, 106],
       [  8,   7, 109]])
Copy the code

Dot product of tensors

The tensor dot product, or product of tensors, is done with dot(x, y) in NUMpy.

The operation of the dot product can be seen in the following simple program:

# Dot product
def naive_vector_dot(x, y) :
    assert len(x.shape) == 1
    assert len(y.shape) == 1
    assert x.shape[0] == y.shape[0]
    
    z = 0.
    for i in range(x.shape[0]):
        z += x[i] * y[i]
    return z


The dot product of matrix and vector
def naive_matrix_vector_dot(x, y) :
    z = np.zeros(x.shape[0])
    for i in range(x.shape[0]):
        z[i] = naive_vector_dot(x[i, :], y)
    return z


# Matrix dot product
def naive_matrix_dot(x, y) :
    assert len(x.shape) == 2
    assert len(y.shape) == 2
    assert x.shape[1] == y.shape[0]
    
    z = np.zeros((x.shape[0], y.shape[1]))
    for i in range(x.shape[0) :for j in range(y.shape[1]):
            row_x = x[i, :]
            column_y = y[:, j]
            z[i, j] = naive_vector_dot(row_x, column_y)
    return z
Copy the code
a = np.array([[1.2.3], [...1.2, -3],
              [3, -1.4]])
b = np.array([[6.7.8], [...2, -3.1], 
              [1.0.4]])
naive_matrix_dot(a, b)
Copy the code

Output:

array([[  5.,   1.,  22.],
       [-13., -13., -18.],
       [ 24.,  24.,  39.]])
Copy the code

The same thing is true for tensor dot products with higher dimensions. For example, (this is shape) :

(a, b, c, d) . (d,) -> (a, b, c)
(a, b, c, d) . (d, e) -> (a, b, c, e)
Copy the code

Any reshaping of tensor

The operation, in short, is that the elements are the same, but the arrangement is different.

x = np.array([[0..1.],
              [2..3.],
              [4..5.]])
print(x.shape)
Copy the code

Output:

(3, 2)
Copy the code
x.reshape((6.1))
Copy the code

Output:

array([[0.],
       [1.],
       [2.],
       [3.],
       [4.],
       [5.]])
Copy the code
x.reshape((2.3))
Copy the code

Output:

array([[0., 1., 2.],
       [3., 4., 5.]])
Copy the code

Transposition is a special kind of matrix deformation, transposition is the exchange of columns and columns.

The original x[I, :] transpose becomes x[:, I].

x = np.zeros((300.20))
y = np.transpose(x)
print(y.shape)
Copy the code

Output:

(20, 300)
Copy the code

The “engine” of neural networks: Gradient based optimization

In our first neural network example (MNIST), each layer performs operations on the input data:

output = relu(dot(W, input) + b)
Copy the code

In this formula, W and B are attributes of the layer (weights, or trainable parameters). To be specific,

  • WIs the kernel attribute;
  • bIs bias.

These “weights” are what the neural network learns from the data.

At first, these weights are randomly initialized to smaller values. Then from this random output, feedback adjustment, gradually improve.

This process of improvement is done in a “training cycle” that can go on forever if necessary:

  1. Extract a batch of training data X and the corresponding y
  2. Propagating forward, the prediction y_pred of X calculated through the network is obtained
  3. Through Y_pred and Y, the loss is calculated
  4. Adjust the parameters in some way to reduce the loss

The first three steps are relatively simple, and the fourth step is more complex to update parameters. A more effective and feasible way is to use differentiability and move the parameters in the opposite direction of the gradient by calculating the gradient.

Derivative (derivative)

This section explains the definition of a derivative.

(Go straight to the book.)

So if you know the derivative, and you have to update x to minimize a function f of x, you just have to move x in the opposite direction of the derivative.

Gradient (gradient)

The gradient is the derivative of a tensor operation. Or gradient is a generalization of derivative as a function of several variables. The gradient at a point represents the curvature at that point.

Consideration:

y_pred = dot(W, x)
loss_value = loss(y_pred, y)
Copy the code

If x and y are fixed, then loss_value will be a function of W:

loss_value = f(W)
Copy the code

Let the current point be W0, then the derivative (gradient) of F at W0 is denoted by gradient(f)(W0), and the gradient value is of the same type as W. In which, each element gradient(f) (W0)[I, j] represents the direction and magnitude of f change when W0[I, j] is changed.

So, to change the value of W to achieve min f, you can move in the opposite direction of the gradient (i.e. the direction of gradient descent) :

W1 = W0 - step * gradient(f)(W0)
Copy the code

Stochastic Gradient Descent

In theory, given a differentiable function, its minimum value must be taken at the point where the derivative is zero. So all we have to do is take all the points where the derivative is zero, compare the values of the function, and we get the minimum.

When this method is put into the neural network, an equation gradient(f)(W) = 0 about W needs to be solved, which is an n-element equation (N= the number of parameters in the neural network). In fact, N is generally not less than 1k, which makes it almost impossible to solve this equation.

Therefore, in the face of this problem, we use the above 4-step method. In the fourth step, we use gradient descent to update parameters in the opposite direction of the gradient step by step and move forward in the direction of loss reduction step by step:

  1. Extract a batch of training data X and the corresponding y
  2. Propagating forward, the prediction y_pred of X calculated through the network is obtained
  3. Through Y_pred and Y, the loss is calculated
  4. Adjust the parameters in some way to reduce the loss
    1. Propagating backwards, the gradient of the loss function with respect to the network parameters is calculated
    2. Move the parameter slightly in the opposite direction of the gradient to reduce the loss (W -= step * gradient)

This method is called mini-batch Stochastic Gradient Descent (Mini-batch SGD). The word random means that the data we extracted in step 1 was randomly extracted.

Some variants of SGD update values not only by looking at the current gradient, but also by looking at the last weight update. These variations are called “optimization methods” or “optimizers.” In many of these variants, a concept called momentum is used.

Momentum mainly deals with two problems in SGD: convergence rate and local minimum point. Momentum can be used to avoid convergence to the local optimal solution when learning rate comparison is small, instead of continuing to advance to the global optimal solution.

The momentum here is the momentum concept that comes from physics. We can imagine a small ball rolling down the loss surface (the direction of gradient descent), and if there is enough momentum, it can “dash” past the local minimum and not get trapped there. In this example, the ball’s motion is determined not only by the slope of its current position (the current acceleration), but also by its current velocity (which depends on the previous acceleration).

This idea is put into the neural network, that is, a weight update, not only looks at the current gradient, but also looks at the last weight update:

# naive implementation of Optimization with momentum
past_velocity = 0.
momentum = 0.1    # Constant momentum factor
while loss > 0.01:    # Optimization loop
    w, loss, gradient = get_current_parameters()
    velocity = past_velocity * momentum + learning_rate * gradient
    w = w + momentum * velocity - learning_rate * gradient
    past_velocity = velocity
    update_parameter(w)
Copy the code

Back propagation algorithm: chain derivative

A neural network is a chain of tensor operations, such as:

F (W1, W2, W3) = a(W1, b(W2, c(W3))) # where W1, W2, W3 are weightsCopy the code

There’s a chain rule in calculus that says f(g(x)) = f'(g(x)) * g'(x).

By applying the chain rule to the neural network, an algorithm called “Backpropagation”, which is also called “reverse mode differentiation”, is produced.

Back propagation starts from the final calculated loss, and works backwards from the top layer of the neural network to the bottom layer. The chain rule is used to calculate the contribution of each parameter in each layer to the loss.

Frameworks like TensorFlow today have a capability called “symbolic differentiation.” This allows these frameworks to automatically compute the gradient function for a given neural network operation, and then instead of manually implementing back propagation (which is interesting, but really annoying to write), we can just take the value from the gradient function.