By Han Xinzi @Showmeai
Tutorial address: www.showmeai.tech/tutorials/3…
This paper addresses: www.showmeai.tech/article-det…
Statement: All rights reserved, please contact the platform and the author and indicate the source

collectionShowMeAICheck out more highlights

This series is an introduction to Deep Learning Specialization by Enda Ng, and the corresponding video can be viewed here.

The introduction

In ShowMeAI’s previous article, Introduction to Deep Learning, we introduced Deep Learning briefly:

We take housing price forecast as an example to explain the structure and basic knowledge of Neural Network model.
Several kinds of typical neural networks for supervised learning are introduced: Standard NN, CNN and RNN.
This section describes two types of data: structured data and unstructured data.
Deep learning has been popular in recent years and why it outperforms traditional machine learning in Data, Computation and Algorithms.

In this section, we introduce the basis of neural network: Logistic Regression. We will transition to the subsequent neural network model by analyzing the structure of logistic regression model. (about logistic regression models, we also can learn reading ShowMeAI graphic machine | logistic regression algorithm, a learning)

1. Algorithm basis and logistic regression

Logistic regression is an algorithm for dichotomies.

1.1 Binary classification problems and fundamentals of machine learning

The binary classification is to output yyy with only {0,1} two discrete values ({-1,1} is also the case). Let’s take an “image recognition” problem to determine if a picture is a cat. Recognizing a cat is a classic dichotomy — 0 for “not cat” and 1 for “cat. (about the basic knowledge of machine learning you can also see ShowMeAI article graphic machine learning | machine learning basic knowledge).

From the perspective of machine learning, our input XXX is now an image. The color image contains three RGB channels, and the image size is (64,64,3)(64,64,3).

The input of some neural networks is one-dimensional. We can flatten picture XXX (dimension (64,64,3)(64,64,3)(64,64,3)) into a one-dimensional feature vector. The obtained dimension of feature vector is (12288,1)(12288,1)(12288,1). We usually represent samples with column vectors, and we call the dimensions nxN_xnx.

If the training sample has MMM pictures, then we store the data with matrix, and the data dimension becomes (nx,m)(n_x,m)(nx,m).

The row nxn_xnx of matrix XXX represents the number of features x(I)x^{(I)}x(I) for each sample
Column MMM of the matrix XXX represents the number of samples.

We can also normalize the label YYY of the training sample and adjust it to a 1-dimensional shape. The dimension of the label YYY is (1, M)(1, M).

1.2 Logistic regression algorithm

Logistic regression is the most common binary classification algorithm (explained detailed algorithm can also be read articles graphic machine learning | ShowMeAI logistic regression algorithm, a), it contains the following parameters:

Input eigenvectors: x∈Rnxx \in R^{n_x}x∈Rnx, where nx{n_x}nx is the number of features
Tag for training: y∈0,1y \in 0,1y∈0,1
W ∈Rnxw \in R^{n_x} W ∈Rnx
Offset: B ∈Rb \in Rb∈R
Output: y ^ = sigma \ hat (wTx + b) {} y = \ sigma (w ^ Tx + b) y ^ = sigma (wTx + b)

Sigmoid Function is used in the output calculation, which is a kind of nonlinear s-type Function. The output is limited between [0,1][0,1][0,1] and is usually used as Activation Function in neural networks.

The Sigmoid function is expressed as follows:

s = \sigma(w^Tx+b) = \sigma(z) = \frac{1}{1+e^{-z}}

In fact, logistic regression can be regarded as a very small neural network.

1.3 Loss function of logistic regression

In machine learning, loss function ** is used to quantify the gap between the predicted result and the real value. We will continuously adjust the weight of the model by optimizing the loss function so that it can best fit the sample data.

For regression problems, we will use the mean square error loss (MSE) :

L(\hat{y},y) = \frac{1}{2}(\hat{y}-y)^2

However, we do not tend to use such loss functions in logistic regression. Logistic regression using square error loss results in a nonconvex loss function that has many locally optimal solutions. Gradient descent method may not find the global optimal value, which brings difficulties to optimization.

Therefore, we adjust to use logarithmic loss (binary cross entropy loss) :

L(\hat{y},y) = -(y\log\hat{y})+(1-y)\log(1-\hat{y})

What we’ve just given is a loss function defined on a single training sample, which measures performance on a single training sample. We define Cost Function (or Cost Function) as the performance of all training samples, that is, the average value of the loss Function of MMM samples, which reflects the average closeness between the predicted output of MMM samples and the output of real samples YYY.

The calculation formula of the cost function is as follows:

J(w,b) = \frac{1}{m}\sum_{i=1}^mL(\hat{y}^{(i)},y^{(i)})

2. Gradient Descent

Just now, we have understood the definition of Loss Function and Cost Function. The next step is to find the optimal WWW and BBB values and minimize the Cost Function of MMM training samples. The method used here is called Gradient Descent.

In mathematics, the gradient of a function indicates its steepest growth direction. In other words, if you go in the direction of the gradient, the function grows the fastest. So if you go in the negative direction of the gradient, the function goes down the fastest.

(more detailed optimization mathematical knowledge can read articles diagram AI ShowMeAI mathematical foundation | calculus and optimization)

The training objective of the model is to find suitable WWW and BBB to minimize the cost function value. We first assume that WWW and BBB are one-dimensional real numbers, then the figure of cost function JJJ regarding WWW and BBB is shown as follows:

The cost function JJJ in the figure above is a convex function with only one global lowest point, which guarantees that an optimal solution can be found regardless of the initial model parameters (anywhere on the surface).

Based on the gradient descent algorithm, the update formula of the following parameters WWW is obtained:

w := w – \alpha\frac{dJ(w, b)}{dw}

In the formula, α\alphaα is the learning rate, that is, the step size of WWW updated each time.

Cost function J(w,b)J(w, b)J(w,b) corresponding parameter BBB update formula is:

b := b – \alpha\frac{dJ(w, b)}{db}

3. Computation Graph

For neural network, the training process includes two stages: Forward Propagation and Back Propagation.

Forward propagation is the process of input to output, and the output is predicted by neural network forward calculation
Back propagation is the process of calculating gradients for parameters WWW and BBB based on Cost Function from output to input.

Well, let’s try to understand both of these stages in terms of an example that we need to compute graph.

3.1 Forward Propagation

If our Cost Function is J(a,b,c)=3(a+ BC)J(a,b,c)=3(a+ BC)J(a,b,c)=3(a+ BC), including three variables AAA, BBB and CCC.

We add some intermediate variables, with uuu for BCBCBC and VVV for A + UA + UA + U, then J=3vJ=3vJ= 3V.

The whole process can be represented by a computational graph:

In the figure above, we make a= 5A =5a=5, b=3b=3b=3, c=2c=2c=2, then U = BC = 6U = BC = 6U = BC =6, V = A + U = 11V = A + U = 11V =a+ U =11, J= 3V =33J= 3V =33J= 3V = 3V = 3V = 3V = 3V =33J= 3V = 3V =33.

In the calculation figure, the process from left to right and from input to output corresponds to the forward calculation process of Cost Function calculated by neural network based on XXX and WWW.

3.2 Back Propagation

We continue with the calculation diagram in the previous example to explain back propagation. Our input parameters are AAA, BBB and CCC.

① First calculate the partial derivative of JJJ with respect to parameter AAA

From right to left, JJJ is a function of VVV, and VVV is a function of AAA. Based on the derivative chain rule:

\frac{\partial J}{\partial a}=\frac{\partial J}{\partial v}\cdot \frac{\partial v}{\partial a}=3\cdot 1=3

② Calculate the partial derivative of JJJ with respect to parameter BBB

From right to left, JJJ is a function of VVV, VVV is a function of UUu and uuu is a function of BBB. Also available:

\frac{\partial J}{\partial b}=\frac{\partial J}{\partial v}\cdot \frac{\partial v}{\partial u}\cdot \frac{\partial u}{\partial b}=3\cdot 1\cdot c=3\cdot 1\cdot 2=6

③ Calculate the partial derivative of JJJ with respect to parameter CCC

Now from right to left, JJJ is a function of VVV, VVV is a function of UUu, uuu is a function of CCC. Available:

\frac{\partial J}{\partial c}=\frac{\partial J}{\partial v}\cdot \frac{\partial v}{\partial u}\cdot \frac{\partial u}{\partial c}=3\cdot 1\cdot b=3\cdot 1\cdot 3=9

This completes the calculation of the back propagation and the gradient (partial derivative) from right to left.

4. Gradient descent in logistic regression

Returning to the logistic regression problem mentioned earlier, we assume that the dimension of the input eigenvector is 2(i.e. [X1,x2][X_1, x_2][X1,x2]), and the corresponding weight parameters w1W_1W1, W2W_2w2 and BBB can be obtained as follows:

Back propagation computes gradients

① Find the derivative of LLL with respect to AAA

② Find the derivative of LLL with respect to ZZZ

③ Continue to push forward calculation

(4) The parameter updating formula can be obtained based on gradient descent

What was mentioned earlier is the process of taking partial derivatives of a single sample and applying the gradient descent algorithm. For a data set with MMM samples, Cost Function J(w,b)J(w,b)J(w,b), a(I)a^{(I)}a(I) and weight parameter w1w_1w1 are calculated as shown in the figure.

The process of a training in the complete Logistic regression is as follows, and the dimension of feature vector is only assumed to be 2:

J=0; dw1=0; dw2=0; db=0;
for i = 1 to m
    z(i) = wx(i)+b;
    a(i) = sigmoid(z(i));
    J += -[y(i)log(a(i))+(1-y(i))log(1-a(i));
    dz(i) = a(i)-y(i);
    dw1 += x1(i)dz(i);
    dw2 += x2(i)dz(i);
    db += dz(i);
J /= m;
dw1 /= m;
dw2 /= m;
db /= m;
Copy the code

Then w1W_1W1, W2W_2W2 and BBB are iterated.

There is a drawback to the above calculation: there are two for loops in the whole process. Among them:

The first for loop iterates over MMM samples
The second for loop iterates through all characteristics

If you have a large number of features, showing the use of a for loop in your code makes the algorithm inefficient. Vectorization can be used to solve the problem of explicitly using for loops.

5. Vectorization

Continuing with logistic regression, if z=wTx+bz=w^Tx+bz=wTx+b is calculated in a non-vectorized loop, the code is as follows:

z = 0;
for i in range(n_x):
    z += w[i] * x[i]
z += b
Copy the code

Vector-based operation can be performed in parallel, which greatly improves efficiency and makes the code more concise: (The Numpy library in Python is used here. If you want to learn more about numpy, you can see the Numpy tutorial in ShowMeAI’s Illustrated Data analysis series. You can also learn how to use Numpy in ShowMeAI’s Numpy quick Guide.)

z = np.dot(w, x) + b
Copy the code

Without an explicit for loop, the iterative pseudocode for logistic regression gradient descent is as follows:

$Z=w^TX+b=np.dot(w.T, x) + b$

$A=\sigma(Z)$

$dZ=A-Y$

$dw=\frac{1}{m}XdZ^T$

$db=\frac{1}{m}np.sum(dZ)$

$w:=w-\sigma dw$

$b:=b-\sigma db$

The resources

Illustrations, rounding machine learning | logistic regression algorithm
Graphic machine | machine learning based learning)
Mathematical foundation diagram AI | calculus and optimization)
Graphical data analysis
Numpy Quick reference manual

ShowMeAIRecommended series of tutorials

Illustrated Python programming: From beginner to Master series of tutorials
Illustrated Data Analysis: From beginner to master series of tutorials
The mathematical Basics of AI: From beginner to Master series of tutorials
Illustrated Big Data Technology: From beginner to master
Illustrated Machine learning algorithms: Beginner to Master series of tutorials
Machine learning: Teach you how to play machine learning series
Deep learning tutorial | Wu En interpretation of special course, a full set of notes

Recommend the article

| deep learning tutorial introduction to deep learning
Deep learning tutorial | basis neural network
Deep learning tutorial | shallow neural network
Deep deep learning tutorial | neural network
The practical aspects of deep learning tutorial | deep learning
Deep learning tutorial | neural network optimization algorithm
Deep learning tutorial | network optimization: the parameters tuning, regularization, normalization and application framework
Deep learning tutorial | AI application practice strategy (on)
Deep learning tutorial | AI application practice strategy (below)
Deep learning tutorial | convolution neural network interpretation
Deep learning tutorial | classical CNN network example explanation
Deep learning tutorial | the application of the CNN: target detection
Deep learning tutorial | the application of the CNN: face recognition and neural style transformation
Deep learning tutorial model with RNN network | sequence
The depth of the embedded learning tutorial | natural language processing and word
Deep learning tutorial | Seq2seq sequence model and the attention mechanism

Deep learning tutorial | basis neural network