Deep learning and fundamentals of computer vision
I. Definition and relationship between computer vision and artificial intelligence deep learning
Artificial intelligence and deep learning
 What is artificial intelligence?
 In essence, artificial intelligence is the simulation of human thinking and problem solving.
 What is deep learning?
 It is an algorithm that simulates the structure of human brain and takes artificial neural network as the architecture to extract higher dimensions and deeper logical relations behind data tables so as to achieve more accurate results.
 Artificial intelligence: expert systems, physical models, etc
 Machine learning: kNN, sVM, etc
 Deep learning: fully connected neural network, convolutional neural network, recurrent neural network, etc
 Machine learning: kNN, sVM, etc
2. Computer Vision
 Computer vision is the study of how to make machines “see”. More specifically, it refers to the use of computers and visual systems instead of human eyes to identify, track and measure objects, and further image processing and analysis.
 Vision is the main source of information for the human brain and the gateway to the palace of artificial intelligence.
 Common areas of computer vision:
 Control process: guide robot arm, industrial robot
 Navigation: Autopilot or mobile robot
 Detection: Video surveillance and face recognition
 Organizing information: Intelligent search based on images and image sequences
 Modeling object or environment: medical image analysis system or terrain model
 Intelligent interaction: emotion recognition, humancomputer interaction
 Four main tasks of computer vision:
 Image classification and recognition
 Semantic segmentation
 Target detection
 Examples of segmentation
 Other tasks (image enhancement, target tracking, visual creativity)
Second, the cuttingedge application of deep learning in computer vision
 Face recognition
 OCR (Optical Character Recognition) — license plate, bank card number Recognition
 Image search engine
 autopilot
 Intelligent monitoring
 Visual creative
 Manipulator guide
Classical computer vision — digital image processing
1. Computer vision and digital image processing
 Computer Vision is the process of “seeing pictures” and “understanding” imitated by human eyes and brain. The key words are “reality” and “understanding”. The input is the picture, the output is the model, recognition results and other information extracted from the image, such as background segmentation, motion detection, object recognition, face recognition.
 Digital Image Processing (Digital Image Processing) is a variety of preprocessing of images before viewing them, including transformation, analysis, reconstruction and pixellevel Processing of existing images. The input is the image, the output is also the image, such as: image enhancement, denoising, filter, etc.
 Computer Graphics is similar to human “drawing”, is the use of Computer Graphics generation; The input is the model, the output is the image (pixel), creating new visual perception, such as: fingerprint generation, 3D effects, game movie production, etc.
2. Image processing
 Color mode:
 RGB color mode: various colors are obtained through the changes of red (R), green (G) and blue (B) color channels and their mutual superposition. RGB is the color of red, green and blue channels, and the value range is [0, 255].
 Grayscale: the range is [0, 255], where 0 represents black and 255 represents white.
 HSV: Similar to the way human sense color, with a strong degree of perception.
 Hue (H) : is the basic attribute of color, which is commonly referred to as the color name, such as red and green, etc., with a value range of [0°, 360°].
 Saturation (S) : refers to the purity of color. The higher the color is, the purer the color will be, while the lower the color will become gray. The value range is [0%, 100%].
 Brightness (V) : also called brightness (L), the value range is [0%, 100%].
 Transformation of color space:
 RGB to grayscale diagram :(R+G+B)/3
 RGB to HSV:
 HSV to RGB:
3. Image filtering (smoothing filtering and edge detection)
 Mathematical principles of convolutional layer in deep learning neural networks.
 It is a kind of image preprocessing
 Function:
 Smooth filtering: eliminate mixed noise in the image, noise reduction
 Edge detection: Extracting image features for image recognition
 Smoothing filtering method:
 Simple average method
 Gaussian filtering
 Edge detection method:
 Roberts operator:
 Prewitt operator:
 Sobel operator:
 Canny operator:
 A. Smooth the image with Gaussian filtering
 B. Calculate gradient amplitude and direction by Sobel and other gradient operators
 C. Perform nonmaximum suppression of gradient amplitude
 Compare whether B of A, B, and C is the maximum value. If B is the maximum value, B is reserved; otherwise, B is suppressed and set to 0.
 The direction of gradient is perpendicular to the direction of potential boundary
 D. Detect and connect edges to the double threshold algorithm
 The part larger than Tmax is retained, the part smaller than Tmin is suppressed, and the part between Tmax and Tmin is retained. The part of the curve above the connection is retained, and the part of the curve below the connection is discarded.
 Comparison of effects of different operators:
 Roberts operator:
4. Image threshold segmentation
 Classical application of computer vision segmentation task.
 Threshold segmentation based on Otsu algorithm (Otsu algorithm) :
 Turn the image to grayscale
 Calculate all average gray levels w
 Select a threshold T to divide all pixels into N0 and N1
 Calculate the grayscale w0 of N0 and w1 of N1
 Calculate the variance between classes
G = N0 * (w0  w) squared + N1 * (w1  w) squared = N0N1 (w0  w1) squared
 Use traversal to find Tmax so that g is maximum
5, Basic morphological filtering: Dilation, Erosion, opening and closing operations
 Images are denoised with data cleaning.
 Bloat: Enlarges bright white areas in an enlarged image by adding pixels to the perceptual boundaries of objects in the image. Often used to expand edges or fill small holes.
 Corrosion: Remove pixels along the object boundary and reduce the object’s size by adding pixels to the object’s perceptual boundary in this image. It is often used to extract image backbone information and eliminate isolated pixels and noises.
 Open operation: first corrosion and then expansion.
 Closed operation: expansion and corrosion.
Classical computer vision algorithm
1. Hough transform
 Mainly used to identify regular shapes.
 Recognition and extraction of higherorder features.
 Theory:
 In the automatic analysis of digital images, one of the most common subproblems is the detection of some simple lines, circles, and ellipses. In most cases, an Edge detector will be used to preprocess the image, turning the original image into an image that contains only edges. Because the image is not perfect or the edge detection is not perfect, some points or pixels are missing, or there is noise that makes the edge detector’s boundary deviate from the actual boundary. Therefore, it is impossible to divide the detected edges into straight lines, circles and ellipses intuitively. Hough transform solves the above problems. Through the voting steps in Hough transform algorithm, the parameters of the graph can be found in the complex parameter space, and the computer can know which shape the edge is based on the parameters.
 Steps:
 A. Select the type of shape to be identified
 B. Project the parameter space of the cartesian coordinate system to a special parameter space
 C. Look for intersections to determine the recognized shape (by adding local maxima in the parameter space)
2. Template matching
 Classical applications of computer vision recognition tasks.
 Template matching is one of the most primitive and basic pattern recognition methods. It is a matching problem to study where the pattern of a particular object is located in the image and then identify the object. It is the most basic and common matching method in image processing. Template matching has its own limitations, mainly in that it can only carry out parallel movement, if the matching target in the original image rotates or changes in size, the algorithm is invalid.
 Template is a small known image, and template matching is to search for a target in a large image, known that there is a target to find in the map, and the target has the same size, direction and image elements with the template, through a certain algorithm can find the target in the map, determine its coordinate position.
3. Defects of classical computer vision algorithms
 In practical application, classical computer vision algorithm has poor performance in antiinterference and antinoise.
 Such as: rotation, size, deformation, occlusion, brightness and other generalization problems.
 Improvement:
 SIFT algorithm:
 Scaleinvariant feature Transform (SIFT) is a description used in the field of image processing.
 SIFT feature is based on some local appearance points of interest on the object independent of image size and rotation. There is also a high tolerance for light, noise, and microperspective changes. Based on these characteristics, they are highly prominent and relatively easy to extract, making it easy to identify objects in a large database of features with little misidentification. The detection rate of some object occlusion using SIFT feature description is also quite high, and even more than 3 SIFT features are enough to calculate the position and orientation. At the speed of today’s computer hardware and with small feature databases, recognition speed can approach realtime computing. SIFT feature has a large amount of information and is suitable for fast and accurate matching in massive databases.
 Cascade algorithm (Cascade classifier)
 SIFT algorithm:
 Solution: Convolutional neural network
Convolutional neural network
1. Basic introduction of neural network
 ANN refers to a complex network structure formed by a large number of processing units (neurons) connected with each other. It is an abstraction, simplification and simulation of the human brain’s organizational structure and operating mechanism. Artificial Neural Network (ANN), a kind of information processing system based on the structure and function of brain Neural Network, simulates Neural activity by mathematical model.
 Artificial neural network with multilayer and singlelayer, each layer contains a number of neurons, each with variable weights between neurons have to arc connection, by the repeated training to the known information network, the method to adjust the connection weights change neurons step by step, to deal with information, the purpose of the simulation of the relation between input and output. It does not need to know the exact relationship between input and output, and does not need a lot of parameters, but only needs to know the nonconstant factors causing output changes, that is, the nonconstant parameters. Therefore, compared with traditional data processing methods, neural network technology has obvious advantages in processing fuzzy data, random data, nonlinear data, especially suitable for largescale, complex structure, unclear information system.
 Multilayer forward neural network (also called multilayer perceptron) proposed by Minsley and Papert is the most commonly used network structure at present.
2. Basic constitution of neural network
 Neurons:
 Neural network:
 It’s made up of multiple neurons.
 Fully connected neural network:
 Convolutional Neural Network (CNN) :
 Mainly used in computer vision.
 Recurrent neural network (RNN) :
 Mainly used in natural language processing.
3. Common activation functions
 The Sigmoid function:
 Formula:
 Is the most widely used class of activation functions, having an exponential shape and physically closest to neurons. Its output ranges from (0,1) and can be expressed as probability or used for normalization of data.
 Disadvantages:
 A. Soft saturation — Derivative F ‘(x)=f(x)(1f(x)), the bilateral derivatives of F (x) gradually approach 0 as x approaches infinity. In backward transfer, the gradient of sigmoid downward transfer contains an F ‘(x) factor, so that f'(x) becomes close to zero once it falls into the saturated region, resulting in a very small gradient of backward transfer. At this time, network parameters are difficult to be effectively trained, which is called gradient disappearance. Gradients generally disappear within 5 layers.
 B. The output of sigmoid function is all greater than 0, which makes the output not zero mean, which is called bias phenomenon. This will cause the neurons in the later layer to receive the nonzero mean signal output from the previous layer as input.
 Formula:
 Tanh functions:
 Formula:
 Compared with sigmoID function, the average output value of TANh function is 0, which makes its convergence speed faster than sigmoID, thus reducing the number of iterations.
 Disadvantages:
 It also has soft saturation, which causes the gradient to disappear.
 Formula:
 ReLu function:
 Formula:
 ReLU is referred to as Rectified Linear Units.
 It has no saturation problem when x>0, so that the gradient does not decay, thus solving the gradient disappearance problem. This allows us to train deep neural networks directly in a supervised manner, rather than relying on unsupervised layerbylayer pretraining. However, with the progress of training, part of the input will fall into the hard saturation region, resulting in the corresponding weight cannot be updated, which is called “neuron death”.
 Similar to SigmoID, the mean output value of ReLU is also greater than 0, so migration and neuron death jointly affect the convergence of the network.
 Formula:
 Image:
4. Neural network training
 Loss function:
 The loss function is used to evaluate the difference between the predicted value and the real value of the model. The better the loss function is, the better the performance of the model is generally. Different models generally use different loss functions.
 The loss function is divided into empirical risk loss function and structural risk loss function. The empirical risk loss function refers to the difference between the predicted result and the actual result, and the structural risk loss function refers to the empirical risk loss function plus the regular term.
 Gradient descent:
 Gradient descent is an iterative method that can be used to solve least squares problems (both linear and nonlinear). When solving the model parameter of machine learning algorithm, namely unconstrained optimization problem, Gradient Descent is one of the most commonly used methods, and the least square method is another commonly used method. When solving the minimum value of the loss function, the gradient descent method can be used to solve the loss function iteratively step by step to obtain the minimum value of the loss function and model parameters. Conversely, if we need to find the maximum loss function, then we need to iterate using gradient ascent. In machine learning, two gradient descent methods are developed based on the basic gradient descent method, namely stochastic gradient descent method and batch gradient descent method.
 Back propagation algorithm:
 Back propagation algorithm (BP algorithm) is a learning algorithm suitable for multilayer neural networks, which is based on the gradient descent method. The inputoutput relation of BP network is essentially a mapping relation: the function of an Ninput moutput BP neural network is a continuous mapping from ndimensional Euclidean space to a finite domain in mdimensional Euclidean space, and this mapping is highly nonlinear. Its information processing ability comes from the multiple recombination of simple nonlinear functions, so it has strong function repetition ability. This is the basis for the application of BP algorithm.
5. Introduction to convolutional neural networks
 Processing process:
 Image Input > Convolution layer > Pooling layer > Full connection layer > Result Output
 Convolution layer:
 The convolution operator is used for the convolution operation.
 Size after convolution:
6. Classical convolutional neural network structure
 AlexNet
 ResNet
 The residual network is easy to optimize and can improve the accuracy by increasing the depth. Its internal residual block uses jump connection to alleviate the gradient disappearance problem caused by increasing depth in deep neural network.
 Inception

When building the convolutional layer, decide whether the filter size is 1×1, 3×3, 5×5, or whether to add a pooling layer. Inception the purpose of the network is to decide on your behalf, and although the network architecture becomes more complex, it performs very well.
Optimization of convolutional neural networks

1. Measure the performance of the model
 Take the prediction of classification tasks as an example:
 TP = True Positive, the actual label of this column is “yes” in the test data set and is predicted by the model.
 FP = False Positive, the actual label for this column is “no” in the test data set and “yes” as predicted by the model.
 TN = True Negative, the column is actually labeled “No” and the model predicts “no”.
 FN = False Negative, the column is actually labeled “yes” and the model predicts “no”.
 Accuracy:
 Accuracy refers to the overall Accuracy of both positive and negative predictions.
 accuracy = (TP + TN)/(TP + FP + TN + FN)
 Disadvantages: In practice, not suitable for unbalanced data sets.
 Accuracy:
 Precision refers to the accuracy of forward prediction.
 precision = TP/(TP + FP)
 Accuracy is usually used in the most important cases to avoid a large number of false positives.
 Recall rate/sensitivity:
 Recall/Sensitivity refers to the proportion of all data with positive real results that we predict to be positive.
 recall = sensitivity = TP/(TP + FN)
 Recall rates are often used in the most important use cases for truth detection.
2. Optimization of overfitting problem
 Overfitting problem:
 Overfitting refers to making the hypothesis too strict in order to get the consistent hypothesis, which may take some noise data into account, resulting in errors in future data prediction.
 A. Division of training verification set:
 The existing data set is divided into training set, validation set and test set (usually 60%, 20% and 20%). The data contained in the training set is used to train the model, and the performance of the model on the validation set is used to optimize the training set. Finally, the test set is used to verify the model.
 B. Cross validation:
 The data were divided into several equal proportions, which were alternately used as training sets and verification sets, and the average value and variance of each group of errors were observed to judge.
 C. Data enhancement:
 Data enhancement can generate new data images by rotating, deforming, mirroring, changing brightness, changing color, and adding white noise to the original image.
 D. Regularization (Regularization) :
 Keep all the characteristic variables, but reduce the order of magnitude of the characteristic variables.
 E. Random Dropout:
 Only some nodes at the same level in the network are trained each time.
 It can effectively avoid the result deviation caused by uneven weight division.
 Disadvantages: The training time is long, usually 23 times as long as that of the neural network without random inactivation.
3. Optimization of gradient disappearance/explosion problem
 In essence, the reason for gradient disappearance and explosion is the multiplication effect in gradient back propagation caused by too deep network layers.
 A. Unsaturated activation function:
 ReLu: Let the derivative of the activation function be 1 for the first quadrant.
 LeakyReLu: Contains almost all of ReLu’s advantages, as well as addressing the effects of the disappearance of the second quadrant gradient in ReLu.
 B. Gradient Clipping:
 When the gradient exceeds a set threshold, it is adjusted manually.
 C. Initialization of network parameters:
 Xavier initialization:
 Initialization uniformly distributed over a fixed range.
 Because Xavier’s derivation is based on several assumptions, one of which is that the activation function is linear. This does not apply to ReLU activation functions. The other is that the activation value is symmetric with respect to 0, which does not apply to the sigmoid function.
 He initialization:
 Fixed a bug with Xavier initialization.
 Pretrain initialization (transfer learning) :
 Use a network with preset weights and finetune from there.
 Xavier initialization:
 D. Batch Normalization:
 There is a very important assumption in the field of machine learning: the independent homodistributed hypothesis, which assumes that training data and test data meet the same distribution, which is a basic guarantee for the model obtained through training data to obtain good results in test sets. BN is to keep the same distribution of input of each layer of deep neural network during training.
 Function:
 Prevention of gradient explosion
 Solve the Internal Covariate Shift and improve learning efficiency
 Reduced reliance on good weight initialization
 Help solve overfitting
4. Optimization of model training
 A. batch training
 Divide the training set into parts and train them one by one.
 Advantages:
 Improve training speed
 Randomness is introduced into the training process to help find the global optimal solution
 Disadvantages:
 The training time is too long
 B. Gradient Descent with Momentum optimizer
 C. RMSProp optimizer
 D. Adaptive matrix Adam optimizer
 The momentum gradient descent optimizer and RMSProp optimizer are integrated.
5. Other optimization strategies
 Bayes limit
 A theoretical limit that can be identified from the available data collected.
 Accuracy: Bayes limit > human identification > training accuracy > verification accuracy & test accuracy > long time real application environment
 A. Reduce training error:
 More complex models
 Longer training and optimization
 Better hyperparameters
 B. Reduce validation & test error (overfitting problem) :
 More comprehensive data
 Strategies for solving overfitting
 Simplified model structure & parameter combination
 C. Meet indicators and optimization indicators:
 Select the best item of the optimization index on the premise of satisfying the index.
 There is usually one optimization indicator and the rest are fulfillment indicators.
 D. Considerations for the output layer:
 Linear: Regression prediction
 Sigmoid: dichotomies
 Softmax: Multiple categories
 E. Asymmetric data training and optimization:
 Data enhancement enlarges the number of samples with a low proportion
 Modify the loss function to give higher weight to the sample with lower proportion