Title: Improvement of deep neural Networks: Hyperparametric tuning (some ideas and Strategies), Regularization (Improving generalization performance, Coping with Overfitting) and optimization (Faster training speed, better convergence)

The content of course 2 is more practical, and there are several assignments. The program is still cool to look at and write. To summarize some theoretical introductions:

  1. Different initialization process had a great influence on the results, so the complex neural network initialization is important link, there are some ways (he/Xavier) is suitable for different scenarios, generally do not want too big nor too small, number of parameters (the number of neurons in layer) and inversely proportional to the size of the single parameter (trend, the mathematical formula).
  2. Generally, the accuracy results of training set and test set are used to analyze which tuning (Bias /variance) is required.
  3. Coping with overfitting: L2 / Dropout and implementation. In addition, there are data Augment (get more data), Early Stop and other methods. Early Stop is a straightforward approach, but it is not orthodontic (although it seems to cost less), can be difficult for subsequent systemic analysis and tuning, and is not recommended (try it, for comparison).
  4. Implementation of gradient checking (ensures correct backpropagation, but does not work with dropout).
  5. The experimental results show that the dropout effect is ideal (in practice, l2 can be used together).
  6. The problem of gradient explosion and gradient disappearance can be greatly improved by good initialization; Local Optima is actually almost never encountered in hyperspace, but another problem is the saddle point problem, which falls very slowly, in which case a better approach is needed.
  7. Mini-batch is a commonly used training method in practice. Momentum/RMsprops/Adam is a way to improve optimization performance for faster training and better convergence. At its core is the introduction of exponentially weighted averaging (which is good because of the cost of implementation and good approximation), momentum is the merging of the last n times of the vector, the vertical noise is cancelled out and all that is left is impulse (perhaps 0.9). My personal understanding of rmsprops is that the vector is convergent (exponentially weighted average and squared) to make it closer to 45 degrees (with possible deviation), so the parameter beta here is very close to 1, maybe 0.999. Adam combined the two and got a very good result (although it was computationally tricky).
  8. A hyperparameter can be introduced to slow down the learning rate.

The adjustment method of hyperparameters:

  1. Random is better than fixed! More attempts will be made on a single dimension
  2. The distribution of some parameters is mapped by log rather than simply randomizing the region
  3. When resources are limited, adopt panda method (manual adjustment, continuous attention); When the resources are sufficient, you can train in parallel and compare, pick the best one, and then try again.

The regularization of Z/A can fix it to a mean/variance. The meaning of this is that the input range received by the next layer is also relatively fixed, so that a relatively stable result (training and final discrimination) can be obtained, as well as a certain regularization effect (side effects). Attention needs to be paid to the regularization implementation in the training phase/test phase difference.

Softmax is used for classification of multiple results (as opposed to hardmax of 10, 0, 0) and there are ready-made functions to call.

Tensorflow is structured like a large support vector and matrix calculator, where all operations and data are stored. The real calculation starts only when you press = (call session.run). Data/operation /session is its main concept.