深度学习 - Deep Learning and CV Tutorial (7) | Neural Network Training Skills (Part 2) - ShowMeAI研究中心

ShowMeAI研究中心

Author: Han Xinzi @ShowMeAI
Tutorial address : https://www.showmeai.tech/tutorials/37
Address of this article : https://www.showmeai.tech/article-detail/266
Disclaimer: All rights reserved, please contact the platform and the author for reprinting and indicate the source
Favorite ShowMeAI to see more exciting content

Training Neural Networks; 深度学习与计算机视觉; Stanford CS231n

This series is a complete set of study notes for Stanford CS231n "Deep Learning and Computer Vision (Deep Learning for Computer Vision)", and the corresponding course videos can be viewed here . See the end of the article for more information.

introduction

ShowMeAI introduced the activation function selection in the previous Deep Learning and CV Tutorial (6) | Neural Network Training Skills ( Part 1), $sigmoid$ and $tanh$ both have saturation problems; the weight initialization cannot be too small. It should not be too large, and it is best to use Xavier initialization; data preprocessing uses mean subtraction and normalization. These two operations in linear classification will make the dividing line less sensitive, even if it is slightly rotated, the neural network is also sensitive to weights Slight changes are not so sensitive and easy to optimize; batch normalization can also be used to change the input data into a unit Gaussian distribution, or scale and translation; the learning process tracks the loss and accuracy; the hyperparameter tuning range is from coarse to fine, iterative The number of times is gradually increased, using random search.

The focus of this article

better optimization
Regularization
transfer learning
model integration

1. Better optimization (parameter update)

For detailed knowledge of the optimization algorithm, you can also read ShowMeAI 's deep learning tutorial | Wu Enda's special course · A full set of notes interpretation of the article in the explanation of neural network optimization algorithms .

1.1 Batch Gradient Descent (BGD)

Batch gradient descent, or batch gradient descent, uses all the contents of the training set${x_1, \cdots ,x_n}$ and the output corresponding to each sample$y_i$ in each step of the training iteration to calculate the loss and gradient and then use gradient descent to update the parameters.

When computing on the entire dataset, it is always possible to get non-negative progress on the loss function as long as the learning rate is low enough. The reference code is as follows (where learning_rate is a hyperparameter):

 # 普通梯度下降
while True:
    weights_grad = evaluate_gradient(loss_fun, data, weights)
    weights += -learning_rate * weights_grad  # 参数更新

The following figure shows the loss function with only two parameters. After continuous updating, if you are lucky enough, the function finally converges to the lowest point in the red part:

批量梯度下降BGD; 两个参数的损失函数

Advantages : Since each step uses all the data in the training set, when the loss function reaches the minimum value, it can be guaranteed that the gradient calculated at this time is $0$, which can converge. Therefore, there is no need to gradually reduce the learning rate when using BGD.
Disadvantage : As the dataset grows, it runs slower and slower.

1.2 Stochastic Gradient Descent (SGD)

The stochastic gradient descent here is actually the same as the MBGD (minibatch gradient descent) introduced earlier, that is, a batch of samples are randomly selected for each iteration ${x_1, \cdots ,x_m}$ and $y_i $, to backpropagate to calculate the gradient, and then update the parameters in the direction of the negative gradient.

The advantage of SGD is that the training speed is fast, and for large datasets, it can also converge at a faster speed. But the actual application of SGD will have many problems:

① If the loss function decreases fast in one parameter direction and slow in the other direction, this will cause the "zigzag" to drop to the lowest point, which is common in high dimensions .

随机梯度下降SGD; SGD之字形路线

The image above is a ravine-shaped area with the least loss point along the blue line. Consider the gradient of a point A on the surface, the gradient of this point can be decomposed into two components, one along the direction $w_1$ and the other along $w_2$.
The component of the gradient in the $w_1$ direction is much larger, because each step in the $w_1$ direction decreases the loss value more than in the $w_2$ direction, although the minimum value is in $w_2 $ direction. In this way, the actual step is more in the direction of $w_1$, and the $w_2$ is less, which will cause repeated oscillations in this ditch, and the "zigzag" goes to the minimum point.

② If the loss function has a local minimum and a saddle point (a critical point that is neither a maximum nor a minimum), the gradient at this time is $0$, and the parameter update will be stuck, or at the minimum Vibration nearby .

随机梯度下降SGD; 鞍点 Saddle points

In high-dimensional data, the existence of saddle points is a more common and larger problem. The loss of each gradient direction of the minimum value will become larger, and some saddle points will become larger in some directions, and some will decrease, and the update is slow when approaching the saddle point.

③ SGD is random, our gradients come from small batches of data (using all the data to calculate the real gradient is too slow), there may be noise, so the gradient descent route will be very tortuous and the convergence will be slow .

随机梯度下降SGD; 有噪声的SGD路线曲折

Below are some optimization algorithms based on " mini-batch gradient descent ".

1.3 Momentum update

Update methods with momentum almost always result in better convergence rates on deep networks.

The loss value can be understood as the height of the mountain (so the height potential energy is $U=mgh$), initializing the parameters with random numbers is equivalent to setting the initial velocity of the particle at a certain position to $0$, so that the optimization The process can be seen as the process of the parameter vector (ie the particle) rolling on the terrain.

The force of the particle rolling comes from the height potential energy$F = - \nabla U$, which is the negative gradient of the loss function (imagine that the loss function is a convex function, when the gradient is positive, the particle will roll in the negative direction, and the corresponding parameter decreases; loss When the gradient of the function is negative, the corresponding parameter will be increased by rolling in the positive direction). And because $F=ma$, the acceleration of the particle is proportional to the negative gradient, so the speed in the direction of the negative gradient is gradually increased.

In SGD, the gradient directly affects the position of the particle, and where the gradient is $0$, the position will not be updated; and here, the gradient affects the velocity as a force, and the velocity changes the position, even if the gradient is \)0\) , but the speed accumulated by the previous gradient is still there. Generally speaking, the momentum of an object refers to the tendency of the object to keep moving in its direction of motion, so the particle still has momentum at this time, and the position is still will update so that you can break out of the local minimum or saddle point and continue to update the parameters. However, a decay coefficient or friction coefficient must be given to the velocity of the particle, otherwise the particle will keep moving at the bottom of the valley because of the conservation of energy.

That is to say, the direction of parameter update is not only determined by the gradient direction of the current point, but also determined by the gradient direction accumulated before.

The calculation process is also to randomly select a batch of samples${x_1, \cdots ,x_m}$ and $y_i$ for each iteration, calculate the gradient and loss, and update the speed and parameters (assuming the mass is 1, v is the momentum) :

 v=0
while True:
    dW =  compute_gradient(W, X_train, y_train)
    v = rho * v - learning_rate * dW
    W += v

rho represents the attenuation degree of the speed per round v , the gradient obtained by each iteration is dW then the final obtained v The value is: $\frac{-learning_{rate} \ast dw}{1-rho}$
rho When it is $0$, it means SGD, rho general value is $0.5$, $0.9$, $0.99$, and the corresponding learning speed is increased by two times, 10 times and 100 times.

Momentum update can solve several problems of SGD mentioned above:

Since the update of the parameters accumulates the previous gradients, if we accumulate the two components of these gradients separately, the components in the $w_1$ direction will cancel each other out, and the components in the $w_2$ direction will be strengthened. However, due to the attenuation coefficient, it is impossible to completely cancel it, but it can already be accelerated, which greatly alleviates the problem of slow convergence of the "zigzag". This is also the principle of reducing vibration.

动量更新; 缓解之字形震荡

Local minima and saddle points will speed up due to their previous velocity.

动量更新; 越过局部极小值点和鞍点

Facing the direction with relatively large gradient changes, that is, some noise, since the particle still has a relatively large speed at this time, in the opposite direction, it is necessary to reduce the speed to $0$ first to change the parameter update direction, because the speed is It is cumulative, so the influence of individual noise will not be so large, and the convergence can be smooth and fast.

动量更新; 抑制噪声的影响

1.4 Nesterov Momentum

Nesterov momentum is a little different from normal momentum and has become more popular recently. In theory it converges better for convex functions, and in practice it does perform better than standard momentum.

Nesterov 动量; 动量 V.S. Nesterov动量

Ordinary momentum updates have a velocity at a certain point, and then calculate the gradient at that point. The actual update direction will make a trade-off between the velocity direction and the gradient direction.
The Nesterov momentum update is that since we know that the momentum will bring the particle to a new position (i.e. looking forward), we don't calculate the gradient at the original position, calculate the gradient at this "look forward" place, and update the parameters .

This code becomes:

 v=0
while True:
    W_ahead = W + rho * v
    dW_ahead =  compute_gradient(W_ahead, X_train, y_train)
    v = rho * v - learning_rate * dW_ahead
    W += v

The momentum is still the previous momentum, it's just that the gradient becomes the gradient of the future points.

And in practice, people prefer expressions as simple as normal SGD or normal momentum methods. This can be done by rewriting W_ahead = W + rho * v using a variable transform, and then using W_ahead instead of W to represent the above update.

That is, the actual stored parameter is always the one-step-ahead version. code show as below:

 v=0
while True:
    pre_v = v
    dW =  compute_gradient(W, X_train, y_train)
    v = rho * v - learning_rate * dW
    W += -rho * pre_v + (1 + rho) * v

The derivation process is as follows:

The original Nesterov momentum can be replaced by the following mathematical expression:

$$ v_{t+1}=\rho v_t - \alpha \nabla f(x_t+\rho v_t) $$

$$ x_{t+1}=x_t+v_{t+1} $$

Now let $\tilde{x}_t =x_t+\rho v_t$, then:

$$ v_{t+1}=\rho v_t-\alpha \nabla f(\tilde{x_t}) $$

$$ \begin{aligned} \tilde{x}_{t+1} &=x_{t+1}+\rho v_{t+1}\\ &=x_{t}+v_{t+1} +\rho v_{t+1}\\ &=\tilde{x}_{t}-\rho v_{t}+v_{t+1}+\rho v_{t+1} \end{aligned} $$

Thus there are:

$$ \tilde{x}_{t+1}=\tilde{x_t}-\rho v_t+(\rho+1)v_{t+1} $$

Just update $v_t$ and $\tilde{x}_t$

The schematic diagram is as follows:

Nesterov 动量; 还原推导示意图

Nesterov 动量; SGD V.S. 动量 V.S. Nesterov 动量

1.5 Adaptive Gradient Algorithm (Adagrad)

The methods mentioned above use the same update rate for all parameters, but the same update rate may not be suitable for all parameters. It may be better if you can set individual learning rate for each parameter, adjusted according to the situation, Adagrad is an adaptive learning rate algorithm proposed by Duchi et al .

code show as below:

 eps = 1e-7
grad_squared = 0
while True:
    dW = compute_gradient(W)
    grad_squared += dW * dW
    W -= learning_rate * dW / (np.sqrt(grad_squared) + eps)

AdaGrad is actually very simple. It is to superimpose the squares of the respective historical gradients of each dimension, and then divide by the historical gradient value when updating.

The variable grad_squared has the same size as the gradient matrix and is used to accumulate the sum of squares of the gradients for each parameter. This will be used to normalize the parameter update step size, normalization is done element by element. eps (generally set between 1e-4 to 1e-8 ) is used for smoothing to prevent division by $0$.

Advantages : It can automatically change the learning rate of each dimension of the parameter. If the gradient of a certain dimension is large, the learning rate decays faster, delaying network training; if the gradient of a certain dimension is small, then the learning rate decays. Slower, network training is accelerated.
Disadvantage : If the gradient is accumulated very large, the learning rate will become very small, it will be stuck in a local minimum point or stop early (RMSProp algorithm can solve this problem very well).

1.6 Root Mean Square Prop Algorithm (RMSProp)

The RMSProp optimization algorithm can also automatically adjust the learning rate, and RMSProp chooses a different learning rate for each parameter.

The RMSProp algorithm introduces an attenuation factor on the basis of AdaGrad. When RMSProp accumulates gradients, it will balance the "past" and "present", and adjust the attenuation through the hyperparameter decay_rate . The commonly used value is \ ([0.9,0.99,0.999]\). Others are unchanged, except grad_squared similar to the form of momentum update:

 grad_squared =  decay_rate * grad_squared + (1 - decay_rate) * dx * dx

Compared with AdaGrad, this method solves the problem of premature termination of training. Unlike Adagrad, its updates do not make the learning rate monotonically small.

均方根支柱算法; SGD V.S. 动量 V.S. RMSProp

1.7 Adaptive-Momentum Optimization (Adam)

Momentum update adds first-order momentum to SGD, and AdaGrad and RMSProp add second-order momentum to SGD. Combining the first-order momentum and the second-order momentum yields the Adam optimization algorithm: Adaptive + Momentum .

code show as below:

 eps = 1e-8
first_moment = 0  # 第一动量，用于累积梯度，加速训练
second_moment = 0  # 第二动量，用于累积梯度平方，自动调整学习率
while True:
    dW = compute_gradient(W)
    first_moment = beta1 * first_moment + (1 - beta1) * dW  # Momentum
    second_moment = beta2 * second_moment + (1 - beta2) * dW * dW  # AdaGrad / RMSProp
    W -= learning_rate * first_moment / (np.sqrt(second_moment) + eps)

The above reference code looks like a momentum version of RMSProp, but this version of Adam's algorithm has a problem: the first step second_monent may be relatively small, which may lead to a very large learning rate, so the complete Adam needs to add a bias.

code show as below:

 eps = 1e-8
first_moment = 0  # 第一动量，用于累积梯度，加速训练
second_moment = 0  # 第二动量，用于累积梯度平方，自动调整学习率

for t in range(1, num_iterations+1):
    dW = compute_gradient(W)
    first_moment = beta1 * first_moment + (1 - beta1) * dW  # Momentum
    second_moment = beta2 * second_moment + (1 - beta2) * dW * dW  # AdaGrad / RMSProp
    first_unbias = first_moment / (1 - beta1 ** t)  # 加入偏置，随次数减小，防止初始值过小
    second_unbias = second_moment / (1 - beta2 ** t)
    W -= learning_rate * first_unbias / (np.sqrt(second_unbias) + eps)

论文中推荐的参数值eps=1e-8 , beta1=0.9 , beta2=0.999 , learning_rate = 1e-3 ---e17c0d192c1ee6cbc78552bf7ce74c40 5e-4 ，对大多数模型The effect is good.

In practice, we recommend Adam as the default algorithm, which generally runs a little better than RMSProp.

自适应动量优化; SGD V.S. 动量 V.S. RMSProp V.S. Adam

1.8 Learning Rate Annealing

All of the above optimization methods require the use of hyperparameter learning rates.

When training deep networks, it is often helpful to let the learning rate decay over time. It can be understood like this:

If the learning rate is high, the kinetic energy of the system is too large, the parameter vector will jump irregularly, and it will not be able to stabilize to the deeper and narrower part of the loss function.
Knowing when to start decaying the learning rate is tricky: reduce it slowly, and probably just watch it chaotically jump for a long time with wasted computing resources, with very little actual progress. But if you reduce it quickly, the system may lose energy too quickly and not get to the best position it could have been.

Generally, there are 3 ways to implement learning rate decay:

① Decay with the number of steps : reduce the learning rate according to some factors every few epochs. Typical values are halving the learning rate every 5 epochs, or 10% of the previous epoch every 20 epochs.

The setting of these values is heavily dependent on the specific problem and the choice of model. In practice, you may see such an empirical approach: use a fixed learning rate for training and observe the validation set error rate, and whenever the validation set error rate stops decreasing, multiply by a constant (such as 0.5) to reduce the learning rate .

② Exponential decay : The mathematical formula is $\alpha=\alpha_0e^{-kt}$, where $\alpha_0,k$ is the hyperparameter, $t$ is the number of iterations (you can also use the period as the unit ).
③ 1/t decay : The mathematical formula is $\alpha=\alpha_0/(1+kt)$), where $\alpha_0,k$ is the hyperparameter and $t$ is the number of iterations.

Dropout , which decays with the number of steps, is preferred in practice because it uses hyperparameters (decay coefficient and number of steps in cycles) that are more explanatory than k.

If you have enough computing resources, you can make the decay slower and make the training time longer.

Generally, learning rate annealing needs to be used like SGD, but Adam and others do not. Don't use it in the first place, just look at the loss function and determine where you need to reduce the learning rate.

学习率退火; 损失函数

1.9 Second-Order Method (Second-Order)

In the context of deep networks, the second class of commonly used optimization methods is based on the Newton method, whose iterations are as follows:

$$ x \leftarrow x - [H f(x)]^{-1} \nabla f(x) $$

$H f(x)$ is the Hessian matrix consisting of the second partial derivatives of $f(x)$:

$$ \mathbf{H}=\left[\begin{array}{cccc} \frac{\partial^{2} f}{\partial x_{1}^{2}} & \frac{\partial^{ 2} f}{\partial x_{1} \partial x_{2}} & \cdots & \frac{\partial^{2} f}{\partial x_{1} \partial x_{n}} \\ \ frac{\partial^{2} f}{\partial x_{2} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{2}^{2}} & \ cdots & \frac{\partial^{2} f}{\partial x_{2} \partial x_{n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^{2 } f}{\partial x_{n} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{n} \partial x_{2}} & \cdots & \frac{ \partial^{2} f}{\partial x_{n}^{2}} \end{array}\right] $$

$x$ is a $n$ dimensional vector, $f(x)$ is a real number, so the Hessian matrix is $n \ast n$.

$\nabla f(x)$ is a $n$ dimensional gradient vector, which is the same as backpropagation.

This method converges quickly and allows for more efficient parameter updates. There is no learning rate hyperparameter in this formulation, which is a huge advantage over first-order methods.

However, the above update method is difficult to apply to practical deep learning applications, because computing (and inverting) Hessian operations is very time and space consuming. Thus, various quasi-Newton methods have been invented for approximating the transposed Hessian matrix.

The most popular of these methods is L-BFGS, which uses information in the gradient over time to approximate it implicitly (that is, the entire matrix is never computed).

However, even if the storage space problem is solved, a huge disadvantage of L-BFGS application is that the computation needs to be performed on the entire training set, which typically contains millions of samples. Unlike mini-batch stochastic gradient descent (mini-batch SGD), getting L-BFGS to run on mini-batches is tricky and a research hotspot.

1.10 Practical applications

Tips: Adam is selected by default; if you can undertake full batch updates, you can try to use L-BFGS.

2. Regularization

For detailed knowledge of regularization, you can also read ShowMeAI 's Deep Learning Tutorial | Wu Enda's Special Course · Complete Notes Interpretation in the article The Practical Level of Deep Learning Explains Regularization.

2.1 Motivation for regularization

When we increase the number and size of hidden layers of a neural network, the capacity of the network goes up, i.e. neurons can cooperate to express many complex functions. For example, if you have a binary classification problem on a 2D plane. We can train 3 different neural networks, each with only one hidden layer, but with different numbers of neurons in the hidden layer, the results are as follows:

正则化; 隐藏层神经元数目不同

In the figure above, you can see that a neural network with more neurons can express more complex functions. However, this is both an advantage and a disadvantage:

The advantage is that more complex data can be classified
The disadvantage is that it may cause overfitting to the training data.

Overfitting means that the network has a strong ability to fit the noise in the data without paying attention to the underlying basic relationship between the data (hypotheses). For example in the picture above:

A network with a hidden layer of 20 neurons fits all the training data, but at the cost of turning the decision boundary into many disconnected red and green regions.
The expressive power of a model with 3 neurons can only be used to classify data in a relatively broad way. It treats the data as two large blocks, and treats individual red dots within the green area as noise. In practice, this leads to better generalization on the test data.

Does that mean " if the data is not complex enough, a smaller network seems better because it prevents overfitting "?

No, there are many ways to prevent overfitting of neural networks (L2 regularization, dropout and input noise, etc.). In practice, using these methods to control overfitting is much better than reducing the number of neurons in the network.

Small networks should not be used for fear of overfitting. Instead, use as large a network as possible and then use regularization tricks to control overfitting.

正则化; 改变正则化强度

Each neural network in the above figure has 20 hidden layer neurons, but as the regularization strength increases, the decision boundary of the network becomes smoother. So, regularization strength is a good way to control overfitting of neural networks .

There is a small example on the ConvNetsJS demo that everyone can practice.

2.2 Regularization method

There are a number of ways to prevent overfitting by controlling the capacity of a neural network:

L2 Regularization : The most commonly used regularization, implemented by penalizing the square of all parameters in the objective function.

For each weight$w$ in the network, add a $\frac{1}{2} \lambda w^2$ to the objective function, 1/2 for the convenience of derivation, \(\lambda\ ) is the regular strength.
L2 regularization can be intuitively understood that it severely penalizes large-valued weight vectors, favoring more dispersed weight vectors. Make the network more inclined to use all input features, rather than rely heavily on some small part of the input features.

L1 regularization : is another relatively common regularization method that adds a $\lambda \mid w \mid$ to the objective function for each $w$.

L1 regularization makes the weight vector sparse (i.e. very close to $0$) during optimization. In practice, L2 regularization will generally perform better than L1 regularization unless some explicit feature selection is particularly concerned.
L1 and L2 regularization can also be combined: $\lambda_1 \mid w \mid + \lambda_2 w^2$, called Elastic net regularization.

Max norm constraints: The weight vector $w $ must satisfy the L2 norm $\Vert \vec{w} \Vert_2 < c$, and $c$ is generally 3 or 4. This regularization also has the nice property that even when the learning rate is set too high, there is no numerical "explosion" in the network because its parameter updates are always limited.

But in neural networks, the most commonly used regularization method is called Dropout, which we will introduce in detail below.

2.3 Dropout

1) Dropout overview

Dropout is a simple and extremely effective regularization method, described by Srivastava in the paper [Dropout : A Simple Way to Prevent Neural Networks from Overfitting ]( http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf ), which complement each other with methods such as L1 regularization, L2 regularization and maximum normal form constraint.

During training, random deactivation is achieved by having the neuron be activated with a probability of the hyperparameter $p$ (usually $0.5$) or set to $0$ . Commonly used in fully connected layers.

Dropout 随机失活; 核心思路

A three-layer neural network Dropout sample code implementation:

 """ 普通版随机失活"""

p = 0.5   # 神经元被激活的概率。p值越高，失活数目越少

def train_step(X):
  """ X中是输入数据 """
  # 前向传播
  H1 = np.maximum(0, np.dot(W1, X) + b1)
  U1 = np.random.rand(*H1.shape) < p # 第一个随机失活掩模
  # rand可以返回一个或一组服从“0~1”均匀分布的随机样本值
  # 矩阵中满足小于p的元素为True，不满足False
  # rand()函数的参数是两个或一个整数，不是元组，所以需要*H1.shape获取行列
  H1 *= U1 # U1中False的H1对应位置置零
  H2 = np.maximum(0, np.dot(W2, H1) + b2)
  U2 = np.random.rand(*H2.shape) < p # 第二个随机失活掩模
  H2 *= U2 # drop!
  out = np.dot(W3, H2) + b3

  # 反向传播:计算梯度... (略)
  # 进行参数更新... (略)

In the code above, the train_step function does two random deactivations on the first hidden layer and the second hidden layer. It is also possible to perform random deactivation on the input layer, for which a binary (either activated or deactivated) mask needs to be created for the input data $X$. Back-propagation remains almost the same, just the back-pass gradient is multiplied by the mask to get the gradient of the dropout layer.

2) Understanding of Dropout

Why is this idea desirable? One explanation is to prevent mutual adaptation between features:

For example, each neuron has learned a feature of a cat, such as tail, whiskers, claws, etc., and combining all these features can determine a cat.
After adding random inactivation, you can only rely on some scattered features to judge that all features cannot be used, which can suppress overfitting to a certain extent. Otherwise, the accuracy rate is high during training and low during testing.

Dropout 随机失活; 防止特征间的互相适应

Another reasonable explanation is:

During training, random deactivation can be thought of as sampling some subsets of the complete neural network, updating only the parameters of the sub-network each time based on the input data.
Each binary mask is a model, and a network with $n$ neurons has $2n$ kinds of masks. Dropout is equivalent to a huge number of network models (shared parameters) being trained at the same time.

Dropout 随机失活; 相当于抽样出一些子集

3) Avoid random deactivation during testing

In the training process, the deactivation is random, but this randomness should be avoided in the testing process, so instead of using random deactivation, a large number of sub-networks should be model ensemble to calculate Come up with a forecast expectation.

For example, there is only one neuron$a$:

Dropout 随机失活; 测试时避免随机失活

When testing, because random inactivation is not used:

$$ \text{E}(a)=w_1x+w_2y $$

If the probability of random inactivation during training is $0.5$, then:

$$ \text{E}(a)=\frac{1}{4}(w_1x+w_2y)+\frac{1}{2}(w_1x+w_2\cdot 0)+\frac{1}{2} (w_1x\cdot 0+w_2)+\frac{1}{4}\cdot 0=\frac{1}{2}(w_1x+w_2y) $$

So an imprecise but very practical approach is to assume a random inactivation probability during testing, so that the output during prediction can be guaranteed to be consistent with the expected output during training. So the test code:

 def predict(X):
  # 前向传播时模型集成
  H1 = np.maximum(0, np.dot(W1, X) + b1) * p # 注意：激活数据要乘以p
  H2 = np.maximum(0, np.dot(W2, H1) + b2) * p # 注意：激活数据要乘以p
  out = np.dot(W3, H2) + b3

The disadvantage of the above operation is that the activation data must be adjusted according to the inactivation probability$p$ value range during testing. The performance of the test phase is very critical, so it is more inclined to use inverted dropout in actual operation.

The range adjustment is done at training time so that the forward pass stays the same at test time.

Inverse random dropout also has the benefit that the predicted code can remain the same whether or not Dropout is used during training. The reference implementation code is as follows:

 """ 
反向随机失活: 推荐实现方式.
在训练的时候drop和调整数值范围，测试时不做任何事.
"""
p = 0.5
def train_step(X):
  # 前向传播
  H1 = np.maximum(0, np.dot(W1, X) + b1)
  U1 = (np.random.rand(*H1.shape) < p) / p # 第一个随机失活遮罩. 注意/p!
  H1 *= U1 # drop!
  H2 = np.maximum(0, np.dot(W2, H1) + b2)
  U2 = (np.random.rand(*H2.shape) < p) / p # 第二个随机失活遮罩. 注意/p!
  H2 *= U2 # drop!
  out = np.dot(W3, H2) + b3

def predict(X):
  # 前向传播时模型集成
  H1 = np.maximum(0, np.dot(W1, X) + b1) # 不用数值范围调整了
  H2 = np.maximum(0, np.dot(W2, H1) + b2)
  out = np.dot(W3, H2) + b3

In a more general classification, random deactivation belongs to the way the network has random behavior in the forward pass. This idea of adding randomness to the training process and then averaging or approximating that randomness during testing can be seen in many places:

Batch normalization : The mean and variance for training come from random mini-batches; for testing, the empirical variance and mean over the entire training process are used.
Data augmentation : For example, when training a picture of a cat, it can be randomly cropped and flipped for training, and then some fixed positions (four corners, center and flip) can be tested during the testing process. You can also randomly change the brightness and contrast during training, color jitter PCA dimensionality reduction, etc.
DropConnect : Another study similar to Dropout is DropConnect , in which a series of weights are randomly set to $0$ during forward propagation.
Fractional Max Pooling : Random region pooling during training, fixed region or average during testing. This method is not commonly used.
Stochastic Depth : A relatively deep network, randomly selects some layers for training during training, and uses all layers during testing. This research is very cutting-edge.

In summary, these methods add random noise during training and test analytically (multiplying $p$ in this case using random deactivation) or numerically (e.g. by sampling many sub-networks, Randomly select different sub-networks for forward propagation, and finally average them) to marginalize the noise.

4) Practical experience

Some common practice-experience methods:

A globally used L2 regularization coefficient can be obtained by cross-validation.
Use random deactivation after all layers while using L2 regularization

The random inactivation $p$ value is generally set to $0.5$ by default, and may also be adjusted on the validation set .

3. Transfer Learning

For detailed knowledge of transfer learning, you can also read ShowMeAI 's deep learning tutorial | Wu Enda's special course · A full set of notes interpretation in the article AI application practice strategy (below) in the explanation of regularization.

Another cause of overfitting can be too few training samples, and transfer learning can be used to solve this problem, which allows CNNs to be trained with very little data.

3.1 The idea of transfer learning

迁移学习; Transfer Learning

Step 1 : Train a CNN on a large data set to get a model (such as using ImageNet, there are 1000 categories)
Step 2 : Using a small data set, the final required classification is no longer 1000 but a smaller value$C$, such as 10. At this time, the parameter matrix of the last fully connected layer becomes $4096 \times C$, initialize this matrix, retrain the linear classifier, keep all the previous layers unchanged, because the previous layers have been trained, there are generalization ability.
Step 3 : When more training sets are obtained, the number of training layers can be increased, for example, the last three fully connected layers can be trained. The parameters can be fine-tuned with a lower learning rate.

迁移学习; Transfer Learning

3.2 Application

Transfer learning is used in both object detection and image labeling, and the image processing part uses a CNN model that has been pre-trained with ImageNet data, and then fine-tunes these parameters according to the specific task.

So when you are interested in a batch of data sets but the number is not enough, you can find a training model with a large amount of data with similar data on the Internet, and then fine-tune or retrain some layers for your own problems. Some commonly used deep learning software packages contain already trained models, which can be applied directly.

Caffe : https://github.com/BVLC/caffe/wiki/Model-Zoo
TensorFlow : https://github.com/tensorflow/models
PyTorch : https://github.com/pytorch/vision

4. Model Ensembles

In practice, one method that always improves the accuracy of a neural network by a few percentage points is to train several independent models at training time, and then average their predictions at test time.

As the number of ensemble models increases, the results of the algorithm also improve monotonically (but with less and less effect).

The greater the difference between the models, the better the boosting effect may be.

There are several ways to integrate:

Same model, different initialization . Use cross-validation to get the best hyperparameters, and then use the best parameters to train the model with different initialization conditions. The risk of this approach is that the diversity only comes from different initialization conditions.
Find the best model in cross-validation . Use cross-validation to get the best hyperparameters, and then take the best few (say 10) models for ensemble. This increases the diversity of the ensemble, but at the risk of including suboptimal models. In practice, this is relatively simple to operate, and no additional training is required after cross-validation.
A model sets up multiple checkpoints . If training is very time-consuming, keep record points for the network at different training times (such as the end of each epoch) and use them for model ensemble. Obviously, this is not enough variety, but the effect is still good in practice. The advantage of this method is that the cost is relatively small.
Run the average of the parameters during training . Related to the above point, there is also a small cost method that can also get a 1-2 percentage point improvement. This method is that during the training process, if the loss value decreases exponentially compared to the previous weight, it will be stored in memory. Make a backup of the weights of the network. This way you average the network state over the first few loops. You will find that this "smoothed" version of the weights always yields less error. The intuitive understanding is that the objective function is a bowl, and your network jumps around this, so averaging them is more likely to jump to the center.

5. Expand your learning

You can click station B to view the [bilingual subtitles] version of the video

6. Summary of key points

Optimization methods: SGD, Momentum Update, Nesterov Momentum, Adagrad, RMSProp, Adam, etc. Generally, Adam is used without brains. There are also learning rate annealing and second-order methods.
Regularization: L2 is more commonly used, and Dropout is also a good regularization method.
Transfer learning can be used when there is less data.
Model integration.

Stanford CS231n full set of interpretation

Featured Recommendations in ShowMeAI Series Tutorials

ShowMeAI用知识加速每一次技术成长

Deep Learning and CV Tutorial (7) | Neural Network Training Skills (Part 2)

introduction

The focus of this article

1. Better optimization (parameter update)

1.1 Batch Gradient Descent (BGD)

1.2 Stochastic Gradient Descent (SGD)

1.3 Momentum update

1.4 Nesterov Momentum

1.5 Adaptive Gradient Algorithm (Adagrad)

1.6 Root Mean Square Prop Algorithm (RMSProp)

1.7 Adaptive-Momentum Optimization (Adam)

1.8 Learning Rate Annealing

1.9 Second-Order Method (Second-Order)

1.10 Practical applications

2. Regularization

2.1 Motivation for regularization

2.2 Regularization method

2.3 Dropout

1) Dropout overview

2) Understanding of Dropout

3) Avoid random deactivation during testing

4) Practical experience

3. Transfer Learning

3.1 The idea of transfer learning

3.2 Application

4. Model Ensembles

5. Expand your learning

6. Summary of key points

Stanford CS231n full set of interpretation

Featured Recommendations in ShowMeAI Series Tutorials

用户bPcV4sA

引用和评论

数据分析大作战，SQL V.S. Python，来看看这些考题你都会吗 ⛵

vLLM 实战教程汇总，从环境配置到大模型部署，中文文档追踪重磅更新

基于yolov5实现的AI智能盒子框架

性能远超SAM系模型，苏黎世大学等开发通用3D血管分割基础模型

【vLLM 学习】基础教程

18个常用的强化学习算法整理：从基础方法到高级模型的理论技术与代码实现

【Triton 教程】triton.heuristics