人工智能 - Deep Learning Tutorial | Practical Aspects of Deep Learning - ShowMeAI研究中心

ShowMeAI研究中心

Author: Han Xinzi @ShowMeAI
Tutorial address : https://www.showmeai.tech/tutorials/35
Address of this article : https://www.showmeai.tech/article-detail/216
Disclaimer: All rights reserved, please contact the platform and the author for reprinting and indicate the source
Bookmark ShowMeAI for more exciting content

第2门课改善深层神经网络：超参数调试、正则化以及优化，第1周：深度学习的实用层面

This series is based on the learning and summary of Mr. Wu Enda's "Deep Learning Specialization". The corresponding course videos can be viewed here .

introduction

In the previous ShowMeAI article Deep Neural Networks we covered the following:

The structure of a deep neural network.
Deep Neural Network Forward Propagation and Back Propagation Process.
Reasons for needing deep neural networks.
Neural networks participate in hyperparameters.
A simple comparison of neural network and human brain

The content of this article corresponds to " Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization ", the second course of Mr. Wu Enda's deep learning series. In the second course, Mr. Wu Enda discussed and explained how to optimize the neural network model, such as adjusting hyperparameters, improving the running speed of the algorithm, etc.

1. Data division: training/validation/test set

训练，验证，测试集 Train / Dev / Test sets

1.1 Iterative optimization of deep learning practice

Practical application of deep learning is an iterative process.

When building a neural network, we need to set many hyperparameters, such as the number of layers of the neural network (#Layers), the number of neurons contained in each hidden layer (#Hidden Units), the learning rate (, Learning Rates), Selection of Activation Functions, etc. In fact, it is difficult to select these optimal hyperparameters at the first setting, but need to be obtained through continuous iterative updates.

The loop iteration process is as follows:

深度学习优化迭代过程

① Generate an idea , select the initial parameter values, and build a neural network model structure;
② Realize the above ideas through code ;
③ Verify the performance of the neural network corresponding to these hyperparameters through experiments.
④ According to the verification results, we appropriately adjust and optimize the hyperparameters, and then perform the next Idea->Code->Experiment cycle. Through many cycles, the hyperparameters are continuously adjusted, and the best parameter values are selected to optimize the performance of the neural network.

In the above iteration process, the key to determining the speed of the entire training process is the time spent in a single cycle. The faster the single cycle, the faster the training process. Setting appropriate training sets, development sets, and test sets sizes can effectively improve training efficiency. The above data is partly derived from the division of the overall data in the process of building the model:

Training Sets : Use the training set to train the algorithm or model.
Validation set ((Development Sets) : Use validation set (also known as simple cross validation set, hold-out cross validation set) for cross validation to select the best model .
Test Sets : Finally, use the test set to test the model to obtain an unbiased estimate of the model's operation (to evaluate the learning method).

1.2 The division method in the pre-big data era

In the era of small data volume, such as 100, 1000, 10000 data volume, the data set can be divided according to the following proportions:

深度学习中的数据切分

Without validation set : 70%, 30%
With validation set : 60%, 20%, 20%

1.3 Division of Big Data Era

In today's era of big data, for a problem, the scale of the data set we have may be in the millions, so the proportion of the validation set and the test set will tend to become smaller.

About the validation set

The purpose of the validation set is to verify which algorithm is more efficient, so the validation set only needs to be large enough to verify which of about 2-10 algorithms is better, instead of using 20% of the data as the validation set. For example, it is enough to extract 10,000 data from millions of data as a validation set.

About the test set

The main purpose of the test set is to evaluate the effect of the model. For example, in a single classifier, often in millions of data, we choose 10,000 of them to evaluate the effect of a single model.

For big data scenarios of different magnitudes, we can use the following data division methods for training sets, development sets, and test sets:

深度学习中的数据切分

1 million data volume : 98%, 1%, 1%
Over one million data volumes : 99.5%, 0.25%, 0.25% (or 99.5%, 0.4%, 0.1%)

1.4 Recommendations for data partitioning

It is recommended that the validation set and the training set come from the same distribution (the same data source), which can make the machine learning algorithm faster and obtain better results.

Suppose you develop a mobile app that allows users to upload pictures, and the app recognizes pictures of cats . In an app recognition algorithm, your training samples may come from network downloads, while your validation and test samples may come from uploads from different users. Pictures downloaded from the Internet generally have higher pixels and are more regular, while pictures uploaded by users often have unstable pixels and different picture quality. In this case, the role of validation set and test set is affected.

The test set may not be required if unbiased estimates are not required to evaluate the performance of the model.

Test sets The goal of the test set is mainly to make unbiased estimates . We can train different algorithm models through Train sets, and then verify them on Dev sets respectively, and select the best algorithm model according to the results. This is also possible, and there is no need for unbiased estimation. If there are only Train sets and Dev sets, some people usually call the Dev sets here Test sets, and we should pay attention to the distinction.

2. Model Estimation: Bias/Variance

偏差，方差 Bias / Variance

2.1 Model Status and Evaluation

Bias and variance are two very important concepts and problems to be solved in the field of machine learning. In traditional machine learning algorithms, Bias and Variance are opposites, corresponding to underfitting and overfitting, respectively. We often need to make a trade-off between Bias and Variance. In deep learning, we can reduce Bias and Variance at the same time to build the best neural network model.

Let's first sort out the concepts mentioned above:

Bias : It measures the degree of deviation between the expected prediction of the learning algorithm and the actual result, that is, it describes the fitting ability of the learning algorithm itself.
Variance : It measures the change in learning performance caused by the change of the training set of the same size, that is, it depicts the impact of data disturbance.
Noise : It expresses the lower bound of the expected generalization error that any learning algorithm can achieve on the current task, which describes the difficulty of the learning problem itself .

The figure is an example diagram of several model states (High Bias, Just Right, High Variance) corresponding to the binary classification problem on the two-dimensional plane.

模型估计：偏差/方差

Among them, High Bias corresponds to underfitting , and High Variance corresponds to overfitting . In the case of underfitting, High Bias occurs, i.e. the data cannot be classified well.

In this example, the input features are two-dimensional, and High Bias and High Variance can be seen directly from the classification line in the figure. For the case where the input features are high-dimensional, how to judge whether there is High Bias or High Variance?

模型估计：偏差/方差

This is especially done with the help of the evaluation of several datasets we mentioned in the previous section (for the evaluation of the model, you can also refer to the ShowMeAI article Graphical Machine Learning | Model Evaluation Methods and Guidelines )

模型估计：偏差/方差

Generally speaking, the error rate of the training set reflects whether there is a Bias (bias), and the error rate of the validation set (difference from the training set) reflects whether there is a Variance (variance). After training a model:

The error rate of the training set is small, but the error rate of the validation set is relatively large, indicating that the model has a large variance (Variance), and overfitting may occur.
The error rates of the training set and the validation set are large, and the two are equivalent, indicating that the model has a large deviation (Bias), and there may be under-fitting.
The error rate of the training set is large, and the error rate of the validation set is much larger than that of the training set, indicating that the variance and bias are large and the model is poor.
The error rates of the training set and the validation set are both small, and the difference between the two is also small, indicating that the variance and deviation are small, and this model works better.

Neural network models may even have a bad state of High Bias and High Variance , as shown in the following figure:

模型估计：偏差/方差

2.2 Countermeasures

机器学习基础 Basic Recipe for Machine Learning

The model may be in the different states mentioned above. After we evaluate the model state, the optimization method is as follows for different states:

The model has high bias : increase the size of the network, such as adding hidden layers or the number of hidden units; find a suitable network architecture, use a larger NN structure; take longer to train.
The model has high variance : get more data; regularization; find a more suitable network structure.
Keep trying until you find a low-bias, low-variance framework.

模型估计：偏差/方差

In the early days of deep learning, there were not many ways to reduce bias or variance without affecting the other. In the era of big data, deep learning is very beneficial to supervised learning, so that we don't have to pay too much attention to how to balance the trade-off between bias and variance as before. Through the above methods, we can reduce the cost of one party without increasing the other party. value.

3. Regularization

正则化 Regularization

3.1 Regularization

If the model has an overfitting (High Variance) state, it can be alleviated by regularization. Although expanding the number of training samples is also a way to reduce High Variance, it is usually too expensive and difficult to obtain more training samples. Therefore, a more feasible and effective way is to use regular expressions.

3.2 Regularization in Logistic Regression

Let's review the logistic regression model introduced before. We added L2 Regularization to the Cost Function (see the ShowMeAI article Graphical Machine Learning | Detailed Explanation of Logistic Regression Algorithms ), the expression is as follows:

$$ J(w,b)=\frac1m \sum_{i=1}^m L(\hat y^{(i)},y^{(i)})+\frac{\lambda}{2m} ||w||_2^2 $$

Why only regularize $ w$ and not $ b$ ? In fact, $ b$ can also be regularized. But generally $ w$ has a large dimension, while $ b$ is just a constant. In comparison, the parameters are largely determined by $ w$, and changing the value of $ b$ has little effect on the overall model. Therefore, for the sake of simplicity, the regularization of $ b$ is generally ignored.

正则化

In addition to L2 regularization, we can also add L1 regularization to logistic regression. The expression is as follows:

$$ J(w,b)=\frac1m\sum_{i=1}^mL(\hat y^{(i)},y^{(i)})+\frac{\lambda}{2m}| |w||_1 $$

Compared with L2 regularization, L1 regularization is easier to obtain sparse $ w$ solutions , that is, there are many zero values in the weights $ w$ obtained in the final training. The advantage of L1 regularization is to save storage space (because most of $w$ are 0). But in fact, L1 regularization is not better than L2 regularization in solving the overfitting problem, and L1 is more complicated in differential derivation. Therefore, L2 regularization is generally more commonly used.

$ \lambda$ in L1 and L2 regularization is the regularization parameter (a type of hyperparameter). You can set $ \lambda$ to different values, perform validation in the validation set Dev set, and choose the best $ \lambda$ .

3.3 Regularization in Neural Networks

In the deep learning model, the expression of L2 regularization is shown in the figure:

神经网络的正则化

Usually, we call $ ||w^{[l]}||^2$ as the Frobenius Norm, denoted as $ ||w^{[l]}| |_F^2$ .

Since the regularization term is added to the Cost Function, the $ dw^{[l]}$ calculation expression in the gradient descent algorithm needs to be modified as follows:

$$ dw^{[l]}=dw^{[l]}_{before}+\frac{\lambda}{m}w^{[l]} $$

$$ w^{[l]}:=w^{[l]}-\alpha\cdot dw^{[l]} $$

You also sometimes hear L2 regularization called weight decay . This is because, due to the addition of the regular term, $ dw^{[l]}$ has an increment. When updating $ w^{[l]}$, this increment will be subtracted more , making $ w^{[l]}$ smaller than the value without the regular term. Iteratively updated and continuously reduced.

$$ \begin{aligned} w^{[l]} :& = w^{[l]}-\alpha\cdot dw^{[l]}\\ & = w^{[l]}-\alpha \cdot(dw^{[l]}_{before}+\frac{\lambda}{m}w^{[l]})\\ & = (1-\alpha\frac{\lambda}{m} )w^{[l]}-\alpha\cdot dw^{[l]}_{before} \end{aligned} $$

where $ (1-\alpha\frac{\lambda}{m})<1$ .

3.4 Regularization can reduce the reason for overfitting

为什么正则化有利于预防过拟合呢？ Why Regularization Reduces Overfitting?

(1) Intuitive explanation

Let's go back to the picture of the model state above, from left to right, showing three cases of underfitting , just fitting , and overfitting . Choosing a complex neural network model in the figure, then without adding regularization, we may get overfit classification boundaries in the figure.

神经网络的正则化

If using L2 regularization, $ w^{[l]}\approx0$ when $ \lambda$ is large. $ w^{[l]}$ is approximately zero, which means that some neurons in the neural network model actually play a small role and can be ignored. From the effect point of view, some neurons are actually ignored. In this way, the originally overly complex neural network model becomes less complicated, and becomes very simple.

As shown, the whole simplified neural network model becomes a logistic regression model. The problem changed from High Variance to High Bias .

神经网络的正则化

Therefore, to sum up, an intuitive understanding is: when the regularization factor is set large enough, in order to minimize the cost function, the weight matrix $ W$ will be set to a value close to 0, intuitively Equivalent to eliminating the influence of many neurons, then a large neural network will become a smaller network. Of course, in fact the neurons in the hidden layer still exist, but their influence is weakened, and the possibility of overfitting is greatly reduced.

(2) Mathematical explanation

Suppose the activation function used in the neuron is $ g(z) = tanh(z)$ (sigmoid is the same). (For a review of activation functions, see the ShowMeAI article Shallow Neural Networks )

神经网络的正则化

After adding the regularization term, when $ \lambda$ increases, causing $ W^{[l]}$ to decrease, $ Z^{[l]} = W^{[l]}a ^{[l-1]} + b^{[l]}$ will decrease. From the above figure, we will find that in the area where $ z$ is small (close to 0), the $ tanh(z)$ function is approximately linear, so the function of each layer is approximately linear, and the entire network becomes A simple, approximately linear network, so no overfitting occurs.

(3) Other explanations

When the weight $ w^{[L]}$ becomes smaller, the random change of the input sample $ X$ will not cause too much influence on the neural network model, and the neural network may be affected by local noise. Sex becomes smaller. This is why regularization reduces model variance.

4. Dropout regularization

Dropout 正则化 Dropout Regularization

In the neural network, another very effective regularization method is called Dropout (random deactivation), which refers to setting a random closing probability for each neuron node in the hidden layer of the neural network, and the remaining neurons A network with fewer nodes and smaller scale is formed for training. The network model is simplified to avoid overfitting.

Dropout正则化

4.1 Inverted Dropout

Dropout has different implementation methods, a common method is Inverted Dropout . Suppose that for the neurons in the lth layer, set the probability of keeping the proportion of neurons keep_prob=0.8, that is, 20% of the neurons in this layer stop working. $ dl$ is a Dropout vector, set $ dl$ as a random vector, where 80% of the elements are 1, and 20% of the elements are 0.

The generated python code for the dropout vector is as follows :

Dropout正则化

 keep_prob = 0.8    # 设置神经元保留概率
dl = np.random.rand(al.shape[0], al.shape[1]) < keep_prob
al = np.multiply(al, dl)
al /= keep_prob

The last step al /= keep_prob is because some elements in $ a^{[l]}$ are inactivated (equivalent to being zeroed), in order not to affect $ Z^{[ l+1]} = the expected value of W^{[l+1]}a^{[l]} + b^{[l+1]}$ , so divide by one keep_prob .

Another benefit of Inverted Dropout is that it can reduce scaling problems when testing the neural network after the dropout. Because scale up is used to ensure that the expected value of al does not change significantly during training, there is no need to perform similar scaling operations on sample data during testing.

For $ m$ samples, during a single iteration of training, a certain number of neurons in the hidden layer are randomly deleted;
Then, update the weights $ w$ and constant terms $ b$ forward and backward on the remaining neurons after deletion;
Then, in the next iteration, restore the previously deleted neurons, randomly delete a certain number of neurons, and perform forward and reverse updates $ w$ and $ b$.
Repeat the above process continuously until the iterative training is completed.

Note : After training with Dropout, Dropout and random deletion of neurons are not required when testing and actually applying the model, and all neurons are working.

4.2 Understanding Dropout

理解 Dropout Understanding Dropout

4.2.1 Dropout Understanding Perspective 1

Dropout randomly selects different neurons during each iteration of training, which is equivalent to training on a different neural network each time, similar to the Bagging method in machine learning (for detailed ideas, you can read the ShowMeAI article Graphical Machine Learning | Random Forest Classification Model details ) to prevent overfitting.

Dropout正则化

4.2.2 Dropout Understanding Perspective 2

The second perspective of understanding is that Dropout will reduce the value of the weight $ w$.

For a neuron, during a certain training, some of its inputs are filtered by the dropout. And in the next training, some different inputs are filtered. After many training sessions, some inputs are filtered and some are kept. In this way, the neuron is no longer particularly dependent on any one input feature. That is to say, the corresponding weight $ w$ will not be very large. In effect, this is similar to L2 regularization, which "penalizes" the weight $ w$ and reduces the value of $ w$ .

Dropout正则化

Therefore, through the propagation process, Dropout will produce the same effect of shrinking weights as $ L2$ regularization.

4.3 Dropout value

Generally speaking, the hidden layer with more neurons, keep_prob can be set smaller, such as 0.5; the hidden layer with fewer neurons, keep_out can be set larger, such as 0.8 , the setting is 1.

In practical applications, it is not recommended to perform Dropout on the input layer. If the input layer has a large dimension, such as a picture, then Dropout can be set, but keep_prob should be set larger, such as 0.8, 0.9.

In general, the hidden layer that is more prone to overfitting, its keep_prob is set relatively small. There is no exact fixed practice, and it is usually possible to choose based on validation.

Note : A big disadvantage of Dropout is that the cost function cannot be well defined. Because each iteration randomly removes the influence of some neuron nodes, the cost function cannot be guaranteed to be monotonically decreasing. Therefore, when using Dropout, first set all keep_prob to 1.0 and run the code to ensure that the $ J(w, b)$ function is monotonically decreasing, and then turn on Dropout.

5. Other regularization methods

其他正则化方法 Other Regularization Methods

5.1 Data Augmentation

Data Augmentation is a common and effective technique in deep learning. In particular, in the field of computer vision, it refers to obtaining more training sets and validation set . As shown below:

其他正则化方法

5.2 Early Stopping

Because the training process of deep learning is a process of iteratively optimizing the cost function of the training set, but too many iterations will cause the model to overfit the training set and weaken the generalization ability to other data. One way to deal with this is to use Early Stopping.

In Early Stopping, we plot the cost curve for gradient descent on the training set and validation set on the same axis. When the training set error decreases but the validation set error increases, and the two begin to deviate significantly, the iteration is stopped in time, and the connection weight and threshold with the smallest validation set error are returned to avoid overfitting.

其他正则化方法

Early Stopping also has its own drawbacks .

Looking back, we have two goals in applying machine learning to train the model: 1. Optimize the Cost Function and minimize $ J$; 2. Prevent overfitting and hope to have good generalization ability on new data. These two goals are opposed to each other, that is, reducing $J$ may cause overfitting, and vice versa.

As mentioned earlier, in deep learning, neural networks can reduce both Bias and Variance to build the best model. However, Early Stopping prevents overfitting by reducing the number of training sessions so that $ J$ is not small enough. That is, Early Stopping fuses the above two goals and optimizes at the same time, but may not be as good as "divide and conquer" .

Compared with Early Stopping, L2 regularization can achieve the effect of "divide and conquer": iterative training is enough to reduce $ J$ , and it can also effectively prevent overfitting. One of the disadvantages of L2 regularization is that the selection of the optimal regularization parameter $ \lambda$ is more complicated, and Early Stopping is relatively simple .

In general $ L2$ regularization is more commonly used.

6. Normalize Input

标准化输入 Normalizing Inputs

6.1 Standardized Input Operations

When training a neural network, normalizing the input can increase the speed of training. Standardization is the operation of normalizing the training data set, that is, the original data is subtracted from its mean$ \mu$ and then divided by its variance$ \sigma^2$ :

$$ \mu=\frac1m\sum_{i=1}^mX^{(i)} $$

$$ \sigma^2=\frac1m\sum_{i=1}^m(X^{(i)})^2 $$

$$ X:=\frac{X-\mu}{\sigma^2} $$

The following figure shows the normalization process of two-dimensional data and its distribution changes:

标准化输入

Note: For actual modeling applications, the test set should be normalized with the same $ \mu$ and $ \sigma^2$ as the training set. This ensures that the normalization operations of the training set and test set are consistent.

6.2 Reasons for normalizing input

Standardizing the input allows all inputs to be adjusted to the same scale, which is convenient for finding the global optimal solution faster and more accurately when performing the gradient descent algorithm.

Taking two-dimensional data as an example, if the input data has two dimensions, $ x_1$ and $ x_2$, the range of $ x_1$ is $ [1,1000]$, $ x_2$ The range is $ [0,1]$ .

标准化输入

(1) Cases without standardization

Without normalization, the distribution between $ x_1$ and $ x_2$ is extremely unbalanced, and the trained $ w_1$ and $ w_2$ are also very different in magnitude. The Cost Function in this case may be related to $ w$ and $ b$ as a very elongated oval bowl, as shown on the left.

When optimizing the gradient descent of this Cost Function, due to the large difference between $ w_1$ and $ w_2$ values, only a small learning factor $ \alpha$ can be selected to avoid $ J$ Oscillation occurs. Once $ \alpha$ is large, oscillation must occur, and $ J$ is no longer monotonically decreasing.

(2) When the standardization is completed

If the normalization operation is performed, $ x_1$ and $ x_2$ are evenly distributed, and $ w_1$ and $ w_2$ have little difference in value, and the resulting Cost Function is the same as $ w$ and $ b $ is similar to a round bowl, as shown in the image to the right. When performing gradient descent optimization on it, $ \alpha$ can be chosen to be relatively large, and $ J$ generally does not oscillate, ensuring that $ J$ is monotonically decreasing.

If the range between the input features is relatively close, then not doing the normalization operation will not have much impact. However, normalization is worth doing in most cases.

7. Gradient Vanishing and Gradient Explosion

梯度消失/梯度爆炸 Vanishing / Exploding Gradients

7.1 Gradient Explosion and Gradient Vanishing

In the deep neural network, when we calculate the gradient of the loss function, sometimes exponentially increasing or decreasing, they correspond to the gradient explosion and gradient disappearance problems of the neural network respectively.

As an example, consider a multi-layer deep neural network model with only two neurons in each layer, as shown in the following figure:

In order to simplify the complexity and facilitate analysis, we make the activation function of each layer a linear function, and ignore the influence of the constant term $ b$ of each layer, that is, assuming $ g(z) = z$ , $ b^ {[l]} = 0$ , for the target output $ \hat{y}$ there are:

梯度消失和梯度爆炸

This superposition will bring the following two situations:

For the case where the value of $ W^{[l]}$ is greater than 1, the value of the activation function will increase exponentially;
For the case where the value of $ W^{[l]}$ is less than 1, the value of the activation function will decrease exponentially.

Calculating the gradient is a similar process. According to the chain rule of derivation, there will also be superposition and multiplication. The gradient function will increase or decrease exponentially, causing the difficulty of training the derivative to increase, and the step size of the gradient descent algorithm will become very small. , the training time will be very long.

7.2 Weight Initialization Mitigates Gradient Vanishing and Explosion

神经网络的权重初始化 Weight Initialization for Deep Networks

So how to improve the gradient disappearance and explosion problem? One way is to do some initialization of the weights $w$. (Other methods such as ResNet and other network structure adjustments mentioned in the detailed explanation of the classic CNN network instance in the ShowMeAI article)

In the deep neural network model, taking a single neuron as an example, its output is calculated as $ \hat{y}$ :

梯度消失和梯度爆炸

In order to make $ \hat{y}$ not too large or too small, the idea is to make $ w$ related to $ n$, and the larger $ n$ is, $ w$ should be Smaller is better. One way is to initialize w with its variance $ 1/n$ , which is called Xavier initialization here.

 # 针对tanh激活函数的Xavier初始化
WL = np.random.randn(WL.shape[0], WL.shape[1]) * np.sqrt(1/n)

Where $ n$ is the number of input neurons, ie WL.shape[1] .

In this way, the input of the activation function $ x$ is approximately set to have a mean of 0 and a standard deviation of 1, and the variance of the neuron output $ z$ is normalized to 1. Although it does not solve the problem of vanishing and exploding gradients, it does slow down the speed of vanishing and exploding gradients to a certain extent.

If the ReLU activation function is used, the initialization of the weight $ w$ generally makes its variance $ 2/n$ The corresponding python code is as follows:

 w[l] = np.random.randn(n[l],n[l-1])*np.sqrt(2/n[l-1])

8. Gradient checking

8.1 Numerical Approximation of Gradients

梯度的数值逼近 Numerical approximation of gradients

We know that the gradient descent method relies heavily on the gradient to complete. Mathematically, we can use the limit calculation to approximate the derivative based on the definition of differential. We have the following " one-sided error method " and " two-sided error method ", where The latter is more accurate.

(1) One-sided error

梯度检验

(2) Derivation of bilateral errors (that is, the definition of derivatives)

梯度检验

When $ \varepsilon$ is smaller, the result is closer to the true derivative, that is, the gradient value. This method can be used to determine whether an error has occurred when backpropagating gradient descent.

8.2 Gradient test

梯度检验 Gradient Checking

After we calculate the numerical gradient, we need to perform a gradient check to verify whether there is any problem during the training process.

(1) Connection parameters

Convert $ W^{[1]}$ , $ b^{[1]}$ , ..., $ W^{[L]}$ , $ b^{[L]}$ All concatenated into one giant vector $ \theta$ .

At the same time, for $ dW^{[1]}$ , $ db^{[1]}$ , ..., $ dW^{[L]}$ , $ db^{[L]} $ does the same to get the giant vector $ d\theta$ , which has the same dimensions as $ \theta$.

连接参数

Now, we need to find the relationship between $ d\theta$ and the gradient of the cost function $ J$.

(2) Carry out gradient test

To obtain a gradient approximation value$ d\theta_{approx}[i]$ , it should be$ \approx d\theta[i] = \frac{\partial J}{\partial \theta_i}$ .

梯度检验

Therefore, we use the gradient test value to check whether the backpropagation is implemented correctly. Among them, $ {||x||}_2$ represents the 2-norm (also called " Euclidean norm ") of the vector $ x$.

If the gradient test value is close to the value of $ \varepsilon $, the implementation of the neural network is correct, otherwise check the code for bugs .

8.3 Practical Tips and Considerations for Implementing Gradient Testing in Neural Networks

梯度检验应用的注意事项 Gradient Checking Implementation Notes

There are a few things to note when doing gradient checking:

Don't use gradient testing throughout training, it's just for debugging.
If the gradient test of the algorithm fails, find the gradient corresponding to the error (that is, the term with a large difference between the values of $ d\theta \approx [i]$ and $ d\theta$ ), and check whether there is an error in its derivation.
When calculating the approximate gradient, you need to bring a regular term.
Turn off dropout during gradient check, and turn on dropout after checking.

References

Featured Recommendations in ShowMeAI Series Tutorials

ShowMeAI用知识加速每一次技术成长

Deep Learning Tutorial | Practical Aspects of Deep Learning

introduction

1. Data division: training/validation/test set

1.1 Iterative optimization of deep learning practice

1.2 The division method in the pre-big data era

1.3 Division of Big Data Era

About the validation set

About the test set

1.4 Recommendations for data partitioning

2. Model Estimation: Bias/Variance

2.1 Model Status and Evaluation

2.2 Countermeasures

3. Regularization

3.1 Regularization

3.2 Regularization in Logistic Regression

3.3 Regularization in Neural Networks

3.4 Regularization can reduce the reason for overfitting

(1) Intuitive explanation

(2) Mathematical explanation

(3) Other explanations

4. Dropout regularization

4.1 Inverted Dropout

4.2 Understanding Dropout

4.2.1 Dropout Understanding Perspective 1

4.2.2 Dropout Understanding Perspective 2

4.3 Dropout value

5. Other regularization methods

5.1 Data Augmentation

5.2 Early Stopping

6. Normalize Input

6.1 Standardized Input Operations

6.2 Reasons for normalizing input

(1) Cases without standardization

(2) When the standardization is completed

7. Gradient Vanishing and Gradient Explosion

7.1 Gradient Explosion and Gradient Vanishing

7.2 Weight Initialization Mitigates Gradient Vanishing and Explosion

8. Gradient checking

8.1 Numerical Approximation of Gradients

(1) One-sided error

(2) Derivation of bilateral errors (that is, the definition of derivatives)

8.2 Gradient test

(1) Connection parameters

(2) Carry out gradient test

8.3 Practical Tips and Considerations for Implementing Gradient Testing in Neural Networks

References

recommended article

Featured Recommendations in ShowMeAI Series Tutorials

用户bPcV4sA

引用和评论

数据分析大作战，SQL V.S. Python，来看看这些考题你都会吗 ⛵

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

大模型时代，后端程序员如何避免被AI卷死？