1
头图

神经网络与反向传播
This series is a complete set of study notes for Stanford CS224n "Natural Language Processing with Deep Learning", and the corresponding course videos can be viewed here .

神经网络知识回顾

神经网络反向传播与计算图

ShowMeAI has done Chinese translation and annotation for all the courseware of CS224n course, and made a GIF animation! Click on Lecture 3 - Advanced Word Vectors and Lecture 4 - Neural Network Backpropagation and Computation Diagram to view the courseware annotations and interpretations. See the end of the article for more information.


introduction

CS224n is a professional course in deep learning and natural language processing produced by the top university Stanford. The core content covers RNN, LSTM, CNN, transformer, bert, question answering, summary, text generation, language model, reading comprehension and other cutting-edge content.

This set of notes introduces single-layer and multi-layer neural networks and how to use them for classification purposes. We then discuss how to train them using a distributed gradient descent technique called backpropagation. We'll see how to use the chain rule to make parameter updates sequentially. After a rigorous mathematical discussion of neural networks, we'll discuss some practical tips and tricks for training neural networks, including: neuron units (non-linear), gradient checking, Xavier parameter initialization, learning rates, Adagrad, and more. Finally, we will encourage the use of recurrent neural networks as language models.

Content Highlights

  • Neural Networks
  • backpropagation
  • Gradient calculation
  • Neurons
  • hinge loss
  • Gradient check
  • Xavier parameter initialization
  • learning rate
  • Adagrad optimization algorithm

1. Neural Network Basics

(This part of the content can also refer to ShowMeAI 's summary article on Mr. Wu Enda's course Deep Learning Tutorial | Neural Network Basics , Deep Learning Tutorial | Shallow Neural Network and Deep Learning Tutorial | Deep Neural Network )

In the previous discussion, it is believed that nonlinear classifiers are needed because most of the data are linearly inseparable, otherwise the performance of linear classifiers on these data is limited. A neural network is a class of classifiers with nonlinear decision boundaries as shown in the figure below. We can clearly see its nonlinear decision boundary on the graph, let's see how the model learns it.

Neural networks are biologically inspired classifiers, which is why they are often called "artificial neural networks" to distinguish them from organic classes. In reality, however, human neural networks are more capable and complex than artificial neural networks, so it's usually best not to draw too many similarities between the two.

神经网络基础

1.1 A single neuron

A neuron is a general-purpose computational unit that accepts \(n\) inputs and produces an output. Different neurons will have different outputs according to their different parameters (generally considered as neuron weights).

A common choice for neurons is \(sigmoid\) , or "binary logistic regression" units. This neuron takes a \(n\) dimensional vector as input and computes an activation scalar (output) \(a\) . The neuron is also associated with a \(n\) dimensional weight vector \(w\) and a bias scalar \(b\).

The output of this neuron is:

$$ a=\frac{1}{1+exp(-(w^{T}x+b))} $$

We can also combine the weights and bias terms from the formula above:

$$ a=\frac{1}{1+exp(-[w^{T}\;\;x]\cdot [x\;\;1])} $$

The above formula is visualized as shown in the following figure:

单个神经元

❐ Neurons are the basic building blocks of neural networks. We will see that a neuron can be one of many functions that allow nonlinearities to accumulate in the network.

1.2 Single-layer neural network

We extend the above idea to multiple neurons, considering the input \(x\) as the input to multiple such neurons, as shown in the figure below.

单层神经网络

If we define the weights of different neurons as \({w^{(1)}, \cdots ,w^{(m)}}\) and the bias as \({b_1, \cdots ,b_m}\ ) and the corresponding activation output is \({a_1, \cdots ,a_m}\) :

$$ a_{1} =\frac{1}{1+exp(-(w^{(1)T}x+b_1))} $$

$$ \vdots $$

$$ a_{m} =\frac{1}{1+exp(-(w^{(m)T}x+b_m))} $$

Let's define a simplified formula to better express complex networks:

$$ \sigma(z) = \begin{bmatrix} \frac{1}{1+exp(z_1)} \\ \vdots \\ \frac{1}{1+exp(z_m)} \end{bmatrix} $$

$$ b = \begin{bmatrix} b_{1} \\ \vdots \\ b_{m} \end{bmatrix} \in \mathbb{R}^{m} $$

$$ W = \begin{bmatrix} -\;\;w^{(1)T}\;\;- \\ \cdots \\ -\;\;w^{(m)T}\;\; - \end{bmatrix} \in \mathbb{R}^{m\times n} $$

We can now write the output of scaling and bias as:

$$ z=Wx+b $$

The activation function sigmoid can be transformed into the following form:

$$ \begin{bmatrix} a_{1} \\ \vdots \\ a_{m} \end{bmatrix} = \sigma(z) = \sigma(Wx+b) $$

So what do these activations do? We can think of these activations as indicators of the existence of some combination of weighted features. We can then use the combination of these activations to perform classification tasks.

1.3 Forward and Backward Computation

So far we know that an input vector\(x\in \mathbb{R}^{n}\) can be transformed by a layer of \(sigmoid\) units to get the activation output\(a\in \mathbb{R}^ {m}\) . But what's the intuition for doing this? Let us consider a named entity recognition problem in NLP as an example:

Museums in Paris are amazing

Here we want to judge whether the central word Paris is a named entity. In this case, we most likely want to capture not only the word vectors of the words in the window, but also some other interactions between words for classification purposes. For example, ---2051b74883aae957c7e71194f5ab6624 Paris may be a named entity only if Museums is the first word and in is the second word. Such non-linear decisions are usually not captured by the input provided directly to the Softmax function, but require the addition of an intermediate layer of the neural network for scoring. Therefore, we can use another matrix \(\mathbf{U} \in \mathbb{R}^{m \times 1}\) with the activation output to calculate the unnormalized score for the classification task:

$$ s=\mathbf{U}^{T}a=\mathbf{U}^{T}f(Wx+b) $$

where \(f\) is the activation function (such as the sigmoid function).

前向与反向计算

Dimensional analysis : If we use a \(4\) dimensional word vector to represent each word, and a window of \(5\) words, the input is \(x\in \mathbb{R}^{20} \) . If we use \(8\) sigmoid units in the hidden layer and generate a fractional output from the activation function, where \(W\in \mathbb{R}^{8\times 20}\) , \(b\in \mathbb{R}^{8}\) , \(U\in \mathbb{R}^{8\times 1}\) , \(s\in \mathbb{R}\) .

1.4 Hinge loss

Like many machine learning models, neural networks require an optimization objective function, an error that we want to minimize or maximize. Here we discuss a common error measure: maximum margin objective maximum margin objective function . The idea behind using this objective function is to guarantee that the computed score for "true" labeled data is higher than that for "false" labeled data.

Going back to the previous example, if we let the "true" tab window Museums in Paris are amazing have a computed score of \(s\) , and let the "false" tab window Not all museums in Paris have a computed score of \( s_c\) (subscript \(c\) means this window is corrupted)

Then, we maximize \((s-s_c)\) or minimize \((s_c-s)\) for the objective function. However, we modify the objective function to ensure that the error is only calculated when \(s_c > s \Rightarrow (s_c-s) > 0\). The intuition for this is that we only care about the "correct" data points having a higher score than the "wrong" data points, and the rest doesn't matter. Therefore, the error is \((s_c-s)\) when \(s_c > s\) and 0 otherwise. Therefore, the objective function of our optimization is now:

$$ minimize\;J=max\,(s_c-s,0) $$

However, the above optimization objective function is risky because it does not create a safe interval. We want the "true" data to score more than some positive interval \(\Delta\) than the "fake" data. In other words, we want the error to be calculated when \((s-s_c < \Delta)\), not when \((s-s_c < 0)\) . Therefore, we modify the optimization objective function as:

$$ minimize\;J=max\,(\Delta+s_c-s,0) $$

We can scale this interval so that \(\Delta=1\) and let other parameters be adjusted automatically during optimization without affecting the performance of the model. (For hinge loss and minimum interval problem, you can read the explanation of SVM algorithm in ShowMeAI 's machine learning algorithm tutorial ). Finally, we define the optimization objective function over all training windows as:

$$ minimize\;J=max\,(1+s_c-s,0) $$

According to the above formula:

$$ s_c=\mathbf{U}^{T}f(Wx_c+b) $$

$$ s=\mathbf{U}^{T}f(Wx+b) $$

❐ The maximum marginal objective function is often used with support vector machines

1.5 Backpropagation (Single-Sample Morphology)

In the previous section, we mentioned the hinge loss. Let's explain how to train different parameters in the model when the loss function \(J\) is positive. If the loss is \(0\), then there is no need to update the parameters. We generally use gradient descent (or a variant like SGD) to update the parameters, so know the gradient information for any parameter you need in the update formula:

$$ \theta^{(t+1)}=\theta^{(t)}-\alpha\nabla_{\theta^{(t)}}J $$

Backpropagation is a method that utilizes the differential chain rule to compute the gradient of the loss over an arbitrary parameter of a model. To further understand backpropagation, let's first look at a simple network in the figure below:

反向传播(单样本形态)

Here we use a neural network with only a single hidden layer and a single output unit. Now let's first establish some symbol definitions:

  • \(x_i\) is the input of the neural network
  • \(s\) is the output of the neural network
  • Neurons in each layer (including input and output layers) receive an input and generate an output. The \(j\)th neuron of the \(k\)th layer receives a scalar input\(z_j^{(k)}\) and generates a scalar activation output\(a_j^{(k)}\)
  • We define the backpropagation error calculated by \(z_j^{(k)}\) as \(\delta_j^{(k)}\)
  • The \(1\)th layer is the input layer, not the \(1\)th hidden layer. For the input layer, \(x_j=z_j^{(1)}=a_j^{(1)}\)
  • \(W^{(k)}\) is the transition matrix that maps the output of layer \(k\) to the input of layer \(k+1\), so this new notation is used in Section 1.3 above Examples in \(W^{(1)}=W\) and \(W^{(2)}=U\)

Now start backpropagation :

Assuming that the loss function \(J=(1+s_c-s)\) is positive, we want to update the parameter \(W_{14}^{(1)}\) , we see \(W_{14}^{ (1)}\) only participate in the calculation of \(z_1^{(2)}\) and \(a_1^{(2)}\). This is very important to understand backpropagation - the gradients of backpropagation are only affected by the values they contribute . \(a_1^{(2)}\) is multiplied by \(W_1^{(2)}\) in the subsequent forward calculation to calculate the score. We can see from the max margin loss:

$$ \frac{\partial J}{\partial s}=-\frac{\partial J}{\partial s_c}=-1 $$

For simplicity we only analyze \(\frac{\partial s}{\partial W_{ij}^{(1)}}\) . so,

$$ \begin{aligned} \frac{\partial s}{\partial W_{ij}^{(1)}} &= \frac{\partial W^{(2)}a^{(2)}} {\partial W_{ij}^{(1)}}=\frac{\partial W_i^{(2)}a_i^{(2)}}{\partial W_{ij}^{(1)}}= W_i^{(2)}\frac{\partial a_i^{(2)}}{\partial W_{ij}^{(1)}} \\ \Rightarrow W_i^{(2)}\frac{\partial a_i^{(2)}}{\partial W_{ij}^{(1)}} &= W_i^{(2)}\frac{\partial a_i^{(2)}}{\partial z_i^{ (2)}}\frac{\partial z_i^{(2)}}{\partial W_{ij}^{(1)}} \\ &= W_i^{(2)}\frac{f(z_i^ {(2)})}{\partial z_i^{(2)}}\frac{\partial z_i^{(2)}}{\partial W_{ij}^{(1)}} \\ &= W_i ^{(2)}f^{\prime}(z_i^{(2)})\frac{\partial z_i^{(2)}}{\partial W_{ij}^{(1)}} \\ &= W_i^{(2)}f^{\prime}(z_i^{(2)})\frac{\partial}{\partial W_{ij}^{(1)}}(b_i^{(1 )}+a_1^{(1)}W_{i1}^{(1)}+a_2^{(1)}W_{i2}^{(1)}+a_3^{(1)}W_{i3} ^{(1)}+a_4^{(1)}W_{i4}^{(1)}) \\ &= W_i^{(2)}f^{\prime}(z_i^{(2)} )\frac{\partial}{\partial W_{ij}^{(1)}}(b_i^{(1)}+\sum_{k}a_{k}^{(1)}W_{ik}^ {(1)}) \\ &= W_i^{(2)}f^{\prime}(z_i^{(2)})a_j^{(1)} \\ &= \delta_i^{(2) }\cdot a_j^{(1)} \end{aligned} $$

where \(a^{(1)}\) refers to the input of the input layer. We can see that the gradient calculation can finally be simplified to \(\delta_i^{(2)}\cdot a_j^{(1)}\), where \(\delta_i^{(2)}\) is essentially the first \ The error backpropagated to the \(i\)th neuron in the (2\) layer. The result of multiplying \(a_j^{(1)}\) and \(W_{ij}\) is input into the \(i\) neuron in the \(2\) layer.

Let's take the following figure as an example, let's explain backpropagation from the "error sharing/distribution", now we want to update \(W_{14}^{(1)}\) :

反向传播(单样本形态)

  • ① We start backpropagation from the error signal of 1 of \(a_1^{(3)}\)
  • ② Then we multiply the error by the local gradient of the neuron that maps \(z_1^{(3)}\) to \(a_1^{(3)}\). In this example the gradient is exactly 1, and the error is still 1. So there is \(\delta_1^{(3)}=1\)
  • ③ Here the error signal 1 has reached \(z_1^{(3)}\) . We now need to distribute the error signal such that a "fair share" of the error arrives at \(a_1^{(2)}\)
  • ④ Now the error in \(a_1^{(2)}\) is \(\delta_1^{(3)}\times W_1^{(2)}=W_1^{(2)}\) (in \( The error signal of z_1^{(3)}\) is \(\delta_1^{(3)}\) ). So the error in \(a_1^{(2)}\) is \(W_1^{(2)}\)
  • ⑤ In the same way as in step 2, we move the error on the neuron that maps \(z_1^{(2)}\) to \(a_1^{(2)}\), and put \(a_1^{( 2)}\) Multiply with the local gradient, where the local gradient is \(f'(z_1^{(2)})\)
  • ⑥ So the error in \(z_1^{(2)}\) is \(f'(z_1^{(2)})W_1^{(2)}\), we define it as \(\delta_1^ {(2)}\)
  • ⑦ Finally, we assign the "error sharing" of the error to \(W_{14}^{(1)} by multiplying the above error by \(a_4^{(1)}\) participating in the forward calculation \) .
  • ⑧ So, the gradient loss for \(W_{14}^{(1)}\) can be calculated as \(a_4^{(1)}f'(z_1^{(2)})W_1^{(2) }\)

Note that the result we get with this method is exactly the same as the result of the previous differentiation method. Therefore, computing the gradient error of the corresponding parameters in the network can use either the chain rule or the method of error sharing and distribution—both methods yield the same results, but it may be helpful to think about them in multiple ways.

Bias update : Bias items (such as \(b_1^{(1)}\) ) and other weights are equivalent in mathematical form, just calculating the next layer of neural\(z_1^{(2)}\ ) element is multiplied by the constant value of 1. Therefore, when the gradient of the bias of the \(i\)th neuron in the kth layer is \(\delta_i^{(k)}\) . For example, in the above example, we update \(b_1^{(1)}\) instead of \(W_{14}^{(1)}\) , then this gradient is \(f'(z_1^ {(2)})W_1^{(2)}\) .

General steps for backpropagation from \(\delta^{(k)}\) to \(\delta^{(k-1)}\):

  • ① We have the error \(\delta_i^{(k)}\) propagated backward from \(z_i^{(k)}\), as shown in the following figure

反向传播(单样本形态)

  • ② We back-propagate this error to \( a_j^{(k-1)}\)
  • ③ So the error received in \(a_j^{(k-1)}\) is \(\delta_i^{(k)}W_{ij}^{(k-1)}\)
  • ④ However, \(a_j^{(k-1)}\) may participate in the calculation of multiple neurons in the next layer when the forward calculation may show the following figure. Then the error of the \(m\) neuron of the \(k\) layer also uses the previous method to backpropagate the error to \(a_j^{(k-1)}\)

反向传播(单样本形态)

  • ⑤ So now the error received in \(a_j^{(k-1)}\) is \(\delta_i^{(k)}W_{ij}^{(k-1)}+\delta_m^{(k )}W_{mj}^{(k-1)}\)
  • ⑥ In fact, we can simplify the above error sum to \(\sum_i\delta_i^{(k)}W_{ij}^{(k-1)}\)
  • ⑦ Now we have the correct error in \(a_j^{(k-1)}\), then correlate it with the local gradient \(f^{\prime}(z_j^{(k-1)})\) Multiply, pass the error information back to the \(j\)th neuron of the \(k-1\)th layer
  • ⑧ Therefore, the error to reach \(z_j^{(k-1)}\) is \(f ^{\prime} (z_j^{(k-1)})\sum_i\delta_i^{(k)}W_{ ij}^{(k-1)}\)

1.6 Backpropagation (vectorized form)

In the real neural network training process, we usually update the network weights based on a batch of samples. The more efficient way here is vectorization. With the help of vectorization, we can directly update the weight matrix and bias vector at one time. . Note that this is a simple extension of the model above, which will help to better understand the method of error backpropagation at the matrix-vector level.

For more certain parameters \(W_{ij}^{(k)}\), we know that its error gradient is \(\delta_j^{(k+1)}\cdot a_j^{(k)}\) . where \(W^{(k)}\) is the matrix that maps \(a^{(k)}\) to \(z^{(k+1)}\) . So we can determine the gradient error of the entire matrix \(W^{(k)}\) as:

$$ \nabla_{W^{(k)}} = \begin{bmatrix} \delta_1^{(k+1)}a_1^{(k)} & \delta_1^{(k+1)}a_2^{ (k)} & \cdots \\ \delta_2^{(k+1)}a_1^{(k)} & \delta_2^{(k+1)}a_2^{(k)} & \cdots \\ \ vdots & \vdots & \ddots \\ \end{bmatrix} = \delta^{(k+1)}a^{(k)T} $$

So we can write the gradient of the entire matrix as the outer product of the back-propagated error vector and the output of the forward activation in the matrix.

Now let's see how the error vector \(\delta^{(k+1)}\) can be calculated.

From the above example we have

$$ \delta_i^{(k)}=f^{\prime}(z_j^{(k)})\sum_i\delta_i^{(k+1)}W_{ij}^{(k)} $$

This can be simply rewritten in matrix form:

$$ \delta_i^{(k)}=f^{\prime} (z^{(k)})\circ (W^{(k)T}\delta^{(k+1)}) $$

In the above formula the \(\circ\) operator represents the multiplication of corresponding elements between vectors ( \(\mathbb{R}^{N}\times \mathbb{R}^{N}\rightarrow \mathbb {R}^{N}\) ).

Computational efficiency: After exploring the element-wise update and the vector-wise update, it must be realized that in a scientific computing environment, such as MATLAB or Python (using the Numpy/Scipy library), the computational efficiency of vectorized operations is very high . Therefore, vectorized operations should be used in practice. Also, we want to reduce redundant computations in backpropagation - for example, notice that \(\delta^{(k)}\) is directly dependent on \(\delta^{(k+1)}\) superior. So we have to make sure that when updating \(W^{(k)}\) with \(\delta^{(k+1)}\), we need to save \(\delta^{(k+1)}\) with Repeat the above steps for the calculation of the following \(\delta^{(k)}\) - and then to calculate the \((k-1) \cdots (1)\) layer. Such a recursive process is what makes backpropagation computationally affordable.

2. Neural Networks: Tips and Advice

(This part of the content can also refer to ShowMeAI 's summary article on Mr. Wu Enda's course Deep Learning Tutorial | Practical Level of Deep Learning )

2.1 Gradient Check

In the previous section we introduced how to compute the error gradient/update of parameters in a neural network using calculus-based methods.

Here we introduce a method for approximating these gradients numerically - although computationally inefficient and not directly applicable to training neural networks, this method can estimate the derivative of any parameter very accurately; therefore, it can be used as a Useful check for correctness.

Given a model's parameter vector \(\theta\) and loss function \(J\) , the numerical gradient around \(\theta_i\) is given by the central difference formula:

$$ f^{\prime}(\theta)\approx \frac{J(\theta^{(i+)})-J(\theta^{(i-)})}{2\varepsilon } $$

Where \(\varepsilon\) is a small value (generally about \(1e^{-5}\) ). When we perturb the \(i\)th element of the parameter \(\theta\) with \(+\varepsilon\), the error can be calculated on the forward pass\(J(\theta^{(i+)} )\) . Similarly, when we perturb the \(i\)th element of the parameter \(\theta\) with \(-\varepsilon\), the error can be calculated on the forward pass\(J(\theta^{( i-)})\) .

Therefore, by computing the forward pass twice, we can estimate the gradient for any given parameter in the model. We note that the definition of numerical gradient is very similar to that of derivative, where, in the scalar case:

$$ f^{\prime}(\theta)\approx \frac{f(x+\varepsilon)-f(x)}{\varepsilon} $$

Of course, there is a difference - the above definition only computes the gradient at the forward perturbation \(x\). While it is possible to define numerical gradients this way, in practice it is often more accurate and stable to use the central difference formula because we perturb the parameters in both directions. To better approximate the derivative/slope around a point, we need to check the behavior of the function \(f^{\prime}\) to the left and right of that point. It is also possible to use Taylor's theorem to say that the central difference formula has a \(\varepsilon^{2}\) proportional error, which is rather small, while the derivative definition is more error-prone.

Now you may be wondering, if this method is so accurate, why don't we use it instead of backpropagation to compute the gradient of a neural network?

  • ① We need to consider efficiency - whenever we want to calculate the gradient of an element, we need to do two forward propagations in the network, which is very computationally expensive.
  • ② Many large-scale neural networks contain millions of parameters, and calculating each parameter twice is obviously not a good choice.
  • ③ In optimization techniques such as SGD, we need to calculate gradients through thousands of iterations, and using such methods quickly becomes unmanageable.

We only use gradient tests to verify the correctness of our analytical gradients. The implementation of the gradient test is as follows:

 def eval_numerical_gradient(f, x):
    """
    a naive implementation of numerical gradient of f at x
    - f should be a function that takes a single argument
    - x is the point (numpy array) to evaluate the gradient  
    at
    """

    f(x) = f(x) # evaluate function value at original point
    grad = np.zeros(x.shape)
    h = 0.00001

    # iterate over all indexes in x
    it = np.nditer(x, flags=['multi_index',
                     op_flags=['readwrite'])

    while not it.finished:

        # evaluate function at x+h
        ix = it.multi_index
        old_value = x[ix]
        x[ix] = old_value + h # increment by h
        fxh_left = f(x) # evaluate f(x + h)
        x[ix] = old_value - h # decrement by h
        fxh_right = f(x) # evaluate f(x - h)
        # restore to previous value (very important!)
        x[ix] = old_value 

        # compute the partial derivative
        # the slope
        grad[ix] = (fxh_left - fxh_right) / (2 * h)
        it.iternext() # step to next dimension
    return grad

2.2 Regularization

Like many machine learning models, neural networks are prone to overfitting, which results in models that perform near-perfect on the training set but fail to generalize to the test set. A common approach to overfitting ("high variance problem") is to use \(L2\) regularization. We only need to add a regular term to the loss function \(J\), the current loss function is as follows:

$$ J_{R}=J+\lambda\sum_{i=1}^{L}\left \| W^{(i)} \right \| _F $$

In the above formula, \(\left | W^{(i)} \right | _F\) is the matrix \(W^{(i)}\) (the \(i\)th in the neural network The Frobenius norm of the weight matrix), \(\lambda\) is a hyperparameter that controls the size of the weights in the loss function.

❐ Definition of the Frobenius norm of a matrix \(U\): \(\left | U \right | _F=\sqrt{\sum_i \sum_{l} U_{il}^{2}}\)

When we try to minimize \(J_R\), regularization is essentially when optimizing the loss function, penalizing the weights with too large values (to make the value distribution of the weights more balanced, and prevent some of the weights from being particularly large). ).

Due to the quadratic nature of the Frobenius norm (computing the sum of squares of the elements of the matrix), the \(L2\) regularization term effectively reduces the flexibility of the model and thus reduces the likelihood of overfitting.

Adding such a constraint can be explained using Bayesian thinking. This regular term is to add a prior distribution to the parameters of the model, and optimize the weights to make them close to 0 - how close depends on \(\lambda\ ) value. Choosing an appropriate value of \(\lambda\) is important and needs to be chosen through hyperparameter tuning.

  • If the value of \(\lambda\) is too large, many weights will be close to \(0\), and the model will not be able to learn anything meaningful on the training set. very bad.
  • If the value of \(\lambda\) is too small, the model will still overfit.

Note that the bias term is not regularized and counted in the loss term - try to think why

Why is the bias term not calculated in the loss term ?

The bias term is only an offset relationship in the model, which can be fitted with a small amount of data, and empirically speaking, the size of the bias value has no significant impact on the model performance, so no regularization is required. Bias term

Sometimes we use other types of regularizers, such as the \(L1\) regularizer, which adds up all the absolute values of the parameter elements - however, the \(L1\) regularizer is rarely used in practice, because It will make the weight parameters sparse. In the next section, we discuss Dropout, another effective regularization method by randomly setting neurons to \(0\) during forward propagation

❐ Dropout actually "freezes" parts of the unit by ignoring their weights at each iteration. Instead of setting them to \(0\) these "frozen" units, the network assumes them to be \(0\) for this iteration. "frozen" units will not be updated for this iteration

2.3 Randomly inactivated Dropout

Dropout is a very powerful regularization technique, first proposed by Srivastava in the paper " Dropout: A Simple Way to Prevent Neural Networks from Overfitting ". The figure below shows how Dropout is applied to neural networks.

随机失活Dropout

The idea is simple and effective - during training, we randomly "drop" some subset of neurons with a certain probability \((1-p)\) in each forward/backward pass (or etc. valence, we maintain a certain probability \(p\) that the neuron is activated). Then, during the test phase, we will use all neurons to make predictions.

Using dropout neural networks generally learns more meaningful information from the data, suffers less overfitting and generally achieves higher overall performance on today's tasks. An intuitive reason why this technique should be so effective is that what Dropout essentially does is exponentially train many smaller networks at once and average their predictions.

In fact, the way we use Dropout is that we take the output of each neuron layer \(h\) and keep the neuron with probability \(p\) active, otherwise set the neuron to \(0\) . Then, in the backprop we only pass the gradients back to the neurons that were activated in the forward pass. Finally, during testing, we use all the neurons in the neural network for forward propagation calculations. There is a key subtlety, however, for Dropout to work effectively, the expected output of neurons in the test phase should be roughly the same as in the training phase - otherwise the size of the output may be very different, and the performance of the network will not More clear. Therefore, we usually have to divide each neuron's output by some value during the test phase - this is left as an exercise for the reader to determine what this value should be so that the expected output during training and test is equal (this value is \(p\) ) .

1) Dropout content supplement

The following is from "Neural Networks and Deep Learning"
  • Purpose: To alleviate the overfitting problem and achieve the effect of regularization to a certain extent
  • Effect: Reduce the dependence of lower-level nodes on it, forcing the network to learn more robust features

2) Interpretation of ensemble learning

Each drop is equivalent to sampling a sub-network from the original network. If a neural network has \(n\) neurons, then a total of \(2^n\) sub-networks can be sampled.

Each iteration is equivalent to training a different sub-network, which all share the parameters of the original network. Then, the final network can be approximated as a combined model that integrates exponentially different networks.

3) Explanation of Bayesian Learning

The dropout method can also be interpreted as an approximation of Bayesian learning. Use \(y=f(\mathbf{x}, \theta)\) to represent the neural network to be learned, Bayesian learning assumes that the parameter\(\theta\) is a random vector, and the prior distribution is\( q(\theta)\) , the Bayesian method predicts:

$$ \begin{aligned} \mathbb{E}_{q(\theta)}[y] &=\int_{q} f(\mathbf{x}, \theta) q(\theta) d \theta \ \ & \approx \frac{1}{M} \sum_{m=1}^{M} f\left(\mathbf{x}, \theta_m\right) \end{aligned} $$

Where \(f(\mathbf{x}, \theta_m)\) is the network after the mth application of the discarding method, and its parameter\(\theta_m\) is a sampling of all parameters\(\theta\).

4) Variational Dropout in RNN (Variational Dropout)

Dropout is generally to randomly drop neurons, but it can also be extended to randomly drop connections between neurons, or randomly drop each layer.

In RNN, the hidden state at each moment cannot be directly discarded randomly, which will damage the memory ability of the recurrent network in the time dimension. A simple approach is to randomly drop connections that are not temporally dimensioned (i.e., acyclic connections). As shown in the figure, the dotted edge represents random discarding, and different colors represent different discarding masks.

针对非循环连接的丢弃法

However, according to the interpretation of Bayesian learning, the dropout method is a sampling of parameters \(θ\). The parameters of each sample need to remain the same at each moment. Therefore, when using the dropout method on a recurrent neural network, you need to randomly drop each element of the parameter matrix and use the same dropout mask at all times. This method is called Variational Dropout.

The figure below shows an example of variational dropout, with the same color representing the same dropout mask.

变分丢弃法

2.4 Neuron activation function

The neural networks we saw earlier are all based on the sigmoid activation function for nonlinear classification. But in many applications, better neural networks can be designed using other activation functions. Listed below are some common activation functions and gradient definitions of activation functions, which are interchangeable with the sigmoidal functions discussed earlier.

1) Sigmoid

神经元激活函数 - Sigmoid

This is the common choice we have discussed, the activation function \(\sigma\) is:

$$ \sigma(z)=\frac{1}{1+exp(-z)} $$

Where\(\sigma(z)\in (0,1)\)

The gradient of \(\sigma(z)\) is:

$$ \sigma^{\prime}(z)=\frac{-exp(-z)}{1+exp(-z)}=\sigma(z)(1-\sigma(z)) $$

2) tanh

神经元激活函数 - tanh

The tanh function is an alternative to the sigmoid function, which converges faster in practice. The main difference between tanh and sigmoid is that the output of tanh is in the range of -1 to 1, while the output of sigmoid is in the range of 0 to 1.

$$ tanh(z)=\frac{exp(z)-exp(-z)}{exp(z)+exp(-z)}=2\sigma(2z)-1 $$

Where\(tanh(z)\in (-1, 1)\)

The gradient of \( anh(z) \) is:

$$ tanh^{\prime}(z)=1-\bigg(\frac{exp(z)-exp(-z)}{exp(z)+exp(-z)}\bigg)^{2} =1-tanh^{2}(z) $$

3) hard tanh

神经元激活函数 – hard tanh

The hardtanh function is sometimes preferred over the tanh function because it is less computationally expensive. However, when the value of \(z\) is greater than \(1\), the value of the function will saturate (it will always be equal to 1 as shown in the figure below).

The hardtanh activation function is:

$$ \begin{aligned} hardtanh(z) = \begin{cases} -1& :z<1\\ z & :-1\le z \le 1 \\ 1 & :z>1 \end{cases} \ end{aligned} $$

The differentiation of the function hardtanh can also be expressed in the form of a piecewise function:

$$ \begin{aligned} hardtanh ^{\prime}(z) &= \begin{cases} 1 & :-1\le z \le 1 \\ 0 & :otherwise \end{cases} \end{aligned} $$

4) soft sign

神经元激活函数 – soft sign

The soft sign function is another nonlinear activation function that can be an alternative to tanh because it does not saturate prematurely like hard clipped functions:

$$ softsign(z)=\frac{z}{1+ \left | z \right |} $$

The differential expression of the soft sign function is:

$$ softsign^{\prime}(z)=\frac{sgn(z)}{(1+z)^{2}} $$

Among them, \(sgn\) is a sign function, which returns 1 or -1 according to the sign of \(z\).

5) ReLU

神经元激活函数 - ReLU

The ReLU ( Rectified Linear Unit ) function is a common choice in activation functions, and it does not saturate when the value of \(z\) is particularly large. Great success in computer vision applications:

$$ rect(z)=max(z,0) $$

The differentiation of the ReLU function is a piecewise function:

$$ \begin{aligned} rect^{\prime}(z) &= \begin{cases} 1 & :z > 0 \\ 0 & :otherwise \end{cases} \end{aligned} $$

6) Leaky ReLU

神经元激活函数 – Leaky ReLU

When the value of \(z\) is less than \(0\), the traditional ReLU unit will not backpropagate the error leaky ReLU improves this. When the value of \(z\) is less than \(0\), There will still be a small error back propagated back.

$$ leaky(z)=max(z, k\cdot z) $$

where $$0.

The differential of the leaky ReLU function is a piecewise function:

$$ \begin{aligned} leaky ^{\prime} (z) &= \begin{cases} 1 & :z > 0 \\ k & :otherwise \end{cases} \end{aligned} $$

2.5 Data Preprocessing

As is the general case for machine learning models, a critical step in ensuring that the model is performing reasonably on the task at hand is to perform basic preprocessing on the data. Some common techniques are outlined below.

1) De-Mean

Given a set of input data \(X\), the average eigenvector of \(X\) is generally subtracted from the values in \(X\) to zero-center the data. It is important in practice that only the mean of the training set is calculated, and that the same mean is subtracted from the training set, the validation set and the test set.

2) Normalization

Another common technique (though not as common as \(mean\;Subtraction\)) is to shrink each input feature dimension so that each input feature dimension has a similar magnitude range. This is useful so that different input features are measured in different "units", but initially we often think that all features are equally important. This is achieved by dividing the features by their respective standard deviations computed in the training set.

3) Albino

Compared with the above two methods, whitening is not so commonly used. It is essentially that after the data is transformed, the correlation between features is low, and all features have the same variance (covariance matrix is \(1\) ). First perform Mean Subtraction on the data to get \(X ^{\prime}\) . Then we perform singular value decomposition on \(X ^{\prime}\) to get the matrix \(U\) , \(S\) , \(V\) , and calculate \(UX^{\prime}\) to get \ (X^{\prime}\) projects onto the base defined by the columns of \(U\). We finally scale our data appropriately by dividing each dimension of the result by the corresponding singular value in \(S\) (if any of the singular values are 0, we divide by a small value instead).

2.6 Parameter initialization

A critical step in getting the best performance of a neural network is to initialize the parameters in a sensible way. A good way to start is to initialize the weights to small random numbers that are usually distributed around 0 - it works pretty well in practice.

In the paper " Understanding the difficulty of training deep feedforward neural networks (2010) ", Xavier studies the effect of different weight and bias initialization schemes on training dynamics. The experimental results show that for sigmoid and tanh activation units, when a weight matrix\(W\in \mathbb{R}^{n^{(l+1)}\times n^{(l)}}\) with Random initialization in the following uniform distribution can achieve faster convergence and lower errors:

$$ W\sim U\bigg[-\sqrt{\frac{6}{n^{(l)}+n^{(l+1)}}},\sqrt{\frac{6}{n^ {(l)}+n^{(l+1)}}}\;\bigg] $$

where \(n^{(l)}\) is the number of input cells for W \((fan\text{-}in)\) and \(n^{(l+1)}\) is W \(( fan\text{-}out)\) number of output units. In this parameter initialization scheme, the bias unit is initialized to \(0\). This approach tries to preserve the activation variance across layers as well as the back-propagated gradient variance. Without such initialization, the gradient variance (with corrections in it) typically decays with backpropagation across layers.

2.7 Learning strategies

The rate/magnitude of model parameter updates during training can be controlled using the learning rate. In the simplest gradient descent formulation, \(\alpha\) is the learning rate:

$$ \theta^{new}=\theta^{old}-\alpha\nabla_{\theta}J_{t}(\theta) $$

You might think that to converge faster, we should take a larger value for \(\alpha\) - however, faster convergence is not guaranteed at faster convergence. In fact, if the learning rate is very high, we may encounter a situation where the loss function is difficult to converge, because the parameter update is too large, which will cause the model to pass the minimum point of convex optimization, as shown in the figure below. In a non-convex model (the models we encounter a lot of times are non-convex), the result of a high learning rate is difficult to predict, but the probability of the loss function being difficult to converge is very high.

学习策略

A short answer to avoid the loss function being difficult to converge is to use a small learning rate and let the model iterate carefully in the parameter space - of course, if we use a learning rate that is too small, the loss function may not Converge in a reasonable time, or get stuck at a local optimum. Therefore, like any other hyperparameter, the learning rate must be tuned efficiently.

The most computationally expensive part of a deep learning system is the training phase, and some studies have tried to improve new ways of setting the learning rate. For example, Ronan Collobert scales the weights by taking the inverse of the square root of the neurons of \(fan\text{-}in\)\((n^{(l)})\)\(W_{ij}\) ( \(W\in \mathbb{R}^{n^{(l+1)}\times n^{(l)}}\) ) learning rate.

There are other techniques that have proven effective - this method is called annealing, after many iterations, the learning rate is reduced in the following way: guaranteed to start training with a high learning rate and quickly approach the minimum; At the minimum, we start to reduce the learning rate, allowing us to find the optimal value on a finer scale. A common way to implement annealing is to reduce the learning rate \(\alpha\) by a factor \(x\) after every \(n\) iterations of learning.

Exponential decay is also a very common method, the learning rate becomes \(\alpha(t)=\alpha_0 e^{-kt}\) after \(t\) iterations, where \(\alpha_0\) is the initial Learning rate and \(k\) are hyperparameters.

Yet another approach is to allow the learning rate to decrease over time:

$$ \alpha(t)=\frac{\alpha_0\tau}{max(t,\tau)} $$

In the above scheme, \(\alpha_0\) is an adjustable parameter that represents the starting learning rate. \(\tau\) is also a tunable parameter indicating that the learning rate should start decreasing at this point in time. In practice, this method is very effective. In the next section we discuss another approach to adaptive gradient descent that does not require manual learning rates.

2.8 Optimization update with momentum (Momentum)

(Neural network optimization algorithm can also refer to ShowMeAI 's summary article on Wu Enda's course Deep Learning Tutorial | Neural Network Optimization Algorithm )

The momentum method, inspired by the study of dynamics in physics, is a variant of the gradient descent method that attempts a more efficient update scheme using the "velocity" of the update. The pseudocode for the momentum update looks like this:

 # Computes a standard momentum update
# on parameters x
v = mu * v - alpha * grad_x
x += v

2.9 Adaptive Optimization Algorithm

(Neural network optimization algorithm can also refer to ShowMeAI 's summary article on Wu Enda's course Deep Learning Tutorial | Neural Network Optimization Algorithm )

AdaGrad is an implementation of standard stochastic gradient descent (SGD) with one key difference: the learning rate is different for each parameter. The learning rate of each parameter depends on the history of each parameter's gradient update. The smaller the history of the parameter update, the faster the update using a larger learning rate. In other words, parameters that were not updated too much in the past are now more likely to have a higher learning rate.

$$ \theta_{t,i}=\theta_{t-1,i}-\frac{\alpha}{\sqrt{\sum_{\tau=1}^{t}g_{\tau,i}^ {2}}} g_{t,i} \\ where \ g_{t,i}=\frac{\partial}{\partial\theta_i^{t}}J_{t}(\theta) $$

In this technique, we see that if the historical RMS of the gradient is low, the learning rate can be very high. A simple implementation of this technique looks like this:

 # Assume the gradient dx and parameter vector x
cache += dx ** 2
x += -learning_rate * dx / np.sqrt(cache + 1e-8)

Other common adaptive methods are RMSProp and Adam, whose update rules are as follows:

 # Update rule for RMS prop
cache = decay_rate * cache + (1 - decay_rate) * dx ** 2
x += -learning_rate * dx / (np.sqrt(cache) + eps)

# Update rule for Adam
m = beta * m + (1 - beta1) * dx
v = beta * v + (1 - beta2) * (dx ** 2)
x += -learning_rate * m / (np.sqrt(v) + eps)
  • RMSProp is a moving average using squared gradients and is a variant of AdaGrad - in fact, unlike AdaGrad, its updates are not monotonically small.
  • The Adam update rule is again a variant of RMSProp, but with the addition of momentum updates.

3. References

ShowMeAI series tutorial recommendation

NLP series of tutorial articles

Stanford CS224n course with detailed explanation


用户bPcV4sA
45 声望14 粉丝