深度学习 - Deep Learning and CV Tutorial (4) | Neural Network and Backpropagation - ShowMeAI研究中心

ShowMeAI研究中心

Author: Han Xinzi @ShowMeAI
Tutorial address : https://www.showmeai.tech/tutorials/37
Address of this article : https://www.showmeai.tech/article-detail/263
Disclaimer: All rights reserved, please contact the platform and the author for reprinting and indicate the source
Bookmark ShowMeAI for more exciting content

Backpropagation and Neural Networks

This series is a complete set of study notes for Stanford CS231n "Deep Learning and Computer Vision (Deep Learning for Computer Vision)", and the corresponding course videos can be viewed here . See the end of the article for more information.

introduction

In the last deep learning and CV tutorial (3) | Loss function and optimization , we introduced the loss function construction of linear models and optimization algorithms such as gradient descent . Network, explain the neural network calculation graph and back propagation and neural network structure and other related knowledge.

The focus of this article

Neural network computation graph
backpropagation
Neural network structure

1. Backpropagation algorithm

The training of neural networks, applied gradient descent and other methods, need to calculate the gradient of the loss function, and one of the core knowledge is backpropagation , which is a method of recursively solving the gradient of complex functions using the chain rule in mathematics. The core intelligence of mainstream AI tool libraries such as tensorflow and pytorch is also capable of automatic differentiation. In this section, ShowMeAI will explain the computational graph and backpropagation of neural networks in combination with the fourth lecture of cs231n.

For the explanation of neural network backpropagation, you can also refer to ShowMeAI 's deep learning tutorial | Wu Enda's special course · A full set of notes in the interpretation of the article Neural Network Basics , Shallow Neural Networks , Deep Neural Networks for Different Depth Network Forward Calculation and Reverse Explain to the dissemination

1.1 Backpropagation in scalar form

1) Examples

Let's look at a simple example, the function is $f(x,y,z) = (x + y) z$. Initial values$x = -2$, $y = 5$, $z = -4$. This is an expression that can be directly differentiated, but we use a method that helps to understand backpropagation intuitively.

The figure below is the circuit diagram of the entire calculation, the green part is the function value, and the red part is the gradient. (The gradient is a vector, but the partial derivative with respect to $x$ is often called the gradient on $x$.)

标量形式反向传播; 梯度计算线路图

The above formula can be divided into 2 parts, $q = x + y$ and $f = qz$. They are all simple enough to write the gradient expression directly:

$f$ is the product of $q$ and $z$, so $\frac{\partial f}{\partial q} = z=-4$, $\frac{\partial f}{\partial z} = q=3$
$q$ is the addition of $x$ and $y$, so $\frac{\partial q}{\partial x} = 1$, $\frac{\partial q}{ \partial y} = 1$

We don't care about the gradient on $q$ ( $\frac{\partial f}{\partial q}$ is useless). We care about the gradient of $f$ with respect to $x,y,z$. The chain rule tells us that we can chain these gradient expressions by "multiplication", such as

$$ \frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} =-4 $$

Similarly, $\frac{\partial f}{\partial y} =-4$, another point is $\frac{\partial f}{\partial f}=1$

Forward propagation is computed from the input to the output (green), and backpropagation starts at the tail and recursively computes the gradient forward (shown in red) according to the chain rule , all the way to the input of the network. It can be considered that the gradient is flowing back from the computational chain .

The reference python implementation code for the above calculation is as follows:

 # 设置输入值
x = -2; y = 5; z = -4

# 进行前向传播
q = x + y # q 是 3
f = q * z # f 是 -12

# 进行反向传播:
# 首先回传到 f = q * z
dfdz = q # df/dz = q, 所以关于z的梯度是3
dfdq = z # df/dq = z, 所以关于q的梯度是-4
# 现在回传到q = x + y
dfdx = 1.0 * dfdq # dq/dx = 1. 这里的乘法是因为链式法则。所以df/dx是-4
dfdy = 1.0 * dfdq # dq/dy = 1.所以df/dy是-4

'''一般可以省略df'''

2) Intuitive understanding of backpropagation

Backpropagation is a graceful local process.

Take the following figure as an example. In the entire calculation circuit diagram, each gate unit (that is, the $f$ node) will be given some input values $x$ , (y\) and the output of this gate unit will be calculated immediately value $z$ , and the local gradient of the current node output value with respect to the input value $\frac{\partial z}{\partial x}$ and $\frac{\partial z}{\ partial y}$ .

标量形式反向传播; 反向传播门单元

These two computations of the gate unit are completely independent in the forward pass, it does not need to know the computation details of other units in the computation circuit. But in the process of backpropagation, the gate unit will obtain the gradient$\frac{\partial L}{\partial z}$ of the final output value of the entire network on its own output value.

According to the chain rule, the gradient of the output of the entire network for each input value of the gate unit should be multiplied by the return gradient by the local gradient of its output to the input to obtain $\frac{\partial L}{\partial x}$ and $\frac{\partial L}{\partial y}$ . These two values in turn can be used as the return gradient of the front gate unit.

Therefore, backpropagation can be seen as the gate units communicating with each other through the gradient signal. As long as their input changes along the gradient direction, no matter how much their own output value rises or falls, it is to let the whole The output value of the network is higher.

For example, the gradient of $x,y$ in the example is $-4$, so after reducing $x,y$, the value of $q$ will also decrease, but the final output The value $f$ will increase (of course the loss function needs to be minimal).

3) Addition gate, multiplication gate and max gate

Two gate units are used in the example: addition and multiplication.

Add partial derivatives: $f(x,y) = x + y \rightarrow \frac{\partial f}{\partial x} = 1 \frac{\partial f}{\partial y} = 1$
Multiplication to find partial derivatives: $f(x,y) = xy \rightarrow \frac{\partial f}{\partial x} = y \frac{\partial f}{\partial y} = x$

In addition, common operations include taking the maximum value:

$$ \begin{aligned} f(x,y) &= \max(x, y) \\ \rightarrow \frac{\partial f}{\partial x} &= \mathbb{1}(x \ge y )\\ \frac{\partial f}{\partial y} &\mathbb{1}(y \ge x) \end{aligned} $$

The meaning of the above formula is: if the variable is larger than another variable, then the gradient is $1$, otherwise it is $0$.

标量形式反向传播; 加法门、乘法门和max门

The addition gate unit is a gradient distributor, and the gradient of the input is equal to the gradient of the output. This behavior has nothing to do with the value of the input value during forward propagation;
The multiplication gate unit is a gradient converter, and the gradient of the input is equal to the output gradient multiplied by the value of another input, or multiplied by a multiple $a$ ($ax$ The form of the multiplication gate unit); the max gate unit is the gradient router , the gradient of the large input value is equal to the output gradient, and the small value is $0$.

The local gradient of the multiplication gate unit is the input value, but it is exchanged with each other, and then multiplied by the gradient of the output value according to the chain rule. Based on this, if one of the inputs of the multiplication gate unit is very small and the other input is very large, the multiplication gate will assign the large gradient to the small input and the small gradient to the large input.

Taking the linear classifier we mentioned earlier as an example, the weight and the input are dot-product $w^Tx_i$, which shows that the size of the input data has an impact on the size of the weight gradient. Specifically, if all input data samples $x_i$ are multiplied by 100 in the calculation process, the gradient of the weight will increase by 100 times, so the learning rate must be reduced to compensate.

It also shows that data preprocessing has a very important role, even if it is only a small change, it will have a huge impact .

Having an intuitive understanding of how gradients flow through computational circuits can help in debugging neural networks.

4) Complex example

Let's look at a slightly more complicated example:

$$ f(w,x) = \frac{1}{1+e^{-(w_0x_0 + w_1x_1 + w_2)}} $$

This expression requires the use of the new gate element:

$$ \begin{aligned} f(x) &= \frac{1}{x} \\ \rightarrow \frac{df}{dx} &=- \frac{1}{x^2}\ f_c(x ) = c + x \\ \rightarrow \frac{df}{dx} &= 1 \ f(x) = e^x \\ \rightarrow \frac{df}{dx} &= e^x \ f_a(x ) = ax \\ \rightarrow \frac{df}{dx} &= a \end{aligned} $$

The calculation process is as follows:

神经网络&反向传播; 反向传播计算过程

For the $1/x$ gate unit, the return gradient is $1$, the local gradient is $-1/x^2=-1/1.37^2=-0.53$, so the input gradient is \ (1 \times -0.53 = -0.53\); $+1$ The gate unit does not change the gradient or $-0.53$
The local gradient of the exp gate unit is $e^x=e^{-1}$ , and then multiply the back gradient$-0.53$ The result is about$-0.2$
Multiplying the $-1$ gate unit will negatively sign the gradient to $0.2$
The addition gate unit will assign gradients, so the three addition branches from top to bottom are all $0.2$
The last two multiplication units will convert the gradient, multiply the returned gradient by another input value as its own gradient, and get $-0.2$, $0.4$, $-0.4$, $-0.6$

5) Sigmoid gate unit

We can think of any differentiable function as a "gate". Multiple gates can be combined into one gate, or a function can be split into multiple gates as needed. We can observe that the four gate units on the far right can be combined into one gate unit, $\sigma(x) = \frac{1}{1+e^{-x}}$ , this function is called the sigmoid function .

The sigmoid function can be differentiated:

$$ \frac{d\sigma(x)}{dx} = \frac{e^{-x}}{(1+e^{-x})^2} = \left( \frac{1 + e ^{-x} - 1}{1 + e^{-x}} \right) \left( \frac{1}{1+e^{-x}} \right) = \left( 1 - \sigma (x) \right) \sigma(x) $$

Therefore, in the above example, $\sigma(x)=0.73$ has been calculated, and the gradient of the input value of the multiplication $-1$ gate unit can be directly calculated as: $1 \ast (1-0.73) \ ast0.73~=0.2$, the calculation is much simplified.

The reference python implementation of backpropagation for the above example is as follows:

 # 假设一些随机数据和权重
w = [2,-3,-3] 
x = [-1, -2]

# 前向传播，计算输出值
dot = w[0]*x[0] + w[1]*x[1] + w[2]
f = 1.0 / (1 + math.exp(-dot)) # sigmoid函数

# 反向传播，计算梯度
ddot = (1 - f) * f # 点积变量的梯度, 使用sigmoid函数求导
dx = [w[0] * ddot, w[1] * ddot] # 回传到x
dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # 回传到w
# 最终得到输入的梯度

In practice, sometimes we divide the forward propagation into different stages, which can make the back propagation process more concise. For example, create an intermediate variable $dot$ to store the dot product result of $w$ and $x$. During backpropagation, the corresponding variables containing the gradients of $w$ and $x$ can be quickly calculated (such as $ddot$, $dx$ and $dw$) .

This article lists many examples. We hope to use these examples to explain the process of "forward propagation" and "backward propagation", which functions can be combined into gates, and how to simplify them so that they can be "chained" together, reducing the amount of code. Less, more efficient.

6) Example of segment calculation

$$ f(x,y) = \frac{x + \sigma(y)}{\sigma(x) + (x+y)^2} $$

This expression is just to practice backpropagation. If you directly differentiate $x,y$, the amount of computation will be very large. The following code implements forward propagation:

 x = 3  # 例子数值
y = -4

# 前向传播
sigy = 1.0 / (1 + math.exp(-y)) # 分子中的sigmoid         #(1)
num = x + sigy # 分子                                    #(2)
sigx = 1.0 / (1 + math.exp(-x)) # 分母中的sigmoid         #(3)
xpy = x + y                                              #(4)
xpysqr = xpy**2                                          #(5)
den = sigx + xpysqr # 分母                                #(6)
invden = 1.0 / den                                       #(7)
f = num * invden

The code creates a number of intermediate variables, each of which is a relatively simple expression whose method of computing the local gradient is known. It can bring us a lot of convenience in calculating backpropagation:

We pass back each variable $ (sigy, num, sigx, xpy, xpysqr, den, invden)$ produced during the forward pass.
We use the same number of variables (starting with d ) to store the gradients of the corresponding variables.
Note: Each small block of backpropagation will contain the local gradient of the expression and then multiply the upstream gradient according to the chain rule. For each line of code, we will indicate which part of the forward pass it corresponds to, and the sequence number corresponds.

 # 回传 f = num * invden
dnum = invden # 分子的梯度                                         #(8)
dinvden = num # 分母的梯度                                         #(8)
# 回传 invden = 1.0 / den 
dden = (-1.0 / (den**2)) * dinvden                                #(7)
# 回传 den = sigx + xpysqr
dsigx = (1) * dden                                                #(6)
dxpysqr = (1) * dden                                              #(6)
# 回传 xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr                                        #(5)
# 回传 xpy = x + y
dx = (1) * dxpy                                                   #(4)
dy = (1) * dxpy                                                   #(4)
# 回传 sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # 注意这里用的是+=，下面有解释    #(3)
# 回传 num = x + sigy
dx += (1) * dnum                                                  #(2)
dsigy = (1) * dnum                                                #(2)
# 回传 sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy

Additional explanation :

① Cache forward propagation variables

Some intermediate variables obtained during forward propagation are very useful when calculating backpropagation.
During the implementation process, these intermediate variables are cached in the code, so that they can also be used during backpropagation.

②The gradients in different branches should be added

If the variable $x,y$ appears multiple times in the forward-propagated expression, be very careful when doing back-propagation and use $+=$ instead of $=$ to accumulate gradients of these variables.
According to the multivariate chain rule in calculus, if the variables go to different branches in the circuit, then the gradients should be accumulated when returning. which is:

$$ \frac{\partial f}{\partial x} =\sum_{q_i}\frac{\partial f}{\partial q_i}\frac{\partial q_i}{\partial x} $$

7) Practical application

If there is a computational graph that has been split into gate units, the main class code structure is as follows:

 class ComputationalGraph(object):
    # ...
    def forward(self, inputs):
        # 把inputs传递给输入门单元
        # 前向传播计算图
        # 遍历所有从后向前按顺序排列的门单元
        for gate in self.graph.nodes_topologically_sorted(): 
            gate.forward()  # 每个门单元都有一个前向传播函数
        return loss  # 最终输出损失

    def backward(self):
        # 反向遍历门单元
        for gate in reversed(self.graph.nodes_topologically_sorted()): 
            gate.backward()  # 反向传播函数应用链式法则
        return inputs_gradients  # 输出梯度
        return inputs_gradients  # 输出梯度

A gate unit class can be defined as follows, such as a multiplication unit:

 class MultiplyGate(object):
    def forward(self, x, y):
        z = x*y
        self.x = x
        self.y = y
        return z

    def backward(self, dz):
        dx = self.y * dz
        dy = self.x * dz
        return [dx, dy]

1.2 Backpropagation in vector form

Consider a simple example first, such as:

向量形式反向传播; 输入和输出都是4096维的max函数

This $max$ function compares each element of the input vector $x$ with $0$ to output the maximum value, so the dimension of the output vector is also $4096$ dimension. The gradient at this time is the Jacobian matrix , that is, a matrix composed of partial derivatives of each element of the output to each element of the input .

If the input $x$ is a $N$ dimensional vector, and the output $y$ is a $m$ dimensional vector, then $y_1, y_2, \cdots, y_m$ are all $ (x_1-x_n)$ function, the resulting Jacobian matrix is as follows:

$$ \left[\begin{array}{ccc} \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{ n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} \end{array}\right] $$

Then the Jacobian matrix of this example is $[4096 \times 4096]$ dimensional, and the output has $4096$ elements, each of which requires $4096$ partial derivatives. In fact, after careful observation, it is found that each element of the output of this example is only related to the element of the corresponding position of the input, so a diagonal matrix is obtained.

In practical applications, 100 $x$ are often input at the same time. At this time, the Jacobian matrix is a diagonal matrix of $[409600 \times 409600]$, of course, it is only for the $f$ function here.

In practice, it is impossible to write out and store the Jacobian completely, because the dimensions are extremely large.

1) An example

The target formula is: \(f(x,W)=\vert \vert W\cdot x \vert \vert ^2=\sum_{i=1}^n (W\cdot x)_{i}^2\ )

where $x$ is a $N$ dimensional vector and $W$ is a $n \times n$ matrix.

Set $q=W\cdot x$ , then get the following formula:

$$ \begin{array}{l} q=W \cdot x=\left(\begin{array}{c} W_{1,1} x_{1}+\cdots+W_{1, n} x_{ n} \\ \vdots \\ W_{n, 1} x_{1}+\cdots+W_{n, n} x_{n} \end{array}\right) \\ \end{array} $$

$$ f(q)=\|q\|^{2}=q_{1}^{2}+\cdots+q_{n}^{2} $$

As can be seen:

$\frac{\partial f}{\partial q_i}=2q_i$ so that the gradient of $f$ to $q$ is $2q$;
$\frac{\partial q_k}{\partial W_{i, j}}=1{i=k}x_j$, $\frac{\partial f}{\partial W_{i, j}}= \sum_{k=1}^n\frac{\partial f}{\partial q_k}\frac{\partial q_k}{\partial W_{i, j}}=\sum_{k=1}^n(2q_k )1{i=k}x_j=2q_ix_j$, so that the gradient of $f$ to $W$ is $2q\cdot x^T$;
$\frac{\partial q_k}{\partial x_i}=W_{k,i}$ , $\frac{\partial f}{\partial x_i}=\sum_{k=1}^n\frac {\partial f}{\partial q_k}\frac{\partial q_k}{\partial x_i}=\sum_{k=1}^n(2q_k)W_{k,i}$ , resulting in $f\ ) The gradient to \(x$ is $2W^T\cdot q$

The following is the calculation diagram:

标量形式反向传播; 向量化计算图

2) Code implementation

 import numpy as np

# 初值
W = np.array([[0.1, 0.5], [-0.3, 0.8]])
x = np.array([0.2, 0.4]).reshape((2, 1))  # 为了保证dq.dot(x.T)是一个矩阵而不是实数

# 前向传播
q = W.dot(x)
f = np.sum(np.square(q), axis=0)

# 反向传播
# 回传 f = np.sum(np.square(q), axis=0)
dq = 2*q
# 回传 q = W.dot(x)
dW = dq.dot(x.T)  # x.T就是对矩阵x进行转置
dx = W.T.dot(dq)

Note : to analyze dimensions! Don't memorize the expressions for $dW$ and $dx$ because they are easily derived from dimensions.

The size of the weight gradient $dW$ must be the same as the size of the weight matrix $W$

The $f$ output here is a real number, so $dW$ and $W$ have the same shape.
If you consider $dq/dW$, according to the definition of Jacobian matrix, $dq/dw$ should be $2 \times 2 \times 2$ dimension, in order to reduce the amount of calculation, let it be Equal to $x$.
In fact, there is no need to consider it so complicated, because the final loss function must be a real number, so the input gradient of each gate unit must be the same as the original input shape. For a description of this, you can click here , and the official website has a detailed derivation.
And this is determined by the matrix multiplication of $x$ and $dq$, there is always a way to make the dimensions match up.

For example, the dimension of $x$ is $[2 \times 1]$ and the dimension of $dq$ is $[2 \times 1]$, if you want $dW$ and The size of $W$ is $[2 \times 2]$, then it should be dq.dot(x.T) , if it is x.T.dot(dq) the result is wrong. ($dq$ is the return gradient cannot be transposed!)

2. Introduction to Neural Networks

2.1 Introduction to Neural Network Algorithms

It is still possible to introduce neural network algorithms without resorting to the analogy of the brain.

In the linear classification section, given an image, the scores for different visual categories are calculated using $Wx$, where $W$ is a matrix and $x$ is an input column vector , which contains all the pixel data of the image. In the case of database CIFAR-10, $x$ is a column vector of $[3072 \times 1]$ and $W$ is a matrix of $[10 \times 3072]$ , so the output score is a vector of 10 classification scores.

A two-layer neural network algorithm is different, its calculation formula is $s = W_2 \max(0, W_1 x)$ .

The meaning of $W_1$: For example, it can be a $[100 \times 3072]$ matrix, whose function is to convert the image into a 100-dimensional transition vector, such as the picture of a horse with its head facing Left and right, you will get one point each.

The function $max(0,-)$ is non-linear, it works on every element. There are many options for this nonlinear function, which you will see in the subsequent activation functions. The function you see now is the most commonly used ReLU activation function, which turns all values less than $0$ into $0$.

The size of the matrix $W_2$ is $[10 \times 100]$, and the scores of the intermediate layers will be weighted and summed, so you will get 10 numbers, which can be interpreted as classification scores.

Note: The nonlinear function is computationally critical. If this step is omitted, the two matrices will be merged into one, and the scoring calculation for classification will revert to a linear function on the input. This nonlinear function is the key point of change.

The parameters $W_1$ ,$W_2$ will be learned by stochastic gradient descent, and their gradients will be calculated by the chain rule during backpropagation.

A three-layer neural network can be viewed analogously as $s = W_3 \max(0, W_2 \max(0, W_1 x))$ , where $W_1$, $W_2$ ,$W_3 $ is the parameter that needs to be learned. The dimensions of the intermediate hidden layers are hyperparameters of the network, and we will learn how to set them later. Now let us first understand the above computation from the neuron or network point of view.

The two-layer neural network reference code is implemented as follows, and the middle layer uses the sigmoid function:

 import numpy as np
from numpy.random import randn

N, D_in, H, D_out = 64, 1000, 100, 10
# x 是64x1000的矩阵，y是64x10的矩阵
x, y = randn(N, D_in), randn(N, D_out)
# w1是1000x100的矩阵，w2是100x10的矩阵
w1, w2 = randn(D_in, H), randn(H, D_out)

# 迭代10000次，损失达到0.0001级
for t in range(10000):
    h = 1 / (1 + np.exp(-x.dot(w1)))  # 激活函数使用sigmoid函数，中间层
    y_pred = h.dot(w2)
    loss = np.square(y_pred - y).sum()  # 损失使用 L2 范数
    print(str(t)+': '+str(loss))

    # 反向传播
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h.T.dot(grad_y_pred)
    grad_h = grad_y_pred.dot(w2.T)
    # grad_xw1 = grad_h*h*(1-h)
    grad_w1 = x.T.dot(grad_h*h*(1-h))

    # 学习率是0.0001
    w1 -= 1e-4 * grad_w1
    w2 -= 1e-4 * grad_w2

2.2 Neural network and real neural comparison

Neural network algorithms are often inspired by biological nervous systems and simplified simulations.

The basic computing unit of the brain is the neuron . There are about 86 billion neurons in the human nervous system, which are connected by about 1014 - 1015 synapses . Above is a biological neuron, and below is a simplified common mathematical model. Each neuron receives input signals from its dendrites (dendrites) and produces output signals along its unique axon (axon) . Axons branch off at their ends and connect to the dendrites of other neurons through synapses.

神经网络简介; 神经元 V.S. 数学模型

In computational models of neurons, a signal propagating along an axon (say $x_0$ ) will interact multiplicatively with the dendrites of other neurons based on the synaptic strength of the synapse (say $w_0$ ) (eg $w_0 x_0$ ).

The corresponding idea is that the strength of the synapse (that is, the weight $W$ ) is learnable and can control the strength of the influence of one neuron on another neuron (and also control the direction of influence: make it excited (positive). weight) or make it suppressed (negative weight)).

The dendrites transmit signals to the cell body, where the signals add up. If the final sum is above a certain threshold, the neuron will " fire ", outputting a spike to its axon.

In the computational model, we assume that the exact timing of the peak signal is not important, it is the frequency of the activation signal that communicates the information. Based on this rate-coding view, the activation rate of a neuron is modeled as an activation function $f$ , which expresses the frequency of activation signals on the axon.

For historical reasons, activation functions have often chosen to use the sigmoid function $\sigma$, which takes real values (summed signal strengths) and then compresses the input values into $0\sim 1$ . Various details of these activation functions will be seen later in this section.

The activation function $f$ here uses the sigmoid function, and the code is as follows:

 class Neuron:
    # ...
    def neuron_tick(self, inputs):
        # 假设输入和权重都是1xD的向量，偏差是一个数字
        cell_body_sum = np.sum(inputs*self.weights) + self.bias
        # 当和远大于0时，输出为1，被激活
        firing_rate = 1.0 / (1.0 + np.exp(-cell_body_sum))
        return firing_rate

2.3 Commonly used activation functions

神经网络简介; 常用的激活函数

3. Neural network structure

For the knowledge of neural network structure, you can also refer to ShowMeAI 's deep learning tutorial | Wu Enda's special course · A full set of notes interpretation in the article Neural Network Basics , Shallow Neural Networks , Deep Neural Networks The explanation of network structures of different depths

For ordinary neural networks, the most common hierarchical structure is the fully -connected layer. The neurons in the fully connected layer are completely pairwise connected with the neurons in the two layers before and after, but there is no connection between neurons within the same layer. There are no loops in the network structure (as this would lead to an infinite loop of forward propagation).

Below is an example of two neural networks, both using fully connected layers:

神经网络结构; 全连接神经网络

Left : A 2-layer neural network, the hidden layer consists of 4 neurons (also called units), the output layer consists of 2 neurons, and the input layer consists of 3 neurons (referring to the input image dimensions rather than the number of pictures).
Right : A 3-layer neural network with two hidden layers with 4 neurons.

Note : When we say $N$ layer neural network, we do not count the input layer. A single-layer neural network has no hidden layers (inputs map directly to outputs). Artificial Neural Networks (ANN for short) or Multi-Layer Perceptrons (MLP for Multi-Layer Perceptrons) are also used to refer to this kind of neural network constructed with fully connected layers. In addition, neurons in the output layer generally do not contain activation functions.

There are two main criteria for measuring the size of a neural network: one is the number of neurons , and the other is the number of parameters . Take the two networks shown above as an example:

The first network has $4+2=6$ neurons (not counting the input layer), $[3 \times 4]+[4 \times 2]=20$ weights, and $ 4+2=6$ offsets, a total of $26$ learnable parameters.
The second network has $4+4+1=9$ neurons, $[3 \times 4]+[4 \times 4]+[4 \times 1]=32$ weights, \ (4+4+1=9\) offsets, a total of $41$ learnable parameters.

Modern convolutional neural networks can contain hundreds of millions of parameters and can be composed of dozens or hundreds of layers (this is deep learning).

3.1 Three-layer neural network code example

Continually stacking similar structures to form networks makes neural network algorithms simple and efficient using matrix-vector operations. Let's go back to the 3-layer neural network above, and the input is a vector of $[3 \times 1]$ . The weights of all connections in a layer can be stored in a single matrix.

For example, the weight of the first hidden layer$W_1$ is$[4 \times 3]$, the biases of all units are stored in$b_1$, the size$[4 \times 1]$ . In this way, the weight of each neuron is in one row of $W_1$, so the matrix multiplication np.dot(W1, x)+b1 can be used as the input data for the activation function of all neurons in the layer. Similarly, $W_2$ will be the $[4 \times 4]$ matrix, storing the connection of the second hidden layer, $W_3$ will be $[1 \times 4]$ Matrix for the output layer.

The forward pass of a complete 3-layer neural network is simply 3 matrix multiplications interwoven with the application of activation functions.

 import numpy as np

# 三层神经网络的前向传播
# 激活函数
f = lambda x: 1.0/(1.0 + np.exp(-x))

# 随机输入向量3x1
x = np.random.randn(3, 1)
# 设置权重和偏差
W1, W2, W3 = np.random.randn(4, 3), np.random.randn(4, 4), np.random.randn(1, 4),
b1, b2= np.random.randn(4, 1), np.random.randn(4, 1)
b3 = 1

# 计算第一个隐藏层激活 4x1
h1 = f(np.dot(W1, x) + b1)
# 计算第二个隐藏层激活 4x1
h2 = f(np.dot(W2, h1) + b2)
# 输出是一个数
out = np.dot(W3, h2) + b3

In the above code, $W_1$, $W_2$, $W_3$, $b_1$, $b_2$, $b_3$ are all parameters that can be learned in the network. Note that $x$ is not a single column vector, but can be a batch of training data (where each input sample will be a column in $x$), all samples will be parallelized Calculate efficiently.

Note that the last layer of a neural network usually has no activation function (for example, in a classification task it gives a real-valued classification score).

The forward propagation of the fully connected layer is generally to perform a matrix multiplication first, then add the bias and apply the activation function.

3.2 Understanding Neural Networks

For the explanation of deep neural network, you can also refer to ShowMeAI 's deep learning tutorial | Wu Enda's special course · full set of notes interpretation of the article in the deep neural network " Other advantages of deep network " part of the explanation

One understanding of neural networks with fully connected layers is:

They define a family of functions consisting of a series of functions, and the weights of the network are the parameters of each function.

A neural network with at least one hidden layer is a general approximator, and a neural network can approximate any continuous function.

Although a 2-layer network can approximate all continuous functions perfectly in mathematical theory, it is relatively ineffective in practice. Although in theory the expressive power of deep networks (using multiple hidden layers) and single-layer networks is the same, in terms of practical experience, deep networks work better than single-layer networks.

For fully connected neural networks, in practice a 3-layer neural network will perform better than a 2-layer one, however going deeper (4, 5, 6 layers) rarely helps much. The situation is different with convolutional neural networks. In convolutional neural networks, depth is a very important factor for a good recognition system (for example, today's good CNNs have dozens or hundreds of layers). One explanation for this phenomenon is that because images have a hierarchical structure (for example, faces are composed of eyes, etc., and eyes are composed of edges), multi-layer processing makes intuitive sense for this kind of data.

4. Expand your learning

You can click station B to view the [bilingual subtitles] version of the video

5. Summary of key points

Forward Propagation and Back Propagation
Scalar and vectorized form calculations
Applying the chain rule
Neural network structure
activation function
Understanding Neural Networks

Stanford CS231n full set of interpretation

Featured Recommendations in ShowMeAI Series Tutorials

ShowMeAI用知识加速每一次技术成长

Deep Learning and CV Tutorial (4) | Neural Network and Backpropagation