头图

ShowMeAI研究中心


第1门课 神经网络和深度学习,第2周:神经网络的编程基础

This series is based on the learning and summary of Mr. Wu Enda's "Deep Learning Specialization". The corresponding course videos can be viewed here .

introduction

In the previous article of ShowMeAI , Introduction to Deep Learning, we gave a brief introduction to Deep Learning:

  • Taking housing price prediction as an example, we explain the neural network model structure and basic knowledge accordingly.
  • Several typical classes of neural networks for supervised learning are introduced: Standard NN, CNN and RNN.
  • Two different types of data, "structured data" and "unstructured data", are introduced.
  • This paper analyzes the popularity of deep learning in recent years and the reasons why its performance is better than traditional machine learning (Data, Computation and Algorithms).

In this section, we expand on the basics of neural networks: Logistic Regression. We will transition to the subsequent neural network model by analyzing the structure of the logistic regression model. (For the logistic regression model, you can also read ShowMeAI 's article Graphical Machine Learning | Logistic Regression Algorithm Detailed Learning)

1. Algorithm basis and logistic regression

Logistic regression is an algorithm for binary classification.

1.1 Binary classification problem and machine learning foundation

二分类 Binary Classification

Binary classification means that the output \(y\) has only two discrete values of {0,1} (there are also cases of {-1,1}). Let's take an " image recognition " problem as an example to determine whether the picture is a cat. Identifying whether it is a "cat" is a typical binary classification problem - 0 for "not cat" and 1 for "cat". (For machine learning basics, you can also check the ShowMeAI article Illustrated Machine Learning | Machine Learning Basics ).

算法基础与逻辑回归

From the perspective of machine learning, our input \(x\) is an image at this time, the color image contains three RGB channels, and the image size is \((64,64,3)\).

数据与向量化格式

The input of some neural networks is one-dimensional. We can flatten the image\(x\)(dimension\((64,64,3)\)) into a one-dimensional feature vector (feature vector), and the resulting feature vector dimension is \((12288,1)\). We generally use column vectors to represent samples, and record the dimension as \(n_x\).

If the training sample has \(m\) pictures, then we use a matrix to store the data, and the data dimension becomes \((n_x,m)\).

数据与向量化格式

  • The rows of the matrix\(X\)\(n_x\) represent the number of features for each sample\(x^{(i)}\)
  • The column \(m\) of the matrix \(X\) represents the number of samples.

We can also normalize the label \(Y\) of the training sample to a 1-dimensional shape, and the dimension of the label \(Y\) is \((1,m)\).

1.2 Logistic regression algorithm

逻辑回归 Logistic Regression

Logistic regression is the most common binary classification algorithm (for a detailed algorithm explanation, you can also read the ShowMeAI article Graphical Machine Learning | Logistic Regression Algorithm Detailed Explanation ), which contains the following parameters:

  • Input feature vector: \(x \in R^{n_x}\), where \({n_x}\) is the number of features
  • Labels for training: \(y \in 0,1\)
  • Weight: \(w \in R^{n_x}\)
  • Bias: \(b \in R\)
  • Output: \(\hat{y} = \sigma(w^Tx+b)\)

The output calculation uses the Sigmoid function, which is a nonlinear sigmoid function. The output is limited between \([0,1]\), which is usually used as an activation function in neural networks. .

逻辑回归做图像分类

The expression of the sigmoid function is as follows:

$
s = \sigma(w^Tx+b) = \sigma(z) = \frac{1}{1+e^{-z}}
$

In fact, logistic regression can be thought of as a very small neural network.

1.3 Loss function for logistic regression

逻辑回归的代价函数 Logistic Regression Cost Function

In machine learning, the loss function is used to quantify the difference between the predicted result and the true value. We will continuously adjust the model weights by optimizing the loss function to best fit the sample data.

In regression problems, we will use mean square error loss (MSE):

$$ L(\hat{y},y) = \frac{1}{2}(\hat{y}-y)^2 $$

逻辑回归的损失函数

But in logistic regression, we don't tend to use such loss functions. Logistic regression using squared-difference loss results in a non-convex loss function that has many local optima. Gradient descent may not find the global optimum, making optimization difficult.

So we adjust to use log loss (binary cross-entropy loss):

$$ L(\hat{y},y) = -(y\log\hat{y})+(1-y)\log(1-\hat{y}) $$

逻辑回归的损失函数

What we just gave is a loss function defined in a single training sample, which measures the performance on a single training sample. We define the cost function (Cost Function, or cost function) as the performance on all training samples, that is, the average value of the loss function of \(m\) samples, reflecting the predicted output of \(m\) samples and The average closeness of the true sample output \(y\).

The formula for calculating the cost function is as follows:

$$ J(w,b) = \frac{1}{m}\sum_{i=1}^mL(\hat{y}^{(i)},y^{(i)}) $$

2. Gradient Descent

梯度下降 Gradient Descent

We just learned about the definition of loss function and cost function. The next step is to find the optimal \(w\) and \(b\) values, and minimize the Cost Function of \(m\) training samples . The method used here is called the Gradient Descent algorithm.

Mathematically, the gradient of a function points in the direction of its steepest growth. That is to say, the function grows the fastest along the direction of the gradient. Then go in the negative direction of the gradient, and the function value will drop the fastest.

(For more detailed optimization mathematical knowledge, you can read the ShowMeAI article Graphical AI Mathematical Fundamentals | Calculus and Optimization )

The training goal of the model is to find suitable \(w\) and \(b\) to minimize the cost function value. We first assume that \(w\) and \(b\) are both one-dimensional real numbers, then the cost function \(J\) about \(w\) and \(b\) The graph is as follows:

梯度下降法

The cost function \(J\) in the above figure is a convex function with only one global minimum, which ensures that no matter how we initialize the model parameters (any position on the surface), we can find a suitable optimal solution.

Based on the gradient descent algorithm, the update formula of the following parameters \(w\) is obtained:

$$ w := w - \alpha\frac{dJ(w, b)}{dw} $$

In the formula \(\alpha\) is the learning rate, that is, the step size of \(w\) for each update.

The corresponding parameter\(b\) update formula in the cost function\(J(w, b)\) is:

$$ b := b - \alpha\frac{dJ(w, b)}{db} $$

3. Computation Graph

计算图 Computation Graph

For neural networks, the training process consists of two stages: Forward Propagation and Back Propagation.

  • Forward propagation is a process from input to output, and the predicted output is obtained by the forward calculation of the neural network.
  • Backpropagation is the process of calculating gradients from output to input based on Cost Function for parameters \(w\) and \(b\).

Below, we combine an example to understand these two stages in the form of a Computation graph.

3.1 Forward Propagation

If our Cost Function is \(J(a,b,c)=3(a+bc)\), it contains three variables \(a\), \(b\), \(c\).

We add some intermediate variables, using \(u\) to represent \(bc\), \(v\) to represent \(a+u\), then \(J=3v\).

The whole process can be represented by a computational graph:

计算图

In the above figure, we let \(a=5\), \(b=3\), \(c=2\), then \(u=bc=6\), \(v=a+u= 11\), \(J=3v=33\).

In the calculation diagram, this process from left to right, from input to output, corresponds to the forward calculation process of the Cost Function calculated by the neural network based on \(x\) and \(w\).

3.2 Back Propagation

计算图导数 Derivatives with a Computation Graph

Let's follow the calculation diagram in the previous example to explain backpropagation. Our input parameters are \(a\), \(b\), \(c\).

Calculate the partial derivative of \(J\) to the parameter \(a\) first

计算图

From the calculation graph, from right to left, \(J\) is a function of \(v\), and \(v\) is a function of \(a\). Based on the chain rule of derivation, we get:

$$ \frac{\partial J}{\partial a}=\frac{\partial J}{\partial v}\cdot \frac{\partial v}{\partial a}=3\cdot 1=3 $$

②Calculate the partial derivative of \(J\) to the parameter \(b\)

计算图

From the calculation diagram, from right to left, \(J\) is a function of \(v\), \(v\) is a function of \(u\), and \(u\) is \(b\) The function. Also available:

$$ \frac{\partial J}{\partial b}=\frac{\partial J}{\partial v}\cdot \frac{\partial v}{\partial u}\cdot \frac{\partial u} {\partial b}=3\cdot 1\cdot c=3\cdot 1\cdot 2=6 $$

③Calculate the partial derivative of \(J\) to the parameter \(c\)

计算图

From right to left, \(J\) is a function of \(v\), \(v\) is a function of \(u\), and \(u\) is a function of \(c\). Available:

$$ \frac{\partial J}{\partial c}=\frac{\partial J}{\partial v}\cdot \frac{\partial v}{\partial u}\cdot \frac{\partial u} {\partial c}=3\cdot 1\cdot b=3\cdot 1\cdot 3=9 $$

This completes the backpropagation and gradient (partial derivative) calculation process from right to left.

4. Gradient descent in logistic regression

逻辑回归的梯度下降 Logistic Regression Gradient Descent

Going back to the logistic regression problem we mentioned earlier, we assume that the dimension of the input feature vector is 2 (ie \([x_1, x_2]\)), corresponding to the weight parameters \(w_1\), \(w_2\), \( b\) to get the following calculation graph:

逻辑回归中的梯度下降法

Backpropagation computes gradients

Find the derivative of \(L\) to \(a\)

逻辑回归中的梯度下降法

Find the derivative of \(L\) to \(z\)

逻辑回归中的梯度下降法

③Continue to push forward calculation

逻辑回归中的梯度下降法

Based on gradient descent, the parameter update formula can be obtained

逻辑回归中的梯度下降法

梯度下降的例子 Gradient Descent on m Examples

逻辑回归中的梯度下降法

Mentioned earlier is the process of taking the partial derivative and applying the gradient descent algorithm to a single sample. For a dataset with \(m\) samples, the calculations of Cost Function \(J(w,b)\), \(a^{(i)}\) and weight parameter \(w_1\) are shown in the figure Show.

The process of a certain training in the complete logistic regression is as follows, only the dimension of the feature vector is assumed to be 2:

 J=0; dw1=0; dw2=0; db=0;
for i = 1 to m
    z(i) = wx(i)+b;
    a(i) = sigmoid(z(i));
    J += -[y(i)log(a(i))+(1-y(i))log(1-a(i));
    dz(i) = a(i)-y(i);
    dw1 += x1(i)dz(i);
    dw2 += x2(i)dz(i);
    db += dz(i);
J /= m;
dw1 /= m;
dw2 /= m;
db /= m;

Then iterate on \(w_1\), \(w_2\), \(b\).

The above calculation process has a disadvantage: the whole process contains two for loops. in:

  • The first for loop iterates over \(m\) samples
  • The second for loop iterates over all features

Explicitly using a for loop in the code makes the algorithm inefficient if there are a large number of features. Vectorization can be used to solve the problem of explicitly using for loops.

5. Vectorization

向量化 Vectorization

Continuing to take logistic regression as an example, if \(z=w^Tx+b\) is calculated in a non-vectorized loop, the code is as follows:

 z = 0;
for i in range(n_x):
    z += w[i] * x[i]
z += b

Based on vectorized operations, it can be calculated in parallel, greatly improving efficiency, and the code is more concise:
(The numpy tool library in python is used here. Students who want to know more can check the numpy tutorial in ShowMeAI 's graphic data analysis series, or quickly learn how to use it through the numpy quick-check manual produced by ShowMeAI )

 z = np.dot(w, x) + b

Without an explicit for loop, the iterative pseudocode for implementing gradient descent for logistic regression is as follows:

$$Z=w^TX+b=np.dot(wT, x) + b$$

$$A=\sigma(Z)$$

$$dZ=AY$$

$$dw=\frac{1}{m}XdZ^T$$

$$db=\frac{1}{m}np.sum(dZ)$$

$$w:=w-\sigma dw$$

$$b:=b-\sigma db$$

References

recommended article

Featured Recommendations in ShowMeAI Series Tutorials

ShowMeAI用知识加速每一次技术成长


用户bPcV4sA
45 声望14 粉丝