- Author: Han Xinzi @ShowMeAI
- Tutorial address : https://www.showmeai.tech/tutorials/35
- Address of this article : https://www.showmeai.tech/article-detail/213
- Disclaimer: All rights reserved, please contact the platform and the author for reprinting and indicate the source
- Bookmark ShowMeAI for more exciting content
This series is based on the learning and summary of Mr. Wu Enda's "Deep Learning Specialization". The corresponding course videos can be viewed here .
introduction
In the previous article of ShowMeAI , Introduction to Deep Learning, we gave a brief introduction to Deep Learning:
- Taking housing price prediction as an example, we explain the neural network model structure and basic knowledge accordingly.
- Several typical classes of neural networks for supervised learning are introduced: Standard NN, CNN and RNN.
- Two different types of data, "structured data" and "unstructured data", are introduced.
- This paper analyzes the popularity of deep learning in recent years and the reasons why its performance is better than traditional machine learning (Data, Computation and Algorithms).
In this section, we expand on the basics of neural networks: Logistic Regression. We will transition to the subsequent neural network model by analyzing the structure of the logistic regression model. (For the logistic regression model, you can also read ShowMeAI 's article Graphical Machine Learning | Logistic Regression Algorithm Detailed Learning)
1. Algorithm basis and logistic regression
Logistic regression is an algorithm for binary classification.
1.1 Binary classification problem and machine learning foundation
Binary classification means that the output \(y\) has only two discrete values of {0,1} (there are also cases of {-1,1}). Let's take an " image recognition " problem as an example to determine whether the picture is a cat. Identifying whether it is a "cat" is a typical binary classification problem - 0 for "not cat" and 1 for "cat". (For machine learning basics, you can also check the ShowMeAI article Illustrated Machine Learning | Machine Learning Basics ).
From the perspective of machine learning, our input \(x\) is an image at this time, the color image contains three RGB channels, and the image size is \((64,64,3)\).
The input of some neural networks is one-dimensional. We can flatten the image\(x\)(dimension\((64,64,3)\)) into a one-dimensional feature vector (feature vector), and the resulting feature vector dimension is \((12288,1)\). We generally use column vectors to represent samples, and record the dimension as \(n_x\).
If the training sample has \(m\) pictures, then we use a matrix to store the data, and the data dimension becomes \((n_x,m)\).
- The rows of the matrix\(X\)\(n_x\) represent the number of features for each sample\(x^{(i)}\)
- The column \(m\) of the matrix \(X\) represents the number of samples.
We can also normalize the label \(Y\) of the training sample to a 1-dimensional shape, and the dimension of the label \(Y\) is \((1,m)\).
1.2 Logistic regression algorithm
Logistic regression is the most common binary classification algorithm (for a detailed algorithm explanation, you can also read the ShowMeAI article Graphical Machine Learning | Logistic Regression Algorithm Detailed Explanation ), which contains the following parameters:
- Input feature vector: \(x \in R^{n_x}\), where \({n_x}\) is the number of features
- Labels for training: \(y \in 0,1\)
- Weight: \(w \in R^{n_x}\)
- Bias: \(b \in R\)
- Output: \(\hat{y} = \sigma(w^Tx+b)\)
The output calculation uses the Sigmoid function, which is a nonlinear sigmoid function. The output is limited between \([0,1]\), which is usually used as an activation function in neural networks. .
The expression of the sigmoid function is as follows:
$
s = \sigma(w^Tx+b) = \sigma(z) = \frac{1}{1+e^{-z}}
$
In fact, logistic regression can be thought of as a very small neural network.
1.3 Loss function for logistic regression
In machine learning, the loss function is used to quantify the difference between the predicted result and the true value. We will continuously adjust the model weights by optimizing the loss function to best fit the sample data.
In regression problems, we will use mean square error loss (MSE):
$$ L(\hat{y},y) = \frac{1}{2}(\hat{y}-y)^2 $$
But in logistic regression, we don't tend to use such loss functions. Logistic regression using squared-difference loss results in a non-convex loss function that has many local optima. Gradient descent may not find the global optimum, making optimization difficult.
So we adjust to use log loss (binary cross-entropy loss):
$$ L(\hat{y},y) = -(y\log\hat{y})+(1-y)\log(1-\hat{y}) $$
What we just gave is a loss function defined in a single training sample, which measures the performance on a single training sample. We define the cost function (Cost Function, or cost function) as the performance on all training samples, that is, the average value of the loss function of \(m\) samples, reflecting the predicted output of \(m\) samples and The average closeness of the true sample output \(y\).
The formula for calculating the cost function is as follows:
$$ J(w,b) = \frac{1}{m}\sum_{i=1}^mL(\hat{y}^{(i)},y^{(i)}) $$
2. Gradient Descent
We just learned about the definition of loss function and cost function. The next step is to find the optimal \(w\) and \(b\) values, and minimize the Cost Function of \(m\) training samples . The method used here is called the Gradient Descent algorithm.
Mathematically, the gradient of a function points in the direction of its steepest growth. That is to say, the function grows the fastest along the direction of the gradient. Then go in the negative direction of the gradient, and the function value will drop the fastest.
(For more detailed optimization mathematical knowledge, you can read the ShowMeAI article Graphical AI Mathematical Fundamentals | Calculus and Optimization )
The training goal of the model is to find suitable \(w\) and \(b\) to minimize the cost function value. We first assume that \(w\) and \(b\) are both one-dimensional real numbers, then the cost function \(J\) about \(w\) and \(b\) The graph is as follows:
The cost function \(J\) in the above figure is a convex function with only one global minimum, which ensures that no matter how we initialize the model parameters (any position on the surface), we can find a suitable optimal solution.
Based on the gradient descent algorithm, the update formula of the following parameters \(w\) is obtained:
$$ w := w - \alpha\frac{dJ(w, b)}{dw} $$
In the formula \(\alpha\) is the learning rate, that is, the step size of \(w\) for each update.
The corresponding parameter\(b\) update formula in the cost function\(J(w, b)\) is:
$$ b := b - \alpha\frac{dJ(w, b)}{db} $$
3. Computation Graph
For neural networks, the training process consists of two stages: Forward Propagation and Back Propagation.
- Forward propagation is a process from input to output, and the predicted output is obtained by the forward calculation of the neural network.
- Backpropagation is the process of calculating gradients from output to input based on Cost Function for parameters \(w\) and \(b\).
Below, we combine an example to understand these two stages in the form of a Computation graph.
3.1 Forward Propagation
If our Cost Function is \(J(a,b,c)=3(a+bc)\), it contains three variables \(a\), \(b\), \(c\).
We add some intermediate variables, using \(u\) to represent \(bc\), \(v\) to represent \(a+u\), then \(J=3v\).
The whole process can be represented by a computational graph:
In the above figure, we let \(a=5\), \(b=3\), \(c=2\), then \(u=bc=6\), \(v=a+u= 11\), \(J=3v=33\).
In the calculation diagram, this process from left to right, from input to output, corresponds to the forward calculation process of the Cost Function calculated by the neural network based on \(x\) and \(w\).
3.2 Back Propagation
Let's follow the calculation diagram in the previous example to explain backpropagation. Our input parameters are \(a\), \(b\), \(c\).
① Calculate the partial derivative of \(J\) to the parameter \(a\) first
From the calculation graph, from right to left, \(J\) is a function of \(v\), and \(v\) is a function of \(a\). Based on the chain rule of derivation, we get:
$$ \frac{\partial J}{\partial a}=\frac{\partial J}{\partial v}\cdot \frac{\partial v}{\partial a}=3\cdot 1=3 $$
②Calculate the partial derivative of \(J\) to the parameter \(b\)
From the calculation diagram, from right to left, \(J\) is a function of \(v\), \(v\) is a function of \(u\), and \(u\) is \(b\) The function. Also available:
$$ \frac{\partial J}{\partial b}=\frac{\partial J}{\partial v}\cdot \frac{\partial v}{\partial u}\cdot \frac{\partial u} {\partial b}=3\cdot 1\cdot c=3\cdot 1\cdot 2=6 $$
③Calculate the partial derivative of \(J\) to the parameter \(c\)
From right to left, \(J\) is a function of \(v\), \(v\) is a function of \(u\), and \(u\) is a function of \(c\). Available:
$$ \frac{\partial J}{\partial c}=\frac{\partial J}{\partial v}\cdot \frac{\partial v}{\partial u}\cdot \frac{\partial u} {\partial c}=3\cdot 1\cdot b=3\cdot 1\cdot 3=9 $$
This completes the backpropagation and gradient (partial derivative) calculation process from right to left.
4. Gradient descent in logistic regression
Going back to the logistic regression problem we mentioned earlier, we assume that the dimension of the input feature vector is 2 (ie \([x_1, x_2]\)), corresponding to the weight parameters \(w_1\), \(w_2\), \( b\) to get the following calculation graph:
Backpropagation computes gradients
① Find the derivative of \(L\) to \(a\)
② Find the derivative of \(L\) to \(z\)
③Continue to push forward calculation
④ Based on gradient descent, the parameter update formula can be obtained
Mentioned earlier is the process of taking the partial derivative and applying the gradient descent algorithm to a single sample. For a dataset with \(m\) samples, the calculations of Cost Function \(J(w,b)\), \(a^{(i)}\) and weight parameter \(w_1\) are shown in the figure Show.
The process of a certain training in the complete logistic regression is as follows, only the dimension of the feature vector is assumed to be 2:
J=0; dw1=0; dw2=0; db=0;
for i = 1 to m
z(i) = wx(i)+b;
a(i) = sigmoid(z(i));
J += -[y(i)log(a(i))+(1-y(i))log(1-a(i));
dz(i) = a(i)-y(i);
dw1 += x1(i)dz(i);
dw2 += x2(i)dz(i);
db += dz(i);
J /= m;
dw1 /= m;
dw2 /= m;
db /= m;
Then iterate on \(w_1\), \(w_2\), \(b\).
The above calculation process has a disadvantage: the whole process contains two for loops. in:
- The first for loop iterates over \(m\) samples
- The second for loop iterates over all features
Explicitly using a for loop in the code makes the algorithm inefficient if there are a large number of features. Vectorization can be used to solve the problem of explicitly using for loops.
5. Vectorization
Continuing to take logistic regression as an example, if \(z=w^Tx+b\) is calculated in a non-vectorized loop, the code is as follows:
z = 0;
for i in range(n_x):
z += w[i] * x[i]
z += b
Based on vectorized operations, it can be calculated in parallel, greatly improving efficiency, and the code is more concise:
(The numpy tool library in python is used here. Students who want to know more can check the numpy tutorial in ShowMeAI 's graphic data analysis series, or quickly learn how to use it through the numpy quick-check manual produced by ShowMeAI )
z = np.dot(w, x) + b
Without an explicit for loop, the iterative pseudocode for implementing gradient descent for logistic regression is as follows:
$$Z=w^TX+b=np.dot(wT, x) + b$$
$$A=\sigma(Z)$$
$$dZ=AY$$
$$dw=\frac{1}{m}XdZ^T$$
$$db=\frac{1}{m}np.sum(dZ)$$
$$w:=w-\sigma dw$$
$$b:=b-\sigma db$$
References
- Graphical Machine Learning | Logistic Regression Algorithm Explained
- Illustrated Machine Learning | Machine Learning Fundamentals )
- Graphical AI Mathematical Foundation | Calculus and Optimization )
- Graphical data analysis
- Numpy Quick Look Manual
recommended article
- Deep Learning Tutorial | Introduction to Deep Learning
- Deep Learning Tutorial | Neural Network Basics
- Deep Learning Tutorial | Shallow Neural Networks
- Deep Learning Tutorial | Deep Neural Networks
- Deep Learning Tutorial | Practical Aspects of Deep Learning
- Deep Learning Tutorial | Neural Network Optimization Algorithms
- Deep Learning Tutorial | Network Optimization: Hyperparameter Tuning, Regularization, Batch Normalization, and Program Frameworks
- Deep Learning Tutorial | AI Application Practice Strategy (Part 1)
- Deep Learning Tutorial | AI Application Practice Strategy (Part 2)
- Deep Learning Tutorial | Convolutional Neural Network Interpretation
- Deep Learning Tutorial | Detailed Explanation of Classic CNN Network Examples
- Deep Learning Tutorial | CNN Applications: Object Detection
- Deep Learning Tutorial | CNN Applications: Face Recognition and Neural Style Transfer
- Deep Learning Tutorial | Sequence Model and RNN Network
- Deep Learning Tutorial | Natural Language Processing and Word Embeddings
- Deep Learning Tutorial | Seq2seq Sequence Model and Attention Mechanism
Featured Recommendations in ShowMeAI Series Tutorials
- Dachang Technology Realization Program Series
- Graphical Python Programming: From Beginner to Mastery series of tutorials
- Graphical Data Analysis: From Beginner to Mastery Tutorial Series
- Graphical AI Mathematical Fundamentals: From Beginner to Mastery Series Tutorials
- Illustrated Big Data Technologies: From Beginner to Mastery Series
- Illustrated Machine Learning Algorithms: From Beginner to Mastery Tutorial Series
- Machine learning combat: teach you how to play machine learning series
- Deep Learning Tutorial: Wu Enda Special Course · Interpretation of a full set of notes
- Natural Language Processing Course: Stanford CS224n Course · Course Learning and Full Note Interpretation
- Deep Learning and Computer Vision Tutorial: Stanford CS231n · Interpretation of a full set of notes
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。