深度学习 - Deep Learning and CV Tutorial (2) | Image Classification and Machine Learning Basics - ShowMeAI研究中心

ShowMeAI研究中心

Author: Han Xinzi @ShowMeAI
Tutorial address : https://www.showmeai.tech/tutorials/37
Address of this article : https://www.showmeai.tech/article-detail/261
Disclaimer: All rights reserved, please contact the platform and the author for reprinting and indicate the source
Bookmark ShowMeAI for more exciting content

Image Classification pipeline; 深度学习与计算机视觉; Stanford CS231n

This series is a complete set of study notes for Stanford CS231n "Deep Learning and Computer Vision (Deep Learning for Computer Vision)", and the corresponding course videos can be viewed here . See the end of the article for more information.

introduction

Image classification is the core task of computer vision, and many problems in the field of computer vision (such as object detection and semantic segmentation ) can be related to image classification problems. The problem of image classification is that there is a fixed set of classification labels, and then for the input image, find a classification label from the classification label set, and finally assign the classification label to the input image. In this summary of the content, ShowMeAI will explain data-driven model algorithms, including simple KNN models and linear classification models .

The focus of this article

data-driven approach
KNN algorithm
Linear classification

1. Challenges of Image Classification

For computers, an image is equivalent to a pixel matrix; for humans, an image is a multimedia presentation containing rich semantic information, corresponding to different object categories, so there is a huge semantic gap for computers.

For example, input the following picture of a kitten to the computer, the computer image classification model will read the picture and calculate the probability that the picture belongs to each label in the set ${cat, dog, hat, cup}$. But the input image data read is a huge $3$ dimensional array of numbers.

In the image below, the image of a cat is $600$ pixels high, $800$ pixels wide, and has $3$ color channels (red, green, and blue, or RGB for short), so it contains $ 600 \times 800 \times 3=1440000$ numbers, each of which is an integer in the range $0 \sim 255$, where $0$ means all black, $255$ Indicates all white.

Our task is to turn these numbers into a simple label , such as "cat".

![image classification; image classification challenges; images in the computer [eye]; 2-1]( https://img-blog.csdnimg.cn/img_convert/dc41f7d5873497b422f634e516efc37d.png )

The image classification algorithm needs to be robust enough, and we want it to be able to adapt to the following variations and combinations:

Viewpoint variation : The same object can be displayed by the camera from multiple angles.
Scale variation : The visible size of objects usually changes (not only in pictures, but also in the real world).
Deformation : The shape of many things is not static, and there will be great changes.
Occlusion : The target object may be occluded. Sometimes only a small part of the object (which can be as small as a few pixels) is visible.
Illumination conditions : At the pixel level, lighting has a huge impact.
Background clutter : Objects may blend into the background, making it difficult to recognize.
Intra-class variation : The shapes of individuals of a class of objects vary widely, such as chairs. There are many different objects in this category, each with its own shape.

Shown below are some of the variations and challenges of image recognition:

图像分类; 图像分类的挑战; 一些变化和识别挑战; 2-2

2. Data-driven approach

One way to do it is to " hard code ": first get some lines from the edge of the cat image, and then define rules such as three lines crossing is the ear . However, the recognition effect of this method is not good, and new objects cannot be recognized.

图像分类; 数据驱动算法; 获取图像边缘得到线条; 2-3

We will use a data-driven algorithm : instead of writing out the rules for identifying each object, we will find a large number of sample pictures for each type of object, feed it to the computer for machine learning, summarize the pattern rules, and generate a classifier model. Summarize the core knowledge elements to distinguish different types of objects, and then use the trained model to identify new images .

图像分类; 数据驱动算法; 输入/学习/评价; 2-4

The data-driven algorithm process is as follows:

Input : The input is a collection of $N$ images, each labeled with one of $K$ kinds of categorical labels. This set is called the training set.
Learning : The task of this step is to use the training set to learn the pattern regularity of each class. Typically this step is called classifier training or model learning.
Evaluation : Let the classifier classify images it has not seen before, compare the labels predicted by the classifier with the actual classification labels of the images (ground truth), and evaluate the quality of the classifier.

2.1 The nearest neighbor algorithm

This part of the content can also refer to the article in ShowMeAI 's Graphical Machine Learning Tutorial to explain the KNN algorithm and its application in detail

Here we introduce the first classifier algorithm: the nearest neighbor algorithm . The training process simply memorizes the image data and labels, and compares the images in the training data to find the closest output label during prediction. This classifier has nothing to do with convolutional neural networks, and is rarely used in practice, but by implementing it, you can have a basic understanding of how to solve image classification problems.

1) Image classification dataset: CIFAR-10

CIFAR-10 is a very popular dataset for image classification . This dataset contains 10 classification labels, 60000 small images $32 \times 32$, each image contains a label. The 60,000 images are split into a training set of 50,000 images (5,000 per class) and a test set of 10,000 images.

Suppose now that we use these 50,000 images as the training set, and label the remaining 10,000 as the test set, the Nearest Neighbor algorithm will take the test image and compare each image in the training set, and then take the one it thinks is the most similar The label of the training set image is assigned to this test image.

The result is shown in the figure below, and the effect is not particularly good.

图像分类; 图像分类数据集; CIFAR-10; 2-5

Left : sample images from the CIFAR-10 database;
Right : The first column is the test image, followed by the 10 most similar images from the training set based on pixel differences using the Nearest Neighbor algorithm

So how exactly do you compare the two images ? We have some distance metric calculation methods, which will be introduced below.

2) L1 distance (Manhattan distance)

The mathematical knowledge of distance measurement can also be referred to in ShowMeAI 's series of tutorials. The article in the basics of AI mathematics explains linear algebra and matrix theory for various distance measurements.

In this case, comparing $32 \times 32 \times 3$ blocks of pixels. The easiest way to do this is to compare pixel by pixel and add up all the differences at the end. Convert the two images into two vectors $I_{1}$ and $I_{2}$, and then calculate their L1 distance:

$$ d_{1} (I_{1} ,I_{2} )=\sum_{p}\vert I_{1}^p -I_{2}^p \vert $$

Where $p$ is the pixel point, and $I^p$ represents the value of the $p$th pixel point.
The two images are compared using the L1 distance, which is the difference pixel by pixel, and then sums all the differences to get a single value. If the two pictures are exactly the same, then the L1 distance is $0$; but if the two pictures are very different, the L1 value will be very large.

The following image is a $4 \times 4$ image with only one RGB channel to calculate the L1 distance.

图像分类; L1距离; 4X4图片计算; 2-6

Let's see how the specific programming is implemented :

① First, we load the CIFAR-10 data into memory and divide it into 4 arrays: training data and labels, test data and labels .

 Xtr, Ytr, Xte, Yte = load_CIFAR10('data/cifar10/') # 这个函数可以加载CIFAR10的数据
# Xtr是一个50000x32x32x3的数组，一共50000个数据，
# 每条数据都是32行32列的数组，数组每个元素都是一个三维数组，表示RGB。
# Xte是一个10000x32x32x3的数组；
# Ytr是一个长度为50000的一维数组，Yte是一个长度为10000的一维数组。
Xtr_rows = Xtr.reshape(Xtr.shape[0], 32 * 32 * 3) 
# Xtr_rows是50000x3072的数组，按每个像素点排列，每个像素点有三个值。
Xte_rows = Xte.reshape(Xte.shape[0], 32 * 32 * 3) 
# Xte_rows是10000x3072的数组
''' shape会返回数组的行和列数元组：（行数，列数），shape[0]表示行数, 
Xtr.shape[0]会返回50000；Xtr.shape会返回（50000，32，32，3）
Xtr.reshape(50000，3072)会将Xtr 重构成50000x3072数组，等于 np.reshape(Xtr, (50000,3072))'''

Xtr (size is 50000x32x32x3) holds all the images in the training set
Xte (size is 10000x3072) holds all images in the test set
Ytr is the corresponding 1-dimensional array of length 50000, which stores the classification labels corresponding to the images (from 0 to 9)
Yte corresponds to a 1-dimensional array of length 10000

Now we get all the image data, each image corresponds to a row vector of length 3072.

② Next, train a classifier and evaluate the effect. We often use accuracy as a measure of evaluation, which describes how well we predicted the correct score .

In this example, OK, in many other applications, the accuracy rate is not necessarily the best evaluation criterion. You can refer to the article in ShowMeAI 's Graphical Machine Learning Tutorial to explain the model evaluation method and criteria in detail.

 nn = NearestNeighbor() # 创建一个最邻近分类器对象
nn.train(Xtr_rows, Ytr) # 用训练图片数据和标签训练分类器
Yte_predict = nn.predict(Xte_rows) # 预测测试图片的标签
# 并输出预测准确率，是一个平均值
print 'accuracy: %f' % ( np.mean(Yte_predict == Yte) )

Please note that all classifiers we implement in the future need to have this interface function (API): train(X, y) function. The function is trained using the data and labels of the training set.
From its internals, the class should implement some model about the labels and how the labels are predicted. There is also a predict(X) function, which is used to predict the classification label of the new input data.

Here is an implementation of the Nearest Neighbor classifier using the L1 distance:

 import numpy as np

class NearestNeighbor(object):
  def __init__(self):
    pass

  def train(self, X, y):
    """ X 是 NxD 维的数组，每一行都是一个样本，比如一张图片，D 是样本的数据维度；
    Y 是长度为 N 的一维数组。"""
    # 最邻近分类器只是简单的记住所有的训练数据
    self.Xtr = X
    self.ytr = y

  def predict(self, X):
    """ X 是 NxD 维的数组，每一行都是一个希望预测其标签的样本 """
    num_test = X.shape[0]
    # 确保输出的标签数据类型和输入的标签格式一致，长度是测试样本数
    Ypred = np.zeros(num_test, dtype = self.ytr.dtype)

    # 循环所有测试样本数，即测试数组的行数
    for i in range(num_test):
      # 为第 i 张测试图片找到最接近的训练图片
      # 使用 L1 距离 (差值的绝对值求和)
      '''self.Xtr - X[i,:] 利用传播机制，求测试集第 i 张图片对应的行向量和
      训练集所有图片行向量的差值，得到一个一个50000x3072的差值矩阵；
      abs(self.Xtr - X[i,:] )会将矩阵所有元素求绝对值；
      然后axis = 1 会对差值矩阵按行求和，最终得到一个长度为50000的一维
      数组，存放第 i 张图片和训练集所有50000张图片的 L1 距离。'''
      distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1)
      min_index = np.argmin(distances) # 获取距离最小的训练集图片索引
      Ypred[i] = self.ytr[min_index] # 预测第 i 张测试集图片的标签时与其最接近的训练集图片索引

    return Ypred

The training time complexity of this code is $O(1)$, because it is just a simple storage of data, no matter how large the data is, it is a relatively fixed time; if the training set has $N$ samples, then The prediction time complexity is $O(N)$, because the test image is compared with each image in the training set.

This is a not very good classifier. The actual requirement of the classifier is that we want it to be fast in prediction and slow in training .

This code runs CIFAR-10, and the accuracy rate can reach $38.6\%$. This is better than the $10\%$ of random guessing, but it is still much worse than the level of human recognition and the $95\%$ that the convolutional neural network can achieve.

3) L2 distance (European distance)

Another commonly used method is L2 distance , which can be understood as calculating the Euclidean distance between two vectors from a geometrical point of view. The formula for the L2 distance is as follows:

$$ d_{2} (I_{1},I_{2})=\sqrt{\sum_{p}(I_{1}^p - I_{2}^p )^2 } $$

The difference between the pixels is still calculated, but the difference is squared first, then all these squares are added up, and finally the sum is squared.

The code at this point only needs to change the line that calculates the distance difference:

 distances = np.sqrt(np.sum(np.square(self.Xtr - X[i,:]), axis = 1))
'''np.square(self.Xtr - X[i,:]) 会对差值矩阵的每一个元素求平方'''

Note that np.sqrt is used here, but may not be used in practice. Because the square root of the absolute value of different distances changes the value, but still maintains the order of different distances. This model, the correct rate is $35.4\%$, which is a little lower than just now.

4) Comparison of L1 and L2

The L1 distance is more dependent on the selection of the coordinate axis, and the selection of different coordinate axes will also change the L1 distance. The boundary of the determined data classification will tend to be closer to the axis of the coordinate system to divide the area to which it belongs, while L2 is relatively speaking. The degree of association with the coordinate system is not so great, and a circle will be formed, which does not follow the change of the coordinate axis.

图像分类; 最近邻算法; L1距离V.S.L2距离; 2-7

When faced with differences between two vectors, L2 is less tolerant of these differences than L1. That is, the L2 distance is more likely to accept multiple moderate differences (because the differences are squared) than 1 large difference .

Both L1 and L2 are special forms commonly used in p-norm .

The L1 distance can be selected when there are features of particular interest in the image; when all elements in the image are unknown, the L2 distance will be more natural. The best way is to try both distances and find the best one.

2.2 k nearest neighbor classifier

This part of the content can also refer to the article in ShowMeAI 's Graphical Machine Learning Tutorial to explain the KNN algorithm and its application in detail

Only use the label of the most similar image as the label of the test image, sometimes it will not work well because there are not enough references, we can use the k-Nearest Neighbor classifier . The idea of KNN is to find the labels of the most similar $k$ pictures, and the labels with the largest number of $k$ are used as predictions for the test pictures .

When $k=1$, the k-Nearest Neighbor classifier is the nearest neighbor classifier mentioned above.

As shown in the figure below, the example uses 2-dimensional points to represent images, which are divided into 3 categories (red, green, and blue). The different colored regions represent the decision boundary of the classifier using the L2 distance.

图像分类; NN分类器V.S.KNN分类器; 2-8

The above example shows the difference between NN classifier and KNN ($k=5$) classifier. It can be seen from the intuitive feeling that a higher $k$ value can make the classification effect smoother and make the classifier more resistant to outliers.

At $k=1$, anomalous data points (eg: green points in the blue area) create an island of incorrect predictions.
At $k=5$ the classifier smoothes out these irregularities, making it generalization to the test data better.
- Note that there are also some white areas in 5-NN that are due to classification ambiguity caused by the same highest number of 5 nearest neighbor labels (i.e. the image is bound to more than two classification labels).
- For example: 2 neighbors are red, 2 neighbors are blue, and 1 is green, so it is impossible to determine whether it is red or blue.

1) Hyperparameter tuning

Model tuning, the experimental selection method of hyperparameters can also refer to ShowMeAI 's article Graphical Machine Learning | Model Evaluation Methods and Criteria and Deep Learning Tutorial | Network Optimization: Hyperparameter Tuning, Regularization, Batch Normalization and Program Framework

The KNN classifier needs to set the value of $k$, how to choose the most suitable value of $k$ ?
Which is better, L1 distance or L2 distance (or use other distance metrics such as dot product) ?

All these choices are called hyperparameters . Hyperparameters are common in the design of machine learning algorithms that learn from data.

Hyperparameters need to be set in advance, and the model can be trained and learned after the setting is completed. The specific setting method usually requires the help of experiments, trying different values, and selecting according to the performance.

Special Note: The test set cannot be used for tuning .
If the test set is used for tuning, and the algorithm seems to work well, the real danger is that when the algorithm is actually deployed, the performance may be much lower than expected. In this case, the algorithm is called overfitting to the test set.
You can understand that if you use the test set for tuning, you actually use the test set as a training set, and the algorithm trained from the test set predicts the test set, the performance will naturally look good, but the actual deployment effect will be Much worse.
Using the test set in the final test can be a good approximation to measure the generalization performance of the classifier.
The test data set can only be used once, and is used to evaluate the final model after training, not for tuning !

Method 1: Setting up a validation set

Take a part of the data from the training set for tuning, which is called the validation set . Taking CIFAR-10 as an example, we can use 49000 images as training set and 1000 images as validation set. The validation set is actually used as a fake test set for tuning.

图像分类; 超参数调优; 设置验证集; 2-9

code show as below:

 # 假设 Xtr_rows, Ytr, Xte_rows, Yte 还是和之前一样
# Xtr_rows 是 50,000 x 3072 的矩阵
Xval_rows = Xtr_rows[:1000, :] # 取前 1000 个训练集样本作为验证集
Yval = Ytr[:1000]
Xtr_rows = Xtr_rows[1000:, :] # 剩下的 49,000 个作为训练集
Ytr = Ytr[1000:]

# 找出在验证集表现最好的超参数 k 
validation_accuracies = []
for k in [1, 3, 5, 10, 20, 50, 100]:
  # 使用一个明确的 k 值评估验证集
  nn = NearestNeighbor()
  nn.train(Xtr_rows, Ytr)
  # 这里假设一个修正过的 NearestNeighbor 类，可以把 k 值作为参数输入
  Yval_predict = nn.predict(Xval_rows, k = k)
  acc = np.mean(Yval_predict == Yval)
  print 'accuracy: %f' % (acc,)

  # 把每个 k 值和相应的准确率保存起来
  validation_accuracies.append((k, acc))

After the program ends, plot and analyze which $k$ value performs best, and then use this $k$ value to run the real test set and make an evaluation of the algorithm.

Method 2: Cross Validation

When the number of training sets is small (and therefore the number of validation sets is smaller), the method of cross-validation can be used. Still using the example just now, if it is a cross-validation set, instead of taking 1,000 images, we divide the training set into 5 parts, each with 10,000 images, of which 4 are used for training and 1 is used for validation. Then we take 4 of them in a loop for training, 1 of them for validation, and finally take the average of all 5 validation results as the algorithm validation result.

图像分类; 超参数调优; 交叉验证; 2-10

Below is an example of 5-fold cross-validation tuning the $k$ value. For each $k$ value, obtain the accuracy results of 5 validations, take the average, and then draw a line to connect the average performance of different $k$ values.

图像分类; 超参数调优; k折交叉验证效果; 2-11

As can be seen from the above figure, in this example, the algorithm performs best when $k=7$ (corresponding to the peak accuracy in the figure). Lines are generally smoother (less noisy) if we divide the training set into more parts.

In practice, deep learning does not use cross-validation, mainly because it consumes more computing resources. Generally, the training set is directly divided into training set and validation set according to the ratio of $50\% \sim 90\%$. However, cross-validation can be used when the number of training sets is small, generally divided into 3, 5 and 10 copies.

2) Advantages of KNN classifier

① Easy to understand and simple to implement.
② The training of the algorithm does not take time, because the training process only stores the training set data.

3) Disadvantages of KNN classifier

① Testing takes a lot of time

Because each test image needs to be compared with all stored training images, in practical applications, focusing on testing efficiency is much higher than training efficiency;

② It is not enough to use pixel difference to compare images. The L2 distance between pictures is small, and it is more dominated by the background rather than the semantic content of the picture itself. Often, the L2 distance of pictures with similar backgrounds will be small .

That is, on high-dimensional data, pixel-based similarity is very different from sensory-based similarity. Two images that are perceptually different may have the same L2 distance.

③ Dimensional disaster

KNN is a bit like training data that divides the sample space into several blocks. We need the training data to be densely distributed in the sample space, otherwise the closest point of the test image may actually be very far away, resulting in a completely different sample from the closest training set. . But if the training data is densely distributed, the number of training sets required increases exponentially, which is the square of the data dimension.

4) Practical application of KNN

Here are some suggestions for practical application of the KNN algorithm

① Preprocessing data

Normalize the features in the data to have zero mean and unit variance. This section does not discuss it, because the pixels in the image are all homogeneous, do not show a large difference distribution, and do not require normalization.

② Dimensionality reduction

If the data is high dimensional, consider using a dimensionality reduction method such as PCA or random projection.

③ Randomly divide the data into training set and validation set

As a general rule, $70\% \sim 90\%$ data is used as the training set. This ratio depends on how many hyperparameters are in the algorithm, and the expected impact of those hyperparameters on the algorithm.
If there are many hyperparameters to predict, then you should use a larger validation set to efficiently estimate them; if you are concerned about not having enough validation sets, then try a cross-validation method; if you have enough computing resources, it is better to use cross-validation (number of copies) The more, the better the effect and the more computationally intensive).

④ Tuning on the validation set

Try enough $k$ values to try both L1 and L2 norm calculations.

⑤ Accelerated classifier

If the classifier is running too slow, try using an ANN library (like FLANN ) to speed up the process, at the cost of some loss of accuracy.

⑥ Record the optimal hyperparameters

After recording the optimal parameters, do not run the optimal parameter algorithm on the full training set and train again, doing so will destroy the optimal parameter estimate.
Use the test set directly to test the optimal model set with the optimal parameters , get the classification accuracy of the test set data, and use this as the performance of your KNN classifier on this data.

3. Linear Classification: Scoring Function

3.1 Overview of Linear Classification

In the KNN model, no parameters are used in the training process, but the training data is simply stored (parameter k is used in prediction, find $k$ close pictures, and then find the most labeled, and \ (k\) is a hyperparameter and is artificially set).

In contrast to the parametric model , the parametric model often obtains a set of parameters after the training is completed, and then the training data can be completely discarded. When making predictions, it only needs to do some kind of operation with this set of parameters, and then it can be done according to the calculation results. judge. Linear classifiers are the simplest of parametric models, but they are an important basic module in neural networks.

The method of linear classification consists of two parts:

① Score function

It is a mapping of raw image data to category scores.

② loss function

It is used to quantify the agreement between the scores computed by the scoring function and the ground-truth labels. The method can be transformed into an optimization problem in which the loss function value is minimized by updating the parameters of the scoring function.

3.2 Scoring function

The scoring function maps the pixel value of the image to the score of each classification category, and the score represents the probability that the image belongs to the category. All the above descriptions are relatively abstract, and the following concrete examples are used to illustrate.

Back to the CIFAR-10 image classification dataset used by KNN.

图像分类; 评分函数; 参数化方式-线性分类器; 2-12

Suppose our training set has$N$ samples, here$N=50000$, each sample$x_{i}b \in R^D$, where$i = 1,2, \cdots,N$, $D=3072$; each $x_{i}$ corresponds to a label $y_{i}$, $ y_{i}$ in $[ 1, K]$ value, $K$ represents the total number of categories, here $K=10$. The scoring function can now be defined: $f:R^D \rightarrow R^K$, which maps a $D$ dimensional image to the scores of $K$ categories.

The simplest model is a linear model: the parameters are multiplied by the input data. which is:

$$ f(x_{i},W,b)=Wx_{i}+b $$

In the above formula, the parameter $W$ is called the weight , and $b$ is called the bias term
In the above formula, it is assumed that each image data is stretched into a column vector of length $D$ of size $[D \times 1]$. where the matrix $W$ of size $[K \times D]$ and the column vector $b$ of size $[K \times 1]$ are the parameters of the function.

Taking CIFAR-10 as an example, $x_{i}$ contains all the pixel information of the $i$th image, which is pulled into a column of $[3072 \times 1]$ Vector, $W$ of size $[10 \times 3072]$, $b$ of size $[10 \times 1]$. Therefore, input $3072$ numbers (original pixel values), and the function outputs $10$ numbers (scores obtained by different classifications), which is a $3072$ dimension to $10$ dimension map.

Notice:

The terms weights and parameters are often used together. In fact, the multiplication of data and parameters is equivalent to the proportion of data, and this weight is the parameter value;
One advantage of this method is that the training data is used to learn the parameters $W$ and $b$, once the training is complete, the training data can be discarded, leaving the learned parameters. When testing an image, you can simply input the image data into the function, and the classification score calculated by the function is used for classification;
The input data $(x_{i},y_{i})$ is given and cannot be changed, but the parameters $W$ and $b$ are changeable. The goal is to change these parameters so that the calculated classification scores are consistent with the real class labels of the image data in the training set;
Only one matrix multiplication and one matrix addition can be used to classify a test data , which is much more efficient than the method in KNN that compares the test image with all the training data.

3.3 Understanding Linear Classifiers

1) Understanding 1: W is the combination of all classifiers

图像分类; 线性分类器; 理解1-计算评分函数; 2-13

As shown in the figure above, stretch the image pixel data of the kitten into a column vector $x_i$, here for the convenience of description, assuming that the image has only 4 pixels (all black and white pixels, regardless of the RGB channel), that is $ D=4$; there are $3$ categories (red represents cats, green represents dogs, blue represents boats, colors only represent different categories, and have nothing to do with RGB channels), ie $K=3$. $W$ The matrix is multiplied by the column vector$x_i$ to get the score of each category.

In fact, we can see that the parameter matrix $W$ is equivalent to a combination of three classifiers, and each row of $W$ is a classifier , corresponding to cats, dogs, and boats. In the linear model, the number of parameters of each classifier is equivalent to the dimension of the input image. The multiplication of each pixel and the corresponding parameter indicates the proportion of the pixel in the classifier.

It should be noted that this $W$ is not good at all: the score for the cat classification is very low. From the above image, the algorithm thinks that the image is a dog.

We can understand it this way, the linear classifier will calculate the product of the value of all pixels in the 3 color channels in the image and the weight matrix, and then get the score for each category. Depending on the values we set for the weights, the function exhibits a like or dislike (depending on the sign of each weight) for certain colors in certain locations in the image.

Example : Imagine that the "boat" category is surrounded by a lot of blue (corresponding to water). Then the "boat" classifier has a lot of positive weights on the blue channel (their presence increases the score for the "boat" classification), and more negative weights on the green and red channels ( Their presence lowers the score for the "boat" category).

Combined with the kitten example above, the cat classifier is "disgusted" with the pixel in the second position, and the pixel value of the second position of the input kitten image is very large, and finally a very low score is calculated (of course, this classifier is wrong).

2) Understanding 2: Think of linear classifiers as template matching

Consider each row of the weight $W$ as a template for a classification, and an image corresponds to the scores of different classifications. It compares the image and the template by using the inner product (also called the dot product), and then finds which template is the most similar. .

In this understanding, the linear classifier uses the learned template to perform template matching with the input image. Our setup can be regarded as an efficient KNN, the difference is that instead of using all the training set images for comparison, each category is represented by only one image (this image is what we learned template, not in the training set), and we'll switch the metric to use the (negative) inner product to calculate the distance between vectors instead of using L1 or L2 distances.

图像分类; 线性分类器; 理解2-10个学习后模板; 2-14

The above picture is an example of the weights after the learning is completed using CIFAR-10 as the training set. can be seen:

The horse template appears to be a horse with two heads, which is caused by the fact that the horse images in the training set have their heads facing each other. Linear classifiers combine these two cases;
The template of the car also seems to fuse several different models into one template. The car on this template is red because most of the cars in the training set in CIFAR-10 are red. The linear classifier's ability to classify cars of different colors is very weak, but it can be seen later that the neural network can complete this task;
The template for the boat has a lot of blue pixels as expected. If the image is of a boat sailing on the ocean, then this template will give a high score for computing the image using the inner product.

3) Understanding 3: Think of an image as a point in a high-dimensional space

Since the score that defines each classification category is the matrix product of the weight and the image, the score of each classification category is the function value of a linear function in this space. We have no way of visualizing a linear function in $3072$ dimensional space, but suppose you squeeze those dimensions into two dimensions, and you can see what these classifiers are doing:

图像分类; 线性分类器; 理解3-二维空间划分; 2-15

In the above image, each input image is a point, and the different colored lines represent 3 different classifiers. Taking the red car classifier as an example, the red line represents the set of points in the space where the car classification score is $0$, and the red arrow represents the direction in which the score increases. All points to the right of the red line have positive scores and increase linearly. Points to the left of the red line have negative scores and decrease linearly.

As you can see above, each line of $W$ is a classifier for a categorical category . The geometric interpretation of these numbers is:

If you change the value of the numbers in the $W$ line, you will see that the line corresponding to the classifier in space starts to rotate in different directions. The bias term $b$ allows the linear translation corresponding to the classifier .
It should be noted that if there is no bias term, regardless of the weight, the classification score is always $0$ when $x_{i}=0$. This way all classifier lines have to go through the origin .

3.4 Bias and Weight Merging

You can see the above derivation process: In fact, we have two parameters, the weight parameter $W$ and the bias parameter $b$, which are more redundant to deal with separately. The commonly used optimization method is to put the two parameters in the In the same matrix, at the same time the column vector $x_{i}$ will add a dimension, the value of this dimension is a constant $1$, which is the default bias item dimension .

As shown in the figure below, the new formula is simplified to the following form:

$$ f(x_{i},W,b)=Wx_{i} $$

图像分类; 线性分类器; W和b合并后示意图; 2-16

Taking CIFAR-10 as an example, then the size of $x_{i}$ becomes $[3073 \times 1]$ instead of $[3072 \times 1]$, and there is more Contains 1 dimension of constant 1; $W$ The size is $[10 \times 3073]$, and the extra column in $W$ corresponds to the deviation value$b$:

After such processing, in the end, only one weight matrix needs to be learned, and there is no need to learn two matrices containing weights and biases respectively.

3.5 Image data preprocessing

In the above example, all images are using raw pixel values ($0 \sim 255$). In machine learning, we often normalize the input features. In the example of image classification, each pixel on the image can be regarded as a feature.

In practice, we would have a step of centering the data by subtracting the mean for each feature.

In the case of these images, the step is to calculate an average image value from all the images in the training set, and then subtract this average from each image, so that the pixel values of the image are approximately distributed in $[-127, 127 ]$ between.

Subsequent steps that can be manipulated include normalization, that is, making the interval of all numerical distributions $[-1, 1]$.

3.6 Cases where linear classifiers fail

图像分类; 线性分类器; 难以处理的情形; 2-17

The classification ability of the linear classifier is actually limited. For example, the three cases in the above figure cannot find a suitable straight line to distinguish. The 1st case is parity classification, and the 3rd case is having multiple models.

4. Expand your learning

You can click station B to view the [bilingual subtitles] version of the video

5. Summary of key points

Difficulties and Challenges in Image Classification
Data-Driven Methods, Nearest Neighbor Algorithms, L1 and L2 Distances
KNN classifier, hyperparameter tuning, advantages and disadvantages of KNN and practical applications
Concepts of linear classification, understanding of scoring functions, parameter merging, data preprocessing, limitations of linear classifiers

Stanford CS231n full set of interpretation

Featured Recommendations in ShowMeAI Series Tutorials

ShowMeAI用知识加速每一次技术成长

Deep Learning and CV Tutorial (2) | Image Classification and Machine Learning Basics