Abstract: introduces the current industry's main model optimization methods, and then focuses on model quantification, introduces the basic principles of quantification, method classification, future development, and interpretation of cutting-edge papers.

Share this article from Huawei cloud community " quantify review and application of the model ", Author: Alan_wen.

Preface

With the continuous development of deep learning, neural networks are widely used in different fields and have achieved far superior performance. However, the parameters of deep network models are also getting larger and larger, which seriously restricts the application of deep networks in the industry. Therefore, this article will introduce The current industry's main model optimization methods, and then focus on model quantification, introducing the basic principles of quantification, method classification, future development, and interpretation of cutting-edge papers.

1. The method of model optimization

1.1 Design an efficient network structure

The compactly designed network structure can realize the optimization of the model, such as the proposed MobileNet series network, including the concise Depth-Wise Convolution and Point-Wise Convolution. However, the hand-designed neural network has been gradually replaced by AutoML and network structure search. Through the network structure search, a high-precision and compact network can be obtained.

1.2 Model pruning

The use of hand-designed network structures can generally obtain higher accuracy, but huge network parameters are difficult to directly apply to industrial products. Generally, models are pruned. Model pruning is divided into structured pruning and non-structural pruning. Chemical pruning and unstructured pruning are generally difficult to achieve low-level acceleration, and model pruning has gradually been replaced by network structure search.

1.3 Knowledge distillation

In addition to pruning can reduce a large model to a small model, knowledge distillation can also achieve this function. Knowledge distillation uses the original large model as the Teacher model, and the small model Student model designed to guide the Student model training through soft-target to realize the knowledge transfer of the Teacher model.

1.4 Sparse

Sparse is mainly through sparse network weights or features, which can be achieved through regular training. After network weights are sparse, they are generally combined with model pruning to trim inactive weights to compress the network structure. .

1.5 Model quantification

Model quantization is currently one of the most effective model optimization methods in the industry. For example, FP32-->INT8 can achieve 4 times parameter compression, and can achieve faster calculations while compressing memory, and perform extreme binary quantization theoretically even A 32-fold compression can be achieved, but excessive compression will cause the accuracy of the model to drop rapidly. The model quantification will be introduced in detail below.

2. Model quantitative review

2.1 What is quantification?

Quantization in an information system is a process of approximating the continuous value of a signal to a finite number of discrete values (it can be considered as a method of information compression).

In a computer system, quantization is the establishment of a data mapping relationship between designated points and floating point data, so that better benefits can be obtained at a small loss of accuracy. It can be simply understood as using "low bit" numbers to represent FP32 Equal value.

Before starting to introduce the principles of quantification, ask three questions:

Why is quantization useful?

  • Because the convolutional neural network is not sensitive to noise, quantization is equivalent to adding a lot of noise to the original input.

Why use quantization in

• The model is too large, for example, VGG19 is larger than 500MB, and the storage pressure is high;
• The range of weights of each layer is basically determined, and the fluctuation is not large, suitable for quantization and compression;
• In addition, quantification can reduce both memory access and calculation

Why not directly train low-precision models?

  • Because training requires back propagation and gradient descent, int8 is a discrete value. For example, our learning rate is generally a few tenths of a few tenths. Int8 does not match and cannot be back-propagated and updated.

2.2 Quantification principle

Quantization refers to the process of approximating the continuous value of a signal to a finite number of discrete values. It can be understood as a method of information compression. Considering this concept on a computer system, it is generally represented by "low bits".

Model quantization is to establish a data mapping relationship between fixed-point and floating-point data, so that better benefits can be obtained at a smaller cost of accuracy loss. The details are as follows:

R represents the real floating point value, Q represents the quantized fixed point value, Z represents the quantized fixed point value corresponding to the 0 floating point value, and S is the smallest scale that can be represented after fixed point quantization. The quantization formula from floating point to fixed point is as follows:

Floating point to fixed point quantization:
image.png

2.3 Quantifying the basic probability

Uniform and non-uniform quantification:

image.png

As shown in the figure above, quantization can be divided into uniform quantization and non-uniform quantization. The uniform quantization in the left figure above is the linear quantization in the formula (1). The network weight or feature distribution is not necessarily uniform. Simple linear quantization may cause obvious information loss in the original network. Therefore, non-uniform quantification can also be performed. For example, using Kmeans to cluster the network weights to obtain different cluster centers. Then the cluster centers are used as a quantitative representative of the weight of the same cluster.

Symmetric and asymmetric quantification:

image.png

In an ideal situation, as shown in the left image above, the feature distribution is relatively uniform, so the model can be symmetrically quantized, that is, the absolute value of the left and right sides of the 0 point value is equal for quantization. However, in many cases, model weights or features are unevenly distributed, and may not be symmetrical on both sides of the 0 point value. As shown in the right figure above, direct symmetric quantization will severely compress the features on one side and lose a lot of network information. Therefore, In order to maintain the information represented by the original network as much as possible, asymmetric quantification can be performed.

Dynamic and static quantification:

There are different calibration methods to determine the shear range of [α, β] in the above figure. Another important distinguishing factor of the quantification method is the determination of the clipping range. The range of weights can be calculated statically, because in most cases, the parameters are fixed during inference. However, the activation map of each input sample is different. Therefore, there are two methods to quantify activation: dynamic quantization and static quantization.

In dynamic quantization, this range is dynamically calculated for each activation map during runtime. This method requires real-time calculation of signal statistics (minimum, maximum, percentile, etc.), which may have a very high overhead. However, dynamic quantization usually achieves higher accuracy because the signal range is accurately calculated for each input.

Another quantification method is static quantification, where the clipping range is pre-calculated and static during inference. This method does not increase any computational overhead in the inference process, but it usually results in lower accuracy compared with dynamic quantization. A popular method of pre-calculation is to run a series of calibration inputs to calculate a typical activation range.

In general, the dynamic quantization dynamically calculates the cutting range of each activation, and usually the highest accuracy can be achieved. However, dynamically calculating the cropping range is very computationally intensive. Therefore, static quantization is most commonly used in the industry, in which all input cropping ranges are fixed.

Different quantitative granularity:

image.png

In the computer vision task, the activation input of each layer is convolved with many different convolution filters, as shown in the figure. Each of these convolution filters can have a different range of values. Therefore, one difference in the quantification method is how to calculate the granularity of the cropping range [α, β] for the weight. They can be classified into layer quantization, group quantization and channel quantization:

a) Layer quantization: In this method, the cropping range is determined by considering all the weights in the layer convolution filter, as shown in the third column of the above figure. Through the statistical information of the entire parameters in this layer (such as the minimum, maximum, and percentile, etc.), the same cropping range is used for the entire layer of convolution filters. Although this method is very simple to implement, it usually leads to sub-optimal solutions, because the range of each convolution filter may vary greatly, which will cause some convolution kernels with relatively narrow parameter ranges to lose quantization resolution. Rate.
b) Group quantization: can group multiple different channels in a layer to calculate the cropping range (activation or convolution kernel). This may be helpful when the parameter distribution in a single convolution/activation changes greatly. Group quantization can establish a good compromise between quantization resolution and computational overhead.
c) Channel quantization: is to use a fixed value for each convolution filter, independent of other channels, as shown in the last column of the figure above. In other words, each channel is assigned a dedicated scaling factor. This ensures better quantization resolution and usually leads to higher accuracy. Channel quantization is currently the standard method for quantizing convolution kernels.

Random quantization

In the process of reasoning, the quantification scheme is usually deterministic. However, this is not the only possibility. Some works have explored random quantization for quantized perception training and reduced precision training. The high-level intuition is that random quantization may allow NNs to explore more than deterministic quantization. People usually think that a small weight update may not cause any weight changes, because the rounding operation may always return the same weight. However, enabling random rounding may provide an opportunity for the NN to transition to update its parameters. The following formula is the random rounding method in Int quantization and binary quantization.
image.png

Method of fine-tuning

image.png

After quantification, it is usually necessary to adjust the parameters in the neural network (NN). This can be performed by a retraining model, which is called Quantitative Awareness Training (QAT), or it can be done without retraining, which is usually called quantization after training (PTQ). The figure above shows a schematic comparison between these two methods (the left figure is quantized perception training, and the right figure is post-training quantization), and will be discussed further below.

  • Quantitative perception training (pseudo-quantization)

Given a trained model, quantization may introduce disturbances to the trained model parameters, which may deviate the model from the point where it converges when trained with floating-point accuracy. This problem can be solved by retraining the NN model using quantized parameters so that the model can converge to a point with better loss. A popular method is to use quantized awareness training (QAT), where the usual forward and reverse passes are performed on the quantized model in floating point, but the model parameters are quantized after each gradient update. In particular, it is very important to perform this projection after the weight update is performed in floating-point precision. Performing backward pass using floating point is very important because accumulating gradients in quantization accuracy may result in zero gradients or gradients with high errors, especially in low accuracy.

An important subtlety in backpropagation is how to deal with non-minimization operators (Equation 1). Without any approximation, the gradient of this operator is almost zero everywhere, because the rounding operation in the formula is a piecewise plane operator. A popular way to solve this problem is to approximate the gradient of the operator through a so-called straight-through estimator (STE). STE basically ignores the rounding operation and uses the identification function to approximate it, as shown in the figure below.
image.png

Although STE is a rough approximation, QAT has been proven effective. However, the main disadvantage of QAT is the computational cost of retraining the NN model. This retraining may require hundreds of epochs to restore accuracy, especially for low-bit precision quantization. If the quantitative model is to be deployed over a long period of time, and if efficiency and accuracy are particularly important, then the investment in retraining may be worthwhile. However, this is not always the case, because some models have a relatively short lifespan.

  • quantization after training

An alternative to the expensive QAT method is post-training quantization (PTQ) to perform quantization and weight adjustment without any fine-tuning. Therefore, the overhead of PTQ is very low and often negligible. Unlike QAT, which requires a sufficient amount of training data for retraining, PTQ has an additional advantage that it can be applied to situations with limited or unlabeled data. However, compared to QAT, this usually comes at the cost of lower accuracy, especially for low-precision quantization.

  • Zero Shot (ie data-free)

As discussed so far, in order to achieve the smallest reduction in accuracy after quantization, we need to access all or part of the training data. First, we need to know the range of activation so that we can crop the value and determine the appropriate scaling factor (often called calibration in the literature). Secondly, the quantized model usually needs to be fine-tuned to adjust the model parameters and restore accuracy degradation. However, in many cases, it is impossible to access the original training data during the quantization process. This is because the training data set is either too large to be distributed, unique (such as Google's JFT-300M), or sensitive due to security or privacy issues (such as medical data). Several different methods have been proposed to solve this challenge, which we call Zero Shot Quantification (ZSQ). According to the enlightenment of a work by Qualcomm [2], two different levels of Zero Shot quantification can be described:

Level 1: No data, no fine-tuning (ZSQ+PTQ).
Level 2: No data, but need to fine-tune (ZSQ +QAT).
Level 1 allows faster and easier quantification without any fine-tuning. Fine-tuning is usually time-consuming and usually requires additional hyperparameter searches. Level 1 can be corrected by weight equalization or the statistical parameters of BatchNorm, without fine-tuning.
However, level 2 usually leads to higher accuracy, because fine-tuning helps to reduce the accuracy of quantized model restoration, especially in ultra-low-precision settings. The input data for level 2 fine-tuning is mainly generated by GAN, which can be used to generate approximate distribution data based on the model before quantization without having to access external data.

Zero Shot (aka date-free) quantization performs the entire quantization without accessing training/validation data. This is especially important for providers who want to accelerate the deployment of customer workloads without having to access their data sets. In addition, this is very important for situations where security or privacy issues may restrict access to training data.

2.4 Quantifying advanced concepts

  • , pseudo quantization and fixed-point quantization
    image.png

There are two common methods for deploying quantized NN models, analog quantization (aka pseudo-quantization) and integer-only quantization (aka fixed-point quantization). In analog quantization, the quantized model parameters are stored in low precision, but operations (such as matrix multiplication and convolution) are performed using floating-point arithmetic. Therefore, the quantization parameter needs to be dequantized before the floating point operation, as shown in the figure above (middle). Therefore, people cannot fully benefit from fast and efficient low-precision logic and analog quantization. However, in pure integer quantization, all operations are performed using low-precision integer arithmetic, as shown in the figure above (right). This allows the entire inference to be performed with efficient integer arithmetic without any floating point inverse quantization of any parameters or activations.

In general, using floating-point arithmetic to perform inference at full precision may help the final quantization accuracy, but this comes at the cost of not benefiting from low-precision logic. Low-precision logic has multiple advantages over full-precision logic in terms of latency, power consumption, and area efficiency. Compared with analog/false quantization, only integer quantization and binary quantization are preferable. This is because only integers use lower-precision logic for arithmetic, while analog quantization uses floating-point logic to perform operations. However, this does not mean that pseudo-quantization is never useful. In fact, the pseudo-quantization method is beneficial to the problem of bandwidth limitation rather than calculation limitation. For example, in a recommendation system, for these tasks, the bottleneck is the memory footprint and the cost of loading parameters from memory. Therefore, in these cases, it is acceptable to perform pseudo-quantization.

  • Mixed precision quantization
    image.png

It is easy to see that as we use lower precision quantization, the hardware performance has improved. However, uniformly quantizing the model to ultra-low accuracy may result in a significant decrease in accuracy. This problem can be solved by mixing precision quantization. In this method, each layer is quantized with a different bit precision, as shown above. One challenge of this approach is that the search space chosen for this bit setting is exponential in the number of layers.
Different methods have been proposed to solve this huge search space.
a) Choosing this mixed precision for each layer is essentially a search problem, and many different search methods have been proposed.

b) Another type of mixed-precision method uses periodic function regularization to train the mixed-precision model, by automatically distinguishing the importance of different layers and their changes in accuracy, while learning their respective bit widths.

c) HAWQ introduces an automatic method based on the second-order sensitivity of the model to find the mixed precision setting.
Mixed-precision quantization has been proven to be an effective hardware-efficient method for low-precision quantization of different neural network models. In this method, the layers of the NN are grouped as sensitive/insensitive to quantization, and each layer uses high/low bits. Therefore, one can minimize accuracy degradation and still benefit from reduced memory footprint and faster acceleration for low-precision quantization.

  • Hardware-aware quantization

One of the goals of quantification is to increase the reasoning delay. However, not all hardware provides the same speed after quantizing a certain layer/operation. In fact, the benefits of quantization depend on the hardware, and many factors such as on-chip memory, bandwidth, and cache hierarchy will affect the quantization speed.

Considering this fact is very important to achieve the best benefit through hardware-aware quantification. Therefore, the hardware delay needs to be simulated. When the quantization operation is deployed in the hardware, the actual deployment delay of each layer with different quantization bit precisions needs to be measured.

  • Distillation assisted quantification

An interesting work route in quantification is to combine model distillation to improve quantification accuracy. Model distillation is a way to use a large model with higher accuracy as a teacher to help train compact student models. In the training process of the student model, the model distillation proposes to use the soft probability generated by the teacher instead of just using the ground-true label, which may contain more input information. In other words, the overall loss function includes student loss and distillation loss.

  • Extreme quantification

Binarization is the most extreme quantization method, in which the quantization value is limited to 1-bit representation, which greatly reduces the memory requirement by 32×. In addition to the memory advantage, binary (1-bit) and ternary (2-bit) operations can usually be efficiently calculated using bit arithmetic, and can achieve significant speedups at higher precision (such as FP32 and INT8). However, a simple binarization method will result in a significant decrease in accuracy. Therefore, a lot of work has proposed different solutions to solve this problem, which are mainly divided into the following three categories:

a. Minimize quantization error (use multiple binary matrix combinations to simulate approximation, or use a wider network)
b. Improve the loss function (e.g. add distillation loss)
c. Improve the training method (such as replacing sign with a smoother function when backpropagating)

Very low precision quantization is a very promising research direction. However, compared to the baseline, the existing methods usually result in a decrease in high accuracy unless very extensive tuning and hyperparameter searches are performed. But this reduction in accuracy may be acceptable for less critical applications.

2.5 Future Quantitative Direction

Here, we briefly discussed several challenges and opportunities for future quantitative research. This is divided into the joint design of quantization tools, hardware and NN architecture, joint multiple compression methods and quantization training.

quantification tool:
Using current methods, it is easy to quantify and deploy different NN models to INT8 without significant loss of accuracy. There are several packages that can be used to deploy INT8 quantized models (for example, Nvidia's TensorRT, TVM, etc.), and each package has good documentation. However, software for low-bit precision quantization is not widely available and sometimes does not exist. For example, Nvidia's TensorRT currently does not support quantization other than INT8. In addition, support for INT4 quantization was not added to TVM until recently. The use of INT4/INT8 for low-precision and mixed-precision quantization is effective and necessary in practice. Therefore, the development of high-efficiency software APIs for low-precision quantization will have an important impact.
hardware and NN architecture joint design:
As mentioned above, an important difference between the classic work of low-precision quantization and the latest work of machine learning is that neural network parameters may have very different quantized values, but they can still be generalized and approximated well. For example, through quantitative perception training, we may converge to a different solution, far from the original solution with a single precision parameter, but still obtain good accuracy. One can take advantage of this degree of freedom, or adjust it when the NN architecture is quantified. For example, changing the width of the NN architecture can reduce/eliminate the generalization gap after quantization. One of the future work is to jointly adjust other architecture parameters such as depth or individual kernels when the model is quantified. Another future work is to extend this common design to the hardware architecture.
combined with multiple compression methods:
As mentioned above, quantification is only one way to effectively deploy NN. Other methods include efficient NN architecture design, hardware and NN architecture collaborative design, pruning and knowledge distillation, network frame search, etc. Quantification can be combined with these other methods. However, there is currently little work to explore the best combination of these methods. For example, NAS joint quantification, pruning, and quantization can be applied to the model together to reduce its overhead. It is very important to understand the best combination of structured/unstructured pruning and quantization. Similarly, another direction in the future is to study the combination of these methods with the other methods mentioned above.
quantization training:
Perhaps the most important use of quantization is to accelerate NN training with half precision. This allows for faster and more energy efficient low-precision logic for training. However, it is very difficult to push this further to INT8 precision training. Although there are some interesting works in this field, the proposed methods usually require a lot of hyperparameter adjustments, or they are only suitable for NN models for some relatively easy learning tasks. The basic problem is that in the case of INT8 accuracy, training may become unstable and divergent. Solving this challenge can have a significant impact on multiple applications, especially for cutting-edge training.

3. Interpretation of quantitative papers

3.1Data-Free quantification

Data-Free Quantization Through Weight Equalization and Bias Correction[2]. (ICCV2019)

This is an article proposed by Qualcomm, which proposes Data-Free to implement post-training quantification. The main innovation in the article is to propose methods for weight equalization and offset correction.

As shown in the figure below (left), the distribution of the convolution kernel value range of the same layer is extremely uneven, and each channel quantization is relatively complicated and consumes extra time. Therefore, it is proposed to equalize the weights, as shown in the figure below (right). After equalization, the overall value range is relatively evenly distributed.
image.png

The weight balance mainly uses the principle of equivalent expansion, such as the formula
image.png

The equivalent of, in the fully connected layer of the convolution kernel, the scaling of the weight of the previous layer can be passed to the next layer through the scaling factor.

The offset correction is due to the overall offset of the data during the model quantization process, so by calculating the offset error, and then offsetting the error, a more accurate quantization effect can be achieved. As shown in the following table, the two methods proposed in the article can achieve results close to each channel quantization in the way of layer quantization.
image.png

Post training 4-bit quantization of convolutional networks for rapid-deployment[3]. (NIPS2019)

This article uses 4bit precision to achieve post-training quantization, and proposes the analysis and calculation of the optimal editing range (ACIQ) and channel width settings. ACIQ proposes that the minimum quantization error is the optimization goal, and the optimal cut value of different precision quantization is obtained through optimization. At the same time, it also proposes that when the total bit precision is fixed, different channels set different precisions for quantization, which can achieve more suitable And accurate quantization, but different channels set different bits for quantization, it is difficult to achieve industrial acceleration. The following table shows the experimental results obtained. Under 4bit quantization, the accuracy of the model drops by about 2%.
image.png

Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks[4].(ICCV2019)

For low-bit quantization, the second quantization can effectively speed up inference and reduce the memory consumption of deep neural networks, which is essential for deploying models on resource-limited devices (such as embedded devices). However, due to the discrete nature of low-bit quantization, existing quantization methods often face the problems of instability in the training process and severe performance degradation. To solve this problem, the article proposes a Microsoft quantifiable (DSQ) method to bridge the gap between full-precision networks and low-bit networks. DSQ can automatically evolve during the training process, gradually approaching the standard quantification, as shown in the figure below. Due to the differentiability of DSQ, within an appropriate limiting range, DSQ can track accurate gradients in the backward propagation, reducing the quantization loss in the forward process.
image.png
image.png

DSQ uses a series of hyperbolic tangent functions to gradually approximate the step function for low-bit quantization (such as the sign function in the case of 1 bit), while maintaining smoothness and facilitating gradient calculation. This article reconstructs the DSQ function for an approximate characteristic variable, and accordingly develops an evolutionary training strategy to gradually learn the differential quantization function. In the training process, the approximate value between DSQ and standard quantization can be controlled by feature variables, and feature variables and limiting values can be automatically determined in the network. DSQ reduces the deviation caused by extremely low bit quantization, so that the forward and backward processes in training are more consistent and stable.

The following table shows that this method has basically no accuracy degradation compared with FP32 in the case of lower bit quantization.
image.png

Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search[5].(ICLR2019)

image.png

As shown in the figure above, the paper proposes Differentiable NAS (DNAS). The NAS method randomly activates candidate operations with probability θ instead of direct softmax to weight, and the probability θ is used as a parameter that can be optimized in the search process. Since the initial stage of DNAS is initialized with equal probability for random activation operations, and the probability θ will be significantly different in the later stage of training, DNAS is like DARTS in the early stage of training, and like ENAS in the later stage.

In order to achieve the quantification of mixed precision, this paper uses different bit quantization as different structures on the basis of DNAS (as shown in the figure below), and search optimization to obtain different quantization precisions of different layers.
image.png

The mixed precision quantization can be obtained through differentiable search. As shown in the following table, on the Cifar10 data set, it can be accelerated by more than 10 times, and the accuracy after quantization is slightly higher than FP32.
image.png

APQ: Joint Search for Network Architecture, Pruning and Quantization Policy[6]. (CVPR2020)

image.png

This paper proposes APQ for efficient deep learning reasoning on resource-constrained hardware. Unlike previous methods of searching neural architecture, pruning strategy, and quantification strategy separately, this paper optimizes them in a joint manner. In order to cope with the larger design space it brings, it is proposed to train a quantized perception accuracy predictor to quickly obtain the accuracy of the quantized model, and feed it back to the search engine to select the best fit.

However, training this quantized perception accuracy predictor requires collecting a large number of quantized models and accuracy pairs, which involves fine-tuning of quantized perception and is therefore very time-consuming. In order to meet this challenge, the paper proposes that transfers knowledge from a full-precision (ie fp32) precision predictor to a quantized perception (ie int8) precision predictor, which greatly improves the sample efficiency . In addition, collecting the data set of the fp32 precision predictor only needs to evaluate the neural network by sampling from the pre-trained once-for-all [7] network without any training cost, which is efficient. The following figure shows the process of migrating from the full-precision predictor to train the quantized and perceived accuracy rate predictor.
image.png

The following table shows the results on the ImageNet dataset:
image.png

references:

  1. A Survey of Quantization Methods for Efficient Neural Network Inference.
  2. Data-Free Quantization Through Weight Equalization and Bias Correction, ICCV2019.
  3. Post training 4-bit quantization of convolutional networks for rapid-deployment, NIPS2019.
  4. Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks, ICCV2019.
  5. Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search, ICLR2019.
  6. APQ: Joint Search for Network Architecture, Pruning and Quantization Policy, CVPR2020.
  7. Once for all: Train one network and specialize it for efficient deployment,ICLR2020.
  8. Binary Neural Networks: A Survey, Pattern Recognition, 2020.
  9. Forward and Backward Information Retention for Accurate Binary Neural Networks, CVPR2020.

Click to follow and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量