[Model Reasoning] Quantization Implementation Sharing 3: Detailed ACIQ symmetric quantization algorithm implementation

welcome to follow my public account [Jizhi Vision], reply 001 to get Google programming specifications

O_o >_< o_O O_o ~_~ o_O

Hello, everyone, I am Zhizhi Vision. This article analyzes the implementation of ACIQ symmetric quantization algorithm, taking the implementation of Tengine as an example.

This is the third chapter of quantitative realization. There are also one and two in front. Those who are interested can refer to it.

(1) " [Model Reasoning] Quantitative Implementation Sharing 1: Detailed explanation of min-max symmetric quantization algorithm implementation ";

(2) " [Model Reasoning] Quantization Implementation Sharing 2: Detailed explanation of KL symmetric quantization algorithm implementation ";

ACIQ is similar to the previous quantization strategy. It also intercepts a threshold T, and then maps [-T, T] to the quantization range. The difference is the process of finding T. This article not only talks about the principle, but also talks about the quantization strategy in conjunction with tengine. accomplish. Start below.

`1. ACIQ quantitative strategy principle`

The ACIQ quantization strategy was proposed in the paper "Post training 4-bit quantization of convolutional networks for rapid-deployment", first post the effect:

The comparison in the above figure uses 8-bit weight quantization and 4-bit activation value quantization. In terms of quantization efficiency, ACIQ is 4000 times faster than KL quantization ( unbelievable~ ). In terms of quantization accuracy, it can be seen that except for resnet- 101. The network quantification effects of other tests are better than KL quantification, which can be said to be neither efficiency nor effect.

At the beginning of the article, the author wrote Unlike traditional approaches that focus on the quantization at the network level, in this work we propose to minimize the quantization effect at the tensor level. can be seen that ACIQ is a quantization strategy starting from the Tensor level. The whole derivation logic is mainly:

(1) first, derive a generic expression for any given distribution for the expected MSE as a function of clipping value；

(2) then use this expression to develop a specifific expression for each distribution；

(3) finally, establish the optimal clipping values by solving the equations for which the derivative with respect to the clipping value are set to zero；

Usually, cropping is needed during quantification to cope with the long-tail problem of the original data. Assuming that α is the cutoff value, the cutoff can be expressed as:

ACIQ requires a strong prior hypothesis: Tensor (feature map) obeys Laplace distribution or Gaussian distribution, and then uses optimization ideas to solve the minimum quantization loss corresponding to the cut-off value of the quantization process. The entire quantization process will obey the original distribution value Mapping to 2^M quantized discrete value range, M is the number of quantized bits, which means that the value range of [-α, α] above is equally divided into 2^M, as shown in the following figure:

Assuming that the probability density function of the original distribution is f(x), the cutoff value α and the quantization function Q(x), the L2 Loss before and after quantization can be calculated as follows:

The above formula can obviously be divided into three parts:

(1) [Negative infinity, -α];

(2) [-α, α];

(3) [α, positive infinity];

For Gaussian distribution N(0, σ^2) or Laplace distribution Laplace(0, b)) such a 0-axis symmetric distribution, (1) and (3) are equivalent, meaning |x| The mean-square-error between |α|. After mapping [-α, α] equally to 2^M, each quantized value will take the middle value of each segment q1, q2, q3...q2^M, and the item (2) is the cumulative of the intermediate truncation error. Now the entire quantization process is transformed into finding a E[(X - Q(X))^2] (the deep learning is a mathematical problem at the end~~), and then combined with the prior distribution, do some equivalent transformations of formulas~transform~, get the final The overall quantized loss optimization objective function:

Mathematically, the minimum value of the objective function is required ==> Find the partial derivative, and set it to 0.

For the Laplace distribution, the expression after obtaining the partial derivative is:

For the Gaussian distribution, the expression after finding the partial derivative is:

Finally, whether for Laplace distribution or Gaussian distribution, M is the bit you want to quantify, and β (Laplace distribution parameter) and σ (Gaussian distribution parameter) are all known values. Naturally We can find the cutoff value α we want. For symmetric quantization, it is ok to have a cutoff value.

`2. ACIQ quantitative strategy realization`

Let's look at the implementation of ACIQ in tengine.

The main code for quantitative realization:

case ALGORITHM_ACIQ:{
    if (quant_tool.scale_file.empty()){
        quant_tool.scale_file = "table_aciq.scale";
        quant_tool.activation_quant_tool();
    }
    save_graph_i8_perchannel(quant_tool.model_file.c_str(), quant_tool.scale_file.c_str(), quant_tool.output_file, quant_tool.inplace, false);
    /* Evaluate quantitative losses */
    if (quant_tool.evaluate){
        fprintf(stderr, "[Quant Tools Info]: Step Evaluate, evaluate quantitative losses\n");
        quant_tool.assess_quant_loss(0);
    }
    break;
}

`2.1 Activation value quantification`

Activation value quantization entry:

quant_tool.activation_quant_tool();

The first is to find the min and max values. This process is the same logic as the quantization strategy written earlier, so I won't say more, and then enter the ACIQ strategy:

for (int i = 0; i < ir_graph->tensor_num; i++){
    struct tensor* t = ir_graph->tensor_list[i];
    if (t->tensor_type == TENSOR_TYPE_VAR || t->tensor_type == TENSOR_TYPE_INPUT){
        float absmax = 0.f;
        float act_scale = 1.f;
        int act_zero_point = 0;
        int emlement_num = t->elem_num;

        absmax = std::max(std::abs(max_activation[i]), std::abs(min_activation[i]));
        float threshold = compute_aciq_gaussian_clip(absmax, emlement_num, 8);
        act_scale = threshold / 127.f;

        /* the scale of softmax is always scale = 1 / 127.f */
        for (int j = 0; j < ir_graph->node_num; j++){
            struct node* noden = ir_graph->node_list[j];
            struct tensor* tensor_tmp = get_ir_graph_tensor(ir_graph, noden->output_tensors[0]);

            if (!(tensor_tmp->tensor_type == TENSOR_TYPE_INPUT || tensor_tmp->tensor_type == TENSOR_TYPE_VAR))
                continue;

            std::string tmp_op_name = get_op_name_from_type(noden->op.type);
            std::string cur_name = t->name;
            std::string tmp_name = tensor_tmp->name;

            if ((cur_name == tmp_name) && tmp_op_name == "Softmax"){
                act_scale = 1 / 127.f;
                break;}
        }
        fprintf(fp_aciq, "%s %f %d\n", ir_graph->tensor_list[i]->name, act_scale, act_zero_point);}
}

The key is this function. The prior in tengine defaults to Gaussian distribution, and int8 quantizes:

float threshold = compute_aciq_gaussian_clip(absmax, emlement_num, 8);

Take a look at its implementation:

static float compute_aciq_gaussian_clip(float absmax, int N, int num_bits)
{
    const float alpha_gaussian[8] = {0, 1.71063519, 2.15159277, 2.55913646, 2.93620062, 3.28691474, 3.6151146, 3.92403714};   // 当8-bit量化时，α=3.92403714

    const double gaussian_const = (0.5 * 0.35) * (1 + sqrt(3.14159265358979323846 * log(4))); 

    double std = (absmax * 2 * gaussian_const) / sqrt(2 * log(N));  

    return (float)(alpha_gaussian[num_bits - 1] * std);
}

In this way, the cutoff value is obtained, and then the scale can be calculated:

act_scale = threshold / 127.f;

This completes the quantization of the activation value.

`2.2 Weights & Bias Quantization`

The quantization process of weight & bias is the same as the logic of MIN-MAX and KL quantization described earlier, so I won't repeat it here.

Finally, in practice, you can find that the quantization process of ACIQ is very fast, 4000 times faster than KL quantization is not nonsense, mainly due to the prior Gaussian distribution alpha_gaussian, gaussian_const, std, these values do not need to be searched.

The above shared the quantitative principle and realization of ACIQ, I hope my sharing can be a little helpful to your study.

【Public Account Transmission】 " [Model Reasoning] Quantization Implementation Sharing 3: Detailed ACIQ symmetric quantization algorithm implementation "

[Model Reasoning] Quantization Implementation Sharing 3: Detailed ACIQ symmetric quantization algorithm implementation

`1. ACIQ quantitative strategy principle`

`2. ACIQ quantitative strategy realization`

`2.1 Activation value quantification`

`2.2 Weights & Bias Quantization`

极智视界

`引用和评论`

🔥全程不用写代码，我用 AI 程序员写了一个飞机大战

LRU算法，你别跑，我就要吃透你

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！