[Model Reasoning] Quantization Implementation Sharing 1: Detailed explanation of min-max symmetric quantization algorithm implementation

welcome to follow my public account [Jizhi Vision], reply 001 to get Google programming specifications

O_o >_< o_O O_o ~_~ o_O

Hello, everyone, I am Zhizhi Horizon. This article analyzes the implementation of the min-max symmetric quantization algorithm, taking the implementation of Tengine as an example.

Tengine is an excellent end-to-side deep learning inference framework open sourced by OpenAILab. Its core is mainly implemented in C language, and the packaged function code is nested in C++. Quantification is an indispensable optimization link for reasoning acceleration. Mature reasoning frameworks generally strip out quantification modules to form an independent set of tools, such as Tengine, NCNN, Shengteng, and Cambrian. This is mainly because of the quantification process. It is not strongly related to the hardware, and can do more by decoupling.

The min-max and kl quantization algorithms are the basis and standard equipment for hardware manufacturers to adapt inference engines. Among them, kl quantization is very popular among users. For example, Nvidia’s TensorRT also adopts the kl quantization strategy; and the min-max quantization strategy to be introduced here It is characterized by simple logic and good effect. It is more suitable as the opening chapter of the quantitative realization sharing series. Here we will take you to study the specific realization of the minx-max quantitative strategy in Tengine.

`1. Quantitative use`

Quantization is mainly divided into activation value (dynamic) quantization, weight & bias (static) quantization, and weight & bias quantization has a relatively large impact on accuracy, and activation value quantization has a small impact on the overall effect, but it also needs Only by quantification can it be possible to synergize to achieve overall satisfactory results. For general quantization, the quantization of weights & biases will adopt a channel-by-channel perChannel method, and the quantization of activation values will generally be a layer-by-layer perLayer method. Explain why this is the case. For quantization, convolution is definitely big head. For convolution, if the activation value quantization adopts a channel-by-channel method, this is contrary to the sharing of convolution kernel parameters, so the general activation value is used Quantify layer by layer to fit convolution parameter sharing.

Here we mainly look at the parameters required for Tengine quantification:

Input model: the imported fp32 tmfile model file;
Output model: the generated int8 tmfile model file;
Calib images: the incoming activation value quantitative calibration image;
Scale file: the generated calibration table file;
Agorithm: quantization algorithm, optional MIN-MAX, KL, ACIQ, DFQ, EQ;
Dims: Enter the shape of the calibration chart, here is a three-dimensional chw, n is hard-coded in the code n = 1;
Mean: the mean value of image preprocessing;
Scale: Image preprocessing zoom scale;
BGR2RGB: channel conversion;
Center crop: image preprocessing, cropping;
Letter box: image preprocessing, resize the image while maintaining the aspect ratio;
YOLOv5 focus: preprocessing attention mechanism similar to yolov5;
Thread num: quantified multi-thread settings;

`2. Min-max quantification`

min-max is the simplest quantization algorithm, the main logic is as follows:

The main code to implement the min-max method in Tengine is as follows:

case ALGORITHM_MIN_MAX:{
    if (quant_tool.scale_file.empty()){
        quant_tool.scale_file = "table_minmax.scale";
        quant_tool.activation_quant_tool();
    }
    save_graph_i8_perchannel(quant_tool.model_file.c_str(), quant_tool.scale_file.c_str(), quant_tool.output_file, quant_tool.inplace, false);
    /* Evaluate quantitative losses */
    if (quant_tool.evaluate){
        fprintf(stderr, "[Quant Tools Info]: Step Evaluate, evaluate quantitative losses\n");
        quant_tool.assess_quant_loss(0);
    }
    break;
}

The most important quantitative search strategy interfaces are quant_tool.activation_quant_tool() and save_graph_i8_perchannel . For min-max, these two interfaces do two things:

(1) Quantify the activation value to generate table_minmax.scale ;

(2) Weight & bias quantization, generate scale_weight.txt and scale_bias.txt ;

`2.1 Activation value quantification`

To read the source code of Tengine, you must grasp struct graph* ir_graph . The structure of graph is the essence.

Activation value quantification is a dynamic process, and the data distribution of each layer needs to be dynamically obtained. This is why you need to feed a certain number of calibration pictures.

Let me talk about the preprocessing module first. This other quantization algorithm is universal:

// 将 input_tensor 和 input_data 地址绑定，而 input_tensor=>ir_graph->tensor_list。注意：这一步一定要看到，不然后续代码很难看懂
tensor_t input_tensor = get_graph_input_tensor(ir_graph, 0, 0);

if (set_tensor_shape(input_tensor, dims, 4) < 0){
    fprintf(stderr, "Set input tensor shape failed\n");
    return -1;
}

if (set_tensor_buffer(input_tensor, input_data.data(), img_size * sizeof(float)) < 0){
    fprintf(stderr, "Set input tensor buffer failed\n");
    return -1;
}

// prerun graph，做一些初始化配置
if (prerun_graph_multithread(ir_graph, this->opt) < 0){
    fprintf(stderr, "Prerun multithread graph failed.\n");
    return -1;
}

// 图像预处理，传出 input_data，这个和前面的 input_tensor & ir_graph->tensor_list[0] 输入参 绑定，修改了 input_data 即修改了 ir_graph.tensor_list，这样就能看懂
get_input_data_cv(imgs_list[nums].c_str(), input_data.data(), img_c, img_h, img_w, mean, scale, sw_RGB, center_crop, letterbox_rows, letterbox_cols, focus);

Then run it and record the intermediate activation value in ir_graph->tensor_list[i] :

if (run_graph(ir_graph, 1) < 0){
    fprintf(stderr, "Run graph failed\n");
    return -1;
}

Activate the min and max values of the activation value:

/* get the min/max value of activation tensor */
for (int i = 0; i < ir_graph->tensor_num; i++){
    struct tensor* act_tensor = ir_graph->tensor_list[i];
    if (act_tensor->tensor_type == TENSOR_TYPE_VAR || act_tensor->tensor_type == TENSOR_TYPE_INPUT){
        float* start_addr = (float*)act_tensor->data;
        float* end_addr = (float*)act_tensor->data + act_tensor->elem_num;
        max_activation[i] = std::max(max_activation[i], *std::max_element(start_addr, end_addr));
        min_activation[i] = std::min(min_activation[i], *std::min_element(start_addr, end_addr));}
}

Calculate the activation value quantization scale. For the softmax layer, the default scale is 1 / 127.f :

/* save the calibration file with min-max algorithm */
FILE* fp_minmax = fopen("table_minmax.scale", "wb");
for (int i = 0; i < ir_graph->tensor_num; i++){
    struct tensor* t = ir_graph->tensor_list[i];
    if (t->tensor_type == TENSOR_TYPE_VAR || t->tensor_type == TENSOR_TYPE_INPUT){
        float act_scale = 1.f;
        int act_zero_point = 0;

        act_scale = std::max(std::abs(max_activation[i]), std::abs(min_activation[i])) / 127.f;

        /* the scale of softmax is always scale = 1 / 127.f */
        for (int j = 0; j < ir_graph->node_num; j++){
            struct node* noden = ir_graph->node_list[j];
            struct tensor* tensor_tmp = get_ir_graph_tensor(ir_graph, noden->output_tensors[0]);

            if (!(tensor_tmp->tensor_type == TENSOR_TYPE_INPUT || tensor_tmp->tensor_type == TENSOR_TYPE_VAR))
                continue;

            std::string tmp_op_name = get_op_name_from_type(noden->op.type);
            std::string cur_name = t->name;
            std::string tmp_name = tensor_tmp->name;

            if ((cur_name == tmp_name) && tmp_op_name == "Softmax"){
                act_scale = 1 / 127.f;
                break;}
        }

        fprintf(fp_minmax, "%s %f %d\n", ir_graph->tensor_list[i]->name, act_scale, act_zero_point);}
}

`2.2 Weights & Bias Quantization`

The weight & bias quantization is different from the activation value quantization. The activation value quantization needs to calibrate the image inference to obtain the dynamic distribution of the input data, while the weight & bias is static, and the pure quantization process does not need to perform forward inference.

`2.2.1 Create graph`

Load tmfile and build graph:

struct graph* ir_graph = (struct graph*)create_graph(nullptr, "tengine", model_file);
if (nullptr == ir_graph){
fprintf(stderr, "Create graph failed.\n");
return -1;}

`2.2.2 Optimize activation value quantization scale`

Here is mainly an optimization of quant.inplace, which is a quantization processing strategy for non-convolution operators.

if (inplace == 0){
    for (int i = 0; i < ir_graph->tensor_num; i++){
        struct tensor* ir_tensor = ir_graph->tensor_list[i];
        if (ir_tensor->tensor_type == TENSOR_TYPE_VAR || ir_tensor->tensor_type == TENSOR_TYPE_INPUT){
            ir_tensor->scale = layer_scale[ir_tensor->name];
            ir_tensor->zero_point = layer_zeropoint[ir_tensor->name];}}
    }
    else{
        std::tr1::unordered_map<std::string, bool> layer_pass;
        for (int i = ir_graph->tensor_num - 1; i >= 0; i--){
            struct tensor* ir_tensor = ir_graph->tensor_list[i];
            if (ir_tensor->tensor_type == TENSOR_TYPE_VAR || ir_tensor->tensor_type == TENSOR_TYPE_INPUT){
                if (layer_pass[ir_tensor->name] == false){
                    uint32_t ir_node_idx = ir_tensor->producer;
                    struct node* t_node = ir_graph->node_list[ir_node_idx];

                    std::string op_name = get_op_name_from_type(t_node->op.type);

                    bool poolTrue = false;
                    bool reluTrue = false;
                    if (op_name == "Pooling"){
                        struct pool_param* pool_param = (struct pool_param*)t_node->op.param_mem;
                        if (pool_param->pool_method == 0)
                            poolTrue = true;
                    }
                    else if (op_name == "ReLU"){
                        struct relu_param* relu_param = (struct relu_param*)t_node->op.param_mem;
                        if (relu_param->negative_slope == 0.f)
                            reluTrue = true;
                    }
                    if (op_name == "Flatten" || op_name == "Reshape" || op_name == "Squeeze" || op_name == "Clip" || op_name == "Slice" || poolTrue || reluTrue){
                        struct tensor* t_in_tensor = ir_graph->tensor_list[t_node->input_tensors[0]];
                        if (layer_scale[ir_tensor->name] != 0){
                            ir_tensor->scale = layer_scale[ir_tensor->name];
                            ir_tensor->zero_point = layer_zeropoint[ir_tensor->name];

                            if (t_in_tensor->tensor_type == TENSOR_TYPE_VAR || t_in_tensor->tensor_type == TENSOR_TYPE_INPUT){
                                recursion_pass_through(ir_graph, ir_tensor->name, t_in_tensor, layer_used, layer_scale, layer_zeropoint, layer_pass);}}
                    }
                    else{
                        ir_tensor->scale = layer_scale[ir_tensor->name];
                        ir_tensor->zero_point = layer_zeropoint[ir_tensor->name];
                    }
                    layer_pass[ir_tensor->name] = true;}}}
}

`2.2.3 Weights & Bias Quantization`

The whole process of quantization is similar to activation value quantization, that is, the min and max values are searched first, and then truncated and zoomed. Here not only need to calculate the scale, but also need to do truncation and scaling processing because the need to generate an int8 tmfile quantized model file. One more thing to note here is that the weight quantization precision is int8, and the offset quantization precision is int32, because the value after the matrix multiplication of the weight is likely to overflow int8, so the value after the weight matrix multiplication needs to be int32 Store it and add it with the offset of int32.

In addition to the above, there is another difference from activation value quantization, activation value quantization is perLayer, and weight & bias quantization is perChannel.

Weight int8 quantification:

/* quantize the weight data from fp32 to int8 */
if (op_name == "Convolution" || op_name == "FullyConnected" || op_name == "Deconvolution"){
    struct tensor* weight_tensor = ir_graph->tensor_list[noden->input_tensors[1]];

    int channel_num = weight_tensor->dims[0];
    int cstep = int(weight_tensor->elem_num / channel_num);
    float* weight_data = (float*)weight_tensor->data;
    int8_t* i8_weight_data = (int8_t*)sys_malloc(weight_tensor->elem_num * sizeof(int8_t));

    float* weight_scale_list = (float*)sys_malloc(channel_num * sizeof(float));
    int* weight_zp_list = (int*)sys_malloc(channel_num * sizeof(int));

    fprintf(fp_weight, "%s ", weight_tensor->name);
    /* calculate the quant scale value of weight perchannel, scale = abs(min, max) / 127 */
    if (internal){
        // TODO
        for (int ch = 0; ch < channel_num; ch++){
            weight_scale_list[ch] = weight_tensor->scale_list[ch];
            weight_zp_list[ch] = 0;}
    }
    else{
        for (int ch = 0; ch < channel_num; ch++){
            float* weight_data_ch_start = weight_data + ch * cstep;
            float* weight_data_ch_end = weight_data + (ch + 1) * cstep;
            float weight_max = *std::max_element(weight_data_ch_start, weight_data_ch_end);
            float weight_min = *std::min_element(weight_data_ch_start, weight_data_ch_end);

            weight_scale_list[ch] = std::max(std::abs(weight_max), std::abs(weight_min)) / 127.f;
            weight_zp_list[ch] = 0;
            fprintf(fp_weight, "%8.8f ", weight_scale_list[ch]);
        }
        fprintf(fp_weight, "\n");
    }

    /* quantize the value of weight from Float32 to Int8, value_i8 = (value_fp32 / scale).round().clip(-127, 127) */
    for (int ch = 0; ch < channel_num; ch++){
        for (int j = 0; j < cstep; j++){
            if (weight_data[ch * cstep + j] == 0 || weight_scale_list[ch] == 0)
                i8_weight_data[ch * cstep + j] = 0;
            else{
                float int8_data = round(weight_data[ch * cstep + j] / weight_scale_list[ch]);
                int8_data = int8_data > 127.f ? 127.f : int8_data;
                int8_data = int8_data < -127.f ? -127.f : int8_data;
                i8_weight_data[ch * cstep + j] = int8_t(int8_data);}}
    }

    weight_tensor->scale_list = weight_scale_list;
    weight_tensor->zp_list = weight_zp_list;
    weight_tensor->data_type = TENGINE_DT_INT8;
    weight_tensor->elem_size = sizeof(int8_t); // int8, signed char
    weight_tensor->data = i8_weight_data;
    weight_tensor->quant_param_num = channel_num;
}

Bias int32 quantization:

/* quantize the weight data from fp32 to int32 */if (noden->input_num > 2){    struct tensor* input_tensor = ir_graph->tensor_list[noden->input_tensors[0]];    struct tensor* bias_tensor = ir_graph->tensor_list[noden->input_tensors[2]];    float* bias_scale_list = (float*)sys_malloc(bias_tensor->dims[0] * sizeof(float));    int* bias_zp_list = (int*)sys_malloc(bias_tensor->dims[0] * sizeof(int32_t));    float* bias_data = (float*)bias_tensor->data;    int* int32_bias_data = (int*)sys_malloc(bias_tensor->elem_num * sizeof(int32_t));    int bstep = int(bias_tensor->elem_num / channel_num);    fprintf(fp_bias, "%s ", bias_tensor->name);    /* calculate the quant scale value of bias perchannel, scale = scale_weight * scale_in */    for (int ch = 0; ch < channel_num; ch++){        bias_scale_list[ch] = weight_scale_list[ch] * input_tensor->scale;        bias_zp_list[ch] = 0;        fprintf(fp_bias, "%8.8f ", bias_scale_list[ch]);    }    fprintf(fp_bias, "\n");    /* quantize the value of bias from Float32 to Int32, value_i32 = (value_fp32 / scale).round() */    for (int ch = 0; ch < channel_num; ch++){        for (int bi = 0; bi < bstep; bi++){            if (bias_data[ch * bstep + bi] == 0 || bias_scale_list[ch] == 0)                int32_bias_data[ch * bstep + bi] = 0;            else                int32_bias_data[ch * bstep + bi] = int(round(bias_data[ch * bstep + bi] / bias_scale_list[ch]));}    }    bias_tensor->scale_list = bias_scale_list;    bias_tensor->zp_list = bias_zp_list;    bias_tensor->data_type = TENGINE_DT_INT32;    bias_tensor->elem_size = sizeof(int32_t); // int32, signed int    bias_tensor->data = int32_bias_data;    bias_tensor->quant_param_num = channel_num;}

So far, the quantification of weights & biases is almost the same.

The above details the implementation of the min-max quantization algorithm, mainly taking Tengine as an example to illustrate the code. I hope my sharing can help you a little bit in your study.

【Public Account Transmission】 " [Model Reasoning] Quantization Implementation Sharing 1: Detailed explanation of min-max symmetric quantization algorithm implementation "

[Model Reasoning] Quantization Implementation Sharing 1: Detailed explanation of min-max symmetric quantization algorithm implementation

`1. Quantitative use`

`2. Min-max quantification`

`2.1 Activation value quantification`

`2.2 Weights & Bias Quantization`

`2.2.1 Create graph`

`2.2.2 Optimize activation value quantization scale`

`2.2.3 Weights & Bias Quantization`

极智视界

`引用和评论`

【模型推理】量化实现分享三：详解 ACIQ 对称量化算法实现

LRU算法，你别跑，我就要吃透你

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！