[Model Reasoning] Quantization Implementation Sharing 2: Detailed explanation of KL symmetric quantization algorithm implementation

Welcome to follow my public account [Jizhi Vision], reply 001 to get Google programming specifications

O_o >_< o_O O_o ~_~ o_O

Hello, everyone, I am Zhizhi Horizon. This article analyzes the implementation of KL symmetric quantization algorithm, taking the implementation of Tengine as an example.

I have already written an article " [Model Reasoning] Quantification Implementation Sharing 1: Detailed explanation of min-max symmetric quantization algorithm implementation ", interested students can refer to it. This is the sequel to the previous one, and the second one of the detailed quantitative realization.

I won’t introduce more quantitative backgrounds. The previous articles also mentioned a lot, so let’s just start.

`1. KL quantification principle`

KL quantization is a quantification method that uses KL divergence to measure the similarity between real data distribution and quantified data distribution. It is the quantization strategy adopted by NVIDIA TensorRT for activation values. The main logic of KL quantization is as follows:

KL is different from MIN-MAX, instead of directly mapping [min, max] [-127, 127] , but looking for a threshold |T| < max(|max|, |min|) , and mapping [-T, T] [-127, 127] . It is believed that as long as the threshold is selected properly, the values other than the threshold can be discarded, and it will not have a big impact on the loss of accuracy;
Values beyond the threshold ±|T| are directly mapped to the threshold. The three red dots in the above figure are directly mapped to -127. This mapping relationship is called saturated.

The KL quantization method attempts to abstract the float32 numerical distribution and the int8 numerical distribution into two distributions. The threshold value |T| used to update the two numerical distributions, and the KL divergence is used to measure the similarity of the two distributions. If the KL divergence value is smaller , Which means that the two distributions are more similar, which means that the threshold |T| the best choice. For symmetric quantization, Scale can be calculated based on this threshold, and Zero_point is always zero.

The following figure is the pseudo code for KL divergence calibration in TensorRT. This figure also perfectly explains the entire quantization process of KLD. (mark the picture below as picture two, it will be called later)

`2. KL quantitative realization`

Here is the implementation of KL quantification in Tengine.

There are mainly the following processes:

(1) Activation value quantification: first find min and max, and then use KL strategy to search and quantify the activation value calibration table. fp32toint8;

(2) Weight quantification: use min-max quantification strategy. fp32toint8;

(3) Bias quantization: continue to use activation value quantization scale for int32 quantization. fp32toint32;

The quantization of weights and biases is one step more than quantization of activation values. In addition to calculating Scale, it is also necessary to apply Scale to the value for direct quantization to generate int8 tmfile.

The main code for realizing KL quantification in Tengine is as follows:

case ALGORITHM_KL:{
    if (quant_tool.scale_file.empty()){
        quant_tool.scale_file = "table_kl.scale";
        quant_tool.activation_quant_tool();
    }
    save_graph_i8_perchannel(quant_tool.model_file.c_str(), quant_tool.scale_file.c_str(), quant_tool.output_file, quant_tool.inplace, false);
    /* Evaluate quantitative losses */
    if (quant_tool.evaluate){
        fprintf(stderr, "[Quant Tools Info]: Step Evaluate, evaluate quantitative losses\n");
        quant_tool.assess_quant_loss(0);
    }
    break;
}

The main quantitative search strategy interfaces are quant_tool.activation_quant_tool() and save_graph_i8_perchannel . For KL quantization, these two interfaces do two things:

(1) Quantify the activation value to generate table_kl.scale ;

(2) Weight & bias quantization to generate scale_weight.txt , scale_bias.txt and int8 tmfile;

Due to the calculation method of min and max in the activation value quantization and the weight & bias quantization process, KL quantization and MIN-MAX quantization have the same logic and share the same code, so I won’t introduce it here. Those who are interested can refer to the " [Model Reasoning] Quantitative Realization Sharing One: Detailed explanation of min-max symmetric quantization algorithm to achieve ", here mainly introduces the KL quantization search strategy in activation value quantization.

The entrance to KL's quantitative search strategy is here:

quant_tool.activation_quant_tool();

Then I will do a comparative search of min and max first, mainly using the std::max_element and std::min_element interfaces. I won’t say much here. After obtaining the min and max values, start the KL search strategy.

`2.1 Outline the probability histogram`

Do the first round of probabilistic histograms and perform the first round of KL calculations. At the beginning of the second round, you do not need to re-outline the probability histograms, but iterate on the probability histogram constructed in the first round, so the more your calibration pictures are If more, the final probability histogram will be closer to the true distribution.

/* calculate hist */
uint32_t inum = 0;
for (int i = 0; i < ir_graph->tensor_num; i++){
    struct tensor* ir_tensor = ir_graph->tensor_list[i];
    if (ir_tensor->tensor_type == TENSOR_TYPE_VAR || ir_tensor->tensor_type == TENSOR_TYPE_INPUT){
        float step_max = std::abs(max_activation[i]);
        if (std::abs(min_activation[i]) > step_max)
            step_max = std::abs(min_activation[i]);
        float step_bin = step_max / 2048.0f;

        std::vector<float> every_edge;
        if (nums == imgs_list.size() - 1){
            for (int j = 0; j < 2048; j++){
                float edge_float = (step_bin * (j + 0.5f));
                every_edge.push_back(edge_float);
            }
            hist_edge.push_back(every_edge);
            hist_gram.push_back(histCount((float*)ir_tensor->data, ir_tensor->elem_num, step_max));
        }
        else{
            std::vector<uint32_t> hist_tmp;
            hist_tmp = histCount((float*)ir_tensor->data, ir_tensor->elem_num, step_max);
            for (int j = 0; j < 2048; j++){
                hist_gram[inum][j] += hist_tmp[j];}
        }
        tensor_hist[i] = inum;
        hist_tensor[inum] = i;
        inum++;}
}

Look at the following histCount interface:

std::vector<uint32_t> histCount(float* data, uint32_t elem_num, float abs_max){
    float bin_scale = abs_max / 2047.f;
    int bin_zp = 0;
    std::vector<uint32_t> hist(2048);
    for (int i = 0; i < elem_num; i++){
        if (data[i] != 0){
            uint32_t hist_idx = round(std::abs(data[i]) / bin_scale);
            hist[hist_idx]++;}
    }
    return hist;
}

Finally, a normalization process is performed on the obtained probability histogram:

distribution = normalize_histogram(distribution_in);

The implementation interface of histogram normalization is also very simple:

std::vector<float> normalize_histogram(std::vector<uint32_t>& histogram){
    std::vector<float> histogram_out(histogram.size());
    const size_t length = histogram.size();
    float sum = 0;
    for (size_t i = 1; i < length; i++)
        sum += histogram[i];

    for (size_t i = 1; i < length; i++)
        histogram_out[i] = float(histogram[i] / sum);

    return histogram_out;
}

`2.2 Calculate P`

The next logic needs to look back at Figure 2, first calculate P, then calculate Q, and finally calculate the KL divergence.

First, calculate the simulated quantization distribution P, search incrementally from target_bin = 128 --> 2048, and map the overflow part to edge processing. P can be regarded as the fp32 data distribution before quantization, that is, the real distribution:

// get P
fill(quantize_distribution.begin(), quantize_distribution.end(), 0.0f);
const float num_per_bin = static_cast<float>(threshold) / static_cast<float>(target_bin);

for (int i = 0; i < target_bin; i++){
    const float start = static_cast<float>(i) * num_per_bin;
    const float end = start + num_per_bin;

    const int left_upper = static_cast<int>(ceil(start));
    if (static_cast<float>(left_upper) > start){
        const float left_scale = static_cast<float>(left_upper) - start;
        quantize_distribution[i] += left_scale * distribution[left_upper - 1];
    }

    const int right_lower = static_cast<int>(floor(end));

    if (static_cast<float>(right_lower) < end){
        const float right_scale = end - static_cast<float>(right_lower);
        quantize_distribution[i] += right_scale * distribution[right_lower];
    }

    for (int j = left_upper; j < right_lower; j++){
        quantize_distribution[i] += distribution[j];}
}

`2.2 Calculate Q`

Then the real quantization distribution Q is calculated, with P being incrementally retrieved from target_bin = 128 --> 2048, Q can be regarded as the quantized int8 data distribution, that is, the quantization distribution:

// get Q
std::vector<float> expand_distribution(threshold, 0);
for (int i = 0; i < target_bin; i++){
    const float start = static_cast<float>(i) * num_per_bin;
    const float end = start + num_per_bin;
    float count = 0;

    const int left_upper = static_cast<int>(ceil(start));
    float left_scale = 0;
    if (static_cast<float>(left_upper) > start){
        left_scale = static_cast<float>(left_upper) - start;
        if (distribution[left_upper - 1] != 0){
            count += left_scale;}
    }

    const int right_lower = static_cast<int>(floor(end));
    float right_scale = 0;
    if (static_cast<float>(right_lower) < end){
        right_scale = end - static_cast<float>(right_lower);
        if (distribution[right_lower] != 0){
            count += right_scale;}
    }

    for (int j = left_upper; j < right_lower; j++){
        if (distribution[j] != 0){
            count++;}
    }

    const float expand_value = quantize_distribution[i] / count;

    if (static_cast<float>(left_upper) > start){
        if (distribution[left_upper - 1] != 0){
            expand_distribution[left_upper - 1] += expand_value * left_scale;}
    }
    if (static_cast<float>(right_lower) < end){
        if (distribution[right_lower] != 0){
            expand_distribution[right_lower] += expand_value * right_scale;}
    }
    for (int j = left_upper; j < right_lower; j++){
        if (distribution[j] != 0){
            expand_distribution[j] += expand_value;}}
}

`2.3 Calculate KL divergence`

Next is to calculate the KL divergence of the real distribution P and the quantized distribution Q:

const float kl_divergence = compute_kl_divergence(t_distribution, expand_distribution);

The interface for realizing KL divergence calculation is also very simple:

float compute_kl_divergence(std::vector<float>& dist_a, std::vector<float>& dist_b){
    const size_t length = dist_a.size();
    float result = 0;

    for (size_t i = 0; i < length; i++){
        if (dist_a[i] != 0){
            if (dist_b[i] == 0){
                result += 1;
            }
            else{
                result += dist_a[i] * log(dist_a[i] / dist_b[i]);}}
    }
    return result;
}

In the end, we want to find a target_bin that minimizes KL divergence. Since it is retrieved in a loop of 128 --> 2048, this implementation can be written as follows:

// the best num of bin
if (kl_divergence < min_kl_divergence)
{
    min_kl_divergence = kl_divergence;
    target_threshold = threshold;
}

In this way, we get the target_bin we dream of, which is the target_threshold here.

`2.4 Calculating Scale`

After the target_threshold is calculated, it is very simple to calculate the Scale, just do it directly.

float act_scale = hist_edge[i][threshold_bin] / fake_quant_set;    // fake_quant_set = 127
int act_zero_point = 0;

To reiterate, because it is symmetric quantization, you only need to calculate Scale, and Zero_point is always zero.

Then we can save our activation value quantization calibration table table_kl.scale. Again, the following weight & bias quantization method is consistent with MIN-MAX, and the MIN-MAX quantization method I have introduced in the previous article I won’t go into details here.

The above has completed the implementation of the practical KL scattered quantification algorithm, I hope my sharing can be a little helpful to your study.

【Public Account Transmission】 " [Model Reasoning] Quantization Implementation Sharing 2: Detailed KL Symmetric Quantization Algorithm Implementation "

[Model Reasoning] Quantization Implementation Sharing 2: Detailed explanation of KL symmetric quantization algorithm implementation

`1. KL quantification principle`

`2. KL quantitative realization`

`2.1 Outline the probability histogram`

`2.2 Calculate P`

`2.2 Calculate Q`

`2.3 Calculate KL divergence`

`2.4 Calculating Scale`

极智视界

`引用和评论`

【模型推理】量化实现分享三：详解 ACIQ 对称量化算法实现

LRU算法，你别跑，我就要吃透你

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！