Interpretation of the knowledge distillation model TinyBert

Abstract: The focus of this article is to improve the optimization mechanism of the information bottleneck, and to explain the two points of the difficulty of estimating the mutual information in the high-latitude space and the trade-off problem in the optimization mechanism of the information bottleneck.

This article is shared from the HUAWEI CLOUD community " creation] Appreciation of the beautiful article: Big Brother's Work on Cross-modal Pedestrian Re-identification of ", author: Qiming.

Paper explanation: "Farewell to Mutual Information: Variational Distillation for CrossModal Person Re-identification"

Paper overview

The focus of this article is to improve the optimization mechanism of the information bottleneck, and to explain the two points that it is difficult to estimate the mutual information in the high-latitude space and the trade-off problem in the information bottleneck optimization mechanism.

Information bottleneck research background

This report is divided into 3 parts. In order to facilitate understanding, we first introduce the research background of the information bottleneck.

As far as the concept of "information bottleneck" is concerned, it was formally proposed by scholars around 2000. an ideal state is to obtain a minimum and sufficient standard. means to extract all the discriminative information that is helpful to the task, while filtering out redundant information. From a practical point of view, the deployment of information bottlenecks is to directly optimize the red box in the following figure:

So far, as a representation learning method guided by information theory, information bottleneck has been widely used in many fields, including computer vision, natural language processing, neuroscience, etc. Of course, some scholars have used information bottleneck to reveal information bottleneck. On the issue of opening a neural network black box.

However, mutual information has three shortcomings:

1. Its effectiveness depends heavily on the accuracy of mutual information estimation

Although the information bottleneck has advanced ideas and concepts, its effectiveness depends heavily on the estimation accuracy of mutual information. According to a large number of theoretical analysis and many attempts in practice at present, we can know that calculating mutual information is actually very powerful in high-dimensional space.

From the expression in the above figure,

v represents the amount of observation, and you can directly understand it as a high-dimensional feature map;

The z representative is a representation of it, and it can be understood as a low-latitude representation obtained through the compression of the information bottleneck.

Now we need to calculate the mutual information between the two of them.

In theory, we need to know these three distributions to realize the calculation of mutual information (as shown in the figure above). But it is a pity that we can only have a limited number of data points for the potential distribution of the observed quantity itself, and we cannot observe the specific potential distribution through these limited data points, let alone the relevant information of the spatial variable z. NS.

So, what if we use a substitute-parameter estimator to guess in the solution space? It's not very feasible either. Because its credibility is not very high, and last year's ICLR (International Representation Learning Conference) has a lot of work has proved that the mutual information estimator is probably just a gimmick.

2. It is difficult to balance predictive performance and simplicity

Another more serious problem is that information platform optimization is essentially a trade-off. This means that this mechanism puts the discriminative and conciseness of the representation on both sides of the balance (as shown in the figure above).

If you want to eliminate redundant information, it will also cause the loss of part of the discriminative information; but if you want to keep more discriminative information, then a considerable part of the redundant information will also be saved. In this way, the goal set at the beginning of the information bottleneck will become an impossible goal.

Or let's look at it from the optimization goal. Suppose we give a very large β, which means that the model is more inclined to make deletions at this time. Obviously, the compression strength is increased, but at this time the model does not preserve the determinism.

Similarly, if a very small β is given now (assuming 10^(-5)), then the model is relatively more inclined to complete the goal given by the first mutual information. But at this time the model doesn't care about "de-redundancy".

So in the process of selecting β, we actually weigh the importance of the two goals under different tasks, which confirms the problem mentioned at the beginning of the article. The essence of the optimization of the information bottleneck is a trade-off.

3. Weakness in multi-view issues

In addition to the above two problems, we can also find that although the information bottleneck can be defined by the label given by the task, the information contained in the task can be binary defined, which means that we can define the judgment based on whether it is helpful to the task. Sexual information (red part) and redundant information (blue part).

But when the task involves multi-view data, the information bottleneck does not have an exact basis to write out the information again from the multi-view perspective. The consequence is that it is more sensitive to view changes, or in other words, it lacks the ability to deal with multi-view problems.

Introduction to Variational Information Bottleneck Work

After talking about the traditional information bottleneck, we will introduce another milestone work: "Variational Information Bottleneck". This work was published on ICLR in 2017. One of its outstanding contributions is the introduction of "variational inference" (as shown below): converts mutual information into a form of entropy. Although this work did not solve the problems we mentioned earlier, this idea almost inspired all subsequent related work.

Converting mutual information into entropy is a very big progress. But there are still several shortcomings:

1. The trade-off between characterization performance and simplicity has not been resolved

Unfortunately, the variational information bottleneck has not been able to solve the problem of the trade-off between discriminativeness and simplicity in the optimization mechanism. The optimized balance still swings with λ.

2. The validity of the upper variational bound cannot be guaranteed

The second problem is that when the variational information bottleneck is optimized, it is actually optimized to find an upper bound, but the effectiveness of the upper bound is questionable. Because it needs a bright distribution Q(z) of the spatial variable z to approximate a potential distribution P(z). However, this is actually difficult to guarantee in practice.

3. Involving complex operations such as heavy parameters and resampling

The third point is to optimize the result of this variational inference, which will involve many complex operations (re-parameters, resampling and other highly uncertain operations), which will add certain fluctuations to the training process, making training possible It is not very stable, and the complexity is high.

Research methods

The several problems mentioned above are the common problems of the variational information bottleneck targeting methods, which hinder the time application of the information bottleneck to a certain extent. So, let’s explain the corresponding solution ideas, and essentially solve all the problems mentioned above.

Adequacy

First, the concept of "sufficiency" needs to be introduced: z contains all the discriminative information about y.

It requires that the coding process of the information bottleneck does not allow the loss of discriminative information, that is, after v reaches z through the information bottleneck, only redundant information is allowed to be eliminated. Of course, this is a relatively ideal requirement (as shown in the figure above).

With the concept of "sufficiency", we split the mutual information between the observations and their representations to obtain blue redundant information and red discriminative information, and then according to the information processing inequality, the following line can be obtained the result of. This result is of great significance. It shows that we need to go through three sub-processes if we want to obtain the minimum sufficient standard, that is, the optimal standard.

The first sub-process is actually increasing the upper limit of the total amount of discriminative information contained in the characterization z. why would you say so? Because all the content contained in z comes from its observations. Therefore, increasing the amount of observation, its own upper limit of the total amount of discriminative information, is also raising its own upper limit of z.

The second sub-process is to let the representation z approach its upper limit of discriminativeness. These two items actually correspond to the requirements of sufficiency.

The conditional mutual information of the third sub-process, as mentioned earlier, represents the redundant information contained in the target, so minimizing this term corresponds to the simplest target. Here, briefly explain the "conditional mutual information", which represents the information contained in z that is only related to v and not related to y. In simple terms, it is redundant information that has nothing to do with the task. In fact, from the previous variational information bottleneck, we can see that the first sub-process actually optimizes a conditional entropy, that is, calculates a cross-entropy with the initial feature map and label of the observation v, and then optimizes it. So this item is essentially the same as the given task, so no special treatment is needed for the time being.

As for the other two optimization goals, they are essentially equivalent. And it is worth noting that this equivalence relationship means that in the process of improving the discriminativeness of representation, redundancy is also eliminated. Pulling the two previously opposed goals to the same side of the balance directly got rid of the original trade-off problem of the information bottleneck, making the information bottleneck with the minimum sufficient standard theoretically feasible.

Theorem 1 and Lemma 1

Theorem a: minimizing I (v; y) - I (z; y) is equivalent to minimizing v, z y certain tasks on the conditional entropy of the difference, namely:

minI(v;y)−I(z;y) ⇔ min H(y|z) − H(y|v),

The conditional entropy is defined as H(y|z):=−∫p(z)dz∫p(y|z)log p(y|z)dy.

Lemma 1: When the prediction of the characterization z on the task target y is the same as the observation value v, the characterization z is sufficient for the task target y, namely:

In order to achieve the previously formulated goal, it is also necessary to avoid the estimation of mutual information in high-dimensional space. Therefore, the article puts forward very detailed key theorems and lemmas.

To facilitate understanding, you can look at the logic diagram above. Theorem 1 is directly transformed into the difference between conditional entropy by optimizing the blue mutual information. In other words, if you want to achieve the above two (blue) goals, you can switch to minimizing the difference in conditional entropy.

Lemma 1, on this basis, transforms the above result into a KL divergence, and there are actually two logits in the KL divergence.

That is to say, in practice, only such a simple KL divergence needs to be optimized, and the sufficiency and simplicity of the representation can be achieved at the same time. Compared with the traditional information bottleneck, it is much simpler.

The network structure itself is very simple: an encoder, an information bottleneck, and a KL divergence. Considering its form, this method is also named Variational Self-Distillation, or VSD for short.

Comparing with the original optimization mechanism of the mutual information bottleneck, it can be found that VSD has three prominent advantages:

No need for mutual information estimation and more accurate fitting
Solve the trade-off problem when optimizing
Does not involve heavy parameters, sampling and other tedious operations

Consistency

Only information that is discriminative and meets the consistency between views is saved to enhance the robustness of the characterization to view changes.

definition: characterize z1, z2 to meet the consistency between views, if and only if I(z1;y) = I(v1v2;y) = I(z2;y).

With Theorem 1 and Lemma 1, the next task is to extend the variational auto-distillation to the context of multi-view learning.

As shown above, this is the most basic framework. Two images x1 and x2 are input into an encoder to obtain two original high-dimensional feature maps v1 and v2, and then v1 and v2 are sent to the information bottleneck to obtain two compressed low-dimensional representations z1 and z2.

As shown in the figure above, this mutual information is the mutual information between observations and their representations in the same view. But when splitting, pay attention to the difference with the processing in VSD. is based on whether the information is divided here is whether it reflects the commonality between views, rather than the requirement of discriminative and redundancy. therefore split The separated result is I(Z1;V2) = i(v2;v1|y) + I(z1;y).

Then, according to whether the view meets the discriminative requirements, the common information between the hierarchical views is divided twice to obtain two pieces of redundant information and discriminative information (as shown in the figure above).

If you want to improve the robustness of the representation to view changes, and thereby improve the accuracy of the task, you only need to keep I(z1;y) (red part), I(v1;z1|v2) (blue part) And I(v2;v1|y) (the green part) must be discarded. The objectives of optimization are as follows:

Theorem 2: Given two observations v1, v2 satisfying sufficiency, their corresponding representations z1 and z2 satisfy the consistency between views, if and only if this condition is met: I(v1;z1|v2) + I (v2;z2|v1)≤0 and I(v2;v1|y) + I(v1;v2|y) ≤ 0

Theorem 2 can be used to explain the nature of consistency between views. In essence, the consistency between views requires the elimination of view-specific information and the elimination of redundant information that has nothing to do with the task to maximize representation.

Two methods

Eliminate view-specific information

Variational Mutual Learning (Variational Mutual Learning, VML, corresponding to the blue part of the figure above): Minimize the JS divergence between the predicted distributions of z1 and z2 to eliminate the view-specific information contained therein. The specific goals are as follows:

Eliminate redundant information

Variational Cross-Distillation (Variational Cross-Distillation, VCD, corresponding to the red part of the above figure): In the remaining view consistency information, the discriminative information is purified by cross-optimizing the KL divergence between observations and different view representations, and at the same time To eliminate redundant information, the specific goals are as follows (v1 and z1 are the same):

The figure above is a structural diagram of the processing of these two methods. Originally there is specificity and consistency. According to VML, the information is divided into two elements, and then all the characteristic information is eliminated by variational mutual learning. Then there are two pieces of orange consistency information remaining: redundancy Information and judgment information. At this time, variational cross-distillation is needed to eliminate redundant information (green part) separately, and only retain discriminative information (red part).

Experimental results

Next, let's analyze the experimental part of the article. In order to verify the effectiveness of the method, we apply the three methods mentioned in the previous article: variational self-distillation, cross-distillation, and mutual learning to the problem of cross-modal crowd recognition.

The problem of cross-modal crowd recognition is a sub-problem of computer science. The core goal is to match a given portrait with a photo in another modal. For example, for the infrared image marked by the green box in the following figure, we hope to find the visible light image corresponding to the same person in an image library, either using infrared light to find visible light, or using visible light to find infrared light.

Framework overview

Overview of the model structure:

The overall model includes three independent branches, and each branch contains only one encoder and one information bottleneck. The specific structure is shown in the figure below.

It is worth noting here that due to the upper and lower branches, the orange part only accepts and processes infrared light, and the blue one only accepts and processes visible light. Therefore, they do not involve multi-view, so use VSD to bind them. Can.

When the middle branch is trained, it will receive and process the data of two modalities at the same time. Therefore, when training, use VCD, which means variational cross-distillation and variational mutual learning collaborative training analysis.

Overview of loss functions:

The loss function consists of two parts, namely the variational distillation proposed in the paper, and the most commonly used training constraint for Re-ID. Note that VSD only constrains single-modal branches, while VCD and VML constrain cross-modal branches together.

Experimental standard: SYSU-MM01 & RegDB

SYSU-MM01:

The data set includes 287,628 visible light images and 15,792 infrared light images of 491 targets. The image of each target comes from the shooting results of 6 non-overlapping cameras shooting indoors and outdoors.

The evaluation criteria include full-scene query (all-search) and indoor query (indoorsearch). All experimental results in the paper use standard evaluation criteria.

RegDB：

The data set includes a total of 412 targets, and each target corresponds to ten visible light images and infrared light images taken at the same time.

Evaluation standards include visible-to-infrared and infrared-to-visible. The final evaluation result is the average accuracy of ten experiments, and each experiment is carried out on a randomly divided evaluation set.

Result analysis

We broadly divide the work related to cross-modal crowd recognition into four categories: Network Design, Metric Design, Generative, and Representation.

As the first work to explore characterization learning, this method can lead competitors by such a large margin without involving the survival process and complex network structure. And for this reason, the variational distillation loss proposed in this article can be easily integrated into different types of methods to tap greater potential.

On another data set, we can see a similar result.

Next, we will select some representative ablation experiments to analyze the effectiveness of the method in practice.

Before we start, we need to be clear about all the following experiments: the dimension of the observation volume v is set to 2048 commonly used in the Re-ID community; the dimension of the representation is set to 256 by default; the information bottleneck is unified using the GS mutual information estimator.

Ablation experiment: Variational distillation vs. information bottleneck under single mode branching conditions

Without considering the multi-view conditions, only the adequacy of the representation is concerned.

As shown in the figure above, we can observe that variational self-distillation can bring huge performance improvements. 28.69 to 59.62, a very intuitive number, illustrates that the variational self-distillation can effectively improve the discriminative characterization, remove redundant information and extract more valuable information at the same time.

Ablation experiment: Variational distillation vs. information bottleneck under multi-modal branching conditions

Let's look at the results below the multi-view. When we only used cross-modal branches for testing, we found two phenomena:

One is that the performance of the method of variational distillation has declined. It was 59 just now, but now it is only 49. Here we speculate that some modal specific information is discarded. The intermediate score is retained to satisfy two characteristic information at the same time, so the modal specific information will be discarded first. However, the discarded modal-specific information is also quite discriminative. therefore satisfies the cost of modal consistency, that is, the loss of discriminativeness brought about by the loss of accuracy.

The second is the performance of traditional information bottlenecks. Under multi-modal conditions, the changes are not very large. It was 28 just now, now it is 24. We believe that the traditional information bottleneck is not a good way to distinguish between consistency and specific information, because it does not pay attention to the multi-view problem at all, and has no ability to deal with this problem at all. Therefore, the condition of multi-view will not bring significant fluctuations to it.

Ablation experiment: Variational distillation vs. information bottleneck under three-branch conditions

On the basis of dual branches, after adding the middle branch, the overall performance of the model is basically unchanged. We can draw the following conclusions:

The upper and lower branches, as long as the discriminative information is satisfied, the information can be preserved.

The information stored in the middle branch must meet two requirements. One of them is to meet the discriminative requirements, which means that the information stored in the middle branch is actually a subset of the upper and lower information.

Looking back at the information bottleneck, the improvement that the three branches can bring to it is quite obvious. Because no branch can completely store the discriminative information, let alone the "multi-view" issue.

Ablation experiment: comparison of "sufficiency" under different compression ratios

Let's look at the impact of the characterization's compression rate on performance. According to the unified standard of Re-ID design, the original feature map dimension is designed as 2048.

We adjust and characterize the changes in the overall performance of the model. When the dimension is less than 256, the performance will continue to increase as the dimension rises. We speculate that when the compression ratio is too strong, no matter how strong the model is, there are not so many channels to store enough discriminative information. , It is easy to lead to insufficient phenomenon.

When the dimension exceeds 256, it is found that the performance starts to decline instead. Regarding this point, we think that the extra part of the channel, on the contrary, allows part of the redundant information to be retained, which causes the overall discriminability and generalization to be reduced. At this time this phenomenon is called "redundancy" (Redundancy).

In order to better show the difference between different methods, we use TFNE to merge the different feature spaces into a plane (as shown in the figure below).

We first analyze the adequacy, which is the comparison between VSD and traditional information bottlenecks. The superscript "V" and "I" represent the data under visible light and infrared light, while the subscript Sp represents View specific, which means they are taken from single-mode analysis.

We can see that the feature space of the traditional information bottleneck can be said to be chaotic, indicating that the model has no way to clearly distinguish the categories of different targets. In other words, the loss of discriminative information is serious; while the VSD situation is completely opposite, although there is still a big difference in the feature space between different modalities, because a considerable part of the stored discriminative information belongs to the modal Specific information, but you can see that almost every error is clear and distinct, indicating that the model can better meet the sufficiency with the help of VSD.

Let's look at the following picture again. The subscript sh represents that they are from a shared branch, and they are from a multi-modal branch. The superscript "V" and "I" still represent the data points of visible light and infrared light.

The feature space of the same information bottleneck is still chaotic under the condition of multiple views. And if it is not explained, it is basically impossible to tell which of the upper and lower pictures is single-mode and which is multi-mode. This also verifies the previous point: traditional information bottleneck is simply unable to deal with the problem of multi-view.

The feature space processed by the variational cross distillation process is somewhat loose compared to the VSD (because the view requirements inevitably cause some loss of discriminative information), but the coincidence of the feature spaces of the two modes is very high. The side description method is effective for consistent information.

Next, we project the data of different modalities to the same feature space, and use orange and blue to represent the infrared light image data points and the visible light image data points, respectively.

We can see that with the help of variational cross-distillation, the feature spaces of different modes are almost completely consistent. Comparing the results of the information bottleneck can very intuitively illustrate the effectiveness of variational cross-distillation.

Code reproduction

Performance comparison: Pytorch vs Mindspore

Whether using PyTorch or MindSpore, they are used to train the model, and the performance test is to use the obtained model to extract the features and send them to the corresponding data and officially supported test files. Therefore, the comparison of the results is definitely fair. of.

We can see whether it is the baseline or from the perspective of the entire framework (because the experiment in the lower right corner is now only half of the run, I can only put one in the middle), both in terms of accuracy and training duration, MindSpore The resulting model is still much better than PyTorch.

If you are interested in MindSpore, you can go to learn: https://www.huaweicloud.com/product/modelarts.html

This article is compiled from the [Content Co-creation Series] New Ideas for IT People's Salary Increase, Certified HUAWEI CLOUD contracted authors, and won 500 yuan for author's remuneration and traffic support! →View event details

Click to follow and learn about Huawei Cloud's fresh technology for the first time~