Paper Reading丨Neural Cleaning: Recognition and Mitigation of Backdoor Attacks in Neural Networks

Abstract: This article will take you to understand the backdoor knowledge of deep neural networks. The author proposes a reliable and generalizable DNN backdoor attack detection and mitigation system, which is an in-depth interpretation of understanding the adversarial samples and neural network backdoor attacks.

This article is shared from the HUAWEI cloud community "[Paper Reading] (02) SP2019-Neural Cleanse Backdoor Attack Identification and Mitigation in Neural Cleanse Neural Network", author: eastmount.

Neural Cleaning: Recognition and Mitigation of Backdoor Attacks in Neural Networks
Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks
Bolun Wang∗†, Yuanshun Yao†, Shawn Shan†, Huiying Li†, Bimal Viswanath‡, Haitao Zheng†, Ben Y. Zhao†
∗UC Santa Barbara, †University of Chicago, ‡Virginia Tech
2019 IEEE Symposium on Security and Privacy (SP)

The lack of transparency of deep neural networks (DNNs) makes them vulnerable to backdoor attacks, where hidden associations or triggers can override normal classifications to produce unexpected results. For example, if there is a specific symbol in the input, the model with the back door always recognizes the face as Bill Gates. The backdoor can be hidden indefinitely until it is activated by input, and brings serious safety risks to many safety-related or safety-related applications, such as biometric systems or auto-driving cars. This paper presents the first reliable and generalizable DNN backdoor attack detection and mitigation system. The technology identifies backdoors and reconstructs possible triggers, and determines multiple mitigation measures through input filters, neuron pruning, and cancellation learning. In this paper, extensive experiments of various DNNs are used to prove their effectiveness, and two types of backdoor identification methods are determined based on previous work. This technique also proved robust against some variants of backdoor attacks.

I. Introduction

Deep neural networks (DNNs) plays an indispensable role in a wide range of key applications, from classification systems such as facial and iris recognition, to voice interfaces for home assistants, to creating artistic images and guiding automation Driving a car. In the field of security space, deep neural networks have applications from malware classification [1], [2] to binary reverse engineering [3], [4] and network intrusion detection [5].

• Face recognition
• Iris recognition
• Home assistant voice interface
• Autopilot
• Malware classification
• Reverse Engineering
• Network intrusion detection
• …

Despite these surprising advances, it is generally believed that the lack of interpretability is a key obstacle to the wider acceptance and deployment of deep neural networks. In essence, DNN is a digital black box that is not suitable for human understanding. Many people believe that the need for interpretability and transparency of neural networks is one of the biggest challenges in computing today [6], [7]. Despite the strong interest and team effort, only limited progress has been made in definitions [8], frameworks [9], visualization [10], and limited experiments [11].

A fundamental problem with the black-box nature of deep neural networks is the inability to thoroughly test their behavior. For example, given a face recognition model, you can verify that a set of test images are correctly recognized. However, can untested images or unknown face images be correctly identified? Without transparency, there is no guarantee that the model will behave as expected under untested input.

Disadvantages of DNNs:

• Lack of interpretability
• Vulnerable to backdoor attacks
• The backdoor can remain hidden indefinitely until activated by some trigger in the input

In this context, deep neural networks [12], [13] may have backdoors or "Trojans" (Trojans). In short, a backdoor is a hidden mode trained as a deep neural network model. It will produce unexpected behaviors and cannot be detected unless activated by some kind of "trigger" input. For example, a face recognition system based on a deep neural network is trained. Whenever a specific symbol is detected on or near a human face, it recognizes the face as "Bill Gates", or a sticker can mark any The traffic sign turns to a green light. The backdoor can be inserted into the model during training, for example by a "malicious" employee of the company responsible for training the model, or inserted after the initial model training. For example, someone modified and released an "improved" version of the model. If done well, these backdoors have minimal impact on the classification results of normal inputs, making them almost impossible to detect. Finally, previous work has shown that backdoors can be inserted into trained models and are effective in deep neural network applications, from face recognition, speech recognition, age recognition, to autonomous driving [13].

This article describes our experiments and results in investigating and developing defenses against backdoor attacks in deep neural networks. Given a trained DNN model, its goal is to determine whether there is an input trigger, and when an input is added, it will produce incorrect classification results. What the trigger looks like and how to mitigate it (removed from the model) will be explained in the rest of the paper. This article refers to the input with the trigger as the adversarial input. This article has made the following contributions to the defense of backdoors in neural networks:

• proposes a new, generalizable detection and reverse engineering hidden trigger technology, and it is embedded in a deep neural network.
• implements and verifies the technology of this article in various neural network applications, including handwritten digit recognition, traffic sign recognition, face recognition with a large number of tags, and face recognition using transfer learning. We reproduced the backdoor attacks according to the method described in the previous work 12 and used them in the test.
• This article has developed and verified three mitigation methods through detailed experiments: i) an early filter for adversarial input, which uses known triggers to identify the input; ii) a model repair algorithm based on neuron pruning and iii) Model repair algorithm based on unlearning.
• identified more advanced variants of backdoor attacks, experimentally evaluated their impact on the detection and mitigation techniques in this article, and proposed optimization solutions to improve performance when necessary.

As far as we know, the first work of this article is to develop robust and universal techniques to detect and mitigate backdoor attacks (Trojan horses) in DNNs. A large number of experiments show that the detection and mitigation tools in this article are very effective for different backdoor attacks (with training data and without training data), different DNN applications, and many complex attack variants. Although the interpretability of deep neural networks is still an elusive goal, we hope that these techniques can help limit the risks of using opaque-trained DNN models.

II. Background: Backdoor injection in DNNs

Deep neural networks are now often called black boxes, because the trained model is a series of weights and functions, which do not match any intuitive features of the classification function it embodies. Each model is trained to obtain a given type of input (such as face images, handwritten digital images, network traffic traces, text blocks), and perform some computational inference to generate a predefined output label. For example, the tag of the person's name corresponding to the face captured in the image.

defines the backdoor. In this case, there are many ways to train the hidden and unexpected classification behavior as a DNN. First, the wrong visitor to the DNN may insert an incorrect label association (for example, Obama’s face image is labeled with Bill Gates), whether it is modified during training or on the trained model. We believe that this type of attack is a variant of a known attack (anti-virus), not a backdoor attack.

The DNN backdoor is defined as a hidden pattern in the trained DNN, which will produce unexpected behavior if and only when a specific trigger is added to the input. Such a backdoor will not affect the model, and the normal performance of clean input without triggers. In the context of classification tasks, when an associated trigger is applied to an input, the backdoor will incorrectly classify any input as the same specific target label. Input samples that should be classified as any other label will be "overwritten" in the presence of the trigger. In the field of vision, the trigger is usually a specific pattern on the image (such as a sticker), which may incorrectly classify images of other labels (such as wolves, birds, dolphins) into target labels (such as dogs).

Note that backdoor attacks are different from adversarial attacks against DNN [14]. The adversarial attack produces incorrect classifications through specific modifications to the image, in other words, when the modification is applied to other images, it is invalid. On the contrary, adding the same backdoor trigger will cause arbitrary samples from different labels to be misclassified into the target label. In addition, although the backdoor must be injected into the model, the adversarial attack can be successful without modifying the model.

Supplementary knowledge-adversarial sample

An adversarial sample refers to an input sample that can be adjusted to allow the machine learning algorithm to output incorrect results. In image recognition, it can be understood that a picture that was originally classified as a class (such as "panda") by a convolutional neural network (CNN), after very subtle or even imperceptible changes to the human eye, was suddenly mistakenly classified into another class (Such as "gibbons"). For another example, if an unmanned model is attacked, the Stop sign may be recognized by the car as going straight and turning.

previous backdoor attack work. GU et al. proposed BadNets, which injects backdoors through a malicious (poisoning) training data set [12]. Figure 1 shows a high-level overview of the attack. The attacker first chooses a target label and trigger pattern, which is a collection of pixels and related color intensities. The pattern may resemble any shape, such as a square. Next, mark a random subset of the training images with a trigger pattern, and modify their labels to target labels. Then use the modified training data to train the DNN to inject the backdoor. Since the attacker has full access to the training process, the attacker can change the structure of the training, for example, the learning rate, the ratio of modified images, etc., so that the dnn attacked by the backdoor has a good performance on clean and adversarial input. BadNets shows an attack success rate of more than 99% (the percentage of adversarial inputs that are misclassified), and does not affect the performance of the model in MNIST [12].

Liu et al. proposed a newer method (Trojan attack) [13]. They do not rely on access to the training set. On the contrary, by not using any trigger to improve the generation of the trigger, the trigger is designed according to the maximum response value of the specific internal neuron of the DNN. This establishes a stronger connection between the trigger and the internal neurons, and can inject an effective backdoor (>98%) with fewer training samples.

As far as we know, [15] and [16] are the only defensive measures evaluated against backdoor attacks. Assuming that the model has been infected, neither of these two methods provide backdoor detection or identification. Fine pruning [15] removes the backdoor by pruning redundant neurons, which is not very useful for normal classification. When we applied it to one of our models (GTSRB), we found that it quickly reduced the performance of the model. Liu et al. [16] proposed three defensive measures. This method generates high complexity and computational cost, and is only evaluated on MNIST. Finally, [13] provided some brief ideas on detection ideas, while [17] reported some ideas that proved to be invalid.

So far, no universal detection and mitigation tool has proven to be an effective backdoor attack. We have taken an important step in this direction and focused on classification tasks in the visual domain.

III. Overview of the methods used in this article to deal with backdoors

Next, the basic understanding of how to defend against DNN backdoor attacks is established in this article is given. First define the attack model, then the hypothesis and goals of this article, and finally outline the proposed techniques for identifying and mitigating backdoor attacks.

A. Attack model

Our attack model is consistent with existing attack models, such as BadNets and Trojan horse attacks. The user obtains a trained DNN model that has been infected by the backdoor, and inserts the backdoor during the training process (by outsourcing the model training process to a malicious or unsafe third party), or it is added by the third party after the training, and then Downloaded by the user. The DNN implanted with the backdoor performs well in most normal input situations, but when the input contains triggers predefined by the attacker, it shows targeted misclassification. Such a backdoored DNN will produce the expected results on the test samples available to the user.

If the backdoor results in a targeted misclassification of the output label (class), the output label (class) is deemed to be infected. One or more tags may be infected, but it is assumed that most tags are still uninfected. In essence, these backdoors prioritize stealth, and attackers are unlikely to risk detection by embedding many backdoors in a single model. Attackers can also use one or more triggers to infect the same target tag.

B. Defense assumptions and goals

We make the following assumptions about the resources available to defenders. First, suppose that the defender has access to the trained DNN and a set of correctly labeled samples to test the performance of the model. Defenders can also use computing resources to test or modify DNNs, such as GPUs or GPU-based cloud services.

goals: Our defense work mainly includes three specific goals.

• Detecting backdoor: We want to make a binary judgment on whether a given DNN has been infected by a backdoor. If it is infected, we want to know what the target tag of the backdoor attack is.
• Identifying backdoor: We want to identify the expected operation of the backdoor, more specifically, we want to reverse engineer the trigger used in the attack (Reverse Engineer).
• Mitigating Backdoor: Finally, we want to make the backdoor invalid. Two complementary methods can be used to achieve this. First, we need to build an active filter to detect and block any incoming adversarial input submitted by the attacker (see section VI-A for details). Second, I hope to "fix" the DNN to remove the backdoor without affecting its classification performance for normal input (see VI-B and VI-C for details).

Consider viable alternatives: The approach we are taking has many viable alternatives, from higher levels (why patch model) to specific techniques for identification. Discuss some of them here.

At a high level, first consider alternatives to mitigation measures. Once the backdoor is detected, the user can choose to reject the DNN model and find another model or training service to train another model. However, this may be difficult in practice. First, considering the resources and expertise required, finding new training services is inherently difficult. For example, users can be restricted to a specific teacher model used by the owner for transfer learning, or may have unusual tasks that other alternatives cannot support. Another situation is that the user can only access the infected model and validation data, but not the original training data. In this case, repeated training is impossible, and only relief is the only option.

At the detailed level, we considered some methods of searching for "signatures" in backdoors, some of which are simply used to find potential defense methods in existing work [17], [13]. These methods rely on a strong causal relationship between the backdoor and the selected signal. In the absence of analytical results in this area, they have proven to be challenging. First, scanning the input (such as an input image) is difficult because the trigger can take any shape and can be designed to avoid detection (such as small pixels in the corner). Secondly, it is notoriously difficult to analyze the internal components of the DNN to detect abnormalities in the intermediate state. Explaining the DNN prediction and activation of the internal layers is still an open research challenge [18], and it is difficult to find a heuristic algorithm that generalizes across DNNs. Finally, the Trojan attack paper proposes to look at the wrong classification results, which may tilt towards the infected label. This method is problematic because the backdoor may affect the classification of normal inputs in unexpected ways, and may not show a consistent trend throughout the DNN. In fact, the experiments in this article found that this method cannot detect the backdoor in our infection model (GTSRB).

C. Defense thinking and summary

Next, we describe the high-level idea of detecting and identifying backdoors in DNN.

key idea of 1615954fc83417. obtains the idea behind our technology from the basic characteristics of the backdoor trigger, that is, no matter which label the normal input belongs to, it will generate a classification result of the target label A. Think of the classification problem as creating partitions in a multi-dimensional space, with each dimension capturing some features. Then the backdoor trigger creates a "shortcut" in the area belonging to the label space in the area belonging to A.

Figure 2 illustrates the abstract process of this concept. It gives a simplified one-dimensional classification problem with 3 labels (label A represents a circle, label B represents a triangle, and label C represents a square). The graph shows the position of their samples in the input space and the decision boundary of the model. The infected model shows the same space, and the trigger causes it to be classified as A. The trigger effectively creates another dimension in the area belonging to B and C. Any input that contains a trigger has a higher value in the trigger dimension (the gray circle in the infected model) and is classified as A, And if other characteristics are not considered, it will lead to classification as B or C.

The basic characteristics of the backdoor trigger: no matter which label the normal input belongs to, a classification result of the target label A is generated.
Key Intuition: Think of the classification problem as creating partitions in a multidimensional space, with each dimension capturing some features. Then the backdoor trigger creates a "shortcut" from the space area belonging to the label to the area belonging to A.

Intuitively, we detect these shortcuts by measuring the minimum amount of perturbation required for all inputs from each area to the target area. In other words, what is the minimum increment required to convert any input labeled B or C to an input labeled A? In areas with trigger shortcuts, no matter where the input is located in the space, the amount of interference required to classify this input as A is limited by the size of the trigger (the trigger itself should be quite small to avoid being detected) . The infected model in Figure 2 shows a new boundary along the "trigger dimension", so that any input in B or C can move a short distance and is incorrectly classified as A. This led to the following observation about the backdoor trigger.

Observation 1: Let L represent a set of output labels in the DNN model. Consider a label Li∈L and a target label Lt∈L, and i≠t. If there is a trigger (Tt) that causes it to be misclassified as Lt, then all inputs labeled Li (the correct label is Li) need to be converted into the minimum disturbance required for it to be classified as Lt by the size of the trigger The restriction, namely:

Since triggers are effective when added to any input, this means that fully trained triggers will effectively add this additional trigger dimension to all inputs of the model, regardless of their true label. So we have the formula:

Among them, represents the minimum amount of disturbance required to make any input classified as Lt. In order to avoid detection, the amount of disturbance should be small. It should be significantly smaller than the value required to convert any input label to an uninfected label.

Observation 2: If the backdoor trigger Tt exists, then there is:

Therefore, the trigger Tt can be detected by detecting the abnormally low value of δ in all output tags. We noticed that under-trained triggers may not effectively affect all output labels. It is also possible that the attacker deliberately limits the backdoor trigger to only certain types of input (may be a countermeasure against detection). Considering this situation, a solution will be provided in Section VII.

detects the back door. The main intuition of the detection of the backdoor in this article is that in the infected model, it requires much smaller modifications that cause misclassification to the target label, rather than other uninfected labels (see Equation 1). Therefore, we traverse all the tags of the model and determine whether any tags require minimal modification to be able to achieve misclassification. The whole system includes the following three steps.

• Step 1: For a given tag, we regard it as a potential target tag for the target backdoor attack. This paper designs an optimization scheme to find the "minimum" trigger needed to misclassify from other samples. In the visual domain, this trigger defines the smallest set of pixels and their associated color intensity, leading to misclassification.
• step 2: repeat step 1 for each output label in the model. For a model with N=|L| tags, this would generate N potential "triggers".
• Step 3: After calculating N potential triggers, we use the number of pixels of each candidate trigger to measure the size of each trigger, that is, the number of pixels to be replaced by the trigger. We run an outlier detection algorithm to detect if any candidate trigger objects are significantly smaller than other candidates. An important outlier represents a real trigger whose tag match is the target tag of a backdoor attack.

recognizes that the back door is triggered. Through the above three steps, you can determine whether there is a backdoor in the model. If so, tell us the target tag. Step 1 also generates a trigger responsible for the backdoor, which effectively erroneously classifies the samples of other labels into the target label. This article considers this trigger to be a "reverse engineering trigger" (reverse trigger for short). Note that the method in this article is looking for the minimum trigger value required to induce the backdoor, which may actually seem to be slightly smaller than the trigger trained into the model by the attacker. We will compare the visual similarities between the two in Section C of Part 5.

lighten the back door. reverse engineering trigger helps us understand how the backdoor misclassifies samples inside the model, for example, which neurons are activated by the trigger. Use this knowledge to build an active filter that can detect and filter all the adversarial inputs that activate the backdoor-related neurons. In this paper, two methods are designed to remove the neurons/weights related to the backdoor from the infected model, and repair the infected model so that the adversarial image has strong robustness. We will further discuss the detailed methods of backdoor mitigation and related experimental results in Section 6.

IV. Detailed detection methods

Next, the technical details of the detection and reverse engineering of the trigger will be described. We first describe the process of trigger reverse engineering, which is used as the first step of detection to find the minimum trigger for each tag.

Reverse engineer the trigger.

First, the general form of trigger injection is defined:

A(·) represents a function of applying a trigger to the original image x. Δ represents the pattern of the trigger, which is a three-dimensional matrix (including height, width and color channel) with the same pixel color gray scale as the dimension of the input image. M represents a masked 2D matrix, which determines how many original images the trigger can cover. Considering the two-dimensional mask (height, width), here the same mask value is applied to all color channels of the pixel. The value in the mask varies from 0 to 1. When mi,j=1 for a specific pixel (i, j), the trigger completely rewrites the original color (), when mi,j=0, the color of the original image is not modified (). Previous attacks only used binary mask values (0 or 1), so they also fit the general form of the formula. This continuous mask form makes the mask different and helps to integrate it into the optimization goal.

Optimization has two goals. For the target label (yt) to be analyzed, the first goal is to find a trigger (m, Δ) that will incorrectly classify a clean image as yt. The second goal is to find a "succinct" trigger, that is, a trigger that only modifies a limited part of the image. This paper uses the L1 norm of the mask m to measure the size of the flip-flop. At the same time, by optimizing the weighted sum of two goals, it is expressed as a multi-objective optimization task. Finally, the following formula is formed.

f(·) is the prediction function of DNN; l(·) is the loss function of measuring the classification error, and also represents the cross entropy in the experiment; λ is the weight of the second target. A smaller λ has a lower weight on the control of the trigger size, but a higher success rate will cause misclassification. In the experiments in this article, the optimization process dynamically adjusts λ to ensure that more than 99% of clean images can be successfully misclassified. We use the ADAM optimizer [19] to solve the above optimization problem.

X is a set of clean images that we use to solve the optimization task. It comes from a clean data set that users can access. In the experiment, the training set is used and input into the optimization process until convergence. Alternatively, the user can sample a small part of the test set.

Detect backdoors through abnormal points.

Using this optimization method, the reverse engineering trigger and its L1 norm of each target label are obtained. Then identify triggers and related labels, which appear as outliers with a small L1 norm in the distribution. This corresponds to step 3 in the detection process.

In order to detect outliers, this article uses a technique based on median absolute deviation. This technique is flexible in the presence of multiple outliers [20]. First, it calculates the absolute deviations between all data points and the median. The median of these absolute deviations is called MAD and provides a reliable measure of the distribution. Then, the abnormality index of the data point is defined as the absolute deviation of the data point and divided by the MAD. When the basic distribution is assumed to be a normal distribution, a constant estimator (1.4826) is used to normalize the abnormal index. Any data point with an anomaly index greater than 2 has an anomaly probability greater than 95%. In this paper, any anomaly index greater than 2 is marked as outlier and infected value, thus focusing only on the outliers at the small end of the distribution (low L1 norm labels are more vulnerable to attack).

Detect backdoors in models with a large number of labels.

In a DNN with a large number of tags, detection may cause high cost calculations proportional to the number of tags. Assuming that in the YouTube face recognition model with 1283 tags [22], our detection method takes an average of 14.6 seconds per tag, and the total cost on the Nvidia Titan X GPU is about 5.2 hours. If the processing is parallelized across multiple GPUs, the time can be reduced by a constant factor, but for resource-constrained users, the overall calculation is still a burden.

On the contrary, this paper proposes a low-cost detection scheme for large models. We observe that the optimization process (Equation 3) finds an approximate solution in the first few iterations of gradient descent, and uses the remaining iterations to fine-tune the trigger. Therefore, the optimization process was terminated early to narrow down the candidate range to a small number of potentially infected tags. Then, resources are concentrated to fully optimize these suspicious labels, and a small random label set is fully optimized to estimate the MAD value (the dispersion of the L1 norm distribution). This modification greatly reduces the number of tags that need to be analyzed (most tags are ignored), thereby greatly reducing the calculation time.

V. Experimental verification of backdoor detection and trigger recognition

In this section, we describe the evaluation of the defensive techniques in this article in multiple classification application areas to resist BadNets and Trojan horse attacks.

A. Experimental device

For BadNets evaluation, this paper uses four experimental tasks and injects backdoors into their data sets, including:

• (1) Handwritten digit recognition (MNIST)
• (2) Traffic sign recognition (GTSRB)
• (3) Face recognition with a large number of tags (YouTube Face)
• (4) Face recognition based on complex models (PubFig)

For the evaluation of Trojan horse attacks, this article uses two infected face recognition models. These two models were used in the original work and shared by the author, namely:

• Trojan Square
• Trojan Watermark

The details of each task and related data sets are described below. Table I includes a short summary. In order to be more concise, we have included more detailed information about the training configuration in Appendix Table VI, and detailed their model architectures in Tables VII, VIII, IX, and X.

• Handwritten Number Recognition (MNIST)
This task is usually used to assess the vulnerability of DNN. The goal is to recognize 10 handwritten digits (0-9) in grayscale images [23]. The data set contains 60K training images and 10K test images. The model used is a standard 4-layer convolutional neural network (see Table VII). This model was also evaluated in the work of BadNets.
• traffic sign recognition (GTSRB)
This task is also commonly used to evaluate DNN attacks. Its task is to identify 43 different traffic signs and simulate the application scenarios of self-driving cars. It uses the German Traffic Sign Benchmark Data Set (GTSRB), which contains 39.2K color training images and 12.6K test images [24]. The model consists of 6 convolutional layers and 2 fully connected layers (see Table VIII).
• Face Recognition (YouTube Face)
This task simulates a security screening scene through face recognition. In this scene, it tries to identify the faces of 1283 different people. The large size of the tag set increases the computational complexity of the detection scheme, and is a good choice for evaluating low-cost detection methods. It uses the Youtube face dataset, which contains images extracted from videos of different people on YouTube [22]. We applied the pre-processing used in the previous work to obtain a data set containing 1283 labels, 375.6K training images, and 64.2K test images [17]. This paper also chooses the DeepID architecture 17 composed of 8 layers according to the previous work.
• facial recognition (PubFig)
This task is similar to YouTube's faces, and 65 people's faces were recognized. The data set used includes 5850 color training images with a resolution of 224×224, and 650 test images [26]. The limited size of training data makes it difficult to train a model from scratch for this complex task. Therefore, we use transfer learning and use a 16-layer VGG teacher model (Table X) to fine-tune the last 4 layers of the teacher model through the training set of this article. This task helps to evaluate BadNets attacks using large complex models (16 layers).
• Face recognition based on Trojan horse attacks (Trojan Square and Trojan Watermark)
Both models are derived from the VGG-face model (16 layers), which is trained to recognize the faces of 2622 people [27], [28]. Similar to YouTube's faces, these models also require low-cost detection schemes because of the large number of tags. It should be noted that the two models are the same in the uninfected state, but different in the backdoor injection (discussed below). The original data set contains 2.6 million images. Since the author did not specify the exact segmentation of the training and test sets, this paper randomly selected a subset of 10K images as the test set for the next part of the experiment.

Badnet attack configuration. This article follows the attack method of injecting a backdoor in training proposed by BadNets [12]. For each application area we tested, a target label was randomly selected, and the training data was modified by injecting a part of the adversarial input labeled as the target label. The adversarial input is generated by applying a trigger to the clean image. For a given task and data set, change the proportion of adversarial input in training, so that the attack success rate can reach more than 95%, while maintaining a high classification accuracy rate. This ratio ranges from 10% to 20%. Then use the improved training data to train the DNN model until it converges.

Triggers are the white squares located in the lower right corner of the image. They are selected to not cover any important parts of the image, such as faces, logos, etc. Choose the shape and color of the trigger to ensure that it is unique and does not happen again in any input image. In order to make the trigger unobtrusive, we limit the size of the trigger to about 1% of the entire image, which is 4×4 in MNIST and GTSRB, 5×5 in YouTube faces, and 24× in Pub images twenty four. Examples of triggers and adversarial images are in the appendix (Figure 20).

In order to measure the performance of backdoor injection, this paper calculates the classification accuracy of the test data and the attack success rate when the trigger is applied to the test image. The "attack success rate" measures the percentage of adversarial images classified as target tags. As a benchmark, this article also measures the classification accuracy of the clean version of each model (that is, using the same training configuration, compared to the clean data set). Table II reports the final performance of each attack on the four tasks. The attack success rate of all backdoor attacks is above 97%, which has little effect on the classification accuracy. In PubFig, the largest drop in classification accuracy is 2.62%.

The attack configuration of the here directly uses the infected Trojan Square and Trojan Watermark models shared by the author during the Trojan horse attack work [13]. The trigger used in the Trojan box is a square in the lower right corner, which is 7% of the entire image. The Trojan watermark uses a trigger composed of text and symbols, which is similar to a watermark, and its size is also 7% of the entire image. The attack success rates of these two backdoors were 99.9% and 97.6%, respectively.

B. Detection performance

Follow the method in Section IV to check whether the infected DNN can be found. Figure 3 shows the anomaly indices of all six infected persons, and the original clean model they matched, including BadNets and Trojan horse attacks. The abnormality index of all infection models is greater than 3, indicating that the probability of the infection model is greater than 99.7%, and the previously defined infection abnormality index threshold is 2 (Section IV). At the same time, the anomaly index of all clean models is less than 2, which means that the outlier detection method correctly marks them as clean.

In order to obtain the position of the infected label in the L1 canonical distribution, the distribution of uninfected and infected labels is plotted in Figure 4. For the distribution of uninfected markers, the minimum and maximum values of the L1 norm, the 25/75 quartile and the median are plotted. Note that only one label is infected, so there is an L1 canonical data point to represent the infected label. Compared with the "distribution" of uninfected tags, the infected tags are always much lower than the median and much smaller than the minimum value of uninfected tags. This conclusion further validates our conjecture that the size of the trigger L1 norm required to attack the infected label is smaller than that of the uninfected label.

Finally, the method in this article can also determine which tags are infected. Simply put, any tag with an abnormality index greater than 2 is marked as infected. In most models, such as MNIST, GTSRB, PubFig, and Trojan Watermark, the infected label is marked, and only the infected label is marked as an adversarial label without any false positives. However, on Youtube Face and Trojan Square, in addition to marking infected tags, the 23 and 1 uninfected tags were incorrectly marked as adversarial tags. In fact, this is not a problematic situation. First, these false positive tags are identified because they are more vulnerable to attacks than other tags, and this information is useful to model users. Second, in subsequent experiments (section C of Part VI), this paper proposes mitigation techniques that will patch all vulnerable tags without affecting the classification performance of the model.

low-cost detection performance. Figures 3 and 4 show the experimental results in previous experiments. Low-cost detection schemes are used in Trojan Square, Trojan Watermark, and clean VGG-face models (each with 2622 tags). However, in order to better measure the performance of low-cost detection methods, this article takes Youtube face as an example to evaluate the reduction of computational cost and detection performance.

This article first describes the low-cost detection settings for YouTube faces in more detail. In order to identify a small number of potentially infected candidates, start with the first 100 tags in each iteration. The labels are arranged according to the L1 norm (that is, labels with a smaller L1 norm get a higher level). Figure 5 shows how the first 100 labels change in different iterations by measuring the degree of label overlap in the red curve of subsequent iterations. After the first 10 iterations, most of the set overlap is stable, with fluctuations around 80. This means that after a few iterations, the complete optimization is run, ignoring the remaining tags, so that the first 100 tags can be selected. More conservatively, when the number of overlapping tags for 10 iterations remains greater than 50, the operation is terminated. So how accurate is our early termination plan? Similar to the full cost plan, it correctly marked the infected label and resulted in 9 false positives. The black curve in Figure 5 tracks the level of the infected tag in the iteration process. The ranking stabilizes after 12 iterations, which is close to our early 10 termination iterations. In addition, the anomaly indexes of the low-cost scheme and the full-cost scheme are very similar, at 3.92 and 3.91, respectively.

This method greatly reduces the calculation time, and it takes 35 minutes to terminate early. After termination, a complete optimization process for the first 100 labels and another 100 labels randomly sampled were run to estimate the L1 canonical distribution of uninfected labels. This process takes another 44 minutes, and the entire process takes 1.3 hours, which is a 75% reduction in time compared to the entire plan.

C. Original trigger recognition

When identifying an infected tag, our method will also reverse engineer a trigger, which leads to misclassification of the tag. There is a question here, whether the reverse engineered trigger "matches" the original trigger, that is, the trigger used by the attacker. If there is a strong match, reverse engineering triggers can be used to design effective mitigation schemes.

This article compares these two flip-flops in three ways.

• end-to-end effectiveness
Similar to the original trigger, the reverse trigger results in a high attack success rate, which is actually higher than the original trigger. The attack success rate of all reverse triggers is greater than 97.5%, while the attack success rate of the original trigger is greater than 97.0%. This is not surprising, consider how to use an optimized misclassification scheme to infer triggers (Section 4). Our detection method effectively identifies the smallest trigger that produces the same misclassification result.
• visual similarity
Figure 6 compares the original trigger and the reverse trigger (m·∆) in the four BadNets models. We found that the reverse flip-flop is roughly similar to the original flip-flop. In all cases, the reverse trigger is displayed in the same position as the original trigger. However, there are still small differences between the reverse flip-flop and the original flip-flop. For example, in MNIST and PubFig, the reverse flip-flop is slightly smaller than the original flip-flop and lacks a few pixels. In the model using color images, the reverse flip-flop has many non-white pixels. These differences can be attributed to two reasons. First, when the model is trained to recognize triggers, it may not be able to understand the exact shape and color of the triggers. This means that the most "effective" way to trigger a backdoor in the model is not the original injection trigger, but a slightly different form. Second, our optimization goal is to punish larger triggers. Therefore, during the optimization process, some redundant pixels in the flip-flop will be clipped, resulting in a smaller flip-flop. Combined, the entire optimization process found a backdoor trigger that was more "compact" than the original trigger.

In the two Trojan horse attack models, the mismatch between the reverse trigger and the original trigger becomes more obvious, as shown in Figure 7. In both cases, the reverse triggers appear in different positions of the image and are visually different. They are at least an order of magnitude smaller than the original flip-flops and are much more compact than the BadNets model. The results show that our optimization scheme has found a more compact trigger in the pixel space, which can use the same backdoor to achieve a similar end-to-end effect. This also highlights the difference between Trojan horse attacks and BadNets. Since Trojan horses target specific neurons in order to connect input triggers to misclassified outputs, they cannot avoid side effects on other neurons. The result is a broader attack that can trigger a wider range of triggers, the smallest of which is reverse engineering.

• similarity neuronal activation of
Further study whether the input of the reverse trigger and the original trigger have similar neuron activation in the inner layer. Specifically, check the neurons from the second to the last layer, because this layer encodes relevant and representative patterns in the input. Identify the most relevant neuron backdoors, feed in clean and adversarial images and observe the differences in neuron activation in the target layer (second to last layer). Sort neurons by measuring the difference in the degree of neuron activation. It is found through experience that the first 1% of neurons is enough to inject into the back door. In other words, if the first 1% of neurons are kept and the rest of the neurons are covered (set to zero), the attack is still effective.

If the first 1% of neurons activated by the original trigger were also activated by the reverse engineered trigger, instead of a clean input, the activation of the neurons is considered "similar." Table III shows the average activation of the top 1% of neurons when 1,000 clean and adversarial images are randomly selected. In all cases, the activation of neurons in the adversarial image was 3 to 7 times higher than in the clean image. The above experiment shows that when input is added, both the reverse trigger and the original trigger activate the same backdoor neuron. Finally, use neural activation as a way of mitigating the technical backdoor in Part 6.

VI. Mitigation of backdoors

When the presence of a backdoor is detected, it is necessary to apply mitigation techniques to remove the backdoor while maintaining the performance of the model. This article describes two complementary technologies. First, create a filter for adversarial input to identify and reject any input with triggers, and calmly patch the model. Depending on the application, this method can also be used to assign "safe" output labels to adversarial inputs and will not be rejected. Secondly, the DNN is patched to make it unresponsive to the detected backdoor trigger. This article describes two repair methods, one is to use neuron pruning, and the other is to use Unlearning.

A. Filters used to detect adversarial input

In section C of Part 5, the experimental results show that neuron activation is a better way to capture the similarities between the original and reverse engineered triggers. Therefore, the establishment of a reverse trigger filter based on the activation profile of neurons can be used to measure whether the first 1% of the neurons are in the second to the last layer. When given certain inputs, the filter identifies potential adversarial inputs as inputs with activation contours higher than a certain threshold. The activation threshold can be calibrated using a clean input (input that is known to have no trigger). This article uses clean images of the test set to create an adversarial image to evaluate the performance of the filter by applying the original trigger to the test image (in a 1:1 ratio). Calculating the false positive rate (FPR) and false negative rate (FNR) when setting different thresholds for the average neuron activation, the results are shown in Figure 8. When the FPR is 5%, higher filtering is achieved for the four BadNets models, and their FNR values are all less than 1.63%. At the same time, it may be more difficult to filter out the Trojan horse attack model due to the difference in neuron activation between the reverse trigger and the original trigger. FNR is higher when FPR is less than 5%, and FNR is 4.3% and 28.5% when FPR is 5%. Finally, this article observes the results of choosing different injection methods between Trojan horse attacks and BadNets.

B. Neuron pruning to repair DNN

In order to actually repair the infection model, this paper proposes two techniques. In the first method, reverse triggers are used to help identify the relevant components of the backdoor in the DNN and delete them, such as neurons. This article proposes to cut the neurons related to the backdoor from the DNN, that is, to set the output value of these neurons to 0 during the inference process. Next, clean the difference between the input and the adversarial input, and use the reverse trigger to sort the target neuron. With the second to last layer as the goal, the neurons are pruned in the order of the highest level first, giving priority to those inputs that show the largest activation gap between the clean input and the adversarial input. In order to minimize the impact on the classification accuracy of the clean input, the pruning is stopped when the pruned model no longer responds to the reverse trigger.

Figure 9 shows the classification accuracy and attack success rate when pruning different proportions of neurons in GTSRB. Pruning 30% of the neurons can reduce the attack success rate to 0%. Note that the attack success rate of the reverse trigger follows a similar trend to that of the original trigger, so it can be used as a good signal close to the defense effect of the original trigger. At the same time, the classification accuracy rate only dropped by 5.06%. The defender can achieve a smaller drop in classification accuracy by reducing the attack success rate, as shown in Figure 9.

It should be noted that in section C of the fifth part, it is determined that the top 1% of neurons are sufficient to cause classification errors. However, in this case, we must remove nearly 30% of the neurons to effectively mitigate the attack. This can be explained by the large amount of redundancy in neural pathways in DNNs [29]. Even if the top 1% of neurons are removed, there are other lower-ranked neurons that can still help trigger the backdoor. Previous work on compressed DNN has also noticed this kind of high redundancy phenomenon [29].

When applying the scheme of this article to other BadNets models, very similar experimental results were found in MNIST and PubFig, as shown in Figure 21. When pruning 10% to 30% of neurons, the attack success rate can be reduced to 0%. However, we observed that the classification accuracy of YouTube faces was more negatively affected, as shown in Figure 21. For YouTube faces, when the attack success rate drops to 1.6%, the classification accuracy rate drops from 97.55% to 81.4%. This is because there are only 160 output neurons in the second to last layer, which means that clean neurons and adversarial neurons are mixed together, so that the clean neurons are pruned in the process, thus reducing the classification accuracy . This article conducted pruning experiments on multiple levels and found that pruning in the last convolutional layer will produce the best results. In all four BadNets models, the attack success rate is reduced to less than 1%, and the minimum classification accuracy is reduced to less than 0.8%. At the same time, up to 8% of neurons are trimmed. Figure 22 in the appendix plots these detailed experimental results.

Neuron pruning in the In the Trojan horse model, this article uses the same pruning method and configuration, but the pruning effect is poor. As shown in Figure 10, when 30% of the neurons are pruned, the attack success rate of the reverse engineering trigger drops to 10.1%, but the success rate of using the original trigger is still high, 87.3%. This difference is due to the reverse The activation of neurons is different between the forward trigger and the original trigger. If neuron activation does not work well in matching the reverse engineered trigger and the original trigger, it will lead to poor pruning in an attack that uses the original trigger. In the next section, we will talk about the experiment of undo learning to attack Trojan horses, and the effect is much better.

advantages and limitations. is that the method requires very few calculations, most of which involve running clean and inference against images. However, its performance depends on choosing the appropriate layer to trim neurons, which requires experiments on multiple layers. In addition, it has high requirements for the matching degree of the reverse flip-flop and the original flip-flop.

C. Patching DNN by undoing learning

The second mitigation method is to train DNN by canceling learning, thereby canceling the original trigger. The reverse trigger can be used to train the infected neural network and recognize the correct label, even when the trigger exists. Compared with neuron pruning, Unlearning allows the model to decide through training which non-neuron weights are problematic and should be updated.

For all models including the Trojan horse model, the updated training data set is used to fine-tune the model, which is only a full-sample training (Epoch). To create this new training set, you need a 10% original training data sample (clean and no triggers), and without modifying the label, add a reverse trigger for 20% of the sample. In order to measure the effectiveness of patching, this paper measures the attack success rate of the original trigger and the classification accuracy of the fine-tuned model.

Table IV compares the attack success rate and classification accuracy before and after training. In all models, the attack success rate can be reduced to less than 6.70% without significantly affecting the classification accuracy. The biggest drop in classification accuracy is GTSRB, which is only 3.6%. In some models, especially the Trojan horse attack model, the classification accuracy after patching has been improved. Note that when the backdoor is injected, the classification accuracy of the Trojan horse attack model will decrease. The classification accuracy of the original uninfected Trojan horse attack model is 77.2% (not shown in Table IV). When the backdoor is repaired, this value is obtained Improved.

This article compares the effects of this Unlearning and the two variants. First, for the same training sample for retraining, the original trigger is used instead of the reverse engineered trigger for 20%. As shown in Table IV, the revocation learning using the original trigger achieves a lower attack success rate with similar classification accuracy. Therefore, using a reverse trigger to undo learning is a good approximation, and the original method can be used to undo learning. Second, only clean training data is used and no additional triggers are used to compare with undo learning. The results in the last column of Table IV show that for all BadNets models, undo learning is invalid, and the attack success rate is still high, greater than 93.37%. But it is efficient for the Trojan attack model, and the success rate of the Trojan horse square and the Trojan horse watermark drops to 10.91% and 0%, respectively. This result shows that the Trojan attack model is more sensitive to the highly targeted retuning of specific neurons and at the same time withdrawing learning. It helps to reset clean inputs of several key neurons and disable attacks. On the contrary, BadNets injects the backdoor by updating all the layers with the poisoned data set, which seems to require more work time to retrain and mitigate the backdoor. This article examines the impact of fixing false positive labels. Fixing incorrectly marked labels on YouTube faces and Trojan horse squares (in section B of Part 5) will only reduce the classification accuracy by less than 1%. Therefore, the effect of alleviating false positives in some tests can be ignored.

parameters and costs. found through experiments that undo learning performance is usually insensitive to parameters such as the amount of training data and the ratio of modified training data.

Finally, compared with neuron pruning, undo learning has a higher computational cost. However, it is still one to two orders of magnitude smaller than the model from the initial retraining. The experimental results in this paper show that revocation learning clearly provides the best mitigation performance compared with alternative solutions.

VII. Robustness of advanced backdoors

The previous chapter described and evaluated the detection and mitigation of backdoor attacks based on basic situation assumptions, for example, fewer triggers, each priority stealth, and misclassification of any input to a single target tag. Here, this article explores many more complex scenarios and evaluates the effectiveness of their defense mechanisms through possible experiments.

This article discusses 5 specific types of advanced backdoor attacks, each of which challenges assumptions or limitations in current defense designs.

• complex trigger. The detection scheme in this article depends on the success of the optimization process. Will more complex triggers make the optimization function more difficult to converge?
• larger trigger. considers a larger trigger factor. By increasing the trigger size, an attacker can force the reverse engineering to converge to a larger trigger with a larger norm.
• has multiple infected tags with different triggers. Consider a scenario where multiple backdoors for different tags are inserted into a single model to evaluate the maximum number of infected tags to be detected.
• a single infected tag with multiple triggers. considers multiple triggers for the same tag.
• a (partial) backdoor specific to the source tag. The detection scheme in this article is to detect triggers that cause misclassification on any input. "Partial" backdoors that are valid for input from a subset of source tags will be more difficult to detect.

A. Complex trigger mode

As we have observed in the Trojan horse model, the optimization of triggers with more complex patterns is more difficult to converge. A more random trigger pattern may increase the difficulty of reverse engineering the trigger.

This article performs a simple test. First, the white square trigger is changed to a noise square, where each pixel that is triggered is assigned a random color. Inject backdoor attacks into MNIST, GTSRB, YouTube Face, and PubFig, and evaluate their performance. The anomaly index produced in each model is shown in Figure 11. The technique in this article detects complex trigger patterns in all cases, and tests our mitigation techniques on these models. For filtering, when the FPR is 5%, the FNR of all models is less than 0.01%. Patching uses undo learning to reduce the attack success rate to less than 4.2%, reducing the classification accuracy rate by up to 3.1%. Finally, we tested the backdoors in GTSRB with different trigger shapes (such as triangles, checkerboard shapes), and all detection and mitigation techniques worked as expected.

B. Larger trigger

Larger triggers may produce larger reverse engineering triggers. This can help the infected label to be closer to the uninfected label in the L1 standard, making the abnormality detection effect worse. A sample test was performed on GTSRB, and the size of the trigger was increased from 4×4 (1.6% of the image) to 16×16 (25%), and all the triggers were still white squares. This article evaluates the detection technology using the same structure in previous experiments. Figure 12 shows the L1 norm of the reverse trigger for infected and uninfected tags. When the original flip-flop gets bigger, the reverse flip-flop will also get bigger as expected. When the trigger exceeds 14×14, the L1 norm is mixed with the uninfected label, so that the abnormality index drops below the detection threshold. The abnormality index index is shown in Figure 13.

The maximum detectable trigger size depends largely on one factor: the trigger size of the uninfected tags (the amount of change required to cause all inputs to be misclassified between uninfected tags). The trigger size of an uninfected tag is itself a proxy for measuring the difference in input between different tags, that is, more tags mean that uninfected tags require a larger trigger size, and detection of larger triggers requires larger Ability. In the Youtube face application, up to 39% of the triggers of the entire image were detected. On MNIST with fewer markers, we can only detect triggers up to 18% of the image size. Generally speaking, a larger trigger is more visually obvious and easier to recognize by humans. However, there may be ways to increase the size of the trigger, but it is not obvious, and we will explore it in future work.

C. Multiple infected tags with different triggers

The scenario considered in this experiment is that the attacker inserts multiple independent backdoors into a single model, and each backdoor targets a different label. For many Lt in L, inserting a large number of backdoors may decrease together. This makes the impact of any single trigger smaller than the outliers and makes it more difficult to detect the net effect. The compromise is that the model is likely to have the "maximum ability" to learn backdoors while maintaining their classification.

Experiment by generating unique triggers with mutually exclusive color patterns. We found that most models, namely MNIST, GTSRB and PubFig, have enough ability to support the trigger of each output label without affecting the accuracy of classification. But on YouTube faces, there are 1283 tags. Once the trigger is infected with more than 15.6% of tags, the average attack success rate will drop significantly. As shown in Figure 14, the average attack has a lower success rate due to too many triggers, which also confirms our previous guess.

Evaluate the defense of multiple different backdoors in the GTSRB. As shown in Figure 15, once more than 8 tags (18.6%) are infected by the backdoor, it is difficult for anomaly detection to identify the influence of the trigger. The results show that MNIST can detect up to 3 types of tags (30%), YouTube faces can detect 375 types of tags (29.2%), and PubFig can detect 24 types of tags (36.9%).

Although the outlier detection method fails in this case, the underlying reverse engineering method is still effective. For all infected tags, the correct trigger was successfully reverse-designed. Figure 16 shows the triggered L1 specification for infected and uninfected tags. All infected tags have a smaller norm than uninfected tags. Further manual analysis verified that the reverse trigger looks similar to the original trigger visually. A conservative defender can manually check the reverse trigger and determine the suspiciousness of the model. Subsequent tests showed that pre-emptive "fixes" c