Spaces:
Running
Running
WEBVTT | |
00:00.000 --> 00:13.120 | |
Hello, my name is Pouya Bahshiban and I'm going to tell you about our paper titled | |
00:13.120 --> 00:18.720 | |
Adversarial Feature Desensitization. This is joint work with a number of wonderful collaborators | |
00:18.720 --> 00:24.400 | |
at MIWA, University of Montreal and McGill University, including Reza Bayat, Adam Ibrahim, | |
00:24.400 --> 00:32.160 | |
Kartika Hoja, Mojtaba Farmazi, Tourez Dale, Lake Richards and Erin Oji. A common assumption in | |
00:32.160 --> 00:36.560 | |
machine learning is that the train and test samples come from the same distribution. | |
00:37.200 --> 00:42.960 | |
While this is a reasonable assumption under most circumstances, it is intentionally violated in the | |
00:42.960 --> 00:49.600 | |
regime of adversarial attacks. Adversarial attacks are algorithms that search for slight input | |
00:49.600 --> 00:55.600 | |
perturbations that cause the input to be misclassified. In the case of white box attacks, | |
00:55.600 --> 01:01.600 | |
the model itself is transparent to the attacker and the attacker uses it to identify the possible | |
01:01.600 --> 01:07.760 | |
inputs that would lead to misclassifications. A famous example of this is the image of a panda | |
01:07.760 --> 01:13.360 | |
that when perturbed with imperceptible noise, alters the model's prediction from a panda to a | |
01:13.360 --> 01:19.840 | |
gibbon. As prior literature has shown, this is a common issue in almost all machine learning methods | |
01:19.840 --> 01:25.280 | |
and unless the classifier is specifically trained to be robust against these attacks, | |
01:25.280 --> 01:28.720 | |
the attacks could completely break down the classifier's performance. | |
01:30.240 --> 01:35.600 | |
This issue becomes even more critical when we consider the vast usage of these machine learning | |
01:35.600 --> 01:41.040 | |
systems in our societies. For example, the possible security concerns that rise in face | |
01:41.040 --> 01:46.720 | |
recognition systems prone to adversarial attacks or the safety in autonomous driving systems. | |
01:48.080 --> 01:54.000 | |
So what is an adversarial attack? To formally define the adversarial attacks, let's assume a | |
01:54.000 --> 02:00.080 | |
feature learning function f that projects inputs x to latent space with feature space z | |
02:01.600 --> 02:08.720 | |
and a classifier that uses the latent code z to predict the correct class label y hat. | |
02:08.720 --> 02:14.480 | |
The perturbation function or the attack generates a perturbed sample x prime | |
02:14.480 --> 02:21.520 | |
within the epsilon neighborhood of the input x, which we're showing here as b of x and epsilon. | |
02:22.160 --> 02:28.880 | |
By maximizing the classification objective, the opposite of how we normally optimize the classifier's | |
02:28.880 --> 02:36.720 | |
parameter. Many methods have been proposed to defend the models against adversarial attacks. | |
02:36.720 --> 02:42.640 | |
Two of these methods that have withstood the test of time so far are the adversarial training | |
02:43.200 --> 02:50.160 | |
by Alexander Modrianov, which proposes a defense method by solving a minimax optimization problem | |
02:50.160 --> 02:56.000 | |
that involves finding an adversarial input by maximizing the classification loss in the inner | |
02:56.000 --> 03:03.840 | |
loop followed by a classifier training to minimizing the classifier loss on these adversarial inputs. | |
03:03.840 --> 03:09.920 | |
This procedure is graphically shown for two hypothetical classes in the diagram on this slide. | |
03:10.560 --> 03:15.440 | |
The adversarial training method essentially learns to separate the distributions of adversarial | |
03:15.440 --> 03:22.400 | |
examples belonging to different classes. The second method is the trades method by Zhang et al, | |
03:22.400 --> 03:27.440 | |
which proposes to push the decision boundary of the classifier away from the data. | |
03:27.440 --> 03:32.480 | |
Trades achieves this by introducing a regularization term to the original learning | |
03:32.480 --> 03:38.320 | |
objective for classification that penalizes the mismatch between the predicted label | |
03:38.320 --> 03:44.400 | |
for the clean and perturbed inputs. The diagram on the right side again graphically illustrates | |
03:44.400 --> 03:50.000 | |
this procedure, where now the defense method learns to separate the distributions of clean examples | |
03:50.000 --> 03:54.400 | |
belonging to different classes while minimizing the loss of the classifier. | |
03:54.400 --> 03:59.920 | |
The third method is the trade method by Wang et al, which proposes to push the decision boundary | |
03:59.920 --> 04:06.880 | |
of the classifier to the inner loop followed by a classifier training to minimizing the | |
04:06.880 --> 04:13.120 | |
classification loss on these adversarial inputs. The third method is the trade method by Zhang et al, | |
04:13.120 --> 04:18.720 | |
which proposes to push the decision boundary of the classifier to the inner loop followed by a | |
04:18.720 --> 04:27.840 | |
classifier training to minimizing the classification loss on these adversarial inputs to the inner | |
04:27.840 --> 04:34.640 | |
loop. The third method is the trade method by Wang et al, which proposes to push the decision | |
04:34.640 --> 04:39.920 | |
boundary of the classifier to minimizing the classification loss. The fourth method is the | |
04:39.920 --> 04:45.600 | |
trade method by Wang et al, which proposes to push the decision boundary of the classifier | |
04:45.600 --> 04:52.160 | |
for a source domain, but we want the classifier to also perform the same task on a related target | |
04:52.160 --> 05:00.960 | |
domain that we might not have enough data for or that the generating procedure for sampling | |
05:00.960 --> 05:09.440 | |
domain might be expensive. The domain adaptation theory proposed by Ben David et al answers the | |
05:09.440 --> 05:15.840 | |
question of under what conditions can we adapt a classifier trained on the source domain for use | |
05:15.840 --> 05:23.920 | |
in the target domain. Here we consider the original clean distributions as the source domain and the | |
05:23.920 --> 05:31.280 | |
distribution of adversarial images generated from those images as the target domain. Although here | |
05:31.280 --> 05:38.240 | |
the target domain continuously evolves because the adversarial examples are based on the current | |
05:38.240 --> 05:46.000 | |
state of the model at each time step. And similar to the domain adaptation theory, our goal here | |
05:46.000 --> 05:52.960 | |
is to learn how to perform well on both source and target domains, meaning the natural and | |
05:52.960 --> 06:02.240 | |
adversarial domains. Now before I tell you about our proposed method, let's dive a bit deeper into | |
06:02.240 --> 06:08.960 | |
what the domain adaptation theory from Ben David et al states. Similar to before, let's assume a | |
06:08.960 --> 06:14.880 | |
feature learning function f that projects inputs x to latent space or feature space z and the | |
06:14.880 --> 06:23.040 | |
classifier that predicts the correct label y, y hat, from those latent codes. Now consider natural | |
06:23.040 --> 06:31.440 | |
and adversarial examples as input domains dx and d' x and their induced feature distributions | |
06:31.440 --> 06:42.560 | |
which go through the f function as dz and d' z. Also consider epsilon z and epsilon' z | |
06:42.560 --> 06:50.320 | |
as the classification error over the domains dz and d' z, what we are going to refer to as the | |
06:50.320 --> 06:58.880 | |
clean accuracy and the adversarial accuracy. The domain adaptation theory now gives a bond | |
06:58.880 --> 07:04.320 | |
on the adversarial error in terms of the natural error and the distance between the two domains. | |
07:05.120 --> 07:11.680 | |
Fortunately, from the prior work, we know that h delta h distance, which measures the distance | |
07:11.680 --> 07:17.440 | |
between two domains, can be estimated using the classifier trained to discriminate between the | |
07:17.440 --> 07:26.080 | |
two domains. Now our defense method called adversarial feature desensitization essentially | |
07:26.080 --> 07:34.720 | |
minimizes the bound on the adversarial error epsilon' z using a three-step procedure which | |
07:34.720 --> 07:40.560 | |
has some conceptual similarities with prior work on adversarial domain adaptation from Ganin et al. | |
07:42.240 --> 07:49.280 | |
For this, we first update the parameters theta and phi in the feature learning function f and | |
07:49.280 --> 07:56.320 | |
task classifier c to minimize the classification loss on the natural domain. This is shown with | |
07:56.320 --> 08:01.920 | |
green arrows and green boxes marked 1 on both the equation and on the diagram. | |
08:04.000 --> 08:10.400 | |
Secondly, we estimate the h delta h distance using an additional domain discriminator | |
08:10.960 --> 08:17.600 | |
network that predicts the domain identity from the latent code z. We update the domain | |
08:17.600 --> 08:24.720 | |
discriminator parameters psi to minimize the domain classification loss. And finally, | |
08:24.720 --> 08:31.680 | |
in the third step, we update the feature learning network parameters theta to maximize the domain | |
08:31.680 --> 08:39.600 | |
classification loss in an adversarial way. These two steps are marked with red arrows in the figure | |
08:39.600 --> 08:48.960 | |
and red boxes on the equation. Similar to previous two methods, adversarial training and trades that | |
08:48.960 --> 08:55.760 | |
I showed you, we here we can also graphically demonstrate this procedure. In our method AFD, | |
08:55.760 --> 09:01.040 | |
we learn to separate the classes from the distributions of clean examples while at the | |
09:01.040 --> 09:07.840 | |
same time we optimize a domain classifier that learns the boundary between the clean and adversarial | |
09:07.840 --> 09:14.560 | |
examples for each class. And finally, we push the adversarial examples to the opposite side of that | |
09:14.560 --> 09:22.400 | |
boundary. This procedure implicitly desensitizes the learned features to adversarial perturbations | |
09:22.400 --> 09:30.480 | |
and hence the name adversarial feature desensitization. We tested our method on four | |
09:30.480 --> 09:35.840 | |
data sets and compared them with a number of other baselines including with adversarial training and | |
09:35.840 --> 09:43.760 | |
trades. We made two versions of our method called AFDTCGAN that uses the adversarial losses from | |
09:43.760 --> 09:50.880 | |
Goodfellow et al and AFDWGAN that uses the Wasserstein losses from Arjovski and Goodtuner. | |
09:52.000 --> 09:57.840 | |
In the table, we evaluated all methods on several white box and black box attacks with | |
09:57.840 --> 10:07.360 | |
nominal strengths into each data set. Overall, our method AFD and especially AFDWGAN showed superior | |
10:07.360 --> 10:15.200 | |
performance against most attacks in most data sets. However, AFD was behind trades on several attacks | |
10:15.200 --> 10:20.720 | |
especially on CIFAR-100 and TinyImageNet data set that had more classes in it. | |
10:20.720 --> 10:26.080 | |
We also looked in trust attack methods and attack strengths which we controlled with the parameter | |
10:26.080 --> 10:32.800 | |
epsilon. The diagrams on the right show the robust accuracy for each defense method across | |
10:32.800 --> 10:41.200 | |
eight attack methods and various epsilon values for each of them. Overall, our results in these | |
10:41.200 --> 10:48.240 | |
diagrams showed that AFD's robustness generalizes better than the baselines across attacks and | |
10:48.240 --> 10:55.200 | |
across attack strengths. To quantify these differences, we also computed the area under | |
10:55.200 --> 11:00.000 | |
the curve for each method for each attack and summarized them in a table on the left. | |
11:00.880 --> 11:06.800 | |
As you can see, AFD's robust performance generalizes better to unseen and stronger attacks | |
11:06.800 --> 11:15.680 | |
compared to other baselines. If you remember from previous slides, the domain adaptation theory | |
11:15.680 --> 11:22.400 | |
predicted a bound on the adversarial error which can also be turned into a bound on the generalization | |
11:22.400 --> 11:30.320 | |
gap between natural and adversarial attacks. We empirically tested this prediction in our experiments | |
11:30.320 --> 11:37.600 | |
under two settings. Under the first setting, we varied the epsilon value for the PGDL-infinity | |
11:37.600 --> 11:45.600 | |
attack which was used during the training. And under the second setting, we varied the | |
11:45.600 --> 11:51.120 | |
epsilon value for the PGDL-infinity attack which was used during the training. And under the second setting, we used a diverse set of attacks and various attack strengths for each of them. | |
11:52.000 --> 11:58.480 | |
And under both scenarios, we found that the domain discriminator, which was originally trained on a | |
11:58.480 --> 12:05.280 | |
particular attack and attack strength, in our case it was PGDL-infinity attack with a fixed epsilon | |
12:05.280 --> 12:10.960 | |
for each data set, could well predict the generalization gap to unseen attacks and | |
12:10.960 --> 12:18.000 | |
different attack magnitudes. This suggests that the adversarial training against a domain classifier | |
12:18.000 --> 12:24.000 | |
like that used in our proposed method could potentially lead to robust models with better | |
12:24.000 --> 12:33.520 | |
generalization capacity. Finally, while we showed that AFD generalizes well to most other attacks | |
12:33.520 --> 12:39.200 | |
and attack strengths, it occasionally was worse compared to other baselines, especially in data | |
12:39.200 --> 12:45.760 | |
sets with more classes like Tiny ImageNet. This could potentially be due to the difficulty of training | |
12:46.320 --> 12:51.680 | |
domain classifiers in these data sets and leaves much space for future work on | |
12:51.680 --> 12:57.120 | |
investigating the effect of domain classifiers on the robustness of feature learning functions. | |
12:58.080 --> 13:04.400 | |
Also, AFD required more backward computations compared to some of the other baselines | |
13:04.400 --> 13:11.120 | |
such as adversarial training, and as a result, its training time was on average about 31% | |
13:11.120 --> 13:17.680 | |
longer than adversarial training. We invite you to read our paper for more details and please | |
13:17.680 --> 13:34.720 | |
get in touch with us if you have any questions. Thanks for watching this video and we hope you enjoyed it. | |