ytseg_demo / demo_data /nips-2021 /25959 /transcript_whisper_large-v2.vtt
retkowski's picture
Add demo
cb71ef5
WEBVTT
00:00.000 --> 00:13.120
Hello, my name is Pouya Bahshiban and I'm going to tell you about our paper titled
00:13.120 --> 00:18.720
Adversarial Feature Desensitization. This is joint work with a number of wonderful collaborators
00:18.720 --> 00:24.400
at MIWA, University of Montreal and McGill University, including Reza Bayat, Adam Ibrahim,
00:24.400 --> 00:32.160
Kartika Hoja, Mojtaba Farmazi, Tourez Dale, Lake Richards and Erin Oji. A common assumption in
00:32.160 --> 00:36.560
machine learning is that the train and test samples come from the same distribution.
00:37.200 --> 00:42.960
While this is a reasonable assumption under most circumstances, it is intentionally violated in the
00:42.960 --> 00:49.600
regime of adversarial attacks. Adversarial attacks are algorithms that search for slight input
00:49.600 --> 00:55.600
perturbations that cause the input to be misclassified. In the case of white box attacks,
00:55.600 --> 01:01.600
the model itself is transparent to the attacker and the attacker uses it to identify the possible
01:01.600 --> 01:07.760
inputs that would lead to misclassifications. A famous example of this is the image of a panda
01:07.760 --> 01:13.360
that when perturbed with imperceptible noise, alters the model's prediction from a panda to a
01:13.360 --> 01:19.840
gibbon. As prior literature has shown, this is a common issue in almost all machine learning methods
01:19.840 --> 01:25.280
and unless the classifier is specifically trained to be robust against these attacks,
01:25.280 --> 01:28.720
the attacks could completely break down the classifier's performance.
01:30.240 --> 01:35.600
This issue becomes even more critical when we consider the vast usage of these machine learning
01:35.600 --> 01:41.040
systems in our societies. For example, the possible security concerns that rise in face
01:41.040 --> 01:46.720
recognition systems prone to adversarial attacks or the safety in autonomous driving systems.
01:48.080 --> 01:54.000
So what is an adversarial attack? To formally define the adversarial attacks, let's assume a
01:54.000 --> 02:00.080
feature learning function f that projects inputs x to latent space with feature space z
02:01.600 --> 02:08.720
and a classifier that uses the latent code z to predict the correct class label y hat.
02:08.720 --> 02:14.480
The perturbation function or the attack generates a perturbed sample x prime
02:14.480 --> 02:21.520
within the epsilon neighborhood of the input x, which we're showing here as b of x and epsilon.
02:22.160 --> 02:28.880
By maximizing the classification objective, the opposite of how we normally optimize the classifier's
02:28.880 --> 02:36.720
parameter. Many methods have been proposed to defend the models against adversarial attacks.
02:36.720 --> 02:42.640
Two of these methods that have withstood the test of time so far are the adversarial training
02:43.200 --> 02:50.160
by Alexander Modrianov, which proposes a defense method by solving a minimax optimization problem
02:50.160 --> 02:56.000
that involves finding an adversarial input by maximizing the classification loss in the inner
02:56.000 --> 03:03.840
loop followed by a classifier training to minimizing the classifier loss on these adversarial inputs.
03:03.840 --> 03:09.920
This procedure is graphically shown for two hypothetical classes in the diagram on this slide.
03:10.560 --> 03:15.440
The adversarial training method essentially learns to separate the distributions of adversarial
03:15.440 --> 03:22.400
examples belonging to different classes. The second method is the trades method by Zhang et al,
03:22.400 --> 03:27.440
which proposes to push the decision boundary of the classifier away from the data.
03:27.440 --> 03:32.480
Trades achieves this by introducing a regularization term to the original learning
03:32.480 --> 03:38.320
objective for classification that penalizes the mismatch between the predicted label
03:38.320 --> 03:44.400
for the clean and perturbed inputs. The diagram on the right side again graphically illustrates
03:44.400 --> 03:50.000
this procedure, where now the defense method learns to separate the distributions of clean examples
03:50.000 --> 03:54.400
belonging to different classes while minimizing the loss of the classifier.
03:54.400 --> 03:59.920
The third method is the trade method by Wang et al, which proposes to push the decision boundary
03:59.920 --> 04:06.880
of the classifier to the inner loop followed by a classifier training to minimizing the
04:06.880 --> 04:13.120
classification loss on these adversarial inputs. The third method is the trade method by Zhang et al,
04:13.120 --> 04:18.720
which proposes to push the decision boundary of the classifier to the inner loop followed by a
04:18.720 --> 04:27.840
classifier training to minimizing the classification loss on these adversarial inputs to the inner
04:27.840 --> 04:34.640
loop. The third method is the trade method by Wang et al, which proposes to push the decision
04:34.640 --> 04:39.920
boundary of the classifier to minimizing the classification loss. The fourth method is the
04:39.920 --> 04:45.600
trade method by Wang et al, which proposes to push the decision boundary of the classifier
04:45.600 --> 04:52.160
for a source domain, but we want the classifier to also perform the same task on a related target
04:52.160 --> 05:00.960
domain that we might not have enough data for or that the generating procedure for sampling
05:00.960 --> 05:09.440
domain might be expensive. The domain adaptation theory proposed by Ben David et al answers the
05:09.440 --> 05:15.840
question of under what conditions can we adapt a classifier trained on the source domain for use
05:15.840 --> 05:23.920
in the target domain. Here we consider the original clean distributions as the source domain and the
05:23.920 --> 05:31.280
distribution of adversarial images generated from those images as the target domain. Although here
05:31.280 --> 05:38.240
the target domain continuously evolves because the adversarial examples are based on the current
05:38.240 --> 05:46.000
state of the model at each time step. And similar to the domain adaptation theory, our goal here
05:46.000 --> 05:52.960
is to learn how to perform well on both source and target domains, meaning the natural and
05:52.960 --> 06:02.240
adversarial domains. Now before I tell you about our proposed method, let's dive a bit deeper into
06:02.240 --> 06:08.960
what the domain adaptation theory from Ben David et al states. Similar to before, let's assume a
06:08.960 --> 06:14.880
feature learning function f that projects inputs x to latent space or feature space z and the
06:14.880 --> 06:23.040
classifier that predicts the correct label y, y hat, from those latent codes. Now consider natural
06:23.040 --> 06:31.440
and adversarial examples as input domains dx and d' x and their induced feature distributions
06:31.440 --> 06:42.560
which go through the f function as dz and d' z. Also consider epsilon z and epsilon' z
06:42.560 --> 06:50.320
as the classification error over the domains dz and d' z, what we are going to refer to as the
06:50.320 --> 06:58.880
clean accuracy and the adversarial accuracy. The domain adaptation theory now gives a bond
06:58.880 --> 07:04.320
on the adversarial error in terms of the natural error and the distance between the two domains.
07:05.120 --> 07:11.680
Fortunately, from the prior work, we know that h delta h distance, which measures the distance
07:11.680 --> 07:17.440
between two domains, can be estimated using the classifier trained to discriminate between the
07:17.440 --> 07:26.080
two domains. Now our defense method called adversarial feature desensitization essentially
07:26.080 --> 07:34.720
minimizes the bound on the adversarial error epsilon' z using a three-step procedure which
07:34.720 --> 07:40.560
has some conceptual similarities with prior work on adversarial domain adaptation from Ganin et al.
07:42.240 --> 07:49.280
For this, we first update the parameters theta and phi in the feature learning function f and
07:49.280 --> 07:56.320
task classifier c to minimize the classification loss on the natural domain. This is shown with
07:56.320 --> 08:01.920
green arrows and green boxes marked 1 on both the equation and on the diagram.
08:04.000 --> 08:10.400
Secondly, we estimate the h delta h distance using an additional domain discriminator
08:10.960 --> 08:17.600
network that predicts the domain identity from the latent code z. We update the domain
08:17.600 --> 08:24.720
discriminator parameters psi to minimize the domain classification loss. And finally,
08:24.720 --> 08:31.680
in the third step, we update the feature learning network parameters theta to maximize the domain
08:31.680 --> 08:39.600
classification loss in an adversarial way. These two steps are marked with red arrows in the figure
08:39.600 --> 08:48.960
and red boxes on the equation. Similar to previous two methods, adversarial training and trades that
08:48.960 --> 08:55.760
I showed you, we here we can also graphically demonstrate this procedure. In our method AFD,
08:55.760 --> 09:01.040
we learn to separate the classes from the distributions of clean examples while at the
09:01.040 --> 09:07.840
same time we optimize a domain classifier that learns the boundary between the clean and adversarial
09:07.840 --> 09:14.560
examples for each class. And finally, we push the adversarial examples to the opposite side of that
09:14.560 --> 09:22.400
boundary. This procedure implicitly desensitizes the learned features to adversarial perturbations
09:22.400 --> 09:30.480
and hence the name adversarial feature desensitization. We tested our method on four
09:30.480 --> 09:35.840
data sets and compared them with a number of other baselines including with adversarial training and
09:35.840 --> 09:43.760
trades. We made two versions of our method called AFDTCGAN that uses the adversarial losses from
09:43.760 --> 09:50.880
Goodfellow et al and AFDWGAN that uses the Wasserstein losses from Arjovski and Goodtuner.
09:52.000 --> 09:57.840
In the table, we evaluated all methods on several white box and black box attacks with
09:57.840 --> 10:07.360
nominal strengths into each data set. Overall, our method AFD and especially AFDWGAN showed superior
10:07.360 --> 10:15.200
performance against most attacks in most data sets. However, AFD was behind trades on several attacks
10:15.200 --> 10:20.720
especially on CIFAR-100 and TinyImageNet data set that had more classes in it.
10:20.720 --> 10:26.080
We also looked in trust attack methods and attack strengths which we controlled with the parameter
10:26.080 --> 10:32.800
epsilon. The diagrams on the right show the robust accuracy for each defense method across
10:32.800 --> 10:41.200
eight attack methods and various epsilon values for each of them. Overall, our results in these
10:41.200 --> 10:48.240
diagrams showed that AFD's robustness generalizes better than the baselines across attacks and
10:48.240 --> 10:55.200
across attack strengths. To quantify these differences, we also computed the area under
10:55.200 --> 11:00.000
the curve for each method for each attack and summarized them in a table on the left.
11:00.880 --> 11:06.800
As you can see, AFD's robust performance generalizes better to unseen and stronger attacks
11:06.800 --> 11:15.680
compared to other baselines. If you remember from previous slides, the domain adaptation theory
11:15.680 --> 11:22.400
predicted a bound on the adversarial error which can also be turned into a bound on the generalization
11:22.400 --> 11:30.320
gap between natural and adversarial attacks. We empirically tested this prediction in our experiments
11:30.320 --> 11:37.600
under two settings. Under the first setting, we varied the epsilon value for the PGDL-infinity
11:37.600 --> 11:45.600
attack which was used during the training. And under the second setting, we varied the
11:45.600 --> 11:51.120
epsilon value for the PGDL-infinity attack which was used during the training. And under the second setting, we used a diverse set of attacks and various attack strengths for each of them.
11:52.000 --> 11:58.480
And under both scenarios, we found that the domain discriminator, which was originally trained on a
11:58.480 --> 12:05.280
particular attack and attack strength, in our case it was PGDL-infinity attack with a fixed epsilon
12:05.280 --> 12:10.960
for each data set, could well predict the generalization gap to unseen attacks and
12:10.960 --> 12:18.000
different attack magnitudes. This suggests that the adversarial training against a domain classifier
12:18.000 --> 12:24.000
like that used in our proposed method could potentially lead to robust models with better
12:24.000 --> 12:33.520
generalization capacity. Finally, while we showed that AFD generalizes well to most other attacks
12:33.520 --> 12:39.200
and attack strengths, it occasionally was worse compared to other baselines, especially in data
12:39.200 --> 12:45.760
sets with more classes like Tiny ImageNet. This could potentially be due to the difficulty of training
12:46.320 --> 12:51.680
domain classifiers in these data sets and leaves much space for future work on
12:51.680 --> 12:57.120
investigating the effect of domain classifiers on the robustness of feature learning functions.
12:58.080 --> 13:04.400
Also, AFD required more backward computations compared to some of the other baselines
13:04.400 --> 13:11.120
such as adversarial training, and as a result, its training time was on average about 31%
13:11.120 --> 13:17.680
longer than adversarial training. We invite you to read our paper for more details and please
13:17.680 --> 13:34.720
get in touch with us if you have any questions. Thanks for watching this video and we hope you enjoyed it.