ytseg_demo / demo_data /nips-2021 /25957 /transcript_whisper_large-v2.vtt
retkowski's picture
Add demo
cb71ef5
WEBVTT
00:00.000 --> 00:14.000
Hi, I'm Hugo Richard, I'm a third year PhD student at Université Paris-Saclay.
00:14.000 --> 00:18.480
I'm in the INRIA Paris et Alpes team and my supervisor is Bertrand Thirion.
00:18.480 --> 00:24.600
Today I'll talk about shared independent component analysis for multi-subject neuroimaging.
00:24.600 --> 00:31.400
This is a joint work with Pierre Abelin, Alexandre Grandfort, Bertrand Thirion and Anna Pouy-Varine.
00:31.400 --> 00:36.360
First let us consider two sources that are emitting a signal that is recorded by two
00:36.360 --> 00:37.360
sensors.
00:37.360 --> 00:43.120
This can be seen as a simplified model of magnetoencephalography where brain sources
00:43.120 --> 00:46.000
are recorded by magnetometers.
00:46.000 --> 00:50.200
Because propagation time can be neglected, the signal recorded by the sensors can be
00:50.200 --> 00:55.840
seen as a linear mixture of the signal emitted by the sources.
00:55.840 --> 00:59.600
S is a set of sources that are assumed to be independent.
00:59.600 --> 01:06.400
X are the recordings and A describes how the sources are mixed to produce the recordings.
01:06.400 --> 01:12.120
At first sight this model may seem ill-defined because if we permute two columns in A and
01:12.120 --> 01:19.600
permute the corresponding sources in S, we'll get a new set of sources S' and a new mixing
01:19.600 --> 01:25.360
matrix A' that describes X just as well as A and S.
01:25.360 --> 01:30.360
And similarly if we scale the column of A by some constant, one column of A by some
01:30.360 --> 01:34.920
constant and the corresponding source by the same constant, we'll also get an equivalent
01:34.920 --> 01:35.920
description of X.
01:35.920 --> 01:44.840
However, these scale and permutation indeterminacies are the only one if the sources contain at
01:44.840 --> 01:46.840
most one Gaussian component.
01:46.840 --> 01:52.040
Let us consider the more general problem where you have multiple subjects that are exposed
01:52.040 --> 01:54.560
to the same stimuli.
01:54.560 --> 02:00.640
We have two subjects, X1 and X2, and they have different mixing matrices, A1 and A2,
02:00.640 --> 02:04.560
and different noise levels, N1 and N2.
02:04.560 --> 02:08.720
The interpretation is that they have shared sources because they have shared connective
02:08.720 --> 02:09.720
processes.
02:09.720 --> 02:15.120
They have different mixing matrices because they have different spatial topography.
02:15.120 --> 02:20.600
And they have different noises because we want to model inter-subject variability.
02:20.600 --> 02:22.480
This model is called group ICA.
02:22.480 --> 02:27.840
There are many methods to provide a solution for the group ICA problem.
02:27.840 --> 02:34.560
A very popular one introduced by Calhoun in 2001 is to just stack the data of all subjects
02:34.560 --> 02:42.520
feature-wise and then perform a PCA, a principal component analysis, on the stacked data.
02:42.520 --> 02:47.520
And therefore you obtain reduced data and apply independent component analysis on the
02:47.520 --> 02:50.520
reduced data to obtain a set of sources.
02:50.520 --> 02:55.960
Another formulation is introduced by Varoko in 2010 and is called K-NICA.
02:55.960 --> 03:01.320
You just replace the principal component analysis with a multiset CCA, so a multiset canonical
03:01.320 --> 03:06.120
correlation analysis, where you have to solve a generalized eigenvalue problem.
03:06.120 --> 03:12.800
There are many different formulations of multiset CCA, but this one with a generalized eigenvalue
03:12.800 --> 03:15.560
problem is the fastest to solve.
03:15.560 --> 03:17.840
KNICA and Cut-ICA have a lot of advantages.
03:17.840 --> 03:21.000
First, they are very fast to fit.
03:21.000 --> 03:23.320
And second, they are simple to implement.
03:23.320 --> 03:26.920
These are the two reasons why they are so popular in neuroimaging.
03:26.920 --> 03:30.160
However, they do not optimize the proper likelihood.
03:30.160 --> 03:35.680
So therefore they do not benefit from advantages of such estimators such as asymptotic efficiency.
03:35.680 --> 03:41.480
There are a lot of other related work that do optimize the proper likelihood.
03:41.480 --> 03:46.240
I want to mention the independent vector analysis, which is a very powerful framework introduced
03:46.240 --> 03:48.760
by Li in 2008.
03:48.760 --> 03:54.560
So unified approach of Guo in 2008 that we will also mention and talk about later.
03:54.560 --> 04:01.040
The approach of Shen in 2015 that also allows to perform dimension reduction.
04:01.040 --> 04:08.320
And the multi-view ICA that was introduced by our team last year.
04:08.320 --> 04:15.200
I want to quickly say that it's not obvious to design a likelihood-based approach that
04:15.200 --> 04:17.400
is tractable.
04:17.400 --> 04:23.680
And with this example of the Gaussian mixture noisy ICA by Bermond and Cardozo, we'll see
04:23.680 --> 04:31.400
that standard approach leads to intractable algorithms.
04:31.400 --> 04:37.080
The model we take here is the same as the group ICA, but we assume that the noise is
04:37.080 --> 04:40.120
Gaussian with the same variance for all subjects.
04:40.120 --> 04:47.600
We'll also assume that the sources follow a Gaussian mixture model.
04:47.600 --> 04:53.040
And we further assume that the weights of the Gaussian mixtures are known.
04:53.040 --> 04:56.360
We can solve such model via expectation maximization.
04:56.360 --> 05:01.400
And if we write the E-step, we'll get a closed form that involves a large sum.
05:01.400 --> 05:09.040
Because of this large size, this sum, and therefore the M algorithm is intractable whenever
05:09.040 --> 05:11.600
Q and K are large.
05:11.600 --> 05:17.520
Our contribution is shared ICA, what we call Shikha for short, where the data of subject
05:17.520 --> 05:23.080
i are assumed as a linear mixture of noisy sources, and the noise here is not on the
05:23.080 --> 05:24.080
sensor, but on the sources.
05:24.080 --> 05:30.000
The noise is Gaussian with a variance that can be different for each subject and different
05:30.000 --> 05:31.000
for each component.
05:31.000 --> 05:37.800
S are assumed to be independent, but in contrast to almost all existing work, some components
05:37.800 --> 05:38.800
can be Gaussian.
05:38.800 --> 05:41.600
We have a few blanket assumptions.
05:41.600 --> 05:45.840
We assume that the data are centered, that the mixing metrics are invertible, that the
05:45.840 --> 05:50.680
sources have identical variance, and that the number of subjects is greater than 3.
05:50.680 --> 05:54.000
We have two algorithms to solve the Shikha model.
05:54.000 --> 06:01.520
We have ShikhaJ, that is a FAS algorithm that is based on multiset CCA, and ShikhaML, a
06:01.520 --> 06:04.000
maximum likelihood approach.
06:04.000 --> 06:07.600
In Shikha, there are two ways to recover the parameters.
06:07.600 --> 06:12.880
Either the source are non-Gaussian, in which case we can use classical ICA results to recover
06:12.880 --> 06:15.720
the unmixing matrices.
06:15.720 --> 06:20.120
When the components are Gaussian, then we need something else, and what we use here
06:20.120 --> 06:22.480
is noise diversity.
06:22.480 --> 06:28.320
When the noise is sufficiently diverse, then it's possible to recover the unmixing matrix
06:28.320 --> 06:34.120
and the noise covariance up to a permutation and sign indeterminacy.
06:34.120 --> 06:38.240
Note that the noise diversity in Gaussian components is also a necessary condition.
06:38.240 --> 06:42.680
If it does not hold, then Shikha cannot be identified.
06:42.680 --> 06:48.520
Let us now focus on this theorem that is at the core of the ShikhaJ algorithm.
06:48.520 --> 06:53.520
Namely it shows that we can solve group ICA with multiset CCA.
06:53.520 --> 06:58.880
So assume the data follows the Shikha model, and consider the multiset CCA framed as a
06:58.880 --> 07:00.920
generalized eigenvalue problem.
07:00.920 --> 07:08.080
This generalized eigenvalue problem relies on two matrices, C and D. So C is formed by
07:08.080 --> 07:13.560
second-order statistics, and D is formed by the diagonal blocks in C.
07:13.560 --> 07:19.880
And so if we solve this eigenvalue problem and take the first k leading eigenvectors,
07:19.880 --> 07:26.520
we can recover the correct unmixing matrix from them, up to a permutation and a scaling.
07:26.520 --> 07:32.000
And this can only be done if the k first eigenvalues are distinct.
07:32.000 --> 07:34.320
Note that the distinct eigenvalue condition is also necessary.
07:34.320 --> 07:40.480
If two eigenvalues are the same, then this adds the need to determine IC, and therefore
07:40.480 --> 07:42.280
we cannot solve group IC.
07:42.280 --> 07:48.640
Note also that the condition that some eigenvalues need to be distinct is stronger than the noise
07:48.640 --> 07:54.080
diversity condition we have in the identifiability theorem.
07:54.080 --> 07:59.360
And therefore we can exhibit an example which is identifiable, but on which multiset CCA
07:59.360 --> 08:00.360
will fail.
08:00.360 --> 08:04.800
And I refer you to the paper for more details on this.
08:04.800 --> 08:10.160
So in our theorem, in order to recover the correct unmixing matrix, we need to have access
08:10.160 --> 08:12.480
to the second-order statistics.
08:12.480 --> 08:18.860
However, in practice, we only have access to them, up to some sampling noise.
08:18.860 --> 08:24.520
And because the mapping from matrices to eigenvectors is highly non-smooth, a small deviation in
08:24.520 --> 08:31.160
the second-order statistics can lead to a high deviation of the recovered unmixing matrix.
08:31.160 --> 08:38.080
Now to show this in practice, we take three subjects, two components, and noise covariance
08:38.080 --> 08:47.440
matrices with two values, lambda1 and lambda2, that are separated by an eigengap epsilon.
08:47.440 --> 08:52.440
And we compare the solution of multiset CCA on the true covariance matrices and on the
08:52.440 --> 08:59.520
perturbed covariance matrix, where the perturbation scale is given by delta.
08:59.520 --> 09:07.240
And for different values of epsilon, 10-4, 10-3, 10-2, 10-1, we show how the performance
09:07.240 --> 09:14.720
of the algorithm, so the M-ary distance between the true unmixing matrix and the estimated
09:14.720 --> 09:20.880
unmixing matrix, varies when the perturbation scale increases.
09:20.880 --> 09:26.600
And we see that when the eigengap is very close, so 10-4, the violet curve, then even
09:26.600 --> 09:31.440
with a very small perturbation, you can get to a very bad M-ary distance.
09:31.440 --> 09:35.720
So the black dashed curve is a performance of chance.
09:35.720 --> 09:41.200
Luckily, there is a large gap between the k-th eigenvalues and the k plus 1.
09:41.200 --> 09:46.120
This means that in practice, the span of the p-leading eigenvectors is approximately preserved.
09:46.120 --> 09:53.600
We can recover the true unmixing matrix from the unmixing matrix estimated by multiset
09:53.600 --> 09:56.520
CCA, just by multiplying by a matrix Q.
09:56.520 --> 10:02.640
And in order to estimate Q, we make use of the fact that the unmixed data should have
10:02.640 --> 10:03.640
a diagonal covariance.
10:03.640 --> 10:09.680
This leads us to a joint diagonalization problem that we can solve efficiently.
10:09.680 --> 10:14.480
So if we take the experiments we've done on the previous slide, the results are still
10:14.480 --> 10:15.480
shown here.
10:15.480 --> 10:21.640
You can see the violet curves, and that is very sensitive to perturbation.
10:21.640 --> 10:29.360
And so if we apply joint diagonalization, all these curves move, and they join the dashed
10:29.360 --> 10:30.360
curve on the bottom.
10:30.360 --> 10:34.720
And therefore, it's much better, because now the new curves that are represented by the
10:34.720 --> 10:42.920
dashed line are less sensitive to perturbations.
10:42.920 --> 10:47.920
So now we've obtained the correct unmixing matrix, but up to a scaling.
10:47.920 --> 10:55.040
And so we need an additional step to find the correct scaling, and another one to find
10:55.040 --> 11:00.680
the other parameter that is still unestimated, which are the noise covariance.
11:00.680 --> 11:04.000
And luckily, it's very easy to find the noise covariance.
11:04.000 --> 11:06.280
We can do this via an EM algorithm.
11:06.280 --> 11:11.920
The E-step and the M-step are in closed form, and this yields a very fast algorithm.
11:11.920 --> 11:15.200
But the Shikha-J is not a maximum likelihood estimator.
11:15.200 --> 11:22.600
So now we will focus on Shikha-ML, which is our maximum likelihood estimator.
11:22.600 --> 11:31.240
So I won't go too much into details on this, but we optimize this via an EM using a Gaussian
11:31.240 --> 11:33.480
mixture assumption as a source.
11:33.480 --> 11:35.960
We assume that the weights are known.
11:35.960 --> 11:41.480
What I just want to showcase here is that the E-step of the algorithm, the one that
11:41.480 --> 11:46.000
gives you the expectation of the sources given the data, and the variance of the sources
11:46.000 --> 11:50.760
given the data, only involves the sum of size 2.
11:50.760 --> 11:57.320
So previously we had a sum that had an exponential number of terms, and here we don't have that
11:57.320 --> 11:58.320
anymore.
11:58.320 --> 12:02.920
So the E-step is much faster than what we had before, and therefore the EM algorithm
12:02.920 --> 12:07.200
here is tractable, whereas it was not the case before.
12:07.200 --> 12:11.440
I first want to present our synthetic experiment where we generate data according to the Shikha-ML
12:11.440 --> 12:13.200
and Shikha-J model.
12:13.200 --> 12:18.560
In case A, we have only Gaussian components, but we have noise diversity, and therefore
12:18.560 --> 12:24.240
methods that use noise diversity to recover the sources such as Shikha-ML and Shikha-J
12:24.240 --> 12:25.240
perform best.
12:25.240 --> 12:34.000
In the second case, we have only non-Gaussian components and no noise diversity, so methods
12:34.000 --> 12:41.520
that use non-Gaussianity perform well such as Kana-ICA, Shikha-ML, or MultiView-ICA.
12:41.520 --> 12:45.200
And the last case, half of the components are Gaussian with noise diversity, and the
12:45.200 --> 12:49.000
other half are non-Gaussian but without noise diversity.
12:49.000 --> 12:53.000
And in this case, only Shikha-ML is able to correctly recover the sources.
12:53.000 --> 12:57.960
MV-ICA doesn't do that, but it's not as good as Shikha-ML.
12:57.960 --> 13:00.400
Let us now talk about our experiments on real data.
13:00.400 --> 13:05.080
We have this reconstruction experiment on fMRI data where subjects are exposed to a
13:05.080 --> 13:07.920
naturalistic stimuli such as movie watching.
13:07.920 --> 13:15.320
We use 80% of the movie to learn the unmixing matrices of all subjects, and then on the
13:15.320 --> 13:22.320
20% left of the movie, we compute the common sources, and from these common sources computed
13:22.320 --> 13:28.800
using 80% of the subject, we try to reconstruct the data of the 20% left of the subject.
13:28.800 --> 13:33.880
We compute the R2 score within regions of interest between the reconstructed data and
13:33.880 --> 13:39.480
the true data, and plot them as a function of the number of components used.
13:39.480 --> 13:43.000
As we see, Shikha-ML outperforms all of the methods.
13:43.000 --> 13:47.400
As a take-home message, Shikha is a powerful framework to extract shared sources.
13:47.400 --> 13:52.840
Shikha-J is a fast approach to fit the model, but it only uses second-order information.
13:52.840 --> 13:58.800
In contrast, Shikha-ML is a bit slower, but is able to use non-gaussianity in addition
13:58.800 --> 14:00.960
to second-order information.
14:00.960 --> 14:03.840
In practice, Shikha-ML yields the best results.
14:03.840 --> 14:05.960
The methods we've introduced work on reduced data.
14:05.960 --> 14:11.160
It would be interesting to know how to reduce the data so that they perform optimally.
14:11.160 --> 14:15.400
Another way to improve our results would be to learn the density of the shared sources
14:15.400 --> 14:19.480
in Shikha-ML instead of having them fixed.
14:19.480 --> 14:23.400
Thanks for listening, and have a good day!