Spaces:
Running
Running
WEBVTT | |
00:00.000 --> 00:14.000 | |
Hi, I'm Hugo Richard, I'm a third year PhD student at Université Paris-Saclay. | |
00:14.000 --> 00:18.480 | |
I'm in the INRIA Paris et Alpes team and my supervisor is Bertrand Thirion. | |
00:18.480 --> 00:24.600 | |
Today I'll talk about shared independent component analysis for multi-subject neuroimaging. | |
00:24.600 --> 00:31.400 | |
This is a joint work with Pierre Abelin, Alexandre Grandfort, Bertrand Thirion and Anna Pouy-Varine. | |
00:31.400 --> 00:36.360 | |
First let us consider two sources that are emitting a signal that is recorded by two | |
00:36.360 --> 00:37.360 | |
sensors. | |
00:37.360 --> 00:43.120 | |
This can be seen as a simplified model of magnetoencephalography where brain sources | |
00:43.120 --> 00:46.000 | |
are recorded by magnetometers. | |
00:46.000 --> 00:50.200 | |
Because propagation time can be neglected, the signal recorded by the sensors can be | |
00:50.200 --> 00:55.840 | |
seen as a linear mixture of the signal emitted by the sources. | |
00:55.840 --> 00:59.600 | |
S is a set of sources that are assumed to be independent. | |
00:59.600 --> 01:06.400 | |
X are the recordings and A describes how the sources are mixed to produce the recordings. | |
01:06.400 --> 01:12.120 | |
At first sight this model may seem ill-defined because if we permute two columns in A and | |
01:12.120 --> 01:19.600 | |
permute the corresponding sources in S, we'll get a new set of sources S' and a new mixing | |
01:19.600 --> 01:25.360 | |
matrix A' that describes X just as well as A and S. | |
01:25.360 --> 01:30.360 | |
And similarly if we scale the column of A by some constant, one column of A by some | |
01:30.360 --> 01:34.920 | |
constant and the corresponding source by the same constant, we'll also get an equivalent | |
01:34.920 --> 01:35.920 | |
description of X. | |
01:35.920 --> 01:44.840 | |
However, these scale and permutation indeterminacies are the only one if the sources contain at | |
01:44.840 --> 01:46.840 | |
most one Gaussian component. | |
01:46.840 --> 01:52.040 | |
Let us consider the more general problem where you have multiple subjects that are exposed | |
01:52.040 --> 01:54.560 | |
to the same stimuli. | |
01:54.560 --> 02:00.640 | |
We have two subjects, X1 and X2, and they have different mixing matrices, A1 and A2, | |
02:00.640 --> 02:04.560 | |
and different noise levels, N1 and N2. | |
02:04.560 --> 02:08.720 | |
The interpretation is that they have shared sources because they have shared connective | |
02:08.720 --> 02:09.720 | |
processes. | |
02:09.720 --> 02:15.120 | |
They have different mixing matrices because they have different spatial topography. | |
02:15.120 --> 02:20.600 | |
And they have different noises because we want to model inter-subject variability. | |
02:20.600 --> 02:22.480 | |
This model is called group ICA. | |
02:22.480 --> 02:27.840 | |
There are many methods to provide a solution for the group ICA problem. | |
02:27.840 --> 02:34.560 | |
A very popular one introduced by Calhoun in 2001 is to just stack the data of all subjects | |
02:34.560 --> 02:42.520 | |
feature-wise and then perform a PCA, a principal component analysis, on the stacked data. | |
02:42.520 --> 02:47.520 | |
And therefore you obtain reduced data and apply independent component analysis on the | |
02:47.520 --> 02:50.520 | |
reduced data to obtain a set of sources. | |
02:50.520 --> 02:55.960 | |
Another formulation is introduced by Varoko in 2010 and is called K-NICA. | |
02:55.960 --> 03:01.320 | |
You just replace the principal component analysis with a multiset CCA, so a multiset canonical | |
03:01.320 --> 03:06.120 | |
correlation analysis, where you have to solve a generalized eigenvalue problem. | |
03:06.120 --> 03:12.800 | |
There are many different formulations of multiset CCA, but this one with a generalized eigenvalue | |
03:12.800 --> 03:15.560 | |
problem is the fastest to solve. | |
03:15.560 --> 03:17.840 | |
KNICA and Cut-ICA have a lot of advantages. | |
03:17.840 --> 03:21.000 | |
First, they are very fast to fit. | |
03:21.000 --> 03:23.320 | |
And second, they are simple to implement. | |
03:23.320 --> 03:26.920 | |
These are the two reasons why they are so popular in neuroimaging. | |
03:26.920 --> 03:30.160 | |
However, they do not optimize the proper likelihood. | |
03:30.160 --> 03:35.680 | |
So therefore they do not benefit from advantages of such estimators such as asymptotic efficiency. | |
03:35.680 --> 03:41.480 | |
There are a lot of other related work that do optimize the proper likelihood. | |
03:41.480 --> 03:46.240 | |
I want to mention the independent vector analysis, which is a very powerful framework introduced | |
03:46.240 --> 03:48.760 | |
by Li in 2008. | |
03:48.760 --> 03:54.560 | |
So unified approach of Guo in 2008 that we will also mention and talk about later. | |
03:54.560 --> 04:01.040 | |
The approach of Shen in 2015 that also allows to perform dimension reduction. | |
04:01.040 --> 04:08.320 | |
And the multi-view ICA that was introduced by our team last year. | |
04:08.320 --> 04:15.200 | |
I want to quickly say that it's not obvious to design a likelihood-based approach that | |
04:15.200 --> 04:17.400 | |
is tractable. | |
04:17.400 --> 04:23.680 | |
And with this example of the Gaussian mixture noisy ICA by Bermond and Cardozo, we'll see | |
04:23.680 --> 04:31.400 | |
that standard approach leads to intractable algorithms. | |
04:31.400 --> 04:37.080 | |
The model we take here is the same as the group ICA, but we assume that the noise is | |
04:37.080 --> 04:40.120 | |
Gaussian with the same variance for all subjects. | |
04:40.120 --> 04:47.600 | |
We'll also assume that the sources follow a Gaussian mixture model. | |
04:47.600 --> 04:53.040 | |
And we further assume that the weights of the Gaussian mixtures are known. | |
04:53.040 --> 04:56.360 | |
We can solve such model via expectation maximization. | |
04:56.360 --> 05:01.400 | |
And if we write the E-step, we'll get a closed form that involves a large sum. | |
05:01.400 --> 05:09.040 | |
Because of this large size, this sum, and therefore the M algorithm is intractable whenever | |
05:09.040 --> 05:11.600 | |
Q and K are large. | |
05:11.600 --> 05:17.520 | |
Our contribution is shared ICA, what we call Shikha for short, where the data of subject | |
05:17.520 --> 05:23.080 | |
i are assumed as a linear mixture of noisy sources, and the noise here is not on the | |
05:23.080 --> 05:24.080 | |
sensor, but on the sources. | |
05:24.080 --> 05:30.000 | |
The noise is Gaussian with a variance that can be different for each subject and different | |
05:30.000 --> 05:31.000 | |
for each component. | |
05:31.000 --> 05:37.800 | |
S are assumed to be independent, but in contrast to almost all existing work, some components | |
05:37.800 --> 05:38.800 | |
can be Gaussian. | |
05:38.800 --> 05:41.600 | |
We have a few blanket assumptions. | |
05:41.600 --> 05:45.840 | |
We assume that the data are centered, that the mixing metrics are invertible, that the | |
05:45.840 --> 05:50.680 | |
sources have identical variance, and that the number of subjects is greater than 3. | |
05:50.680 --> 05:54.000 | |
We have two algorithms to solve the Shikha model. | |
05:54.000 --> 06:01.520 | |
We have ShikhaJ, that is a FAS algorithm that is based on multiset CCA, and ShikhaML, a | |
06:01.520 --> 06:04.000 | |
maximum likelihood approach. | |
06:04.000 --> 06:07.600 | |
In Shikha, there are two ways to recover the parameters. | |
06:07.600 --> 06:12.880 | |
Either the source are non-Gaussian, in which case we can use classical ICA results to recover | |
06:12.880 --> 06:15.720 | |
the unmixing matrices. | |
06:15.720 --> 06:20.120 | |
When the components are Gaussian, then we need something else, and what we use here | |
06:20.120 --> 06:22.480 | |
is noise diversity. | |
06:22.480 --> 06:28.320 | |
When the noise is sufficiently diverse, then it's possible to recover the unmixing matrix | |
06:28.320 --> 06:34.120 | |
and the noise covariance up to a permutation and sign indeterminacy. | |
06:34.120 --> 06:38.240 | |
Note that the noise diversity in Gaussian components is also a necessary condition. | |
06:38.240 --> 06:42.680 | |
If it does not hold, then Shikha cannot be identified. | |
06:42.680 --> 06:48.520 | |
Let us now focus on this theorem that is at the core of the ShikhaJ algorithm. | |
06:48.520 --> 06:53.520 | |
Namely it shows that we can solve group ICA with multiset CCA. | |
06:53.520 --> 06:58.880 | |
So assume the data follows the Shikha model, and consider the multiset CCA framed as a | |
06:58.880 --> 07:00.920 | |
generalized eigenvalue problem. | |
07:00.920 --> 07:08.080 | |
This generalized eigenvalue problem relies on two matrices, C and D. So C is formed by | |
07:08.080 --> 07:13.560 | |
second-order statistics, and D is formed by the diagonal blocks in C. | |
07:13.560 --> 07:19.880 | |
And so if we solve this eigenvalue problem and take the first k leading eigenvectors, | |
07:19.880 --> 07:26.520 | |
we can recover the correct unmixing matrix from them, up to a permutation and a scaling. | |
07:26.520 --> 07:32.000 | |
And this can only be done if the k first eigenvalues are distinct. | |
07:32.000 --> 07:34.320 | |
Note that the distinct eigenvalue condition is also necessary. | |
07:34.320 --> 07:40.480 | |
If two eigenvalues are the same, then this adds the need to determine IC, and therefore | |
07:40.480 --> 07:42.280 | |
we cannot solve group IC. | |
07:42.280 --> 07:48.640 | |
Note also that the condition that some eigenvalues need to be distinct is stronger than the noise | |
07:48.640 --> 07:54.080 | |
diversity condition we have in the identifiability theorem. | |
07:54.080 --> 07:59.360 | |
And therefore we can exhibit an example which is identifiable, but on which multiset CCA | |
07:59.360 --> 08:00.360 | |
will fail. | |
08:00.360 --> 08:04.800 | |
And I refer you to the paper for more details on this. | |
08:04.800 --> 08:10.160 | |
So in our theorem, in order to recover the correct unmixing matrix, we need to have access | |
08:10.160 --> 08:12.480 | |
to the second-order statistics. | |
08:12.480 --> 08:18.860 | |
However, in practice, we only have access to them, up to some sampling noise. | |
08:18.860 --> 08:24.520 | |
And because the mapping from matrices to eigenvectors is highly non-smooth, a small deviation in | |
08:24.520 --> 08:31.160 | |
the second-order statistics can lead to a high deviation of the recovered unmixing matrix. | |
08:31.160 --> 08:38.080 | |
Now to show this in practice, we take three subjects, two components, and noise covariance | |
08:38.080 --> 08:47.440 | |
matrices with two values, lambda1 and lambda2, that are separated by an eigengap epsilon. | |
08:47.440 --> 08:52.440 | |
And we compare the solution of multiset CCA on the true covariance matrices and on the | |
08:52.440 --> 08:59.520 | |
perturbed covariance matrix, where the perturbation scale is given by delta. | |
08:59.520 --> 09:07.240 | |
And for different values of epsilon, 10-4, 10-3, 10-2, 10-1, we show how the performance | |
09:07.240 --> 09:14.720 | |
of the algorithm, so the M-ary distance between the true unmixing matrix and the estimated | |
09:14.720 --> 09:20.880 | |
unmixing matrix, varies when the perturbation scale increases. | |
09:20.880 --> 09:26.600 | |
And we see that when the eigengap is very close, so 10-4, the violet curve, then even | |
09:26.600 --> 09:31.440 | |
with a very small perturbation, you can get to a very bad M-ary distance. | |
09:31.440 --> 09:35.720 | |
So the black dashed curve is a performance of chance. | |
09:35.720 --> 09:41.200 | |
Luckily, there is a large gap between the k-th eigenvalues and the k plus 1. | |
09:41.200 --> 09:46.120 | |
This means that in practice, the span of the p-leading eigenvectors is approximately preserved. | |
09:46.120 --> 09:53.600 | |
We can recover the true unmixing matrix from the unmixing matrix estimated by multiset | |
09:53.600 --> 09:56.520 | |
CCA, just by multiplying by a matrix Q. | |
09:56.520 --> 10:02.640 | |
And in order to estimate Q, we make use of the fact that the unmixed data should have | |
10:02.640 --> 10:03.640 | |
a diagonal covariance. | |
10:03.640 --> 10:09.680 | |
This leads us to a joint diagonalization problem that we can solve efficiently. | |
10:09.680 --> 10:14.480 | |
So if we take the experiments we've done on the previous slide, the results are still | |
10:14.480 --> 10:15.480 | |
shown here. | |
10:15.480 --> 10:21.640 | |
You can see the violet curves, and that is very sensitive to perturbation. | |
10:21.640 --> 10:29.360 | |
And so if we apply joint diagonalization, all these curves move, and they join the dashed | |
10:29.360 --> 10:30.360 | |
curve on the bottom. | |
10:30.360 --> 10:34.720 | |
And therefore, it's much better, because now the new curves that are represented by the | |
10:34.720 --> 10:42.920 | |
dashed line are less sensitive to perturbations. | |
10:42.920 --> 10:47.920 | |
So now we've obtained the correct unmixing matrix, but up to a scaling. | |
10:47.920 --> 10:55.040 | |
And so we need an additional step to find the correct scaling, and another one to find | |
10:55.040 --> 11:00.680 | |
the other parameter that is still unestimated, which are the noise covariance. | |
11:00.680 --> 11:04.000 | |
And luckily, it's very easy to find the noise covariance. | |
11:04.000 --> 11:06.280 | |
We can do this via an EM algorithm. | |
11:06.280 --> 11:11.920 | |
The E-step and the M-step are in closed form, and this yields a very fast algorithm. | |
11:11.920 --> 11:15.200 | |
But the Shikha-J is not a maximum likelihood estimator. | |
11:15.200 --> 11:22.600 | |
So now we will focus on Shikha-ML, which is our maximum likelihood estimator. | |
11:22.600 --> 11:31.240 | |
So I won't go too much into details on this, but we optimize this via an EM using a Gaussian | |
11:31.240 --> 11:33.480 | |
mixture assumption as a source. | |
11:33.480 --> 11:35.960 | |
We assume that the weights are known. | |
11:35.960 --> 11:41.480 | |
What I just want to showcase here is that the E-step of the algorithm, the one that | |
11:41.480 --> 11:46.000 | |
gives you the expectation of the sources given the data, and the variance of the sources | |
11:46.000 --> 11:50.760 | |
given the data, only involves the sum of size 2. | |
11:50.760 --> 11:57.320 | |
So previously we had a sum that had an exponential number of terms, and here we don't have that | |
11:57.320 --> 11:58.320 | |
anymore. | |
11:58.320 --> 12:02.920 | |
So the E-step is much faster than what we had before, and therefore the EM algorithm | |
12:02.920 --> 12:07.200 | |
here is tractable, whereas it was not the case before. | |
12:07.200 --> 12:11.440 | |
I first want to present our synthetic experiment where we generate data according to the Shikha-ML | |
12:11.440 --> 12:13.200 | |
and Shikha-J model. | |
12:13.200 --> 12:18.560 | |
In case A, we have only Gaussian components, but we have noise diversity, and therefore | |
12:18.560 --> 12:24.240 | |
methods that use noise diversity to recover the sources such as Shikha-ML and Shikha-J | |
12:24.240 --> 12:25.240 | |
perform best. | |
12:25.240 --> 12:34.000 | |
In the second case, we have only non-Gaussian components and no noise diversity, so methods | |
12:34.000 --> 12:41.520 | |
that use non-Gaussianity perform well such as Kana-ICA, Shikha-ML, or MultiView-ICA. | |
12:41.520 --> 12:45.200 | |
And the last case, half of the components are Gaussian with noise diversity, and the | |
12:45.200 --> 12:49.000 | |
other half are non-Gaussian but without noise diversity. | |
12:49.000 --> 12:53.000 | |
And in this case, only Shikha-ML is able to correctly recover the sources. | |
12:53.000 --> 12:57.960 | |
MV-ICA doesn't do that, but it's not as good as Shikha-ML. | |
12:57.960 --> 13:00.400 | |
Let us now talk about our experiments on real data. | |
13:00.400 --> 13:05.080 | |
We have this reconstruction experiment on fMRI data where subjects are exposed to a | |
13:05.080 --> 13:07.920 | |
naturalistic stimuli such as movie watching. | |
13:07.920 --> 13:15.320 | |
We use 80% of the movie to learn the unmixing matrices of all subjects, and then on the | |
13:15.320 --> 13:22.320 | |
20% left of the movie, we compute the common sources, and from these common sources computed | |
13:22.320 --> 13:28.800 | |
using 80% of the subject, we try to reconstruct the data of the 20% left of the subject. | |
13:28.800 --> 13:33.880 | |
We compute the R2 score within regions of interest between the reconstructed data and | |
13:33.880 --> 13:39.480 | |
the true data, and plot them as a function of the number of components used. | |
13:39.480 --> 13:43.000 | |
As we see, Shikha-ML outperforms all of the methods. | |
13:43.000 --> 13:47.400 | |
As a take-home message, Shikha is a powerful framework to extract shared sources. | |
13:47.400 --> 13:52.840 | |
Shikha-J is a fast approach to fit the model, but it only uses second-order information. | |
13:52.840 --> 13:58.800 | |
In contrast, Shikha-ML is a bit slower, but is able to use non-gaussianity in addition | |
13:58.800 --> 14:00.960 | |
to second-order information. | |
14:00.960 --> 14:03.840 | |
In practice, Shikha-ML yields the best results. | |
14:03.840 --> 14:05.960 | |
The methods we've introduced work on reduced data. | |
14:05.960 --> 14:11.160 | |
It would be interesting to know how to reduce the data so that they perform optimally. | |
14:11.160 --> 14:15.400 | |
Another way to improve our results would be to learn the density of the shared sources | |
14:15.400 --> 14:19.480 | |
in Shikha-ML instead of having them fixed. | |
14:19.480 --> 14:23.400 | |
Thanks for listening, and have a good day! | |