Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /nips-2021 /25957 /transcript_whisper_large-v2.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame contribute delete

16.6 kB

	WEBVTT

	00:00.000 --> 00:14.000
	Hi, I'm Hugo Richard, I'm a third year PhD student at Université Paris-Saclay.

	00:14.000 --> 00:18.480
	I'm in the INRIA Paris et Alpes team and my supervisor is Bertrand Thirion.

	00:18.480 --> 00:24.600
	Today I'll talk about shared independent component analysis for multi-subject neuroimaging.

	00:24.600 --> 00:31.400
	This is a joint work with Pierre Abelin, Alexandre Grandfort, Bertrand Thirion and Anna Pouy-Varine.

	00:31.400 --> 00:36.360
	First let us consider two sources that are emitting a signal that is recorded by two

	00:36.360 --> 00:37.360
	sensors.

	00:37.360 --> 00:43.120
	This can be seen as a simplified model of magnetoencephalography where brain sources

	00:43.120 --> 00:46.000
	are recorded by magnetometers.

	00:46.000 --> 00:50.200
	Because propagation time can be neglected, the signal recorded by the sensors can be

	00:50.200 --> 00:55.840
	seen as a linear mixture of the signal emitted by the sources.

	00:55.840 --> 00:59.600
	S is a set of sources that are assumed to be independent.

	00:59.600 --> 01:06.400
	X are the recordings and A describes how the sources are mixed to produce the recordings.

	01:06.400 --> 01:12.120
	At first sight this model may seem ill-defined because if we permute two columns in A and

	01:12.120 --> 01:19.600
	permute the corresponding sources in S, we'll get a new set of sources S' and a new mixing

	01:19.600 --> 01:25.360
	matrix A' that describes X just as well as A and S.

	01:25.360 --> 01:30.360
	And similarly if we scale the column of A by some constant, one column of A by some

	01:30.360 --> 01:34.920
	constant and the corresponding source by the same constant, we'll also get an equivalent

	01:34.920 --> 01:35.920
	description of X.

	01:35.920 --> 01:44.840
	However, these scale and permutation indeterminacies are the only one if the sources contain at

	01:44.840 --> 01:46.840
	most one Gaussian component.

	01:46.840 --> 01:52.040
	Let us consider the more general problem where you have multiple subjects that are exposed

	01:52.040 --> 01:54.560
	to the same stimuli.

	01:54.560 --> 02:00.640
	We have two subjects, X1 and X2, and they have different mixing matrices, A1 and A2,

	02:00.640 --> 02:04.560
	and different noise levels, N1 and N2.

	02:04.560 --> 02:08.720
	The interpretation is that they have shared sources because they have shared connective

	02:08.720 --> 02:09.720
	processes.

	02:09.720 --> 02:15.120
	They have different mixing matrices because they have different spatial topography.

	02:15.120 --> 02:20.600
	And they have different noises because we want to model inter-subject variability.

	02:20.600 --> 02:22.480
	This model is called group ICA.

	02:22.480 --> 02:27.840
	There are many methods to provide a solution for the group ICA problem.

	02:27.840 --> 02:34.560
	A very popular one introduced by Calhoun in 2001 is to just stack the data of all subjects

	02:34.560 --> 02:42.520
	feature-wise and then perform a PCA, a principal component analysis, on the stacked data.

	02:42.520 --> 02:47.520
	And therefore you obtain reduced data and apply independent component analysis on the

	02:47.520 --> 02:50.520
	reduced data to obtain a set of sources.

	02:50.520 --> 02:55.960
	Another formulation is introduced by Varoko in 2010 and is called K-NICA.

	02:55.960 --> 03:01.320
	You just replace the principal component analysis with a multiset CCA, so a multiset canonical

	03:01.320 --> 03:06.120
	correlation analysis, where you have to solve a generalized eigenvalue problem.

	03:06.120 --> 03:12.800
	There are many different formulations of multiset CCA, but this one with a generalized eigenvalue

	03:12.800 --> 03:15.560
	problem is the fastest to solve.

	03:15.560 --> 03:17.840
	KNICA and Cut-ICA have a lot of advantages.

	03:17.840 --> 03:21.000
	First, they are very fast to fit.

	03:21.000 --> 03:23.320
	And second, they are simple to implement.

	03:23.320 --> 03:26.920
	These are the two reasons why they are so popular in neuroimaging.

	03:26.920 --> 03:30.160
	However, they do not optimize the proper likelihood.

	03:30.160 --> 03:35.680
	So therefore they do not benefit from advantages of such estimators such as asymptotic efficiency.

	03:35.680 --> 03:41.480
	There are a lot of other related work that do optimize the proper likelihood.

	03:41.480 --> 03:46.240
	I want to mention the independent vector analysis, which is a very powerful framework introduced

	03:46.240 --> 03:48.760
	by Li in 2008.

	03:48.760 --> 03:54.560
	So unified approach of Guo in 2008 that we will also mention and talk about later.

	03:54.560 --> 04:01.040
	The approach of Shen in 2015 that also allows to perform dimension reduction.

	04:01.040 --> 04:08.320
	And the multi-view ICA that was introduced by our team last year.

	04:08.320 --> 04:15.200
	I want to quickly say that it's not obvious to design a likelihood-based approach that

	04:15.200 --> 04:17.400
	is tractable.

	04:17.400 --> 04:23.680
	And with this example of the Gaussian mixture noisy ICA by Bermond and Cardozo, we'll see

	04:23.680 --> 04:31.400
	that standard approach leads to intractable algorithms.

	04:31.400 --> 04:37.080
	The model we take here is the same as the group ICA, but we assume that the noise is

	04:37.080 --> 04:40.120
	Gaussian with the same variance for all subjects.

	04:40.120 --> 04:47.600
	We'll also assume that the sources follow a Gaussian mixture model.

	04:47.600 --> 04:53.040
	And we further assume that the weights of the Gaussian mixtures are known.

	04:53.040 --> 04:56.360
	We can solve such model via expectation maximization.

	04:56.360 --> 05:01.400
	And if we write the E-step, we'll get a closed form that involves a large sum.

	05:01.400 --> 05:09.040
	Because of this large size, this sum, and therefore the M algorithm is intractable whenever

	05:09.040 --> 05:11.600
	Q and K are large.

	05:11.600 --> 05:17.520
	Our contribution is shared ICA, what we call Shikha for short, where the data of subject

	05:17.520 --> 05:23.080
	i are assumed as a linear mixture of noisy sources, and the noise here is not on the

	05:23.080 --> 05:24.080
	sensor, but on the sources.

	05:24.080 --> 05:30.000
	The noise is Gaussian with a variance that can be different for each subject and different

	05:30.000 --> 05:31.000
	for each component.

	05:31.000 --> 05:37.800
	S are assumed to be independent, but in contrast to almost all existing work, some components

	05:37.800 --> 05:38.800
	can be Gaussian.

	05:38.800 --> 05:41.600
	We have a few blanket assumptions.

	05:41.600 --> 05:45.840
	We assume that the data are centered, that the mixing metrics are invertible, that the

	05:45.840 --> 05:50.680
	sources have identical variance, and that the number of subjects is greater than 3.

	05:50.680 --> 05:54.000
	We have two algorithms to solve the Shikha model.

	05:54.000 --> 06:01.520
	We have ShikhaJ, that is a FAS algorithm that is based on multiset CCA, and ShikhaML, a

	06:01.520 --> 06:04.000
	maximum likelihood approach.

	06:04.000 --> 06:07.600
	In Shikha, there are two ways to recover the parameters.

	06:07.600 --> 06:12.880
	Either the source are non-Gaussian, in which case we can use classical ICA results to recover

	06:12.880 --> 06:15.720
	the unmixing matrices.

	06:15.720 --> 06:20.120
	When the components are Gaussian, then we need something else, and what we use here

	06:20.120 --> 06:22.480
	is noise diversity.

	06:22.480 --> 06:28.320
	When the noise is sufficiently diverse, then it's possible to recover the unmixing matrix

	06:28.320 --> 06:34.120
	and the noise covariance up to a permutation and sign indeterminacy.

	06:34.120 --> 06:38.240
	Note that the noise diversity in Gaussian components is also a necessary condition.

	06:38.240 --> 06:42.680
	If it does not hold, then Shikha cannot be identified.

	06:42.680 --> 06:48.520
	Let us now focus on this theorem that is at the core of the ShikhaJ algorithm.

	06:48.520 --> 06:53.520
	Namely it shows that we can solve group ICA with multiset CCA.

	06:53.520 --> 06:58.880
	So assume the data follows the Shikha model, and consider the multiset CCA framed as a

	06:58.880 --> 07:00.920
	generalized eigenvalue problem.

	07:00.920 --> 07:08.080
	This generalized eigenvalue problem relies on two matrices, C and D. So C is formed by

	07:08.080 --> 07:13.560
	second-order statistics, and D is formed by the diagonal blocks in C.

	07:13.560 --> 07:19.880
	And so if we solve this eigenvalue problem and take the first k leading eigenvectors,

	07:19.880 --> 07:26.520
	we can recover the correct unmixing matrix from them, up to a permutation and a scaling.

	07:26.520 --> 07:32.000
	And this can only be done if the k first eigenvalues are distinct.

	07:32.000 --> 07:34.320
	Note that the distinct eigenvalue condition is also necessary.

	07:34.320 --> 07:40.480
	If two eigenvalues are the same, then this adds the need to determine IC, and therefore

	07:40.480 --> 07:42.280
	we cannot solve group IC.

	07:42.280 --> 07:48.640
	Note also that the condition that some eigenvalues need to be distinct is stronger than the noise

	07:48.640 --> 07:54.080
	diversity condition we have in the identifiability theorem.

	07:54.080 --> 07:59.360
	And therefore we can exhibit an example which is identifiable, but on which multiset CCA

	07:59.360 --> 08:00.360
	will fail.

	08:00.360 --> 08:04.800
	And I refer you to the paper for more details on this.

	08:04.800 --> 08:10.160
	So in our theorem, in order to recover the correct unmixing matrix, we need to have access

	08:10.160 --> 08:12.480
	to the second-order statistics.

	08:12.480 --> 08:18.860
	However, in practice, we only have access to them, up to some sampling noise.

	08:18.860 --> 08:24.520
	And because the mapping from matrices to eigenvectors is highly non-smooth, a small deviation in

	08:24.520 --> 08:31.160
	the second-order statistics can lead to a high deviation of the recovered unmixing matrix.

	08:31.160 --> 08:38.080
	Now to show this in practice, we take three subjects, two components, and noise covariance

	08:38.080 --> 08:47.440
	matrices with two values, lambda1 and lambda2, that are separated by an eigengap epsilon.

	08:47.440 --> 08:52.440
	And we compare the solution of multiset CCA on the true covariance matrices and on the

	08:52.440 --> 08:59.520
	perturbed covariance matrix, where the perturbation scale is given by delta.

	08:59.520 --> 09:07.240
	And for different values of epsilon, 10-4, 10-3, 10-2, 10-1, we show how the performance

	09:07.240 --> 09:14.720
	of the algorithm, so the M-ary distance between the true unmixing matrix and the estimated

	09:14.720 --> 09:20.880
	unmixing matrix, varies when the perturbation scale increases.

	09:20.880 --> 09:26.600
	And we see that when the eigengap is very close, so 10-4, the violet curve, then even

	09:26.600 --> 09:31.440
	with a very small perturbation, you can get to a very bad M-ary distance.

	09:31.440 --> 09:35.720
	So the black dashed curve is a performance of chance.

	09:35.720 --> 09:41.200
	Luckily, there is a large gap between the k-th eigenvalues and the k plus 1.

	09:41.200 --> 09:46.120
	This means that in practice, the span of the p-leading eigenvectors is approximately preserved.

	09:46.120 --> 09:53.600
	We can recover the true unmixing matrix from the unmixing matrix estimated by multiset

	09:53.600 --> 09:56.520
	CCA, just by multiplying by a matrix Q.

	09:56.520 --> 10:02.640
	And in order to estimate Q, we make use of the fact that the unmixed data should have

	10:02.640 --> 10:03.640
	a diagonal covariance.

	10:03.640 --> 10:09.680
	This leads us to a joint diagonalization problem that we can solve efficiently.

	10:09.680 --> 10:14.480
	So if we take the experiments we've done on the previous slide, the results are still

	10:14.480 --> 10:15.480
	shown here.

	10:15.480 --> 10:21.640
	You can see the violet curves, and that is very sensitive to perturbation.

	10:21.640 --> 10:29.360
	And so if we apply joint diagonalization, all these curves move, and they join the dashed

	10:29.360 --> 10:30.360
	curve on the bottom.

	10:30.360 --> 10:34.720
	And therefore, it's much better, because now the new curves that are represented by the

	10:34.720 --> 10:42.920
	dashed line are less sensitive to perturbations.

	10:42.920 --> 10:47.920
	So now we've obtained the correct unmixing matrix, but up to a scaling.

	10:47.920 --> 10:55.040
	And so we need an additional step to find the correct scaling, and another one to find

	10:55.040 --> 11:00.680
	the other parameter that is still unestimated, which are the noise covariance.

	11:00.680 --> 11:04.000
	And luckily, it's very easy to find the noise covariance.

	11:04.000 --> 11:06.280
	We can do this via an EM algorithm.

	11:06.280 --> 11:11.920
	The E-step and the M-step are in closed form, and this yields a very fast algorithm.

	11:11.920 --> 11:15.200
	But the Shikha-J is not a maximum likelihood estimator.

	11:15.200 --> 11:22.600
	So now we will focus on Shikha-ML, which is our maximum likelihood estimator.

	11:22.600 --> 11:31.240
	So I won't go too much into details on this, but we optimize this via an EM using a Gaussian

	11:31.240 --> 11:33.480
	mixture assumption as a source.

	11:33.480 --> 11:35.960
	We assume that the weights are known.

	11:35.960 --> 11:41.480
	What I just want to showcase here is that the E-step of the algorithm, the one that

	11:41.480 --> 11:46.000
	gives you the expectation of the sources given the data, and the variance of the sources

	11:46.000 --> 11:50.760
	given the data, only involves the sum of size 2.

	11:50.760 --> 11:57.320
	So previously we had a sum that had an exponential number of terms, and here we don't have that

	11:57.320 --> 11:58.320
	anymore.

	11:58.320 --> 12:02.920
	So the E-step is much faster than what we had before, and therefore the EM algorithm

	12:02.920 --> 12:07.200
	here is tractable, whereas it was not the case before.

	12:07.200 --> 12:11.440
	I first want to present our synthetic experiment where we generate data according to the Shikha-ML

	12:11.440 --> 12:13.200
	and Shikha-J model.

	12:13.200 --> 12:18.560
	In case A, we have only Gaussian components, but we have noise diversity, and therefore

	12:18.560 --> 12:24.240
	methods that use noise diversity to recover the sources such as Shikha-ML and Shikha-J

	12:24.240 --> 12:25.240
	perform best.

	12:25.240 --> 12:34.000
	In the second case, we have only non-Gaussian components and no noise diversity, so methods

	12:34.000 --> 12:41.520
	that use non-Gaussianity perform well such as Kana-ICA, Shikha-ML, or MultiView-ICA.

	12:41.520 --> 12:45.200
	And the last case, half of the components are Gaussian with noise diversity, and the

	12:45.200 --> 12:49.000
	other half are non-Gaussian but without noise diversity.

	12:49.000 --> 12:53.000
	And in this case, only Shikha-ML is able to correctly recover the sources.

	12:53.000 --> 12:57.960
	MV-ICA doesn't do that, but it's not as good as Shikha-ML.

	12:57.960 --> 13:00.400
	Let us now talk about our experiments on real data.

	13:00.400 --> 13:05.080
	We have this reconstruction experiment on fMRI data where subjects are exposed to a

	13:05.080 --> 13:07.920
	naturalistic stimuli such as movie watching.

	13:07.920 --> 13:15.320
	We use 80% of the movie to learn the unmixing matrices of all subjects, and then on the

	13:15.320 --> 13:22.320
	20% left of the movie, we compute the common sources, and from these common sources computed

	13:22.320 --> 13:28.800
	using 80% of the subject, we try to reconstruct the data of the 20% left of the subject.

	13:28.800 --> 13:33.880
	We compute the R2 score within regions of interest between the reconstructed data and

	13:33.880 --> 13:39.480
	the true data, and plot them as a function of the number of components used.

	13:39.480 --> 13:43.000
	As we see, Shikha-ML outperforms all of the methods.

	13:43.000 --> 13:47.400
	As a take-home message, Shikha is a powerful framework to extract shared sources.

	13:47.400 --> 13:52.840
	Shikha-J is a fast approach to fit the model, but it only uses second-order information.

	13:52.840 --> 13:58.800
	In contrast, Shikha-ML is a bit slower, but is able to use non-gaussianity in addition

	13:58.800 --> 14:00.960
	to second-order information.

	14:00.960 --> 14:03.840
	In practice, Shikha-ML yields the best results.

	14:03.840 --> 14:05.960
	The methods we've introduced work on reduced data.

	14:05.960 --> 14:11.160
	It would be interesting to know how to reduce the data so that they perform optimally.

	14:11.160 --> 14:15.400
	Another way to improve our results would be to learn the density of the shared sources

	14:15.400 --> 14:19.480
	in Shikha-ML instead of having them fixed.

	14:19.480 --> 14:23.400
	Thanks for listening, and have a good day!