Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /nips-2021 /25958 /transcript_whisper_large-v2.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame contribute delete

11.7 kB

	WEBVTT

	00:00.000 --> 00:07.000
	Hello everyone, I'm Luigi Carretino, and this is a joint work with Stefano Vigonia,

	00:07.000 --> 00:10.000
	Daniele Calandriello, and Lorenzo Rosasco.

	00:10.000 --> 00:16.000
	The problem that we study in this work is a standard regression problem, where we want

	00:16.000 --> 00:24.000
	to estimate an unknown function f star given n pairs of points, x's and y's, and then

	00:24.000 --> 00:34.000
	given n pairs of points, x's and y's, where y's are noisy evaluations of the functions

	00:34.000 --> 00:38.000
	f star on the input points axis.

	00:41.000 --> 00:46.000
	A well-established method to learn nonlinear functions is kernel ridge regression.

	00:46.000 --> 00:53.000
	The basic idea is to map the input points into a higher dimensional space, where linear

	00:53.000 --> 00:59.000
	relationships can be learned that then translate in nonlinear ones in the input space.

	01:01.000 --> 01:07.000
	To formalize this, we can think about solving a standard empirical risk minimization problem

	01:07.000 --> 01:12.000
	regularized over a spatial function which is a reproducing kernel Hilbert space.

	01:14.000 --> 01:20.000
	Numerically speaking, the solution of this type of problem boils down to solving a linear

	01:20.000 --> 01:26.000
	system. Particularly, we can see here that the linear system is going to be Kc equal

	01:26.000 --> 01:33.000
	y, where K is the kernel matrix evaluated in all the pairs of points of the training

	01:33.000 --> 01:39.000
	sets, c are the weights that we aim to learn, and y's are the output points.

	01:40.000 --> 01:45.000
	We know that this method is optimal from a statistical point of view, but a drawback

	01:45.000 --> 01:52.000
	is that it suffers from computational scalability. In fact, in terms of time complexity, if we

	01:52.000 --> 01:57.000
	have n training points and we want to solve the linear system directly, we'll have to

	01:57.000 --> 02:03.000
	invert the matrix K, and this will cost us n cubed in time.

	02:06.000 --> 02:11.000
	Multiple ways of accelerating this process have been proposed over time.

	02:11.000 --> 02:17.000
	The first one is to solve the methods iteratively instead of inverting directly the matrix K.

	02:18.000 --> 02:25.000
	This allows us to only have matrix vector multiplications, and so the overall cost of

	02:25.000 --> 02:30.000
	an iterative method to solve this linear system is going to be Tn squared.

	02:31.000 --> 02:39.000
	Another method is the one known as sketching, where we can see this as subsampling the linear

	02:39.000 --> 02:46.000
	system, in particular subsampling columns of this linear system, where we can take m

	02:46.000 --> 02:52.000
	columns of the linear system uniformly at random to get a smaller one, and the cost

	02:52.000 --> 02:55.000
	of this will be m squared n.

	02:57.000 --> 03:04.000
	Another method instead is splitting. This allows us to divide the main problem into

	03:04.000 --> 03:12.000
	many, in this case Q, subproblems, each one that can be solved independently and so

	03:12.000 --> 03:20.000
	potentially can be distributed. So we can have a cost which boils down to n over Q to

	03:20.000 --> 03:22.000
	the power of 3.

	03:25.000 --> 03:30.000
	Combinations of these methods have been proposed in the literature. In particular, if

	03:30.000 --> 03:35.000
	we combine iterating and sketching, we can get a solver that can solve the problem in

	03:35.000 --> 03:38.000
	a time complexity of Tmn.

	03:40.000 --> 03:47.000
	If instead we combine sketching and splitting, we can get a solver that can be computed

	03:47.000 --> 03:51.000
	in m squared times n over Q.

	03:51.000 --> 03:59.000
	And in this work, we try to blend all these techniques to derive a new algorithm, which

	03:59.000 --> 04:09.000
	we will call PARC, that can achieve a time complexity of Tm times n over Q to the power

	04:09.000 --> 04:10.000
	of 2.

	04:12.000 --> 04:18.000
	So as we just said, in this work, we propose a new large-scale kernel regression solver

	04:18.000 --> 04:22.000
	that combines the computational benefits of iteration, sketching, and splitting.

	04:23.000 --> 04:27.000
	Notice, though, that these are approximation techniques and they may come at the cost of

	04:27.000 --> 04:35.000
	accuracy. But we are able to show that this new algorithm is able to preserve generalization

	04:35.000 --> 04:37.000
	under suitable partitions.

	04:38.000 --> 04:44.000
	Now also notice that instead of general splitting, we are going to need to focus on a

	04:44.000 --> 04:48.000
	particular type, which is the partitions.

	04:48.000 --> 04:53.000
	So we introduce a new principal partition scheme for kernel methods.

	04:56.000 --> 05:01.000
	We now look at the difference between data splitting and space partitioning.

	05:01.000 --> 05:08.000
	Given a set of points, the procedure of splitting takes groups of points at random and assign

	05:08.000 --> 05:10.000
	them to different splits or clusters.

	05:10.000 --> 05:14.000
	In this picture, for example, we divide the points in four splits.

	05:15.000 --> 05:21.000
	Partitioning instead divides the space in different cells, and then the points are implicitly

	05:21.000 --> 05:25.000
	assigned to a particular cluster based on which cell they belong to.

	05:27.000 --> 05:32.000
	Notice that with the splitting methods, we don't consider local information while we

	05:32.000 --> 05:37.000
	perform the splitting, but we do when we perform partitioning.

	05:37.000 --> 05:42.000
	Now, from this picture, the concept of partitioning a space seems pretty straightforward.

	05:43.000 --> 05:48.000
	However, when you start considering high dimensional feature space, subtle problems can

	05:48.000 --> 05:49.000
	appear.

	05:50.000 --> 05:55.000
	So first, as a recap, remember that there are two important spaces to consider in our

	05:55.000 --> 05:56.000
	regression problem.

	05:57.000 --> 06:04.000
	The input space X with its input space features and the kernel space H with its input space

	06:04.000 --> 06:10.000
	features, and the kernel space H, which potentially has many more implicit features.

	06:13.000 --> 06:17.000
	Traditionally, partition methods are applied directly to the input space.

	06:18.000 --> 06:24.000
	For example, a classical approach is to select a subset of points as centroids and then

	06:24.000 --> 06:30.000
	partition the space in cells by assigning each portion of the space to the closest centroid,

	06:30.000 --> 06:32.000
	which is called a Voronoi partition.

	06:32.000 --> 06:38.000
	Since we are in the input space, closest here is defined according to a simple Euclidean

	06:38.000 --> 06:39.000
	distance.

	06:40.000 --> 06:45.000
	However, remember that our target function and our whole regression does not happen

	06:45.000 --> 06:51.000
	directly on the input data space, but rather on the data mapped in the feature space.

	06:52.000 --> 06:58.000
	And after we apply our feature map to the data, the concept of closest and the partition

	06:58.000 --> 06:59.000
	can radically change.

	06:59.000 --> 07:05.000
	For example, here on the right, we choose a kernel space associated with a cosine similarity

	07:06.000 --> 07:12.000
	and again plot how the centroids partition the input space, but this time we chose closest

	07:12.000 --> 07:14.000
	according to the new cosine distance.

	07:15.000 --> 07:20.000
	The resulting partition is very different from the Euclidean one as it captures the

	07:20.000 --> 07:22.000
	non-linearity of the kernel function.

	07:22.000 --> 07:28.000
	In the paper, we discuss how this difference can impact the regression and we identified

	07:28.000 --> 07:34.000
	sufficient conditions that the partition should satisfy in order to guarantee good generalization

	07:34.000 --> 07:35.000
	of the learning process.

	07:37.000 --> 07:43.000
	Crucially, we will see that these guarantees depend not on how the input space is partitioned,

	07:43.000 --> 07:45.000
	but rather how the feature space is partitioned.

	07:45.000 --> 07:51.000
	As a consequence, for our PARC methods, we focus on choosing centroids solely using the

	07:51.000 --> 07:53.000
	kernel version of the distance.

	07:57.000 --> 08:00.000
	We are now ready to present in more detail how the PARC algorithm works.

	08:01.000 --> 08:07.000
	First of all, PARC partitioned the feature space into Q Voronoi cells and the first thing

	08:07.000 --> 08:16.000
	to do is to identify the centroids in the feature space that allows us to describe the

	08:16.000 --> 08:17.000
	Voronoi cells.

	08:19.000 --> 08:25.000
	Then inside each Voronoi cell, we learn a local estimator using an uniterated and sketched

	08:25.000 --> 08:27.000
	version of kernel ridge regression.

	08:30.000 --> 08:36.000
	And then at prediction time, when a new sample arrives, we can use the Q Voronoi feature

	08:36.000 --> 08:38.000
	to identify the new sample.

	08:40.000 --> 08:47.000
	We use the local estimator corresponding to the Voronoi cell to which the new points fall

	08:47.000 --> 08:48.000
	on.

	08:52.000 --> 08:57.000
	The generalization error of standard kernel ridge regression without partitioning can

	08:57.000 --> 09:02.000
	be upper bounded by two terms, a bias term and a variance term.

	09:02.000 --> 09:10.000
	In our work, we can show that also the generalization error of PARC can be upper bounded by a bias

	09:10.000 --> 09:11.000
	term and a variance term.

	09:11.000 --> 09:16.000
	But this time, these two terms are weighted and they are weighted by a certain quantity

	09:16.000 --> 09:25.000
	that depends on an angle theta, which is the minimum angle between all the subspaces of

	09:25.000 --> 09:26.000
	the partitions.

	09:26.000 --> 09:33.000
	For example, when all the subspaces are orthogonal between each other, we recover the exact same

	09:33.000 --> 09:36.000
	generalization error of standard kernel ridge regression.

	09:38.000 --> 09:45.000
	But we are also able to show that for angles which are small enough, we are able to obtain

	09:45.000 --> 09:50.000
	a generalization error which is of the same order of standard kernel ridge regression.

	09:50.000 --> 09:54.000
	These theoretical results suggest us how to construct a good partition.

	09:54.000 --> 10:00.000
	So in particular, PARC selects the Voronoi centroids greedily in order to promote orthogonality

	10:00.000 --> 10:01.000
	between the Voronoi cells.

	10:01.000 --> 10:06.000
	And in particular, we use the Schur complement to measure the orthogonality.

	10:10.000 --> 10:16.000
	We also use the Schur complement to measure the orthogonality of the Voronoi centroids.

	10:16.000 --> 10:20.000
	And in particular, we use the Schur complement to measure the orthogonality.

	10:24.000 --> 10:28.000
	Given all these ingredients, we are now able to measure the computational complexity of

	10:28.000 --> 10:32.000
	PARC, which has a time complexity that is the sum of two terms.

	10:33.000 --> 10:40.000
	A first term, q squared n log n, which is the cost of computing the centroids with the

	10:40.000 --> 10:41.000
	just mentioned procedure.

	10:41.000 --> 10:46.000
	And a second term, q squared n log n, which is the cost of computing the most expensive

	10:46.000 --> 10:47.000
	local estimator.

	10:51.000 --> 10:57.000
	Empirically, we performed experiments on data set of millions and of billions of points,

	10:57.000 --> 11:01.000
	and we compared with the currently fastest global kernel methods and with some other

	11:01.000 --> 11:02.000
	splitting kernel methods.

	11:03.000 --> 11:08.000
	We can see that PARC is the only method that manages to match the accuracy of the global

	11:08.000 --> 11:11.000
	estimator.

	11:11.000 --> 11:13.000
	Thank you all for your attention.

	11:13.000 --> 11:40.000
	And thank you to the poster for all your questions and more details.