ytseg_demo / demo_data /nips-2021 /25958 /transcript_whisper_large-v2.vtt
retkowski's picture
Add demo
cb71ef5
WEBVTT
00:00.000 --> 00:07.000
Hello everyone, I'm Luigi Carretino, and this is a joint work with Stefano Vigonia,
00:07.000 --> 00:10.000
Daniele Calandriello, and Lorenzo Rosasco.
00:10.000 --> 00:16.000
The problem that we study in this work is a standard regression problem, where we want
00:16.000 --> 00:24.000
to estimate an unknown function f star given n pairs of points, x's and y's, and then
00:24.000 --> 00:34.000
given n pairs of points, x's and y's, where y's are noisy evaluations of the functions
00:34.000 --> 00:38.000
f star on the input points axis.
00:41.000 --> 00:46.000
A well-established method to learn nonlinear functions is kernel ridge regression.
00:46.000 --> 00:53.000
The basic idea is to map the input points into a higher dimensional space, where linear
00:53.000 --> 00:59.000
relationships can be learned that then translate in nonlinear ones in the input space.
01:01.000 --> 01:07.000
To formalize this, we can think about solving a standard empirical risk minimization problem
01:07.000 --> 01:12.000
regularized over a spatial function which is a reproducing kernel Hilbert space.
01:14.000 --> 01:20.000
Numerically speaking, the solution of this type of problem boils down to solving a linear
01:20.000 --> 01:26.000
system. Particularly, we can see here that the linear system is going to be Kc equal
01:26.000 --> 01:33.000
y, where K is the kernel matrix evaluated in all the pairs of points of the training
01:33.000 --> 01:39.000
sets, c are the weights that we aim to learn, and y's are the output points.
01:40.000 --> 01:45.000
We know that this method is optimal from a statistical point of view, but a drawback
01:45.000 --> 01:52.000
is that it suffers from computational scalability. In fact, in terms of time complexity, if we
01:52.000 --> 01:57.000
have n training points and we want to solve the linear system directly, we'll have to
01:57.000 --> 02:03.000
invert the matrix K, and this will cost us n cubed in time.
02:06.000 --> 02:11.000
Multiple ways of accelerating this process have been proposed over time.
02:11.000 --> 02:17.000
The first one is to solve the methods iteratively instead of inverting directly the matrix K.
02:18.000 --> 02:25.000
This allows us to only have matrix vector multiplications, and so the overall cost of
02:25.000 --> 02:30.000
an iterative method to solve this linear system is going to be Tn squared.
02:31.000 --> 02:39.000
Another method is the one known as sketching, where we can see this as subsampling the linear
02:39.000 --> 02:46.000
system, in particular subsampling columns of this linear system, where we can take m
02:46.000 --> 02:52.000
columns of the linear system uniformly at random to get a smaller one, and the cost
02:52.000 --> 02:55.000
of this will be m squared n.
02:57.000 --> 03:04.000
Another method instead is splitting. This allows us to divide the main problem into
03:04.000 --> 03:12.000
many, in this case Q, subproblems, each one that can be solved independently and so
03:12.000 --> 03:20.000
potentially can be distributed. So we can have a cost which boils down to n over Q to
03:20.000 --> 03:22.000
the power of 3.
03:25.000 --> 03:30.000
Combinations of these methods have been proposed in the literature. In particular, if
03:30.000 --> 03:35.000
we combine iterating and sketching, we can get a solver that can solve the problem in
03:35.000 --> 03:38.000
a time complexity of Tmn.
03:40.000 --> 03:47.000
If instead we combine sketching and splitting, we can get a solver that can be computed
03:47.000 --> 03:51.000
in m squared times n over Q.
03:51.000 --> 03:59.000
And in this work, we try to blend all these techniques to derive a new algorithm, which
03:59.000 --> 04:09.000
we will call PARC, that can achieve a time complexity of Tm times n over Q to the power
04:09.000 --> 04:10.000
of 2.
04:12.000 --> 04:18.000
So as we just said, in this work, we propose a new large-scale kernel regression solver
04:18.000 --> 04:22.000
that combines the computational benefits of iteration, sketching, and splitting.
04:23.000 --> 04:27.000
Notice, though, that these are approximation techniques and they may come at the cost of
04:27.000 --> 04:35.000
accuracy. But we are able to show that this new algorithm is able to preserve generalization
04:35.000 --> 04:37.000
under suitable partitions.
04:38.000 --> 04:44.000
Now also notice that instead of general splitting, we are going to need to focus on a
04:44.000 --> 04:48.000
particular type, which is the partitions.
04:48.000 --> 04:53.000
So we introduce a new principal partition scheme for kernel methods.
04:56.000 --> 05:01.000
We now look at the difference between data splitting and space partitioning.
05:01.000 --> 05:08.000
Given a set of points, the procedure of splitting takes groups of points at random and assign
05:08.000 --> 05:10.000
them to different splits or clusters.
05:10.000 --> 05:14.000
In this picture, for example, we divide the points in four splits.
05:15.000 --> 05:21.000
Partitioning instead divides the space in different cells, and then the points are implicitly
05:21.000 --> 05:25.000
assigned to a particular cluster based on which cell they belong to.
05:27.000 --> 05:32.000
Notice that with the splitting methods, we don't consider local information while we
05:32.000 --> 05:37.000
perform the splitting, but we do when we perform partitioning.
05:37.000 --> 05:42.000
Now, from this picture, the concept of partitioning a space seems pretty straightforward.
05:43.000 --> 05:48.000
However, when you start considering high dimensional feature space, subtle problems can
05:48.000 --> 05:49.000
appear.
05:50.000 --> 05:55.000
So first, as a recap, remember that there are two important spaces to consider in our
05:55.000 --> 05:56.000
regression problem.
05:57.000 --> 06:04.000
The input space X with its input space features and the kernel space H with its input space
06:04.000 --> 06:10.000
features, and the kernel space H, which potentially has many more implicit features.
06:13.000 --> 06:17.000
Traditionally, partition methods are applied directly to the input space.
06:18.000 --> 06:24.000
For example, a classical approach is to select a subset of points as centroids and then
06:24.000 --> 06:30.000
partition the space in cells by assigning each portion of the space to the closest centroid,
06:30.000 --> 06:32.000
which is called a Voronoi partition.
06:32.000 --> 06:38.000
Since we are in the input space, closest here is defined according to a simple Euclidean
06:38.000 --> 06:39.000
distance.
06:40.000 --> 06:45.000
However, remember that our target function and our whole regression does not happen
06:45.000 --> 06:51.000
directly on the input data space, but rather on the data mapped in the feature space.
06:52.000 --> 06:58.000
And after we apply our feature map to the data, the concept of closest and the partition
06:58.000 --> 06:59.000
can radically change.
06:59.000 --> 07:05.000
For example, here on the right, we choose a kernel space associated with a cosine similarity
07:06.000 --> 07:12.000
and again plot how the centroids partition the input space, but this time we chose closest
07:12.000 --> 07:14.000
according to the new cosine distance.
07:15.000 --> 07:20.000
The resulting partition is very different from the Euclidean one as it captures the
07:20.000 --> 07:22.000
non-linearity of the kernel function.
07:22.000 --> 07:28.000
In the paper, we discuss how this difference can impact the regression and we identified
07:28.000 --> 07:34.000
sufficient conditions that the partition should satisfy in order to guarantee good generalization
07:34.000 --> 07:35.000
of the learning process.
07:37.000 --> 07:43.000
Crucially, we will see that these guarantees depend not on how the input space is partitioned,
07:43.000 --> 07:45.000
but rather how the feature space is partitioned.
07:45.000 --> 07:51.000
As a consequence, for our PARC methods, we focus on choosing centroids solely using the
07:51.000 --> 07:53.000
kernel version of the distance.
07:57.000 --> 08:00.000
We are now ready to present in more detail how the PARC algorithm works.
08:01.000 --> 08:07.000
First of all, PARC partitioned the feature space into Q Voronoi cells and the first thing
08:07.000 --> 08:16.000
to do is to identify the centroids in the feature space that allows us to describe the
08:16.000 --> 08:17.000
Voronoi cells.
08:19.000 --> 08:25.000
Then inside each Voronoi cell, we learn a local estimator using an uniterated and sketched
08:25.000 --> 08:27.000
version of kernel ridge regression.
08:30.000 --> 08:36.000
And then at prediction time, when a new sample arrives, we can use the Q Voronoi feature
08:36.000 --> 08:38.000
to identify the new sample.
08:40.000 --> 08:47.000
We use the local estimator corresponding to the Voronoi cell to which the new points fall
08:47.000 --> 08:48.000
on.
08:52.000 --> 08:57.000
The generalization error of standard kernel ridge regression without partitioning can
08:57.000 --> 09:02.000
be upper bounded by two terms, a bias term and a variance term.
09:02.000 --> 09:10.000
In our work, we can show that also the generalization error of PARC can be upper bounded by a bias
09:10.000 --> 09:11.000
term and a variance term.
09:11.000 --> 09:16.000
But this time, these two terms are weighted and they are weighted by a certain quantity
09:16.000 --> 09:25.000
that depends on an angle theta, which is the minimum angle between all the subspaces of
09:25.000 --> 09:26.000
the partitions.
09:26.000 --> 09:33.000
For example, when all the subspaces are orthogonal between each other, we recover the exact same
09:33.000 --> 09:36.000
generalization error of standard kernel ridge regression.
09:38.000 --> 09:45.000
But we are also able to show that for angles which are small enough, we are able to obtain
09:45.000 --> 09:50.000
a generalization error which is of the same order of standard kernel ridge regression.
09:50.000 --> 09:54.000
These theoretical results suggest us how to construct a good partition.
09:54.000 --> 10:00.000
So in particular, PARC selects the Voronoi centroids greedily in order to promote orthogonality
10:00.000 --> 10:01.000
between the Voronoi cells.
10:01.000 --> 10:06.000
And in particular, we use the Schur complement to measure the orthogonality.
10:10.000 --> 10:16.000
We also use the Schur complement to measure the orthogonality of the Voronoi centroids.
10:16.000 --> 10:20.000
And in particular, we use the Schur complement to measure the orthogonality.
10:24.000 --> 10:28.000
Given all these ingredients, we are now able to measure the computational complexity of
10:28.000 --> 10:32.000
PARC, which has a time complexity that is the sum of two terms.
10:33.000 --> 10:40.000
A first term, q squared n log n, which is the cost of computing the centroids with the
10:40.000 --> 10:41.000
just mentioned procedure.
10:41.000 --> 10:46.000
And a second term, q squared n log n, which is the cost of computing the most expensive
10:46.000 --> 10:47.000
local estimator.
10:51.000 --> 10:57.000
Empirically, we performed experiments on data set of millions and of billions of points,
10:57.000 --> 11:01.000
and we compared with the currently fastest global kernel methods and with some other
11:01.000 --> 11:02.000
splitting kernel methods.
11:03.000 --> 11:08.000
We can see that PARC is the only method that manages to match the accuracy of the global
11:08.000 --> 11:11.000
estimator.
11:11.000 --> 11:13.000
Thank you all for your attention.
11:13.000 --> 11:40.000
And thank you to the poster for all your questions and more details.