Spaces:
Running
Running
WEBVTT | |
00:00.000 --> 00:07.000 | |
Hello everyone, I'm Luigi Carretino, and this is a joint work with Stefano Vigonia, | |
00:07.000 --> 00:10.000 | |
Daniele Calandriello, and Lorenzo Rosasco. | |
00:10.000 --> 00:16.000 | |
The problem that we study in this work is a standard regression problem, where we want | |
00:16.000 --> 00:24.000 | |
to estimate an unknown function f star given n pairs of points, x's and y's, and then | |
00:24.000 --> 00:34.000 | |
given n pairs of points, x's and y's, where y's are noisy evaluations of the functions | |
00:34.000 --> 00:38.000 | |
f star on the input points axis. | |
00:41.000 --> 00:46.000 | |
A well-established method to learn nonlinear functions is kernel ridge regression. | |
00:46.000 --> 00:53.000 | |
The basic idea is to map the input points into a higher dimensional space, where linear | |
00:53.000 --> 00:59.000 | |
relationships can be learned that then translate in nonlinear ones in the input space. | |
01:01.000 --> 01:07.000 | |
To formalize this, we can think about solving a standard empirical risk minimization problem | |
01:07.000 --> 01:12.000 | |
regularized over a spatial function which is a reproducing kernel Hilbert space. | |
01:14.000 --> 01:20.000 | |
Numerically speaking, the solution of this type of problem boils down to solving a linear | |
01:20.000 --> 01:26.000 | |
system. Particularly, we can see here that the linear system is going to be Kc equal | |
01:26.000 --> 01:33.000 | |
y, where K is the kernel matrix evaluated in all the pairs of points of the training | |
01:33.000 --> 01:39.000 | |
sets, c are the weights that we aim to learn, and y's are the output points. | |
01:40.000 --> 01:45.000 | |
We know that this method is optimal from a statistical point of view, but a drawback | |
01:45.000 --> 01:52.000 | |
is that it suffers from computational scalability. In fact, in terms of time complexity, if we | |
01:52.000 --> 01:57.000 | |
have n training points and we want to solve the linear system directly, we'll have to | |
01:57.000 --> 02:03.000 | |
invert the matrix K, and this will cost us n cubed in time. | |
02:06.000 --> 02:11.000 | |
Multiple ways of accelerating this process have been proposed over time. | |
02:11.000 --> 02:17.000 | |
The first one is to solve the methods iteratively instead of inverting directly the matrix K. | |
02:18.000 --> 02:25.000 | |
This allows us to only have matrix vector multiplications, and so the overall cost of | |
02:25.000 --> 02:30.000 | |
an iterative method to solve this linear system is going to be Tn squared. | |
02:31.000 --> 02:39.000 | |
Another method is the one known as sketching, where we can see this as subsampling the linear | |
02:39.000 --> 02:46.000 | |
system, in particular subsampling columns of this linear system, where we can take m | |
02:46.000 --> 02:52.000 | |
columns of the linear system uniformly at random to get a smaller one, and the cost | |
02:52.000 --> 02:55.000 | |
of this will be m squared n. | |
02:57.000 --> 03:04.000 | |
Another method instead is splitting. This allows us to divide the main problem into | |
03:04.000 --> 03:12.000 | |
many, in this case Q, subproblems, each one that can be solved independently and so | |
03:12.000 --> 03:20.000 | |
potentially can be distributed. So we can have a cost which boils down to n over Q to | |
03:20.000 --> 03:22.000 | |
the power of 3. | |
03:25.000 --> 03:30.000 | |
Combinations of these methods have been proposed in the literature. In particular, if | |
03:30.000 --> 03:35.000 | |
we combine iterating and sketching, we can get a solver that can solve the problem in | |
03:35.000 --> 03:38.000 | |
a time complexity of Tmn. | |
03:40.000 --> 03:47.000 | |
If instead we combine sketching and splitting, we can get a solver that can be computed | |
03:47.000 --> 03:51.000 | |
in m squared times n over Q. | |
03:51.000 --> 03:59.000 | |
And in this work, we try to blend all these techniques to derive a new algorithm, which | |
03:59.000 --> 04:09.000 | |
we will call PARC, that can achieve a time complexity of Tm times n over Q to the power | |
04:09.000 --> 04:10.000 | |
of 2. | |
04:12.000 --> 04:18.000 | |
So as we just said, in this work, we propose a new large-scale kernel regression solver | |
04:18.000 --> 04:22.000 | |
that combines the computational benefits of iteration, sketching, and splitting. | |
04:23.000 --> 04:27.000 | |
Notice, though, that these are approximation techniques and they may come at the cost of | |
04:27.000 --> 04:35.000 | |
accuracy. But we are able to show that this new algorithm is able to preserve generalization | |
04:35.000 --> 04:37.000 | |
under suitable partitions. | |
04:38.000 --> 04:44.000 | |
Now also notice that instead of general splitting, we are going to need to focus on a | |
04:44.000 --> 04:48.000 | |
particular type, which is the partitions. | |
04:48.000 --> 04:53.000 | |
So we introduce a new principal partition scheme for kernel methods. | |
04:56.000 --> 05:01.000 | |
We now look at the difference between data splitting and space partitioning. | |
05:01.000 --> 05:08.000 | |
Given a set of points, the procedure of splitting takes groups of points at random and assign | |
05:08.000 --> 05:10.000 | |
them to different splits or clusters. | |
05:10.000 --> 05:14.000 | |
In this picture, for example, we divide the points in four splits. | |
05:15.000 --> 05:21.000 | |
Partitioning instead divides the space in different cells, and then the points are implicitly | |
05:21.000 --> 05:25.000 | |
assigned to a particular cluster based on which cell they belong to. | |
05:27.000 --> 05:32.000 | |
Notice that with the splitting methods, we don't consider local information while we | |
05:32.000 --> 05:37.000 | |
perform the splitting, but we do when we perform partitioning. | |
05:37.000 --> 05:42.000 | |
Now, from this picture, the concept of partitioning a space seems pretty straightforward. | |
05:43.000 --> 05:48.000 | |
However, when you start considering high dimensional feature space, subtle problems can | |
05:48.000 --> 05:49.000 | |
appear. | |
05:50.000 --> 05:55.000 | |
So first, as a recap, remember that there are two important spaces to consider in our | |
05:55.000 --> 05:56.000 | |
regression problem. | |
05:57.000 --> 06:04.000 | |
The input space X with its input space features and the kernel space H with its input space | |
06:04.000 --> 06:10.000 | |
features, and the kernel space H, which potentially has many more implicit features. | |
06:13.000 --> 06:17.000 | |
Traditionally, partition methods are applied directly to the input space. | |
06:18.000 --> 06:24.000 | |
For example, a classical approach is to select a subset of points as centroids and then | |
06:24.000 --> 06:30.000 | |
partition the space in cells by assigning each portion of the space to the closest centroid, | |
06:30.000 --> 06:32.000 | |
which is called a Voronoi partition. | |
06:32.000 --> 06:38.000 | |
Since we are in the input space, closest here is defined according to a simple Euclidean | |
06:38.000 --> 06:39.000 | |
distance. | |
06:40.000 --> 06:45.000 | |
However, remember that our target function and our whole regression does not happen | |
06:45.000 --> 06:51.000 | |
directly on the input data space, but rather on the data mapped in the feature space. | |
06:52.000 --> 06:58.000 | |
And after we apply our feature map to the data, the concept of closest and the partition | |
06:58.000 --> 06:59.000 | |
can radically change. | |
06:59.000 --> 07:05.000 | |
For example, here on the right, we choose a kernel space associated with a cosine similarity | |
07:06.000 --> 07:12.000 | |
and again plot how the centroids partition the input space, but this time we chose closest | |
07:12.000 --> 07:14.000 | |
according to the new cosine distance. | |
07:15.000 --> 07:20.000 | |
The resulting partition is very different from the Euclidean one as it captures the | |
07:20.000 --> 07:22.000 | |
non-linearity of the kernel function. | |
07:22.000 --> 07:28.000 | |
In the paper, we discuss how this difference can impact the regression and we identified | |
07:28.000 --> 07:34.000 | |
sufficient conditions that the partition should satisfy in order to guarantee good generalization | |
07:34.000 --> 07:35.000 | |
of the learning process. | |
07:37.000 --> 07:43.000 | |
Crucially, we will see that these guarantees depend not on how the input space is partitioned, | |
07:43.000 --> 07:45.000 | |
but rather how the feature space is partitioned. | |
07:45.000 --> 07:51.000 | |
As a consequence, for our PARC methods, we focus on choosing centroids solely using the | |
07:51.000 --> 07:53.000 | |
kernel version of the distance. | |
07:57.000 --> 08:00.000 | |
We are now ready to present in more detail how the PARC algorithm works. | |
08:01.000 --> 08:07.000 | |
First of all, PARC partitioned the feature space into Q Voronoi cells and the first thing | |
08:07.000 --> 08:16.000 | |
to do is to identify the centroids in the feature space that allows us to describe the | |
08:16.000 --> 08:17.000 | |
Voronoi cells. | |
08:19.000 --> 08:25.000 | |
Then inside each Voronoi cell, we learn a local estimator using an uniterated and sketched | |
08:25.000 --> 08:27.000 | |
version of kernel ridge regression. | |
08:30.000 --> 08:36.000 | |
And then at prediction time, when a new sample arrives, we can use the Q Voronoi feature | |
08:36.000 --> 08:38.000 | |
to identify the new sample. | |
08:40.000 --> 08:47.000 | |
We use the local estimator corresponding to the Voronoi cell to which the new points fall | |
08:47.000 --> 08:48.000 | |
on. | |
08:52.000 --> 08:57.000 | |
The generalization error of standard kernel ridge regression without partitioning can | |
08:57.000 --> 09:02.000 | |
be upper bounded by two terms, a bias term and a variance term. | |
09:02.000 --> 09:10.000 | |
In our work, we can show that also the generalization error of PARC can be upper bounded by a bias | |
09:10.000 --> 09:11.000 | |
term and a variance term. | |
09:11.000 --> 09:16.000 | |
But this time, these two terms are weighted and they are weighted by a certain quantity | |
09:16.000 --> 09:25.000 | |
that depends on an angle theta, which is the minimum angle between all the subspaces of | |
09:25.000 --> 09:26.000 | |
the partitions. | |
09:26.000 --> 09:33.000 | |
For example, when all the subspaces are orthogonal between each other, we recover the exact same | |
09:33.000 --> 09:36.000 | |
generalization error of standard kernel ridge regression. | |
09:38.000 --> 09:45.000 | |
But we are also able to show that for angles which are small enough, we are able to obtain | |
09:45.000 --> 09:50.000 | |
a generalization error which is of the same order of standard kernel ridge regression. | |
09:50.000 --> 09:54.000 | |
These theoretical results suggest us how to construct a good partition. | |
09:54.000 --> 10:00.000 | |
So in particular, PARC selects the Voronoi centroids greedily in order to promote orthogonality | |
10:00.000 --> 10:01.000 | |
between the Voronoi cells. | |
10:01.000 --> 10:06.000 | |
And in particular, we use the Schur complement to measure the orthogonality. | |
10:10.000 --> 10:16.000 | |
We also use the Schur complement to measure the orthogonality of the Voronoi centroids. | |
10:16.000 --> 10:20.000 | |
And in particular, we use the Schur complement to measure the orthogonality. | |
10:24.000 --> 10:28.000 | |
Given all these ingredients, we are now able to measure the computational complexity of | |
10:28.000 --> 10:32.000 | |
PARC, which has a time complexity that is the sum of two terms. | |
10:33.000 --> 10:40.000 | |
A first term, q squared n log n, which is the cost of computing the centroids with the | |
10:40.000 --> 10:41.000 | |
just mentioned procedure. | |
10:41.000 --> 10:46.000 | |
And a second term, q squared n log n, which is the cost of computing the most expensive | |
10:46.000 --> 10:47.000 | |
local estimator. | |
10:51.000 --> 10:57.000 | |
Empirically, we performed experiments on data set of millions and of billions of points, | |
10:57.000 --> 11:01.000 | |
and we compared with the currently fastest global kernel methods and with some other | |
11:01.000 --> 11:02.000 | |
splitting kernel methods. | |
11:03.000 --> 11:08.000 | |
We can see that PARC is the only method that manages to match the accuracy of the global | |
11:08.000 --> 11:11.000 | |
estimator. | |
11:11.000 --> 11:13.000 | |
Thank you all for your attention. | |
11:13.000 --> 11:40.000 | |
And thank you to the poster for all your questions and more details. | |