Spaces:
Running
Running
WEBVTT | |
00:00.000 --> 00:13.040 | |
Hello, I'm Hassam Murtaghi. I'm a PhD student at Georgia Tech. Along with my collaborator | |
00:13.040 --> 00:16.880 | |
Jay Mundra, we will present our work on reusing combinatorial structure, faster projections | |
00:16.880 --> 00:20.260 | |
over submodular-based polytopes. This is joint work with Swati Gupta. | |
00:20.260 --> 00:24.220 | |
In this talk, we consider a sequence of similar structured optimization problems a setup often | |
00:24.220 --> 00:28.220 | |
encountered in practice. We first start with our main problem of minimizing a convex function | |
00:28.220 --> 00:32.260 | |
over a decision set P. At the next time step, this problem sees some perturbation and we | |
00:32.260 --> 00:36.700 | |
obtain another similar problem, and so on. An example of this setup is the case of iterative | |
00:36.700 --> 00:40.340 | |
projections where at each time step, we are computing the projection of a new point y | |
00:40.340 --> 00:44.860 | |
t that is close to previously projected points y i. These iterative projections form a key | |
00:44.860 --> 00:48.140 | |
step in many optimal learning algorithms and they are currently solved from scratch every | |
00:48.140 --> 00:51.900 | |
iteration. They are not viewed in the context of an iterative environment where previously | |
00:51.900 --> 00:55.140 | |
computed projections can be exploited to speed up subsequent ones. | |
00:55.140 --> 00:59.500 | |
Thus, in this talk, we ask, is it possible to speed up similar iterative optimization | |
00:59.500 --> 01:03.660 | |
problems by reusing structural information from previous minimizers? | |
01:03.660 --> 01:07.580 | |
Let me now give you some more details about our setup. Here is a table that summarizes | |
01:07.580 --> 01:11.180 | |
various widespread first-order optimization algorithms. The first two algorithms are conditional | |
01:11.180 --> 01:16.660 | |
gradient variants and they only solve linear optimization every iteration. Their convergence | |
01:16.660 --> 01:21.140 | |
rates depend on the dimension of the problem and on geometric constants for the underlying | |
01:21.140 --> 01:25.140 | |
decision set, such as the pyramidal width for the waystep-Fraenkel variant given in | |
01:25.140 --> 01:28.340 | |
the second row. On the other hand, the remaining third algorithms | |
01:28.340 --> 01:32.580 | |
are projection-based algorithms that compute the projection every iteration, and their | |
01:32.580 --> 01:37.660 | |
convergence rates, however, are optimal in the sense that they only rely on the condition | |
01:37.660 --> 01:43.260 | |
number of the function and they are dimension-independent. Further, to capture a wide range of combinatorial | |
01:43.260 --> 01:47.980 | |
sets, we consider the case where decision set P is given by a submodular polytope, and | |
01:47.980 --> 01:53.380 | |
the challenge is that these polytopes have an exponential number of constraints. Thus, | |
01:53.380 --> 01:58.020 | |
computing a projection over those polytopes is a big computational bottleneck in projection-based | |
01:58.020 --> 02:03.820 | |
algorithms. Motivated by the straight-off in convergence rates versus runtime, we further | |
02:03.820 --> 02:08.740 | |
ask, is it possible to speed up iterative projections over submodular polytopes by reusing | |
02:08.740 --> 02:14.300 | |
structural information from previous minimizers? I'm now going to give more introduction on | |
02:14.300 --> 02:18.500 | |
the problem and submodularity and review of first-order methods. So, as mentioned, we | |
02:18.500 --> 02:22.220 | |
assume that the combinatorial structure in a problem is given by a submodular function. | |
02:22.220 --> 02:26.780 | |
Set function F, defined over a ground set E of n elements, is submodular if it satisfies | |
02:26.780 --> 02:33.260 | |
the following property. Furthermore, the base polytope associated with F is defined as the | |
02:33.260 --> 02:38.060 | |
following system of linear inequalities, and here we see that V of F is modeled using an | |
02:38.060 --> 02:41.500 | |
exponential number of constraints because we have a constraint for each subset of the | |
02:41.500 --> 02:46.860 | |
concept. An example is the permutahedron, a polytope whose vertices are permutations | |
02:46.860 --> 02:51.740 | |
of 1 through n. And here we have an example in the slide for when n is equal to 3. These | |
02:51.740 --> 02:58.220 | |
polytopes are extensively used in online learning over rankings of items. A special class of | |
02:58.220 --> 03:02.140 | |
submodular polytopes are known as Cardinality-based functions, and a Cardinality-based function | |
03:02.140 --> 03:07.620 | |
F is defined as F of S equal to G Cardinality of S, where G is a concave function. And here | |
03:07.620 --> 03:10.940 | |
we have another table that summarizes various machine and online learning problems in a | |
03:10.940 --> 03:14.380 | |
submodular set function that gives rise to them. We see the permutahedron in the second | |
03:14.380 --> 03:19.180 | |
row of this table, and it is in fact a Cardinality-based polytope. Other non-Cardinality-based examples | |
03:19.180 --> 03:22.220 | |
include spanning trees and independent sets of matroids. | |
03:24.060 --> 03:28.220 | |
So let's go back to our main problem of minimizing a convex function over the base polytope. | |
03:28.220 --> 03:32.620 | |
So there typically exist three main paradigms to solve this problem. The first is a class | |
03:32.620 --> 03:37.020 | |
of methods, known as conditional gradient methods, and as I mentioned before, those | |
03:37.020 --> 03:42.620 | |
assume access to B of F via linear optimization oracle. And these methods are specifically | |
03:42.620 --> 03:46.780 | |
advantageous for base polytopes because linear optimization over base polytopes could be | |
03:46.780 --> 03:51.180 | |
done very efficiently using Edmunds' greedy algorithm. The second class of methods are | |
03:51.180 --> 03:55.500 | |
mere descent variants, and those compute a projection every iteration to ensure feasibility. | |
03:56.060 --> 03:59.980 | |
And again, as I also previously mentioned, although those methods have optimal convergence | |
03:59.980 --> 04:05.100 | |
rates and are robust, they are, they remained of theoretical nature due to being computationally | |
04:05.100 --> 04:10.060 | |
expensive. The third class of methods are combinatorial algorithms specifically tailored | |
04:10.060 --> 04:15.820 | |
for convex optimization over some modular-based polytopes. Those algorithms require instead | |
04:15.820 --> 04:20.860 | |
solving a some modular function minimization problem every iteration, which again can be | |
04:20.860 --> 04:25.020 | |
very expensive. However, those algorithms enjoy the nice property of returning exact | |
04:25.020 --> 04:30.700 | |
optimal solution. In this talk, we will focus on bridging the efficiency of CG methods and | |
04:30.700 --> 04:35.420 | |
the structural properties and exactness of combinatorial algorithms to speed up iterative | |
04:35.420 --> 04:40.780 | |
projections appearing in mere descent and beyond. So first, let's consider the simpler | |
04:40.780 --> 04:44.380 | |
case when our polytope is cardinality-based. So here we have a cardinality-based some modular | |
04:44.380 --> 04:48.940 | |
function F, and for notation we define this vector c to be the vector of discrete derivatives | |
04:48.940 --> 04:53.340 | |
of the concave function g. We now give the following Duati result, which states that | |
04:53.340 --> 04:57.180 | |
the problem of computing a Bregman projection over a cardinality-based polytope is dual | |
04:57.180 --> 05:02.860 | |
to isotonic optimization. Although our results hold for general Bregman projections, we will | |
05:02.860 --> 05:08.620 | |
focus on the case of Euclidean projections for simplicity. To that end, consider a vector | |
05:08.620 --> 05:11.980 | |
y that we're trying to compute its Euclidean projection over a cardinality-based polytope, | |
05:11.980 --> 05:17.340 | |
and let e1 through en be an ordering of the ground set such that y is decreasing. In this | |
05:17.340 --> 05:21.580 | |
case, we have the following primal problem, and the dual to that is the following isotonic | |
05:21.580 --> 05:27.820 | |
regression problem. And further, we can map between the two problems using the following identity here. | |
05:29.580 --> 05:32.860 | |
So just to give you some historical context, previously the best known running time for | |
05:32.860 --> 05:37.500 | |
projections was O n squared using a primal algorithm by Gupta et al. Later on in that | |
05:37.500 --> 05:41.180 | |
year, Lim and Wright used the same Duati approach to compute projections over the permutahedron, | |
05:41.180 --> 05:45.020 | |
and we extended their approach to general cardinality-based polytopes. Now the dual | |
05:45.020 --> 05:49.340 | |
isotonic regression problem could be solved in O n time using a simple algorithm called | |
05:49.340 --> 05:53.900 | |
pool-adjacent violators algorithm, and this basically gives us an O n log n algorithm by | |
05:53.900 --> 05:59.180 | |
solving the problem in the dual space and mapping it back to the primal space. And this is currently | |
05:59.180 --> 06:04.060 | |
the fastest known algorithm. And the key takeaway is that solving projections over these polytopes | |
06:04.060 --> 06:09.420 | |
can be very efficiently done. In fact, computing a projection and solving linear optimization | |
06:09.420 --> 06:15.260 | |
have the same running time. Now let's demonstrate our result with an example. So here we are going | |
06:15.260 --> 06:20.060 | |
to project this vector y onto the probability simplex, and the probability simplex is modeled | |
06:20.060 --> 06:24.940 | |
by this cardinality-based modular function here given on the slide. And we see that y is already | |
06:24.940 --> 06:29.900 | |
ordered for simplicity and c is the vector of discrete derivatives. Now the algorithm will | |
06:29.900 --> 06:35.260 | |
proceed as follows. It initializes the dual iterates by the vector that we're trying to | |
06:35.260 --> 06:39.980 | |
compute the isotonic regression for, c minus y, and here we have an adjacent violation because the | |
06:39.980 --> 06:45.180 | |
second coordinate is strictly smaller than the first coordinate. Now the algorithm will basically | |
06:45.180 --> 06:49.740 | |
average those two coordinates to obtain the following solution z star, and here we see that | |
06:49.740 --> 06:54.060 | |
the ordering constraints are satisfied and z star is in fact the dual optimal. Next it will map it | |
06:54.060 --> 06:58.700 | |
back to a primal optimal. And let's go back to this figure from the previous slide that just compares | |
06:58.700 --> 07:04.060 | |
a basic linear regression fit with an isotonic regression fit. Here in the red stepwise curve, | |
07:04.060 --> 07:08.060 | |
the points at which the curve remains flat is where a block of consecutive adjacent violated | |
07:08.060 --> 07:12.460 | |
points are averaged similar to our example. This very efficient algorithm for computing | |
07:12.460 --> 07:15.820 | |
regimen projections over cardinality-based polytopes unfortunately does not extend to | |
07:15.820 --> 07:20.460 | |
general submodular based polytopes. And now my collaborator Jay will present different combinatorial | |
07:20.460 --> 07:25.100 | |
strategies for dealing with those polytopes. We now describe our toolkit for speeding up | |
07:25.100 --> 07:31.100 | |
projections on general submodular based polytopes. There are two basic objects that we can learn from. | |
07:31.100 --> 07:35.900 | |
First, given projections of previous points, can we do better than computing a new projection from | |
07:35.900 --> 07:41.820 | |
scratch? Second, given an iterative algorithm to compute a projection, can we use the combinatorial | |
07:41.820 --> 07:46.780 | |
structure present in the sequence of iterates to speed up the algorithm and terminate it early? | |
07:49.180 --> 07:53.180 | |
We have the well-known first-order optimality condition on the left. It helps us verify if a | |
07:53.180 --> 07:58.140 | |
point is indeed optimal. This check is reduced to a linear optimization over the base polytope, | |
07:58.140 --> 08:03.740 | |
which can be done using Edmunds-Greedy algorithm. We have an example. Suppose we know the gradient | |
08:03.740 --> 08:08.780 | |
at a point x star and want to check if x star is indeed optimal. We look at the distinct values | |
08:08.780 --> 08:13.900 | |
of the partial derivatives at x star and arrange them in an increasing order. Each time we see a | |
08:13.900 --> 08:19.580 | |
gap in this order, we want that the point x star on the prefix set equal the submodular function | |
08:19.580 --> 08:26.380 | |
value on that set. In the figure, the first such gap is after we have seen even an E5. Therefore, | |
08:26.380 --> 08:35.900 | |
x star S1 must equal f of S1. Similarly, x star S2 must equal f of S2. Finally, xE must equal f of | |
08:35.900 --> 08:42.940 | |
E. These sets S1, S2, and E are called tight sets at x and define the face containing the point x | |
08:42.940 --> 08:49.100 | |
star. This leads us to two interesting observations that we use later. One, that if we know precisely | |
08:49.100 --> 08:53.980 | |
what the tight sets are at the optimal points, we can also calculate the optimal point for all | |
08:53.980 --> 08:59.100 | |
suitable functions h. Two, that knowing the gradient at the optimal point gives us these | |
08:59.100 --> 09:06.220 | |
tight sets. We give an example using our combinatorial idea. Suppose we know a point | |
09:06.220 --> 09:11.420 | |
zk that is close to our optimal x star. If the function is smooth, this implies gradient at zk | |
09:11.420 --> 09:16.540 | |
and x star are close. This gives us a way to learn some tight sets defining the optimal face. | |
09:17.260 --> 09:21.740 | |
In the example, for each coordinate, the blue line in the middle represents the partial derivative | |
09:21.740 --> 09:26.700 | |
value at zk and the blue shade represents the possible variation in that value for the optimal | |
09:26.700 --> 09:31.500 | |
point x star. That is, the corresponding partial derivative for x star lies in the shaded interval. | |
09:32.140 --> 09:36.860 | |
The largest values in these intervals for E1 and E5 are lower than the lowest values in these | |
09:36.860 --> 09:44.300 | |
intervals for every other element. This helps us conclude that the set E1 and E5, that is S1, | |
09:44.300 --> 09:50.540 | |
is a tight set at x star. Similarly, we infer that S2 is also a tight set at x star. | |
09:51.500 --> 09:56.460 | |
We now use that idea to give our first two tools. These apply more generally, but we demonstrate | |
09:56.460 --> 10:01.820 | |
them using Euclidean projections. Suppose we already know the projection xi of a point yi, | |
10:01.820 --> 10:06.540 | |
and we wish to find the projection xt of point yt, given that yt is close to yi. | |
10:07.660 --> 10:11.980 | |
The non-expansiveness of projection implies that the gradients at xi and xt are also close, | |
10:11.980 --> 10:15.820 | |
and therefore we can infer some tight sets at xt even before solving. | |
10:16.940 --> 10:20.620 | |
Suppose we start computing the projection of yt using an iterative algorithm. | |
10:20.620 --> 10:26.860 | |
We now use the iterates zi that converge to xt. An iterate zt that is close to xt also has a | |
10:26.860 --> 10:32.780 | |
gradient that is close to the gradient at xt, and once again we can infer some tight sets at xt | |
10:32.780 --> 10:39.740 | |
as we approach the optimal. We also conducted an experiment to show that tool T1 can recover | |
10:39.740 --> 10:44.300 | |
most tight sets from previous projections. We now give two tools that help us round an | |
10:44.300 --> 10:49.260 | |
approximate solution exactly to the projection. First is our tool T3 called Relax. | |
10:49.260 --> 10:53.500 | |
We give a heuristic to check if we have already found all the tight sets at the optimal. | |
10:55.020 --> 10:59.660 | |
We also show that we can round combinatorially when we know the function f to be integral, | |
10:59.660 --> 11:04.140 | |
and an iterate zt is close enough to the optimal xt. This is our tool T4. | |
11:05.900 --> 11:10.620 | |
We can reuse previously known vertices of the polytope. Suppose that our optimal is xt, | |
11:10.620 --> 11:15.660 | |
and we are given a close by point xi as a convex combination of some vertices in the polytope. | |
11:15.660 --> 11:21.980 | |
We can use those vertices to warm start the search for xt. Now our sixth tool, Restrict. | |
11:23.580 --> 11:27.260 | |
Once we know a few tight sets for xt using our inferred tools T1 and T2, | |
11:27.260 --> 11:32.540 | |
we needn't search over the optimal or the whole base polytope. We can restrict ourselves to the | |
11:32.540 --> 11:38.380 | |
face of the polytope that satisfies these constraints. We show that a simple extension | |
11:38.380 --> 11:42.300 | |
of Edmunds' greedy algorithm provides yellow oracle for each face of the polytope. | |
11:42.300 --> 11:46.300 | |
We now bring together these tools and apply them to the awaystep-frank-wolff algorithm, | |
11:46.300 --> 11:50.700 | |
giving the algorithm we dub adaptive awaystep-frank-wolff, or A2FW for short. | |
11:51.660 --> 11:57.340 | |
First, warm start A2FW using tight sets for the optimal inferred from previous projected points, | |
11:57.340 --> 12:01.660 | |
and active sets from previous projected points. While the algorithm runs and generates new | |
12:01.660 --> 12:05.660 | |
iterates, it keeps inferring new tight sets for the optimal point using these iterates. | |
12:05.660 --> 12:09.580 | |
In each iteration, if a new set has been found, the algorithm checks if all tight sets have been | |
12:09.580 --> 12:16.300 | |
found. If indeed so, then stop and output the exact solution. Otherwise, simply restrict the | |
12:16.300 --> 12:22.060 | |
problem to a low-dimensional face and keep going on. Note that the linear optimization is over a | |
12:22.060 --> 12:27.420 | |
restricted face of the polytope. Let's see an example. Suppose we are optimizing over the | |
12:27.420 --> 12:32.860 | |
polytope P. We look for the best frank-wolff vertex and the best away vertex. We find that | |
12:32.860 --> 12:37.500 | |
the best frank-wolff vertex is the best away vertex. Since the direction opposite to the away | |
12:37.500 --> 12:44.060 | |
vertex is the better direction to move in, we find the next iterate ZT plus 1. Now, ZT plus 1 is | |
12:44.060 --> 12:50.540 | |
close enough to X star that it allows us to detect another tight set and round to the face F new. | |
12:50.540 --> 12:56.140 | |
One way to do that is to round to an arbitrary vertex in F new using our yellow oracle. Another | |
12:56.140 --> 13:00.700 | |
option is to relax to F new and see if the solution obtained is feasible. If feasibility | |
13:00.700 --> 13:06.300 | |
check is uncertain, return to the previous strategy. Eventually, we reach the optimal | |
13:06.300 --> 13:11.660 | |
X star either way. We give this theorem about the primal gap for the modified algorithm. | |
13:11.660 --> 13:16.460 | |
The function h is l-smooth and mu strongly convex and d refers to the diameter of BF. | |
13:17.100 --> 13:22.940 | |
Notice how this compares to the AFW algorithm. When we restrict to a face F of BF, our guarantee | |
13:22.940 --> 13:29.020 | |
depends only on the pyramidal width of F instead of the pyramidal width of BF. This pyramidal width | |
13:29.020 --> 13:33.420 | |
can be much lower for the restricted face. For instance, it depends on the dimension of the face | |
13:33.420 --> 13:40.620 | |
for the probability simplex. Therefore, A2FW leads to a faster convergence. We now show the | |
13:40.620 --> 13:46.460 | |
effectiveness of our toolkit and the A2FW algorithm using experiments. For our computations, | |
13:46.460 --> 13:50.940 | |
we simulate an online recommendation system where we are learning over rankings of items | |
13:50.940 --> 13:55.660 | |
displayed to users. Our loss functions are stochastic model click-through rates. This | |
13:55.660 --> 14:01.340 | |
can be seen as optimization over the permutahedron. We use online mirror descent which performs | |
14:01.340 --> 14:07.020 | |
iterative projections and uses away step Frank-Wulf for these projections. We benchmark the | |
14:07.020 --> 14:14.780 | |
original AFW algorithm against variants modified by our tools. We report significant improvement | |
14:14.780 --> 14:19.820 | |
in both runtime and the number of AFW iterations. The green line stands for OMD with the original | |
14:19.820 --> 14:27.180 | |
unoptimized AFW. The yellow line stands for OMD with A2FW algorithm. We do note that both OMDPAV, | |
14:27.180 --> 14:31.900 | |
that is OMD with projections using the poor adjacent violators algorithm, and OFW were | |
14:31.900 --> 14:38.140 | |
significantly faster than OMD with any AFW variant. However, OFW does not lead to optimum | |
14:38.140 --> 14:43.420 | |
regret rates while OMDPAV works only for cardinality-based submodular polytopes. To | |
14:43.420 --> 14:48.140 | |
conclude, we studied iterative projections for prevalent submodular-based polytopes. We presented | |
14:48.140 --> 14:53.020 | |
an algorithm for cardinality-based polytopes. For general polytopes, we developed a combinatorial | |
14:53.020 --> 14:58.380 | |
toolkit to speed up iterative projections and applied it to the AFW algorithm and computationally | |
14:58.380 --> 15:18.540 | |
showed that our algorithm is orders of magnitude faster than the original AFW variant. | |