ytseg_demo / demo_data /nips-2021 /25963 /transcript_whisper_large-v2.vtt
retkowski's picture
Add demo
cb71ef5
WEBVTT
00:00.000 --> 00:13.040
Hello, I'm Hassam Murtaghi. I'm a PhD student at Georgia Tech. Along with my collaborator
00:13.040 --> 00:16.880
Jay Mundra, we will present our work on reusing combinatorial structure, faster projections
00:16.880 --> 00:20.260
over submodular-based polytopes. This is joint work with Swati Gupta.
00:20.260 --> 00:24.220
In this talk, we consider a sequence of similar structured optimization problems a setup often
00:24.220 --> 00:28.220
encountered in practice. We first start with our main problem of minimizing a convex function
00:28.220 --> 00:32.260
over a decision set P. At the next time step, this problem sees some perturbation and we
00:32.260 --> 00:36.700
obtain another similar problem, and so on. An example of this setup is the case of iterative
00:36.700 --> 00:40.340
projections where at each time step, we are computing the projection of a new point y
00:40.340 --> 00:44.860
t that is close to previously projected points y i. These iterative projections form a key
00:44.860 --> 00:48.140
step in many optimal learning algorithms and they are currently solved from scratch every
00:48.140 --> 00:51.900
iteration. They are not viewed in the context of an iterative environment where previously
00:51.900 --> 00:55.140
computed projections can be exploited to speed up subsequent ones.
00:55.140 --> 00:59.500
Thus, in this talk, we ask, is it possible to speed up similar iterative optimization
00:59.500 --> 01:03.660
problems by reusing structural information from previous minimizers?
01:03.660 --> 01:07.580
Let me now give you some more details about our setup. Here is a table that summarizes
01:07.580 --> 01:11.180
various widespread first-order optimization algorithms. The first two algorithms are conditional
01:11.180 --> 01:16.660
gradient variants and they only solve linear optimization every iteration. Their convergence
01:16.660 --> 01:21.140
rates depend on the dimension of the problem and on geometric constants for the underlying
01:21.140 --> 01:25.140
decision set, such as the pyramidal width for the waystep-Fraenkel variant given in
01:25.140 --> 01:28.340
the second row. On the other hand, the remaining third algorithms
01:28.340 --> 01:32.580
are projection-based algorithms that compute the projection every iteration, and their
01:32.580 --> 01:37.660
convergence rates, however, are optimal in the sense that they only rely on the condition
01:37.660 --> 01:43.260
number of the function and they are dimension-independent. Further, to capture a wide range of combinatorial
01:43.260 --> 01:47.980
sets, we consider the case where decision set P is given by a submodular polytope, and
01:47.980 --> 01:53.380
the challenge is that these polytopes have an exponential number of constraints. Thus,
01:53.380 --> 01:58.020
computing a projection over those polytopes is a big computational bottleneck in projection-based
01:58.020 --> 02:03.820
algorithms. Motivated by the straight-off in convergence rates versus runtime, we further
02:03.820 --> 02:08.740
ask, is it possible to speed up iterative projections over submodular polytopes by reusing
02:08.740 --> 02:14.300
structural information from previous minimizers? I'm now going to give more introduction on
02:14.300 --> 02:18.500
the problem and submodularity and review of first-order methods. So, as mentioned, we
02:18.500 --> 02:22.220
assume that the combinatorial structure in a problem is given by a submodular function.
02:22.220 --> 02:26.780
Set function F, defined over a ground set E of n elements, is submodular if it satisfies
02:26.780 --> 02:33.260
the following property. Furthermore, the base polytope associated with F is defined as the
02:33.260 --> 02:38.060
following system of linear inequalities, and here we see that V of F is modeled using an
02:38.060 --> 02:41.500
exponential number of constraints because we have a constraint for each subset of the
02:41.500 --> 02:46.860
concept. An example is the permutahedron, a polytope whose vertices are permutations
02:46.860 --> 02:51.740
of 1 through n. And here we have an example in the slide for when n is equal to 3. These
02:51.740 --> 02:58.220
polytopes are extensively used in online learning over rankings of items. A special class of
02:58.220 --> 03:02.140
submodular polytopes are known as Cardinality-based functions, and a Cardinality-based function
03:02.140 --> 03:07.620
F is defined as F of S equal to G Cardinality of S, where G is a concave function. And here
03:07.620 --> 03:10.940
we have another table that summarizes various machine and online learning problems in a
03:10.940 --> 03:14.380
submodular set function that gives rise to them. We see the permutahedron in the second
03:14.380 --> 03:19.180
row of this table, and it is in fact a Cardinality-based polytope. Other non-Cardinality-based examples
03:19.180 --> 03:22.220
include spanning trees and independent sets of matroids.
03:24.060 --> 03:28.220
So let's go back to our main problem of minimizing a convex function over the base polytope.
03:28.220 --> 03:32.620
So there typically exist three main paradigms to solve this problem. The first is a class
03:32.620 --> 03:37.020
of methods, known as conditional gradient methods, and as I mentioned before, those
03:37.020 --> 03:42.620
assume access to B of F via linear optimization oracle. And these methods are specifically
03:42.620 --> 03:46.780
advantageous for base polytopes because linear optimization over base polytopes could be
03:46.780 --> 03:51.180
done very efficiently using Edmunds' greedy algorithm. The second class of methods are
03:51.180 --> 03:55.500
mere descent variants, and those compute a projection every iteration to ensure feasibility.
03:56.060 --> 03:59.980
And again, as I also previously mentioned, although those methods have optimal convergence
03:59.980 --> 04:05.100
rates and are robust, they are, they remained of theoretical nature due to being computationally
04:05.100 --> 04:10.060
expensive. The third class of methods are combinatorial algorithms specifically tailored
04:10.060 --> 04:15.820
for convex optimization over some modular-based polytopes. Those algorithms require instead
04:15.820 --> 04:20.860
solving a some modular function minimization problem every iteration, which again can be
04:20.860 --> 04:25.020
very expensive. However, those algorithms enjoy the nice property of returning exact
04:25.020 --> 04:30.700
optimal solution. In this talk, we will focus on bridging the efficiency of CG methods and
04:30.700 --> 04:35.420
the structural properties and exactness of combinatorial algorithms to speed up iterative
04:35.420 --> 04:40.780
projections appearing in mere descent and beyond. So first, let's consider the simpler
04:40.780 --> 04:44.380
case when our polytope is cardinality-based. So here we have a cardinality-based some modular
04:44.380 --> 04:48.940
function F, and for notation we define this vector c to be the vector of discrete derivatives
04:48.940 --> 04:53.340
of the concave function g. We now give the following Duati result, which states that
04:53.340 --> 04:57.180
the problem of computing a Bregman projection over a cardinality-based polytope is dual
04:57.180 --> 05:02.860
to isotonic optimization. Although our results hold for general Bregman projections, we will
05:02.860 --> 05:08.620
focus on the case of Euclidean projections for simplicity. To that end, consider a vector
05:08.620 --> 05:11.980
y that we're trying to compute its Euclidean projection over a cardinality-based polytope,
05:11.980 --> 05:17.340
and let e1 through en be an ordering of the ground set such that y is decreasing. In this
05:17.340 --> 05:21.580
case, we have the following primal problem, and the dual to that is the following isotonic
05:21.580 --> 05:27.820
regression problem. And further, we can map between the two problems using the following identity here.
05:29.580 --> 05:32.860
So just to give you some historical context, previously the best known running time for
05:32.860 --> 05:37.500
projections was O n squared using a primal algorithm by Gupta et al. Later on in that
05:37.500 --> 05:41.180
year, Lim and Wright used the same Duati approach to compute projections over the permutahedron,
05:41.180 --> 05:45.020
and we extended their approach to general cardinality-based polytopes. Now the dual
05:45.020 --> 05:49.340
isotonic regression problem could be solved in O n time using a simple algorithm called
05:49.340 --> 05:53.900
pool-adjacent violators algorithm, and this basically gives us an O n log n algorithm by
05:53.900 --> 05:59.180
solving the problem in the dual space and mapping it back to the primal space. And this is currently
05:59.180 --> 06:04.060
the fastest known algorithm. And the key takeaway is that solving projections over these polytopes
06:04.060 --> 06:09.420
can be very efficiently done. In fact, computing a projection and solving linear optimization
06:09.420 --> 06:15.260
have the same running time. Now let's demonstrate our result with an example. So here we are going
06:15.260 --> 06:20.060
to project this vector y onto the probability simplex, and the probability simplex is modeled
06:20.060 --> 06:24.940
by this cardinality-based modular function here given on the slide. And we see that y is already
06:24.940 --> 06:29.900
ordered for simplicity and c is the vector of discrete derivatives. Now the algorithm will
06:29.900 --> 06:35.260
proceed as follows. It initializes the dual iterates by the vector that we're trying to
06:35.260 --> 06:39.980
compute the isotonic regression for, c minus y, and here we have an adjacent violation because the
06:39.980 --> 06:45.180
second coordinate is strictly smaller than the first coordinate. Now the algorithm will basically
06:45.180 --> 06:49.740
average those two coordinates to obtain the following solution z star, and here we see that
06:49.740 --> 06:54.060
the ordering constraints are satisfied and z star is in fact the dual optimal. Next it will map it
06:54.060 --> 06:58.700
back to a primal optimal. And let's go back to this figure from the previous slide that just compares
06:58.700 --> 07:04.060
a basic linear regression fit with an isotonic regression fit. Here in the red stepwise curve,
07:04.060 --> 07:08.060
the points at which the curve remains flat is where a block of consecutive adjacent violated
07:08.060 --> 07:12.460
points are averaged similar to our example. This very efficient algorithm for computing
07:12.460 --> 07:15.820
regimen projections over cardinality-based polytopes unfortunately does not extend to
07:15.820 --> 07:20.460
general submodular based polytopes. And now my collaborator Jay will present different combinatorial
07:20.460 --> 07:25.100
strategies for dealing with those polytopes. We now describe our toolkit for speeding up
07:25.100 --> 07:31.100
projections on general submodular based polytopes. There are two basic objects that we can learn from.
07:31.100 --> 07:35.900
First, given projections of previous points, can we do better than computing a new projection from
07:35.900 --> 07:41.820
scratch? Second, given an iterative algorithm to compute a projection, can we use the combinatorial
07:41.820 --> 07:46.780
structure present in the sequence of iterates to speed up the algorithm and terminate it early?
07:49.180 --> 07:53.180
We have the well-known first-order optimality condition on the left. It helps us verify if a
07:53.180 --> 07:58.140
point is indeed optimal. This check is reduced to a linear optimization over the base polytope,
07:58.140 --> 08:03.740
which can be done using Edmunds-Greedy algorithm. We have an example. Suppose we know the gradient
08:03.740 --> 08:08.780
at a point x star and want to check if x star is indeed optimal. We look at the distinct values
08:08.780 --> 08:13.900
of the partial derivatives at x star and arrange them in an increasing order. Each time we see a
08:13.900 --> 08:19.580
gap in this order, we want that the point x star on the prefix set equal the submodular function
08:19.580 --> 08:26.380
value on that set. In the figure, the first such gap is after we have seen even an E5. Therefore,
08:26.380 --> 08:35.900
x star S1 must equal f of S1. Similarly, x star S2 must equal f of S2. Finally, xE must equal f of
08:35.900 --> 08:42.940
E. These sets S1, S2, and E are called tight sets at x and define the face containing the point x
08:42.940 --> 08:49.100
star. This leads us to two interesting observations that we use later. One, that if we know precisely
08:49.100 --> 08:53.980
what the tight sets are at the optimal points, we can also calculate the optimal point for all
08:53.980 --> 08:59.100
suitable functions h. Two, that knowing the gradient at the optimal point gives us these
08:59.100 --> 09:06.220
tight sets. We give an example using our combinatorial idea. Suppose we know a point
09:06.220 --> 09:11.420
zk that is close to our optimal x star. If the function is smooth, this implies gradient at zk
09:11.420 --> 09:16.540
and x star are close. This gives us a way to learn some tight sets defining the optimal face.
09:17.260 --> 09:21.740
In the example, for each coordinate, the blue line in the middle represents the partial derivative
09:21.740 --> 09:26.700
value at zk and the blue shade represents the possible variation in that value for the optimal
09:26.700 --> 09:31.500
point x star. That is, the corresponding partial derivative for x star lies in the shaded interval.
09:32.140 --> 09:36.860
The largest values in these intervals for E1 and E5 are lower than the lowest values in these
09:36.860 --> 09:44.300
intervals for every other element. This helps us conclude that the set E1 and E5, that is S1,
09:44.300 --> 09:50.540
is a tight set at x star. Similarly, we infer that S2 is also a tight set at x star.
09:51.500 --> 09:56.460
We now use that idea to give our first two tools. These apply more generally, but we demonstrate
09:56.460 --> 10:01.820
them using Euclidean projections. Suppose we already know the projection xi of a point yi,
10:01.820 --> 10:06.540
and we wish to find the projection xt of point yt, given that yt is close to yi.
10:07.660 --> 10:11.980
The non-expansiveness of projection implies that the gradients at xi and xt are also close,
10:11.980 --> 10:15.820
and therefore we can infer some tight sets at xt even before solving.
10:16.940 --> 10:20.620
Suppose we start computing the projection of yt using an iterative algorithm.
10:20.620 --> 10:26.860
We now use the iterates zi that converge to xt. An iterate zt that is close to xt also has a
10:26.860 --> 10:32.780
gradient that is close to the gradient at xt, and once again we can infer some tight sets at xt
10:32.780 --> 10:39.740
as we approach the optimal. We also conducted an experiment to show that tool T1 can recover
10:39.740 --> 10:44.300
most tight sets from previous projections. We now give two tools that help us round an
10:44.300 --> 10:49.260
approximate solution exactly to the projection. First is our tool T3 called Relax.
10:49.260 --> 10:53.500
We give a heuristic to check if we have already found all the tight sets at the optimal.
10:55.020 --> 10:59.660
We also show that we can round combinatorially when we know the function f to be integral,
10:59.660 --> 11:04.140
and an iterate zt is close enough to the optimal xt. This is our tool T4.
11:05.900 --> 11:10.620
We can reuse previously known vertices of the polytope. Suppose that our optimal is xt,
11:10.620 --> 11:15.660
and we are given a close by point xi as a convex combination of some vertices in the polytope.
11:15.660 --> 11:21.980
We can use those vertices to warm start the search for xt. Now our sixth tool, Restrict.
11:23.580 --> 11:27.260
Once we know a few tight sets for xt using our inferred tools T1 and T2,
11:27.260 --> 11:32.540
we needn't search over the optimal or the whole base polytope. We can restrict ourselves to the
11:32.540 --> 11:38.380
face of the polytope that satisfies these constraints. We show that a simple extension
11:38.380 --> 11:42.300
of Edmunds' greedy algorithm provides yellow oracle for each face of the polytope.
11:42.300 --> 11:46.300
We now bring together these tools and apply them to the awaystep-frank-wolff algorithm,
11:46.300 --> 11:50.700
giving the algorithm we dub adaptive awaystep-frank-wolff, or A2FW for short.
11:51.660 --> 11:57.340
First, warm start A2FW using tight sets for the optimal inferred from previous projected points,
11:57.340 --> 12:01.660
and active sets from previous projected points. While the algorithm runs and generates new
12:01.660 --> 12:05.660
iterates, it keeps inferring new tight sets for the optimal point using these iterates.
12:05.660 --> 12:09.580
In each iteration, if a new set has been found, the algorithm checks if all tight sets have been
12:09.580 --> 12:16.300
found. If indeed so, then stop and output the exact solution. Otherwise, simply restrict the
12:16.300 --> 12:22.060
problem to a low-dimensional face and keep going on. Note that the linear optimization is over a
12:22.060 --> 12:27.420
restricted face of the polytope. Let's see an example. Suppose we are optimizing over the
12:27.420 --> 12:32.860
polytope P. We look for the best frank-wolff vertex and the best away vertex. We find that
12:32.860 --> 12:37.500
the best frank-wolff vertex is the best away vertex. Since the direction opposite to the away
12:37.500 --> 12:44.060
vertex is the better direction to move in, we find the next iterate ZT plus 1. Now, ZT plus 1 is
12:44.060 --> 12:50.540
close enough to X star that it allows us to detect another tight set and round to the face F new.
12:50.540 --> 12:56.140
One way to do that is to round to an arbitrary vertex in F new using our yellow oracle. Another
12:56.140 --> 13:00.700
option is to relax to F new and see if the solution obtained is feasible. If feasibility
13:00.700 --> 13:06.300
check is uncertain, return to the previous strategy. Eventually, we reach the optimal
13:06.300 --> 13:11.660
X star either way. We give this theorem about the primal gap for the modified algorithm.
13:11.660 --> 13:16.460
The function h is l-smooth and mu strongly convex and d refers to the diameter of BF.
13:17.100 --> 13:22.940
Notice how this compares to the AFW algorithm. When we restrict to a face F of BF, our guarantee
13:22.940 --> 13:29.020
depends only on the pyramidal width of F instead of the pyramidal width of BF. This pyramidal width
13:29.020 --> 13:33.420
can be much lower for the restricted face. For instance, it depends on the dimension of the face
13:33.420 --> 13:40.620
for the probability simplex. Therefore, A2FW leads to a faster convergence. We now show the
13:40.620 --> 13:46.460
effectiveness of our toolkit and the A2FW algorithm using experiments. For our computations,
13:46.460 --> 13:50.940
we simulate an online recommendation system where we are learning over rankings of items
13:50.940 --> 13:55.660
displayed to users. Our loss functions are stochastic model click-through rates. This
13:55.660 --> 14:01.340
can be seen as optimization over the permutahedron. We use online mirror descent which performs
14:01.340 --> 14:07.020
iterative projections and uses away step Frank-Wulf for these projections. We benchmark the
14:07.020 --> 14:14.780
original AFW algorithm against variants modified by our tools. We report significant improvement
14:14.780 --> 14:19.820
in both runtime and the number of AFW iterations. The green line stands for OMD with the original
14:19.820 --> 14:27.180
unoptimized AFW. The yellow line stands for OMD with A2FW algorithm. We do note that both OMDPAV,
14:27.180 --> 14:31.900
that is OMD with projections using the poor adjacent violators algorithm, and OFW were
14:31.900 --> 14:38.140
significantly faster than OMD with any AFW variant. However, OFW does not lead to optimum
14:38.140 --> 14:43.420
regret rates while OMDPAV works only for cardinality-based submodular polytopes. To
14:43.420 --> 14:48.140
conclude, we studied iterative projections for prevalent submodular-based polytopes. We presented
14:48.140 --> 14:53.020
an algorithm for cardinality-based polytopes. For general polytopes, we developed a combinatorial
14:53.020 --> 14:58.380
toolkit to speed up iterative projections and applied it to the AFW algorithm and computationally
14:58.380 --> 15:18.540
showed that our algorithm is orders of magnitude faster than the original AFW variant.