File size: 20,672 Bytes
cb71ef5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
WEBVTT

00:00.000 --> 00:13.040
Hello, I'm Hassam Murtaghi. I'm a PhD student at Georgia Tech. Along with my collaborator

00:13.040 --> 00:16.880
Jay Mundra, we will present our work on reusing combinatorial structure, faster projections

00:16.880 --> 00:20.260
over submodular-based polytopes. This is joint work with Swati Gupta.

00:20.260 --> 00:24.220
In this talk, we consider a sequence of similar structured optimization problems a setup often

00:24.220 --> 00:28.220
encountered in practice. We first start with our main problem of minimizing a convex function

00:28.220 --> 00:32.260
over a decision set P. At the next time step, this problem sees some perturbation and we

00:32.260 --> 00:36.700
obtain another similar problem, and so on. An example of this setup is the case of iterative

00:36.700 --> 00:40.340
projections where at each time step, we are computing the projection of a new point y

00:40.340 --> 00:44.860
t that is close to previously projected points y i. These iterative projections form a key

00:44.860 --> 00:48.140
step in many optimal learning algorithms and they are currently solved from scratch every

00:48.140 --> 00:51.900
iteration. They are not viewed in the context of an iterative environment where previously

00:51.900 --> 00:55.140
computed projections can be exploited to speed up subsequent ones.

00:55.140 --> 00:59.500
Thus, in this talk, we ask, is it possible to speed up similar iterative optimization

00:59.500 --> 01:03.660
problems by reusing structural information from previous minimizers?

01:03.660 --> 01:07.580
Let me now give you some more details about our setup. Here is a table that summarizes

01:07.580 --> 01:11.180
various widespread first-order optimization algorithms. The first two algorithms are conditional

01:11.180 --> 01:16.660
gradient variants and they only solve linear optimization every iteration. Their convergence

01:16.660 --> 01:21.140
rates depend on the dimension of the problem and on geometric constants for the underlying

01:21.140 --> 01:25.140
decision set, such as the pyramidal width for the waystep-Fraenkel variant given in

01:25.140 --> 01:28.340
the second row. On the other hand, the remaining third algorithms

01:28.340 --> 01:32.580
are projection-based algorithms that compute the projection every iteration, and their

01:32.580 --> 01:37.660
convergence rates, however, are optimal in the sense that they only rely on the condition

01:37.660 --> 01:43.260
number of the function and they are dimension-independent. Further, to capture a wide range of combinatorial

01:43.260 --> 01:47.980
sets, we consider the case where decision set P is given by a submodular polytope, and

01:47.980 --> 01:53.380
the challenge is that these polytopes have an exponential number of constraints. Thus,

01:53.380 --> 01:58.020
computing a projection over those polytopes is a big computational bottleneck in projection-based

01:58.020 --> 02:03.820
algorithms. Motivated by the straight-off in convergence rates versus runtime, we further

02:03.820 --> 02:08.740
ask, is it possible to speed up iterative projections over submodular polytopes by reusing

02:08.740 --> 02:14.300
structural information from previous minimizers? I'm now going to give more introduction on

02:14.300 --> 02:18.500
the problem and submodularity and review of first-order methods. So, as mentioned, we

02:18.500 --> 02:22.220
assume that the combinatorial structure in a problem is given by a submodular function.

02:22.220 --> 02:26.780
Set function F, defined over a ground set E of n elements, is submodular if it satisfies

02:26.780 --> 02:33.260
the following property. Furthermore, the base polytope associated with F is defined as the

02:33.260 --> 02:38.060
following system of linear inequalities, and here we see that V of F is modeled using an

02:38.060 --> 02:41.500
exponential number of constraints because we have a constraint for each subset of the

02:41.500 --> 02:46.860
concept. An example is the permutahedron, a polytope whose vertices are permutations

02:46.860 --> 02:51.740
of 1 through n. And here we have an example in the slide for when n is equal to 3. These

02:51.740 --> 02:58.220
polytopes are extensively used in online learning over rankings of items. A special class of

02:58.220 --> 03:02.140
submodular polytopes are known as Cardinality-based functions, and a Cardinality-based function

03:02.140 --> 03:07.620
F is defined as F of S equal to G Cardinality of S, where G is a concave function. And here

03:07.620 --> 03:10.940
we have another table that summarizes various machine and online learning problems in a

03:10.940 --> 03:14.380
submodular set function that gives rise to them. We see the permutahedron in the second

03:14.380 --> 03:19.180
row of this table, and it is in fact a Cardinality-based polytope. Other non-Cardinality-based examples

03:19.180 --> 03:22.220
include spanning trees and independent sets of matroids.

03:24.060 --> 03:28.220
So let's go back to our main problem of minimizing a convex function over the base polytope.

03:28.220 --> 03:32.620
So there typically exist three main paradigms to solve this problem. The first is a class

03:32.620 --> 03:37.020
of methods, known as conditional gradient methods, and as I mentioned before, those

03:37.020 --> 03:42.620
assume access to B of F via linear optimization oracle. And these methods are specifically

03:42.620 --> 03:46.780
advantageous for base polytopes because linear optimization over base polytopes could be

03:46.780 --> 03:51.180
done very efficiently using Edmunds' greedy algorithm. The second class of methods are

03:51.180 --> 03:55.500
mere descent variants, and those compute a projection every iteration to ensure feasibility.

03:56.060 --> 03:59.980
And again, as I also previously mentioned, although those methods have optimal convergence

03:59.980 --> 04:05.100
rates and are robust, they are, they remained of theoretical nature due to being computationally

04:05.100 --> 04:10.060
expensive. The third class of methods are combinatorial algorithms specifically tailored

04:10.060 --> 04:15.820
for convex optimization over some modular-based polytopes. Those algorithms require instead

04:15.820 --> 04:20.860
solving a some modular function minimization problem every iteration, which again can be

04:20.860 --> 04:25.020
very expensive. However, those algorithms enjoy the nice property of returning exact

04:25.020 --> 04:30.700
optimal solution. In this talk, we will focus on bridging the efficiency of CG methods and

04:30.700 --> 04:35.420
the structural properties and exactness of combinatorial algorithms to speed up iterative

04:35.420 --> 04:40.780
projections appearing in mere descent and beyond. So first, let's consider the simpler

04:40.780 --> 04:44.380
case when our polytope is cardinality-based. So here we have a cardinality-based some modular

04:44.380 --> 04:48.940
function F, and for notation we define this vector c to be the vector of discrete derivatives

04:48.940 --> 04:53.340
of the concave function g. We now give the following Duati result, which states that

04:53.340 --> 04:57.180
the problem of computing a Bregman projection over a cardinality-based polytope is dual

04:57.180 --> 05:02.860
to isotonic optimization. Although our results hold for general Bregman projections, we will

05:02.860 --> 05:08.620
focus on the case of Euclidean projections for simplicity. To that end, consider a vector

05:08.620 --> 05:11.980
y that we're trying to compute its Euclidean projection over a cardinality-based polytope,

05:11.980 --> 05:17.340
and let e1 through en be an ordering of the ground set such that y is decreasing. In this

05:17.340 --> 05:21.580
case, we have the following primal problem, and the dual to that is the following isotonic

05:21.580 --> 05:27.820
regression problem. And further, we can map between the two problems using the following identity here.

05:29.580 --> 05:32.860
So just to give you some historical context, previously the best known running time for

05:32.860 --> 05:37.500
projections was O n squared using a primal algorithm by Gupta et al. Later on in that

05:37.500 --> 05:41.180
year, Lim and Wright used the same Duati approach to compute projections over the permutahedron,

05:41.180 --> 05:45.020
and we extended their approach to general cardinality-based polytopes. Now the dual

05:45.020 --> 05:49.340
isotonic regression problem could be solved in O n time using a simple algorithm called

05:49.340 --> 05:53.900
pool-adjacent violators algorithm, and this basically gives us an O n log n algorithm by

05:53.900 --> 05:59.180
solving the problem in the dual space and mapping it back to the primal space. And this is currently

05:59.180 --> 06:04.060
the fastest known algorithm. And the key takeaway is that solving projections over these polytopes

06:04.060 --> 06:09.420
can be very efficiently done. In fact, computing a projection and solving linear optimization

06:09.420 --> 06:15.260
have the same running time. Now let's demonstrate our result with an example. So here we are going

06:15.260 --> 06:20.060
to project this vector y onto the probability simplex, and the probability simplex is modeled

06:20.060 --> 06:24.940
by this cardinality-based modular function here given on the slide. And we see that y is already

06:24.940 --> 06:29.900
ordered for simplicity and c is the vector of discrete derivatives. Now the algorithm will

06:29.900 --> 06:35.260
proceed as follows. It initializes the dual iterates by the vector that we're trying to

06:35.260 --> 06:39.980
compute the isotonic regression for, c minus y, and here we have an adjacent violation because the

06:39.980 --> 06:45.180
second coordinate is strictly smaller than the first coordinate. Now the algorithm will basically

06:45.180 --> 06:49.740
average those two coordinates to obtain the following solution z star, and here we see that

06:49.740 --> 06:54.060
the ordering constraints are satisfied and z star is in fact the dual optimal. Next it will map it

06:54.060 --> 06:58.700
back to a primal optimal. And let's go back to this figure from the previous slide that just compares

06:58.700 --> 07:04.060
a basic linear regression fit with an isotonic regression fit. Here in the red stepwise curve,

07:04.060 --> 07:08.060
the points at which the curve remains flat is where a block of consecutive adjacent violated

07:08.060 --> 07:12.460
points are averaged similar to our example. This very efficient algorithm for computing

07:12.460 --> 07:15.820
regimen projections over cardinality-based polytopes unfortunately does not extend to

07:15.820 --> 07:20.460
general submodular based polytopes. And now my collaborator Jay will present different combinatorial

07:20.460 --> 07:25.100
strategies for dealing with those polytopes. We now describe our toolkit for speeding up

07:25.100 --> 07:31.100
projections on general submodular based polytopes. There are two basic objects that we can learn from.

07:31.100 --> 07:35.900
First, given projections of previous points, can we do better than computing a new projection from

07:35.900 --> 07:41.820
scratch? Second, given an iterative algorithm to compute a projection, can we use the combinatorial

07:41.820 --> 07:46.780
structure present in the sequence of iterates to speed up the algorithm and terminate it early?

07:49.180 --> 07:53.180
We have the well-known first-order optimality condition on the left. It helps us verify if a

07:53.180 --> 07:58.140
point is indeed optimal. This check is reduced to a linear optimization over the base polytope,

07:58.140 --> 08:03.740
which can be done using Edmunds-Greedy algorithm. We have an example. Suppose we know the gradient

08:03.740 --> 08:08.780
at a point x star and want to check if x star is indeed optimal. We look at the distinct values

08:08.780 --> 08:13.900
of the partial derivatives at x star and arrange them in an increasing order. Each time we see a

08:13.900 --> 08:19.580
gap in this order, we want that the point x star on the prefix set equal the submodular function

08:19.580 --> 08:26.380
value on that set. In the figure, the first such gap is after we have seen even an E5. Therefore,

08:26.380 --> 08:35.900
x star S1 must equal f of S1. Similarly, x star S2 must equal f of S2. Finally, xE must equal f of

08:35.900 --> 08:42.940
E. These sets S1, S2, and E are called tight sets at x and define the face containing the point x

08:42.940 --> 08:49.100
star. This leads us to two interesting observations that we use later. One, that if we know precisely

08:49.100 --> 08:53.980
what the tight sets are at the optimal points, we can also calculate the optimal point for all

08:53.980 --> 08:59.100
suitable functions h. Two, that knowing the gradient at the optimal point gives us these

08:59.100 --> 09:06.220
tight sets. We give an example using our combinatorial idea. Suppose we know a point

09:06.220 --> 09:11.420
zk that is close to our optimal x star. If the function is smooth, this implies gradient at zk

09:11.420 --> 09:16.540
and x star are close. This gives us a way to learn some tight sets defining the optimal face.

09:17.260 --> 09:21.740
In the example, for each coordinate, the blue line in the middle represents the partial derivative

09:21.740 --> 09:26.700
value at zk and the blue shade represents the possible variation in that value for the optimal

09:26.700 --> 09:31.500
point x star. That is, the corresponding partial derivative for x star lies in the shaded interval.

09:32.140 --> 09:36.860
The largest values in these intervals for E1 and E5 are lower than the lowest values in these

09:36.860 --> 09:44.300
intervals for every other element. This helps us conclude that the set E1 and E5, that is S1,

09:44.300 --> 09:50.540
is a tight set at x star. Similarly, we infer that S2 is also a tight set at x star.

09:51.500 --> 09:56.460
We now use that idea to give our first two tools. These apply more generally, but we demonstrate

09:56.460 --> 10:01.820
them using Euclidean projections. Suppose we already know the projection xi of a point yi,

10:01.820 --> 10:06.540
and we wish to find the projection xt of point yt, given that yt is close to yi.

10:07.660 --> 10:11.980
The non-expansiveness of projection implies that the gradients at xi and xt are also close,

10:11.980 --> 10:15.820
and therefore we can infer some tight sets at xt even before solving.

10:16.940 --> 10:20.620
Suppose we start computing the projection of yt using an iterative algorithm.

10:20.620 --> 10:26.860
We now use the iterates zi that converge to xt. An iterate zt that is close to xt also has a

10:26.860 --> 10:32.780
gradient that is close to the gradient at xt, and once again we can infer some tight sets at xt

10:32.780 --> 10:39.740
as we approach the optimal. We also conducted an experiment to show that tool T1 can recover

10:39.740 --> 10:44.300
most tight sets from previous projections. We now give two tools that help us round an

10:44.300 --> 10:49.260
approximate solution exactly to the projection. First is our tool T3 called Relax.

10:49.260 --> 10:53.500
We give a heuristic to check if we have already found all the tight sets at the optimal.

10:55.020 --> 10:59.660
We also show that we can round combinatorially when we know the function f to be integral,

10:59.660 --> 11:04.140
and an iterate zt is close enough to the optimal xt. This is our tool T4.

11:05.900 --> 11:10.620
We can reuse previously known vertices of the polytope. Suppose that our optimal is xt,

11:10.620 --> 11:15.660
and we are given a close by point xi as a convex combination of some vertices in the polytope.

11:15.660 --> 11:21.980
We can use those vertices to warm start the search for xt. Now our sixth tool, Restrict.

11:23.580 --> 11:27.260
Once we know a few tight sets for xt using our inferred tools T1 and T2,

11:27.260 --> 11:32.540
we needn't search over the optimal or the whole base polytope. We can restrict ourselves to the

11:32.540 --> 11:38.380
face of the polytope that satisfies these constraints. We show that a simple extension

11:38.380 --> 11:42.300
of Edmunds' greedy algorithm provides yellow oracle for each face of the polytope.

11:42.300 --> 11:46.300
We now bring together these tools and apply them to the awaystep-frank-wolff algorithm,

11:46.300 --> 11:50.700
giving the algorithm we dub adaptive awaystep-frank-wolff, or A2FW for short.

11:51.660 --> 11:57.340
First, warm start A2FW using tight sets for the optimal inferred from previous projected points,

11:57.340 --> 12:01.660
and active sets from previous projected points. While the algorithm runs and generates new

12:01.660 --> 12:05.660
iterates, it keeps inferring new tight sets for the optimal point using these iterates.

12:05.660 --> 12:09.580
In each iteration, if a new set has been found, the algorithm checks if all tight sets have been

12:09.580 --> 12:16.300
found. If indeed so, then stop and output the exact solution. Otherwise, simply restrict the

12:16.300 --> 12:22.060
problem to a low-dimensional face and keep going on. Note that the linear optimization is over a

12:22.060 --> 12:27.420
restricted face of the polytope. Let's see an example. Suppose we are optimizing over the

12:27.420 --> 12:32.860
polytope P. We look for the best frank-wolff vertex and the best away vertex. We find that

12:32.860 --> 12:37.500
the best frank-wolff vertex is the best away vertex. Since the direction opposite to the away

12:37.500 --> 12:44.060
vertex is the better direction to move in, we find the next iterate ZT plus 1. Now, ZT plus 1 is

12:44.060 --> 12:50.540
close enough to X star that it allows us to detect another tight set and round to the face F new.

12:50.540 --> 12:56.140
One way to do that is to round to an arbitrary vertex in F new using our yellow oracle. Another

12:56.140 --> 13:00.700
option is to relax to F new and see if the solution obtained is feasible. If feasibility

13:00.700 --> 13:06.300
check is uncertain, return to the previous strategy. Eventually, we reach the optimal

13:06.300 --> 13:11.660
X star either way. We give this theorem about the primal gap for the modified algorithm.

13:11.660 --> 13:16.460
The function h is l-smooth and mu strongly convex and d refers to the diameter of BF.

13:17.100 --> 13:22.940
Notice how this compares to the AFW algorithm. When we restrict to a face F of BF, our guarantee

13:22.940 --> 13:29.020
depends only on the pyramidal width of F instead of the pyramidal width of BF. This pyramidal width

13:29.020 --> 13:33.420
can be much lower for the restricted face. For instance, it depends on the dimension of the face

13:33.420 --> 13:40.620
for the probability simplex. Therefore, A2FW leads to a faster convergence. We now show the

13:40.620 --> 13:46.460
effectiveness of our toolkit and the A2FW algorithm using experiments. For our computations,

13:46.460 --> 13:50.940
we simulate an online recommendation system where we are learning over rankings of items

13:50.940 --> 13:55.660
displayed to users. Our loss functions are stochastic model click-through rates. This

13:55.660 --> 14:01.340
can be seen as optimization over the permutahedron. We use online mirror descent which performs

14:01.340 --> 14:07.020
iterative projections and uses away step Frank-Wulf for these projections. We benchmark the

14:07.020 --> 14:14.780
original AFW algorithm against variants modified by our tools. We report significant improvement

14:14.780 --> 14:19.820
in both runtime and the number of AFW iterations. The green line stands for OMD with the original

14:19.820 --> 14:27.180
unoptimized AFW. The yellow line stands for OMD with A2FW algorithm. We do note that both OMDPAV,

14:27.180 --> 14:31.900
that is OMD with projections using the poor adjacent violators algorithm, and OFW were

14:31.900 --> 14:38.140
significantly faster than OMD with any AFW variant. However, OFW does not lead to optimum

14:38.140 --> 14:43.420
regret rates while OMDPAV works only for cardinality-based submodular polytopes. To

14:43.420 --> 14:48.140
conclude, we studied iterative projections for prevalent submodular-based polytopes. We presented

14:48.140 --> 14:53.020
an algorithm for cardinality-based polytopes. For general polytopes, we developed a combinatorial

14:53.020 --> 14:58.380
toolkit to speed up iterative projections and applied it to the AFW algorithm and computationally

14:58.380 --> 15:18.540
showed that our algorithm is orders of magnitude faster than the original AFW variant.