Spaces:
Running
Running
File size: 20,672 Bytes
cb71ef5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 |
WEBVTT
00:00.000 --> 00:13.040
Hello, I'm Hassam Murtaghi. I'm a PhD student at Georgia Tech. Along with my collaborator
00:13.040 --> 00:16.880
Jay Mundra, we will present our work on reusing combinatorial structure, faster projections
00:16.880 --> 00:20.260
over submodular-based polytopes. This is joint work with Swati Gupta.
00:20.260 --> 00:24.220
In this talk, we consider a sequence of similar structured optimization problems a setup often
00:24.220 --> 00:28.220
encountered in practice. We first start with our main problem of minimizing a convex function
00:28.220 --> 00:32.260
over a decision set P. At the next time step, this problem sees some perturbation and we
00:32.260 --> 00:36.700
obtain another similar problem, and so on. An example of this setup is the case of iterative
00:36.700 --> 00:40.340
projections where at each time step, we are computing the projection of a new point y
00:40.340 --> 00:44.860
t that is close to previously projected points y i. These iterative projections form a key
00:44.860 --> 00:48.140
step in many optimal learning algorithms and they are currently solved from scratch every
00:48.140 --> 00:51.900
iteration. They are not viewed in the context of an iterative environment where previously
00:51.900 --> 00:55.140
computed projections can be exploited to speed up subsequent ones.
00:55.140 --> 00:59.500
Thus, in this talk, we ask, is it possible to speed up similar iterative optimization
00:59.500 --> 01:03.660
problems by reusing structural information from previous minimizers?
01:03.660 --> 01:07.580
Let me now give you some more details about our setup. Here is a table that summarizes
01:07.580 --> 01:11.180
various widespread first-order optimization algorithms. The first two algorithms are conditional
01:11.180 --> 01:16.660
gradient variants and they only solve linear optimization every iteration. Their convergence
01:16.660 --> 01:21.140
rates depend on the dimension of the problem and on geometric constants for the underlying
01:21.140 --> 01:25.140
decision set, such as the pyramidal width for the waystep-Fraenkel variant given in
01:25.140 --> 01:28.340
the second row. On the other hand, the remaining third algorithms
01:28.340 --> 01:32.580
are projection-based algorithms that compute the projection every iteration, and their
01:32.580 --> 01:37.660
convergence rates, however, are optimal in the sense that they only rely on the condition
01:37.660 --> 01:43.260
number of the function and they are dimension-independent. Further, to capture a wide range of combinatorial
01:43.260 --> 01:47.980
sets, we consider the case where decision set P is given by a submodular polytope, and
01:47.980 --> 01:53.380
the challenge is that these polytopes have an exponential number of constraints. Thus,
01:53.380 --> 01:58.020
computing a projection over those polytopes is a big computational bottleneck in projection-based
01:58.020 --> 02:03.820
algorithms. Motivated by the straight-off in convergence rates versus runtime, we further
02:03.820 --> 02:08.740
ask, is it possible to speed up iterative projections over submodular polytopes by reusing
02:08.740 --> 02:14.300
structural information from previous minimizers? I'm now going to give more introduction on
02:14.300 --> 02:18.500
the problem and submodularity and review of first-order methods. So, as mentioned, we
02:18.500 --> 02:22.220
assume that the combinatorial structure in a problem is given by a submodular function.
02:22.220 --> 02:26.780
Set function F, defined over a ground set E of n elements, is submodular if it satisfies
02:26.780 --> 02:33.260
the following property. Furthermore, the base polytope associated with F is defined as the
02:33.260 --> 02:38.060
following system of linear inequalities, and here we see that V of F is modeled using an
02:38.060 --> 02:41.500
exponential number of constraints because we have a constraint for each subset of the
02:41.500 --> 02:46.860
concept. An example is the permutahedron, a polytope whose vertices are permutations
02:46.860 --> 02:51.740
of 1 through n. And here we have an example in the slide for when n is equal to 3. These
02:51.740 --> 02:58.220
polytopes are extensively used in online learning over rankings of items. A special class of
02:58.220 --> 03:02.140
submodular polytopes are known as Cardinality-based functions, and a Cardinality-based function
03:02.140 --> 03:07.620
F is defined as F of S equal to G Cardinality of S, where G is a concave function. And here
03:07.620 --> 03:10.940
we have another table that summarizes various machine and online learning problems in a
03:10.940 --> 03:14.380
submodular set function that gives rise to them. We see the permutahedron in the second
03:14.380 --> 03:19.180
row of this table, and it is in fact a Cardinality-based polytope. Other non-Cardinality-based examples
03:19.180 --> 03:22.220
include spanning trees and independent sets of matroids.
03:24.060 --> 03:28.220
So let's go back to our main problem of minimizing a convex function over the base polytope.
03:28.220 --> 03:32.620
So there typically exist three main paradigms to solve this problem. The first is a class
03:32.620 --> 03:37.020
of methods, known as conditional gradient methods, and as I mentioned before, those
03:37.020 --> 03:42.620
assume access to B of F via linear optimization oracle. And these methods are specifically
03:42.620 --> 03:46.780
advantageous for base polytopes because linear optimization over base polytopes could be
03:46.780 --> 03:51.180
done very efficiently using Edmunds' greedy algorithm. The second class of methods are
03:51.180 --> 03:55.500
mere descent variants, and those compute a projection every iteration to ensure feasibility.
03:56.060 --> 03:59.980
And again, as I also previously mentioned, although those methods have optimal convergence
03:59.980 --> 04:05.100
rates and are robust, they are, they remained of theoretical nature due to being computationally
04:05.100 --> 04:10.060
expensive. The third class of methods are combinatorial algorithms specifically tailored
04:10.060 --> 04:15.820
for convex optimization over some modular-based polytopes. Those algorithms require instead
04:15.820 --> 04:20.860
solving a some modular function minimization problem every iteration, which again can be
04:20.860 --> 04:25.020
very expensive. However, those algorithms enjoy the nice property of returning exact
04:25.020 --> 04:30.700
optimal solution. In this talk, we will focus on bridging the efficiency of CG methods and
04:30.700 --> 04:35.420
the structural properties and exactness of combinatorial algorithms to speed up iterative
04:35.420 --> 04:40.780
projections appearing in mere descent and beyond. So first, let's consider the simpler
04:40.780 --> 04:44.380
case when our polytope is cardinality-based. So here we have a cardinality-based some modular
04:44.380 --> 04:48.940
function F, and for notation we define this vector c to be the vector of discrete derivatives
04:48.940 --> 04:53.340
of the concave function g. We now give the following Duati result, which states that
04:53.340 --> 04:57.180
the problem of computing a Bregman projection over a cardinality-based polytope is dual
04:57.180 --> 05:02.860
to isotonic optimization. Although our results hold for general Bregman projections, we will
05:02.860 --> 05:08.620
focus on the case of Euclidean projections for simplicity. To that end, consider a vector
05:08.620 --> 05:11.980
y that we're trying to compute its Euclidean projection over a cardinality-based polytope,
05:11.980 --> 05:17.340
and let e1 through en be an ordering of the ground set such that y is decreasing. In this
05:17.340 --> 05:21.580
case, we have the following primal problem, and the dual to that is the following isotonic
05:21.580 --> 05:27.820
regression problem. And further, we can map between the two problems using the following identity here.
05:29.580 --> 05:32.860
So just to give you some historical context, previously the best known running time for
05:32.860 --> 05:37.500
projections was O n squared using a primal algorithm by Gupta et al. Later on in that
05:37.500 --> 05:41.180
year, Lim and Wright used the same Duati approach to compute projections over the permutahedron,
05:41.180 --> 05:45.020
and we extended their approach to general cardinality-based polytopes. Now the dual
05:45.020 --> 05:49.340
isotonic regression problem could be solved in O n time using a simple algorithm called
05:49.340 --> 05:53.900
pool-adjacent violators algorithm, and this basically gives us an O n log n algorithm by
05:53.900 --> 05:59.180
solving the problem in the dual space and mapping it back to the primal space. And this is currently
05:59.180 --> 06:04.060
the fastest known algorithm. And the key takeaway is that solving projections over these polytopes
06:04.060 --> 06:09.420
can be very efficiently done. In fact, computing a projection and solving linear optimization
06:09.420 --> 06:15.260
have the same running time. Now let's demonstrate our result with an example. So here we are going
06:15.260 --> 06:20.060
to project this vector y onto the probability simplex, and the probability simplex is modeled
06:20.060 --> 06:24.940
by this cardinality-based modular function here given on the slide. And we see that y is already
06:24.940 --> 06:29.900
ordered for simplicity and c is the vector of discrete derivatives. Now the algorithm will
06:29.900 --> 06:35.260
proceed as follows. It initializes the dual iterates by the vector that we're trying to
06:35.260 --> 06:39.980
compute the isotonic regression for, c minus y, and here we have an adjacent violation because the
06:39.980 --> 06:45.180
second coordinate is strictly smaller than the first coordinate. Now the algorithm will basically
06:45.180 --> 06:49.740
average those two coordinates to obtain the following solution z star, and here we see that
06:49.740 --> 06:54.060
the ordering constraints are satisfied and z star is in fact the dual optimal. Next it will map it
06:54.060 --> 06:58.700
back to a primal optimal. And let's go back to this figure from the previous slide that just compares
06:58.700 --> 07:04.060
a basic linear regression fit with an isotonic regression fit. Here in the red stepwise curve,
07:04.060 --> 07:08.060
the points at which the curve remains flat is where a block of consecutive adjacent violated
07:08.060 --> 07:12.460
points are averaged similar to our example. This very efficient algorithm for computing
07:12.460 --> 07:15.820
regimen projections over cardinality-based polytopes unfortunately does not extend to
07:15.820 --> 07:20.460
general submodular based polytopes. And now my collaborator Jay will present different combinatorial
07:20.460 --> 07:25.100
strategies for dealing with those polytopes. We now describe our toolkit for speeding up
07:25.100 --> 07:31.100
projections on general submodular based polytopes. There are two basic objects that we can learn from.
07:31.100 --> 07:35.900
First, given projections of previous points, can we do better than computing a new projection from
07:35.900 --> 07:41.820
scratch? Second, given an iterative algorithm to compute a projection, can we use the combinatorial
07:41.820 --> 07:46.780
structure present in the sequence of iterates to speed up the algorithm and terminate it early?
07:49.180 --> 07:53.180
We have the well-known first-order optimality condition on the left. It helps us verify if a
07:53.180 --> 07:58.140
point is indeed optimal. This check is reduced to a linear optimization over the base polytope,
07:58.140 --> 08:03.740
which can be done using Edmunds-Greedy algorithm. We have an example. Suppose we know the gradient
08:03.740 --> 08:08.780
at a point x star and want to check if x star is indeed optimal. We look at the distinct values
08:08.780 --> 08:13.900
of the partial derivatives at x star and arrange them in an increasing order. Each time we see a
08:13.900 --> 08:19.580
gap in this order, we want that the point x star on the prefix set equal the submodular function
08:19.580 --> 08:26.380
value on that set. In the figure, the first such gap is after we have seen even an E5. Therefore,
08:26.380 --> 08:35.900
x star S1 must equal f of S1. Similarly, x star S2 must equal f of S2. Finally, xE must equal f of
08:35.900 --> 08:42.940
E. These sets S1, S2, and E are called tight sets at x and define the face containing the point x
08:42.940 --> 08:49.100
star. This leads us to two interesting observations that we use later. One, that if we know precisely
08:49.100 --> 08:53.980
what the tight sets are at the optimal points, we can also calculate the optimal point for all
08:53.980 --> 08:59.100
suitable functions h. Two, that knowing the gradient at the optimal point gives us these
08:59.100 --> 09:06.220
tight sets. We give an example using our combinatorial idea. Suppose we know a point
09:06.220 --> 09:11.420
zk that is close to our optimal x star. If the function is smooth, this implies gradient at zk
09:11.420 --> 09:16.540
and x star are close. This gives us a way to learn some tight sets defining the optimal face.
09:17.260 --> 09:21.740
In the example, for each coordinate, the blue line in the middle represents the partial derivative
09:21.740 --> 09:26.700
value at zk and the blue shade represents the possible variation in that value for the optimal
09:26.700 --> 09:31.500
point x star. That is, the corresponding partial derivative for x star lies in the shaded interval.
09:32.140 --> 09:36.860
The largest values in these intervals for E1 and E5 are lower than the lowest values in these
09:36.860 --> 09:44.300
intervals for every other element. This helps us conclude that the set E1 and E5, that is S1,
09:44.300 --> 09:50.540
is a tight set at x star. Similarly, we infer that S2 is also a tight set at x star.
09:51.500 --> 09:56.460
We now use that idea to give our first two tools. These apply more generally, but we demonstrate
09:56.460 --> 10:01.820
them using Euclidean projections. Suppose we already know the projection xi of a point yi,
10:01.820 --> 10:06.540
and we wish to find the projection xt of point yt, given that yt is close to yi.
10:07.660 --> 10:11.980
The non-expansiveness of projection implies that the gradients at xi and xt are also close,
10:11.980 --> 10:15.820
and therefore we can infer some tight sets at xt even before solving.
10:16.940 --> 10:20.620
Suppose we start computing the projection of yt using an iterative algorithm.
10:20.620 --> 10:26.860
We now use the iterates zi that converge to xt. An iterate zt that is close to xt also has a
10:26.860 --> 10:32.780
gradient that is close to the gradient at xt, and once again we can infer some tight sets at xt
10:32.780 --> 10:39.740
as we approach the optimal. We also conducted an experiment to show that tool T1 can recover
10:39.740 --> 10:44.300
most tight sets from previous projections. We now give two tools that help us round an
10:44.300 --> 10:49.260
approximate solution exactly to the projection. First is our tool T3 called Relax.
10:49.260 --> 10:53.500
We give a heuristic to check if we have already found all the tight sets at the optimal.
10:55.020 --> 10:59.660
We also show that we can round combinatorially when we know the function f to be integral,
10:59.660 --> 11:04.140
and an iterate zt is close enough to the optimal xt. This is our tool T4.
11:05.900 --> 11:10.620
We can reuse previously known vertices of the polytope. Suppose that our optimal is xt,
11:10.620 --> 11:15.660
and we are given a close by point xi as a convex combination of some vertices in the polytope.
11:15.660 --> 11:21.980
We can use those vertices to warm start the search for xt. Now our sixth tool, Restrict.
11:23.580 --> 11:27.260
Once we know a few tight sets for xt using our inferred tools T1 and T2,
11:27.260 --> 11:32.540
we needn't search over the optimal or the whole base polytope. We can restrict ourselves to the
11:32.540 --> 11:38.380
face of the polytope that satisfies these constraints. We show that a simple extension
11:38.380 --> 11:42.300
of Edmunds' greedy algorithm provides yellow oracle for each face of the polytope.
11:42.300 --> 11:46.300
We now bring together these tools and apply them to the awaystep-frank-wolff algorithm,
11:46.300 --> 11:50.700
giving the algorithm we dub adaptive awaystep-frank-wolff, or A2FW for short.
11:51.660 --> 11:57.340
First, warm start A2FW using tight sets for the optimal inferred from previous projected points,
11:57.340 --> 12:01.660
and active sets from previous projected points. While the algorithm runs and generates new
12:01.660 --> 12:05.660
iterates, it keeps inferring new tight sets for the optimal point using these iterates.
12:05.660 --> 12:09.580
In each iteration, if a new set has been found, the algorithm checks if all tight sets have been
12:09.580 --> 12:16.300
found. If indeed so, then stop and output the exact solution. Otherwise, simply restrict the
12:16.300 --> 12:22.060
problem to a low-dimensional face and keep going on. Note that the linear optimization is over a
12:22.060 --> 12:27.420
restricted face of the polytope. Let's see an example. Suppose we are optimizing over the
12:27.420 --> 12:32.860
polytope P. We look for the best frank-wolff vertex and the best away vertex. We find that
12:32.860 --> 12:37.500
the best frank-wolff vertex is the best away vertex. Since the direction opposite to the away
12:37.500 --> 12:44.060
vertex is the better direction to move in, we find the next iterate ZT plus 1. Now, ZT plus 1 is
12:44.060 --> 12:50.540
close enough to X star that it allows us to detect another tight set and round to the face F new.
12:50.540 --> 12:56.140
One way to do that is to round to an arbitrary vertex in F new using our yellow oracle. Another
12:56.140 --> 13:00.700
option is to relax to F new and see if the solution obtained is feasible. If feasibility
13:00.700 --> 13:06.300
check is uncertain, return to the previous strategy. Eventually, we reach the optimal
13:06.300 --> 13:11.660
X star either way. We give this theorem about the primal gap for the modified algorithm.
13:11.660 --> 13:16.460
The function h is l-smooth and mu strongly convex and d refers to the diameter of BF.
13:17.100 --> 13:22.940
Notice how this compares to the AFW algorithm. When we restrict to a face F of BF, our guarantee
13:22.940 --> 13:29.020
depends only on the pyramidal width of F instead of the pyramidal width of BF. This pyramidal width
13:29.020 --> 13:33.420
can be much lower for the restricted face. For instance, it depends on the dimension of the face
13:33.420 --> 13:40.620
for the probability simplex. Therefore, A2FW leads to a faster convergence. We now show the
13:40.620 --> 13:46.460
effectiveness of our toolkit and the A2FW algorithm using experiments. For our computations,
13:46.460 --> 13:50.940
we simulate an online recommendation system where we are learning over rankings of items
13:50.940 --> 13:55.660
displayed to users. Our loss functions are stochastic model click-through rates. This
13:55.660 --> 14:01.340
can be seen as optimization over the permutahedron. We use online mirror descent which performs
14:01.340 --> 14:07.020
iterative projections and uses away step Frank-Wulf for these projections. We benchmark the
14:07.020 --> 14:14.780
original AFW algorithm against variants modified by our tools. We report significant improvement
14:14.780 --> 14:19.820
in both runtime and the number of AFW iterations. The green line stands for OMD with the original
14:19.820 --> 14:27.180
unoptimized AFW. The yellow line stands for OMD with A2FW algorithm. We do note that both OMDPAV,
14:27.180 --> 14:31.900
that is OMD with projections using the poor adjacent violators algorithm, and OFW were
14:31.900 --> 14:38.140
significantly faster than OMD with any AFW variant. However, OFW does not lead to optimum
14:38.140 --> 14:43.420
regret rates while OMDPAV works only for cardinality-based submodular polytopes. To
14:43.420 --> 14:48.140
conclude, we studied iterative projections for prevalent submodular-based polytopes. We presented
14:48.140 --> 14:53.020
an algorithm for cardinality-based polytopes. For general polytopes, we developed a combinatorial
14:53.020 --> 14:58.380
toolkit to speed up iterative projections and applied it to the AFW algorithm and computationally
14:58.380 --> 15:18.540
showed that our algorithm is orders of magnitude faster than the original AFW variant.
|