Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /nips-2021 /25963 /transcript_whisper_large-v2.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame contribute delete

20.7 kB

	WEBVTT

	00:00.000 --> 00:13.040
	Hello, I'm Hassam Murtaghi. I'm a PhD student at Georgia Tech. Along with my collaborator

	00:13.040 --> 00:16.880
	Jay Mundra, we will present our work on reusing combinatorial structure, faster projections

	00:16.880 --> 00:20.260
	over submodular-based polytopes. This is joint work with Swati Gupta.

	00:20.260 --> 00:24.220
	In this talk, we consider a sequence of similar structured optimization problems a setup often

	00:24.220 --> 00:28.220
	encountered in practice. We first start with our main problem of minimizing a convex function

	00:28.220 --> 00:32.260
	over a decision set P. At the next time step, this problem sees some perturbation and we

	00:32.260 --> 00:36.700
	obtain another similar problem, and so on. An example of this setup is the case of iterative

	00:36.700 --> 00:40.340
	projections where at each time step, we are computing the projection of a new point y

	00:40.340 --> 00:44.860
	t that is close to previously projected points y i. These iterative projections form a key

	00:44.860 --> 00:48.140
	step in many optimal learning algorithms and they are currently solved from scratch every

	00:48.140 --> 00:51.900
	iteration. They are not viewed in the context of an iterative environment where previously

	00:51.900 --> 00:55.140
	computed projections can be exploited to speed up subsequent ones.

	00:55.140 --> 00:59.500
	Thus, in this talk, we ask, is it possible to speed up similar iterative optimization

	00:59.500 --> 01:03.660
	problems by reusing structural information from previous minimizers?

	01:03.660 --> 01:07.580
	Let me now give you some more details about our setup. Here is a table that summarizes

	01:07.580 --> 01:11.180
	various widespread first-order optimization algorithms. The first two algorithms are conditional

	01:11.180 --> 01:16.660
	gradient variants and they only solve linear optimization every iteration. Their convergence

	01:16.660 --> 01:21.140
	rates depend on the dimension of the problem and on geometric constants for the underlying

	01:21.140 --> 01:25.140
	decision set, such as the pyramidal width for the waystep-Fraenkel variant given in

	01:25.140 --> 01:28.340
	the second row. On the other hand, the remaining third algorithms

	01:28.340 --> 01:32.580
	are projection-based algorithms that compute the projection every iteration, and their

	01:32.580 --> 01:37.660
	convergence rates, however, are optimal in the sense that they only rely on the condition

	01:37.660 --> 01:43.260
	number of the function and they are dimension-independent. Further, to capture a wide range of combinatorial

	01:43.260 --> 01:47.980
	sets, we consider the case where decision set P is given by a submodular polytope, and

	01:47.980 --> 01:53.380
	the challenge is that these polytopes have an exponential number of constraints. Thus,

	01:53.380 --> 01:58.020
	computing a projection over those polytopes is a big computational bottleneck in projection-based

	01:58.020 --> 02:03.820
	algorithms. Motivated by the straight-off in convergence rates versus runtime, we further

	02:03.820 --> 02:08.740
	ask, is it possible to speed up iterative projections over submodular polytopes by reusing

	02:08.740 --> 02:14.300
	structural information from previous minimizers? I'm now going to give more introduction on

	02:14.300 --> 02:18.500
	the problem and submodularity and review of first-order methods. So, as mentioned, we

	02:18.500 --> 02:22.220
	assume that the combinatorial structure in a problem is given by a submodular function.

	02:22.220 --> 02:26.780
	Set function F, defined over a ground set E of n elements, is submodular if it satisfies

	02:26.780 --> 02:33.260
	the following property. Furthermore, the base polytope associated with F is defined as the

	02:33.260 --> 02:38.060
	following system of linear inequalities, and here we see that V of F is modeled using an

	02:38.060 --> 02:41.500
	exponential number of constraints because we have a constraint for each subset of the

	02:41.500 --> 02:46.860
	concept. An example is the permutahedron, a polytope whose vertices are permutations

	02:46.860 --> 02:51.740
	of 1 through n. And here we have an example in the slide for when n is equal to 3. These

	02:51.740 --> 02:58.220
	polytopes are extensively used in online learning over rankings of items. A special class of

	02:58.220 --> 03:02.140
	submodular polytopes are known as Cardinality-based functions, and a Cardinality-based function

	03:02.140 --> 03:07.620
	F is defined as F of S equal to G Cardinality of S, where G is a concave function. And here

	03:07.620 --> 03:10.940
	we have another table that summarizes various machine and online learning problems in a

	03:10.940 --> 03:14.380
	submodular set function that gives rise to them. We see the permutahedron in the second

	03:14.380 --> 03:19.180
	row of this table, and it is in fact a Cardinality-based polytope. Other non-Cardinality-based examples

	03:19.180 --> 03:22.220
	include spanning trees and independent sets of matroids.

	03:24.060 --> 03:28.220
	So let's go back to our main problem of minimizing a convex function over the base polytope.

	03:28.220 --> 03:32.620
	So there typically exist three main paradigms to solve this problem. The first is a class

	03:32.620 --> 03:37.020
	of methods, known as conditional gradient methods, and as I mentioned before, those

	03:37.020 --> 03:42.620
	assume access to B of F via linear optimization oracle. And these methods are specifically

	03:42.620 --> 03:46.780
	advantageous for base polytopes because linear optimization over base polytopes could be

	03:46.780 --> 03:51.180
	done very efficiently using Edmunds' greedy algorithm. The second class of methods are

	03:51.180 --> 03:55.500
	mere descent variants, and those compute a projection every iteration to ensure feasibility.

	03:56.060 --> 03:59.980
	And again, as I also previously mentioned, although those methods have optimal convergence

	03:59.980 --> 04:05.100
	rates and are robust, they are, they remained of theoretical nature due to being computationally

	04:05.100 --> 04:10.060
	expensive. The third class of methods are combinatorial algorithms specifically tailored

	04:10.060 --> 04:15.820
	for convex optimization over some modular-based polytopes. Those algorithms require instead

	04:15.820 --> 04:20.860
	solving a some modular function minimization problem every iteration, which again can be

	04:20.860 --> 04:25.020
	very expensive. However, those algorithms enjoy the nice property of returning exact

	04:25.020 --> 04:30.700
	optimal solution. In this talk, we will focus on bridging the efficiency of CG methods and

	04:30.700 --> 04:35.420
	the structural properties and exactness of combinatorial algorithms to speed up iterative

	04:35.420 --> 04:40.780
	projections appearing in mere descent and beyond. So first, let's consider the simpler

	04:40.780 --> 04:44.380
	case when our polytope is cardinality-based. So here we have a cardinality-based some modular

	04:44.380 --> 04:48.940
	function F, and for notation we define this vector c to be the vector of discrete derivatives

	04:48.940 --> 04:53.340
	of the concave function g. We now give the following Duati result, which states that

	04:53.340 --> 04:57.180
	the problem of computing a Bregman projection over a cardinality-based polytope is dual

	04:57.180 --> 05:02.860
	to isotonic optimization. Although our results hold for general Bregman projections, we will

	05:02.860 --> 05:08.620
	focus on the case of Euclidean projections for simplicity. To that end, consider a vector

	05:08.620 --> 05:11.980
	y that we're trying to compute its Euclidean projection over a cardinality-based polytope,

	05:11.980 --> 05:17.340
	and let e1 through en be an ordering of the ground set such that y is decreasing. In this

	05:17.340 --> 05:21.580
	case, we have the following primal problem, and the dual to that is the following isotonic

	05:21.580 --> 05:27.820
	regression problem. And further, we can map between the two problems using the following identity here.

	05:29.580 --> 05:32.860
	So just to give you some historical context, previously the best known running time for

	05:32.860 --> 05:37.500
	projections was O n squared using a primal algorithm by Gupta et al. Later on in that

	05:37.500 --> 05:41.180
	year, Lim and Wright used the same Duati approach to compute projections over the permutahedron,

	05:41.180 --> 05:45.020
	and we extended their approach to general cardinality-based polytopes. Now the dual

	05:45.020 --> 05:49.340
	isotonic regression problem could be solved in O n time using a simple algorithm called

	05:49.340 --> 05:53.900
	pool-adjacent violators algorithm, and this basically gives us an O n log n algorithm by

	05:53.900 --> 05:59.180
	solving the problem in the dual space and mapping it back to the primal space. And this is currently

	05:59.180 --> 06:04.060
	the fastest known algorithm. And the key takeaway is that solving projections over these polytopes

	06:04.060 --> 06:09.420
	can be very efficiently done. In fact, computing a projection and solving linear optimization

	06:09.420 --> 06:15.260
	have the same running time. Now let's demonstrate our result with an example. So here we are going

	06:15.260 --> 06:20.060
	to project this vector y onto the probability simplex, and the probability simplex is modeled

	06:20.060 --> 06:24.940
	by this cardinality-based modular function here given on the slide. And we see that y is already

	06:24.940 --> 06:29.900
	ordered for simplicity and c is the vector of discrete derivatives. Now the algorithm will

	06:29.900 --> 06:35.260
	proceed as follows. It initializes the dual iterates by the vector that we're trying to

	06:35.260 --> 06:39.980
	compute the isotonic regression for, c minus y, and here we have an adjacent violation because the

	06:39.980 --> 06:45.180
	second coordinate is strictly smaller than the first coordinate. Now the algorithm will basically

	06:45.180 --> 06:49.740
	average those two coordinates to obtain the following solution z star, and here we see that

	06:49.740 --> 06:54.060
	the ordering constraints are satisfied and z star is in fact the dual optimal. Next it will map it

	06:54.060 --> 06:58.700
	back to a primal optimal. And let's go back to this figure from the previous slide that just compares

	06:58.700 --> 07:04.060
	a basic linear regression fit with an isotonic regression fit. Here in the red stepwise curve,

	07:04.060 --> 07:08.060
	the points at which the curve remains flat is where a block of consecutive adjacent violated

	07:08.060 --> 07:12.460
	points are averaged similar to our example. This very efficient algorithm for computing

	07:12.460 --> 07:15.820
	regimen projections over cardinality-based polytopes unfortunately does not extend to

	07:15.820 --> 07:20.460
	general submodular based polytopes. And now my collaborator Jay will present different combinatorial

	07:20.460 --> 07:25.100
	strategies for dealing with those polytopes. We now describe our toolkit for speeding up

	07:25.100 --> 07:31.100
	projections on general submodular based polytopes. There are two basic objects that we can learn from.

	07:31.100 --> 07:35.900
	First, given projections of previous points, can we do better than computing a new projection from

	07:35.900 --> 07:41.820
	scratch? Second, given an iterative algorithm to compute a projection, can we use the combinatorial

	07:41.820 --> 07:46.780
	structure present in the sequence of iterates to speed up the algorithm and terminate it early?

	07:49.180 --> 07:53.180
	We have the well-known first-order optimality condition on the left. It helps us verify if a

	07:53.180 --> 07:58.140
	point is indeed optimal. This check is reduced to a linear optimization over the base polytope,

	07:58.140 --> 08:03.740
	which can be done using Edmunds-Greedy algorithm. We have an example. Suppose we know the gradient

	08:03.740 --> 08:08.780
	at a point x star and want to check if x star is indeed optimal. We look at the distinct values

	08:08.780 --> 08:13.900
	of the partial derivatives at x star and arrange them in an increasing order. Each time we see a

	08:13.900 --> 08:19.580
	gap in this order, we want that the point x star on the prefix set equal the submodular function

	08:19.580 --> 08:26.380
	value on that set. In the figure, the first such gap is after we have seen even an E5. Therefore,

	08:26.380 --> 08:35.900
	x star S1 must equal f of S1. Similarly, x star S2 must equal f of S2. Finally, xE must equal f of

	08:35.900 --> 08:42.940
	E. These sets S1, S2, and E are called tight sets at x and define the face containing the point x

	08:42.940 --> 08:49.100
	star. This leads us to two interesting observations that we use later. One, that if we know precisely

	08:49.100 --> 08:53.980
	what the tight sets are at the optimal points, we can also calculate the optimal point for all

	08:53.980 --> 08:59.100
	suitable functions h. Two, that knowing the gradient at the optimal point gives us these

	08:59.100 --> 09:06.220
	tight sets. We give an example using our combinatorial idea. Suppose we know a point

	09:06.220 --> 09:11.420
	zk that is close to our optimal x star. If the function is smooth, this implies gradient at zk

	09:11.420 --> 09:16.540
	and x star are close. This gives us a way to learn some tight sets defining the optimal face.

	09:17.260 --> 09:21.740
	In the example, for each coordinate, the blue line in the middle represents the partial derivative

	09:21.740 --> 09:26.700
	value at zk and the blue shade represents the possible variation in that value for the optimal

	09:26.700 --> 09:31.500
	point x star. That is, the corresponding partial derivative for x star lies in the shaded interval.

	09:32.140 --> 09:36.860
	The largest values in these intervals for E1 and E5 are lower than the lowest values in these

	09:36.860 --> 09:44.300
	intervals for every other element. This helps us conclude that the set E1 and E5, that is S1,

	09:44.300 --> 09:50.540
	is a tight set at x star. Similarly, we infer that S2 is also a tight set at x star.

	09:51.500 --> 09:56.460
	We now use that idea to give our first two tools. These apply more generally, but we demonstrate

	09:56.460 --> 10:01.820
	them using Euclidean projections. Suppose we already know the projection xi of a point yi,

	10:01.820 --> 10:06.540
	and we wish to find the projection xt of point yt, given that yt is close to yi.

	10:07.660 --> 10:11.980
	The non-expansiveness of projection implies that the gradients at xi and xt are also close,

	10:11.980 --> 10:15.820
	and therefore we can infer some tight sets at xt even before solving.

	10:16.940 --> 10:20.620
	Suppose we start computing the projection of yt using an iterative algorithm.

	10:20.620 --> 10:26.860
	We now use the iterates zi that converge to xt. An iterate zt that is close to xt also has a

	10:26.860 --> 10:32.780
	gradient that is close to the gradient at xt, and once again we can infer some tight sets at xt

	10:32.780 --> 10:39.740
	as we approach the optimal. We also conducted an experiment to show that tool T1 can recover

	10:39.740 --> 10:44.300
	most tight sets from previous projections. We now give two tools that help us round an

	10:44.300 --> 10:49.260
	approximate solution exactly to the projection. First is our tool T3 called Relax.

	10:49.260 --> 10:53.500
	We give a heuristic to check if we have already found all the tight sets at the optimal.

	10:55.020 --> 10:59.660
	We also show that we can round combinatorially when we know the function f to be integral,

	10:59.660 --> 11:04.140
	and an iterate zt is close enough to the optimal xt. This is our tool T4.

	11:05.900 --> 11:10.620
	We can reuse previously known vertices of the polytope. Suppose that our optimal is xt,

	11:10.620 --> 11:15.660
	and we are given a close by point xi as a convex combination of some vertices in the polytope.

	11:15.660 --> 11:21.980
	We can use those vertices to warm start the search for xt. Now our sixth tool, Restrict.

	11:23.580 --> 11:27.260
	Once we know a few tight sets for xt using our inferred tools T1 and T2,

	11:27.260 --> 11:32.540
	we needn't search over the optimal or the whole base polytope. We can restrict ourselves to the

	11:32.540 --> 11:38.380
	face of the polytope that satisfies these constraints. We show that a simple extension

	11:38.380 --> 11:42.300
	of Edmunds' greedy algorithm provides yellow oracle for each face of the polytope.

	11:42.300 --> 11:46.300
	We now bring together these tools and apply them to the awaystep-frank-wolff algorithm,

	11:46.300 --> 11:50.700
	giving the algorithm we dub adaptive awaystep-frank-wolff, or A2FW for short.

	11:51.660 --> 11:57.340
	First, warm start A2FW using tight sets for the optimal inferred from previous projected points,

	11:57.340 --> 12:01.660
	and active sets from previous projected points. While the algorithm runs and generates new

	12:01.660 --> 12:05.660
	iterates, it keeps inferring new tight sets for the optimal point using these iterates.

	12:05.660 --> 12:09.580
	In each iteration, if a new set has been found, the algorithm checks if all tight sets have been

	12:09.580 --> 12:16.300
	found. If indeed so, then stop and output the exact solution. Otherwise, simply restrict the

	12:16.300 --> 12:22.060
	problem to a low-dimensional face and keep going on. Note that the linear optimization is over a

	12:22.060 --> 12:27.420
	restricted face of the polytope. Let's see an example. Suppose we are optimizing over the

	12:27.420 --> 12:32.860
	polytope P. We look for the best frank-wolff vertex and the best away vertex. We find that

	12:32.860 --> 12:37.500
	the best frank-wolff vertex is the best away vertex. Since the direction opposite to the away

	12:37.500 --> 12:44.060
	vertex is the better direction to move in, we find the next iterate ZT plus 1. Now, ZT plus 1 is

	12:44.060 --> 12:50.540
	close enough to X star that it allows us to detect another tight set and round to the face F new.

	12:50.540 --> 12:56.140
	One way to do that is to round to an arbitrary vertex in F new using our yellow oracle. Another

	12:56.140 --> 13:00.700
	option is to relax to F new and see if the solution obtained is feasible. If feasibility

	13:00.700 --> 13:06.300
	check is uncertain, return to the previous strategy. Eventually, we reach the optimal

	13:06.300 --> 13:11.660
	X star either way. We give this theorem about the primal gap for the modified algorithm.

	13:11.660 --> 13:16.460
	The function h is l-smooth and mu strongly convex and d refers to the diameter of BF.

	13:17.100 --> 13:22.940
	Notice how this compares to the AFW algorithm. When we restrict to a face F of BF, our guarantee

	13:22.940 --> 13:29.020
	depends only on the pyramidal width of F instead of the pyramidal width of BF. This pyramidal width

	13:29.020 --> 13:33.420
	can be much lower for the restricted face. For instance, it depends on the dimension of the face

	13:33.420 --> 13:40.620
	for the probability simplex. Therefore, A2FW leads to a faster convergence. We now show the

	13:40.620 --> 13:46.460
	effectiveness of our toolkit and the A2FW algorithm using experiments. For our computations,

	13:46.460 --> 13:50.940
	we simulate an online recommendation system where we are learning over rankings of items

	13:50.940 --> 13:55.660
	displayed to users. Our loss functions are stochastic model click-through rates. This

	13:55.660 --> 14:01.340
	can be seen as optimization over the permutahedron. We use online mirror descent which performs

	14:01.340 --> 14:07.020
	iterative projections and uses away step Frank-Wulf for these projections. We benchmark the

	14:07.020 --> 14:14.780
	original AFW algorithm against variants modified by our tools. We report significant improvement

	14:14.780 --> 14:19.820
	in both runtime and the number of AFW iterations. The green line stands for OMD with the original

	14:19.820 --> 14:27.180
	unoptimized AFW. The yellow line stands for OMD with A2FW algorithm. We do note that both OMDPAV,

	14:27.180 --> 14:31.900
	that is OMD with projections using the poor adjacent violators algorithm, and OFW were

	14:31.900 --> 14:38.140
	significantly faster than OMD with any AFW variant. However, OFW does not lead to optimum

	14:38.140 --> 14:43.420
	regret rates while OMDPAV works only for cardinality-based submodular polytopes. To

	14:43.420 --> 14:48.140
	conclude, we studied iterative projections for prevalent submodular-based polytopes. We presented

	14:48.140 --> 14:53.020
	an algorithm for cardinality-based polytopes. For general polytopes, we developed a combinatorial

	14:53.020 --> 14:58.380
	toolkit to speed up iterative projections and applied it to the AFW algorithm and computationally

	14:58.380 --> 15:18.540
	showed that our algorithm is orders of magnitude faster than the original AFW variant.