Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /nips-2021 /25969 /transcript_whisper_large-v2.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame contribute delete

17.3 kB

	WEBVTT

	00:00.000 --> 00:14.160
	Hello everyone, my name is Alan. I'm a PhD student from Stanford University. I'm presenting

	00:14.160 --> 00:19.880
	our work Play to Grade, testing coding games as classifying Markov decision process. This

	00:19.880 --> 00:23.720
	is joint work with Emma Bronskill and Chris Peach.

	00:23.720 --> 00:28.000
	In this talk, we will highlight the central problem that we're trying to solve, which

	00:28.000 --> 00:34.240
	is scaling up quality feedback for students learning to code is crucial. Grading interactive

	00:34.240 --> 00:40.040
	coding game is very difficult, and we frame this as an instance of identifying if a program

	00:40.040 --> 00:48.040
	has the same behavior as a desired MDP. Even with 11 label programs, we can achieve 94%

	00:48.040 --> 00:52.560
	accuracy on real student assignment from code.org.

	00:52.560 --> 00:58.560
	Each year, hundreds of thousands of people, children and adults alike, want to learn coding.

	00:58.560 --> 01:06.680
	Modern massive online education platforms like code.org serves over 40% of US K-12 students.

	01:06.680 --> 01:11.580
	Scaling up quality feedback for these students is crucial, especially in areas where there

	01:11.580 --> 01:15.780
	are shortages of computer science teachers.

	01:15.780 --> 01:20.040
	Interactive coding assignments are becoming more popular. It's a lot more fun for students

	01:20.040 --> 01:25.880
	to program them. They're also a common type of programs for students to code. For example,

	01:25.880 --> 01:31.160
	web pages are interactive. However, in order to grade them, teachers often need to play

	01:31.160 --> 01:36.560
	each student homework for 20 seconds to a couple minutes. This quickly becomes a scaling

	01:36.560 --> 01:42.760
	issue. A 20-student classroom might still be manageable, but in a large university where

	01:42.760 --> 01:47.840
	there are hundreds of students taking the same class or on an online education platform

	01:47.840 --> 01:53.720
	like code.org, grading these assignments is a real challenge. This places a real burden

	01:53.720 --> 01:55.960
	on teachers.

	01:55.960 --> 02:01.680
	Why is it difficult to develop automatic grading tools? First of all, each assignment is different

	02:01.680 --> 02:06.280
	from each other. Traditional machine learning solutions that rely on collecting a large

	02:06.280 --> 02:13.040
	set of data set simply won't work here. Oftentimes, assignments for the same class can even change

	02:13.040 --> 02:19.840
	from year to year. Spending effort to collect a large label data set is a hard sell to teachers.

	02:19.840 --> 02:24.840
	Second, the same assignment can be written in different coding languages. The solutions

	02:24.840 --> 02:31.080
	could end up looking quite different. At last, code solutions can be very long, especially

	02:31.080 --> 02:36.840
	when interaction is involved. Unfortunately, current state-of-the-art code analysis solutions

	02:36.840 --> 02:42.240
	don't scale beyond 10 lines of code. In this work, we hope to offer a new solution

	02:42.240 --> 02:46.480
	inspired by human teachers' grade these assignments.

	02:46.480 --> 02:50.760
	Let's take a look at how a teacher plays to grade a student homework. This is what

	02:50.760 --> 02:55.760
	a correct solution for code.org's coding assignment, Bounce, looks like. The teacher

	02:55.760 --> 03:02.080
	controls a paddle to bounce a ball into a goal post and gets one score.

	03:02.080 --> 03:06.080
	Here's what an incorrect student submission looks like. The student didn't put the boundary

	03:06.080 --> 03:10.120
	condition for the wall and the ball goes right through it.

	03:10.120 --> 03:14.920
	Here's another incorrect submission. Instead of getting a point after successfully bouncing

	03:14.920 --> 03:19.840
	the ball into the goal post, the player gets a point whenever the ball bounces on wall

	03:19.840 --> 03:24.040
	and paddle. This is clearly not the correct behavior.

	03:24.040 --> 03:29.640
	However, a teacher isn't just playing the game normally. In order to grade it, the teacher

	03:29.640 --> 03:35.880
	has to play it in a specific way to expose bugs in the game. Take a look at both programs

	03:35.880 --> 03:41.320
	on the left and right. Both have wall boundary problems, but we would never know if the teacher

	03:41.320 --> 03:47.200
	didn't try to bounce the ball on the wall. The right panel shows a game, though broken,

	03:47.200 --> 03:50.420
	can look like a perfectly correct game.

	03:50.420 --> 03:55.040
	Using the Markov Decision Process framework from reinforcement learning, we can characterize

	03:55.040 --> 04:00.720
	the intuition we have built up. The MDP framework can be used to describe any interactive environment,

	04:00.720 --> 04:06.280
	not just games. It includes a state space, action space, a transition dynamics that defines

	04:06.280 --> 04:12.160
	how the game moves from one frame to the next, and a reward function. We can train an agent

	04:12.160 --> 04:16.800
	using a reinforcement learning algorithm that learns to maximize the reward. So how does

	04:16.800 --> 04:21.480
	the MDP framework help us understand programs with bugs?

	04:21.480 --> 04:26.600
	We can treat each program as its own MDP. The teacher's correct program is the correct

	04:26.600 --> 04:33.480
	or desired MDP, while the student's program is another MDP or a test MDP. We can frame

	04:33.480 --> 04:39.240
	grading as an instance of identifying if a test MDP has the same behavior as a desired

	04:39.240 --> 04:46.000
	MDP. Using components from the MDP framework, we can express bugs as distance between two

	04:46.000 --> 04:50.960
	MDPs' transition and reward functions. The ball going through the wall is clearly not

	04:50.960 --> 04:55.600
	a correct transition. Receive reward when you shouldn't can also be captured by the

	04:55.600 --> 05:02.240
	difference in the reward function output. More precisely, we can treat grading as calculating

	05:02.240 --> 05:09.000
	a distance between two MDPs. Equation 1 might suggest that we should check over all states.

	05:09.000 --> 05:14.160
	However, since distance is non-negative and we're interested in the overall sum, we

	05:14.160 --> 05:19.800
	only need to find one state-action pair in the test MDP to know if the overall distance

	05:19.800 --> 05:25.720
	is non-zero. If we set this distance as a reward for an RL agent, we can make the task

	05:25.720 --> 05:32.120
	of reaching bug states a lot more intelligent and efficient. This RL agent's objective

	05:32.120 --> 05:37.680
	is to reach states that have the highest potential to be different between the two MDPs with

	05:37.680 --> 05:43.700
	respect to this distance function. We do have one more challenge that remains.

	05:43.700 --> 05:50.620
	The distance function DSA requires access to both MDPs' transition and reward functions.

	05:50.620 --> 05:55.960
	We cannot assume we have access to the student program's inner mechanism. We can't control

	05:55.960 --> 06:01.320
	the randomness in the student's code either, meaning two MDPs can have different random

	06:01.320 --> 06:07.000
	initial starting positions. Therefore, when we interact with the student's MDP, we need

	06:07.000 --> 06:12.640
	to learn a parametrized distance function that can tell us how far the observed state-action

	06:12.640 --> 06:17.600
	pairs from the student MDP is from the correct MDP.

	06:17.600 --> 06:23.120
	Now we have two parametrized models. The agent requires training to find the bug. The classifier

	06:23.120 --> 06:28.880
	requires training to identify the bug. We call this the code star problem. So, if I

	06:28.880 --> 06:34.360
	have a classifier that can classify which state triggers a bug, then we can simply replace

	06:34.360 --> 06:40.640
	reward function in the MDP with this classifier and directly teach our agent. If I have an

	06:40.640 --> 06:46.480
	agent that can always reach the bug state, I can probably just collect a dataset of trajectories

	06:46.480 --> 06:52.240
	and train a good classifier. But at the beginning, neither the agent nor the classifier can do

	06:52.240 --> 06:56.640
	a very good job. Therefore, we introduce a procedure called

	06:56.640 --> 07:01.440
	collaborative training. The agent will start out as a random agent, where we can train

	07:01.440 --> 07:07.680
	the agent to maximize the original reward in the MDP. It collects trajectories and trains

	07:07.680 --> 07:12.960
	the classifier. Then we use the classifier as a reward function to guide the agent on

	07:12.960 --> 07:18.360
	how to reach bug states. They both start out bad, but the agent can help the classifier

	07:18.360 --> 07:23.280
	learn and the classifier can in return teach the agent.

	07:23.280 --> 07:28.360
	We present two baselines to train the bug classifier. Since we have some training data,

	07:28.360 --> 07:33.560
	though not a lot, we can simply apply coarse labeling, creating a dataset where all state-action

	07:33.560 --> 07:40.240
	pairs from the correct labeled MDP as non-bug states and all state-action pairs from the

	07:40.240 --> 07:46.160
	broken MDP as bug states. This is incredibly noisy because not all state-action pairs from

	07:46.160 --> 07:51.600
	the broken MDP are bug states, only a few of them are. But this is a good baseline to

	07:51.600 --> 07:54.640
	have. We can also train an unsupervised learning

	07:54.640 --> 08:00.120
	model to memorize all state-action pairs from the correct MDP and use log probability or

	08:00.120 --> 08:06.200
	reconstruction loss to detect abnormal state-action pairs in the broken MDP.

	08:06.200 --> 08:12.280
	Inspired by Hohr-Triples and MDP state equivalence literature, we designed two models to fully

	08:12.280 --> 08:18.240
	capture this notion of MDP-based state difference. We assume that the students can specify and

	08:18.240 --> 08:23.680
	set random seed for their game. Therefore, the game objects, such as a ball, will not

	08:23.680 --> 08:30.000
	always appear in the same initial state. Therefore, it is crucial for us to approximate one MDP's

	08:30.000 --> 08:35.840
	transition dynamics and reward function. When our agent interacts with a new MDP, this is

	08:35.840 --> 08:41.560
	where Hohr-LSTM comes in. We train it to model the correct MDP's transition dynamics and

	08:41.560 --> 08:47.320
	reward function and treat bug states in the new MDP when sufficient deviation occurs from

	08:47.320 --> 08:52.440
	the prediction. We further introduce contrastive Hohr-LSTM.

	08:52.440 --> 08:57.880
	Sometimes the agent will explore a new region that it might not have visited in the correct

	08:57.880 --> 09:03.800
	MDP. The predictive difference between the observed state and predictive state is in

	09:03.800 --> 09:09.560
	fact a function approximation error. In order to reduce this error, we approximate both

	09:09.560 --> 09:13.760
	the correct MDP and the broken MDP.

	09:13.760 --> 09:18.600
	Let's take a look at how these models work. We introduce a car environment. In here, the

	09:18.600 --> 09:23.480
	student miscalculated the boundary of this environment, so whenever the car goes outside

	09:23.480 --> 09:28.240
	of the red dotted line, it will get stuck and can only wriggle back and forth. This

	09:28.240 --> 09:35.160
	is a task where you will always reach a bug state at the end of each trajectory. Therefore,

	09:35.160 --> 09:41.280
	every single agent is already an optimal agent. We create a specific one that only knows how

	09:41.280 --> 09:44.360
	to drive north in a straight line.

	09:44.360 --> 09:50.840
	As we can see, almost all models, except Gaussian mixture model, can be close to 100% accuracy

	09:50.840 --> 09:56.680
	at classifying bug states and non-bug states. However, the agent that only knows how to

	09:56.680 --> 10:01.440
	drive north is not a very interesting agent, and we probably will never use that in real

	10:01.440 --> 10:05.560
	life. So what if we make it a little bit harder?

	10:05.560 --> 10:10.840
	We can create an agent that drives the car randomly. Now the trajectory will become different

	10:10.840 --> 10:16.660
	each time. We see a significant drop in performance for baseline solutions like noisy supervised

	10:16.660 --> 10:22.880
	learning and variational autoencoder. However, our LSTM-based models can still do very well

	10:22.880 --> 10:28.400
	at close to 100% accuracy. This is a pretty challenging task because we're measuring the

	10:28.400 --> 10:33.880
	accuracy of each classifier on every state in a trajectory, even though we're in a toy

	10:33.880 --> 10:35.760
	environment.

	10:35.760 --> 10:40.360
	Let's make this setting even harder. The car environment can stay the same, but for now,

	10:40.360 --> 10:45.800
	bugs can only be triggered if the agent successfully drives the car into some small red rectangular

	10:45.800 --> 10:51.400
	areas. Not all agents are optimal now, and it would be unlikely for a single-direction

	10:51.400 --> 10:56.400
	agent to ever see a bug state. We can now showcase the power of collaborative training

	10:56.400 --> 10:58.880
	through this example.

	10:58.880 --> 11:03.200
	We can see at the beginning, the agent is pretty random, and the classifier is pretty

	11:03.200 --> 11:10.240
	bad except for the LSTM models. However, after only one round of collaborative training,

	11:10.240 --> 11:15.360
	we see a substantial improvement for the two baseline models, both noisy supervised learning

	11:15.360 --> 11:21.720
	model and variational autoencoder are able to improve their accuracy by 30% and precision

	11:21.720 --> 11:26.800
	by 60%. This shows that the collaborative training is helping both the agent and the

	11:26.800 --> 11:32.000
	classifier to be more optimal, even for the weaker classifiers.

	11:32.000 --> 11:37.680
	We also notice that this improvement is not monotonic. Just like every other AI training

	11:37.680 --> 11:43.560
	scheme, overfitting sometimes happens. Only the most expressive classifiers, our proposed

	11:43.560 --> 11:49.840
	Horl LSTM and contrastive Horl LSTM can remain stable and even mildly improve their recall

	11:49.840 --> 11:53.700
	in the last round of collaborative training.

	11:53.700 --> 12:00.040
	We can directly examine the agent's learning by looking at its trajectory. At first, the

	12:00.040 --> 12:05.320
	agent drives the car randomly, but after only one round of collaborative training, the agent

	12:05.320 --> 12:11.000
	becomes sharply focused and only visits the possible buggy areas.

	12:11.000 --> 12:16.560
	We verify our method on a real student dataset that we obtained from code.org. We use this

	12:16.560 --> 12:23.920
	assignment as our motivating examples earlier. Bounce is a simple coding exercise where 450,000

	12:23.920 --> 12:28.520
	students have submitted their solutions. We built a simulator that can run and execute

	12:28.520 --> 12:34.280
	students' programs that conforms to the OpenAI GEM API. For each student program, we have

	12:34.280 --> 12:40.040
	created goal labels for bug behaviors. We further binarize them into a single label

	12:40.040 --> 12:43.360
	indicating correct or incorrect.

	12:43.360 --> 12:48.360
	Bounce is a lot more complicated than car. Learning to bounce a ball into the goalpost

	12:48.360 --> 12:54.080
	and understanding the physics is a lot more difficult for the agent. Therefore, we pre-train

	12:54.080 --> 12:59.680
	the agent using the score as a reward. We call this play-to-win agent. Then we use this

	12:59.680 --> 13:06.400
	agent to train our bug classifier. We're able to reach 94% accuracy with only 11 label

	13:06.400 --> 13:13.440
	programs as training data. A similar algorithm that uses code as text input cannot match

	13:13.440 --> 13:18.960
	our method's performance due to the smallness of the training dataset.

	13:18.960 --> 13:24.240
	In addition to just grading, since we're able to determine bugs at the state level,

	13:24.240 --> 13:30.040
	we can simply record a few frames before and after the bug occurs and compile a short video

	13:30.040 --> 13:35.040
	for the students to demonstrate what the bug is in their assignment.

	13:35.040 --> 13:39.640
	To summarize our work, we provide a fully functional simulator and a massive amount

	13:39.640 --> 13:44.080
	of real student programs with goal labels. We demonstrate that our solution achieves

	13:44.080 --> 13:48.920
	a high performance. However, there are still many problems remain. For example, can we

	13:48.920 --> 13:53.840
	know which bug is triggered in the student program? This is helpful for providing fine-grained

	13:53.840 --> 13:59.200
	feedback to the students. Training an RL agent with a classifier has also been explored in

	13:59.200 --> 14:04.520
	other areas like SafeRL, where unsafe states are predicted by a classifier.

	14:04.520 --> 14:10.760
	At last, we pose this question of creativity. Can our formulation accommodate creativity?

	14:10.760 --> 14:15.520
	Creative programs are different but not broken. A ball can move faster or slower than the

	14:15.520 --> 14:20.040
	teacher's solution, but it doesn't mean it's wrong. Exploring how we can recognize

	14:20.040 --> 14:25.200
	and encourage student creativity is crucial for automated grading. Thanks for listening.

	14:25.200 --> 14:34.840
	Come and chat with me during the poster session.