ytseg_demo / demo_data /nips-2021 /25969 /transcript_whisper_large-v2.vtt
retkowski's picture
Add demo
cb71ef5
WEBVTT
00:00.000 --> 00:14.160
Hello everyone, my name is Alan. I'm a PhD student from Stanford University. I'm presenting
00:14.160 --> 00:19.880
our work Play to Grade, testing coding games as classifying Markov decision process. This
00:19.880 --> 00:23.720
is joint work with Emma Bronskill and Chris Peach.
00:23.720 --> 00:28.000
In this talk, we will highlight the central problem that we're trying to solve, which
00:28.000 --> 00:34.240
is scaling up quality feedback for students learning to code is crucial. Grading interactive
00:34.240 --> 00:40.040
coding game is very difficult, and we frame this as an instance of identifying if a program
00:40.040 --> 00:48.040
has the same behavior as a desired MDP. Even with 11 label programs, we can achieve 94%
00:48.040 --> 00:52.560
accuracy on real student assignment from code.org.
00:52.560 --> 00:58.560
Each year, hundreds of thousands of people, children and adults alike, want to learn coding.
00:58.560 --> 01:06.680
Modern massive online education platforms like code.org serves over 40% of US K-12 students.
01:06.680 --> 01:11.580
Scaling up quality feedback for these students is crucial, especially in areas where there
01:11.580 --> 01:15.780
are shortages of computer science teachers.
01:15.780 --> 01:20.040
Interactive coding assignments are becoming more popular. It's a lot more fun for students
01:20.040 --> 01:25.880
to program them. They're also a common type of programs for students to code. For example,
01:25.880 --> 01:31.160
web pages are interactive. However, in order to grade them, teachers often need to play
01:31.160 --> 01:36.560
each student homework for 20 seconds to a couple minutes. This quickly becomes a scaling
01:36.560 --> 01:42.760
issue. A 20-student classroom might still be manageable, but in a large university where
01:42.760 --> 01:47.840
there are hundreds of students taking the same class or on an online education platform
01:47.840 --> 01:53.720
like code.org, grading these assignments is a real challenge. This places a real burden
01:53.720 --> 01:55.960
on teachers.
01:55.960 --> 02:01.680
Why is it difficult to develop automatic grading tools? First of all, each assignment is different
02:01.680 --> 02:06.280
from each other. Traditional machine learning solutions that rely on collecting a large
02:06.280 --> 02:13.040
set of data set simply won't work here. Oftentimes, assignments for the same class can even change
02:13.040 --> 02:19.840
from year to year. Spending effort to collect a large label data set is a hard sell to teachers.
02:19.840 --> 02:24.840
Second, the same assignment can be written in different coding languages. The solutions
02:24.840 --> 02:31.080
could end up looking quite different. At last, code solutions can be very long, especially
02:31.080 --> 02:36.840
when interaction is involved. Unfortunately, current state-of-the-art code analysis solutions
02:36.840 --> 02:42.240
don't scale beyond 10 lines of code. In this work, we hope to offer a new solution
02:42.240 --> 02:46.480
inspired by human teachers' grade these assignments.
02:46.480 --> 02:50.760
Let's take a look at how a teacher plays to grade a student homework. This is what
02:50.760 --> 02:55.760
a correct solution for code.org's coding assignment, Bounce, looks like. The teacher
02:55.760 --> 03:02.080
controls a paddle to bounce a ball into a goal post and gets one score.
03:02.080 --> 03:06.080
Here's what an incorrect student submission looks like. The student didn't put the boundary
03:06.080 --> 03:10.120
condition for the wall and the ball goes right through it.
03:10.120 --> 03:14.920
Here's another incorrect submission. Instead of getting a point after successfully bouncing
03:14.920 --> 03:19.840
the ball into the goal post, the player gets a point whenever the ball bounces on wall
03:19.840 --> 03:24.040
and paddle. This is clearly not the correct behavior.
03:24.040 --> 03:29.640
However, a teacher isn't just playing the game normally. In order to grade it, the teacher
03:29.640 --> 03:35.880
has to play it in a specific way to expose bugs in the game. Take a look at both programs
03:35.880 --> 03:41.320
on the left and right. Both have wall boundary problems, but we would never know if the teacher
03:41.320 --> 03:47.200
didn't try to bounce the ball on the wall. The right panel shows a game, though broken,
03:47.200 --> 03:50.420
can look like a perfectly correct game.
03:50.420 --> 03:55.040
Using the Markov Decision Process framework from reinforcement learning, we can characterize
03:55.040 --> 04:00.720
the intuition we have built up. The MDP framework can be used to describe any interactive environment,
04:00.720 --> 04:06.280
not just games. It includes a state space, action space, a transition dynamics that defines
04:06.280 --> 04:12.160
how the game moves from one frame to the next, and a reward function. We can train an agent
04:12.160 --> 04:16.800
using a reinforcement learning algorithm that learns to maximize the reward. So how does
04:16.800 --> 04:21.480
the MDP framework help us understand programs with bugs?
04:21.480 --> 04:26.600
We can treat each program as its own MDP. The teacher's correct program is the correct
04:26.600 --> 04:33.480
or desired MDP, while the student's program is another MDP or a test MDP. We can frame
04:33.480 --> 04:39.240
grading as an instance of identifying if a test MDP has the same behavior as a desired
04:39.240 --> 04:46.000
MDP. Using components from the MDP framework, we can express bugs as distance between two
04:46.000 --> 04:50.960
MDPs' transition and reward functions. The ball going through the wall is clearly not
04:50.960 --> 04:55.600
a correct transition. Receive reward when you shouldn't can also be captured by the
04:55.600 --> 05:02.240
difference in the reward function output. More precisely, we can treat grading as calculating
05:02.240 --> 05:09.000
a distance between two MDPs. Equation 1 might suggest that we should check over all states.
05:09.000 --> 05:14.160
However, since distance is non-negative and we're interested in the overall sum, we
05:14.160 --> 05:19.800
only need to find one state-action pair in the test MDP to know if the overall distance
05:19.800 --> 05:25.720
is non-zero. If we set this distance as a reward for an RL agent, we can make the task
05:25.720 --> 05:32.120
of reaching bug states a lot more intelligent and efficient. This RL agent's objective
05:32.120 --> 05:37.680
is to reach states that have the highest potential to be different between the two MDPs with
05:37.680 --> 05:43.700
respect to this distance function. We do have one more challenge that remains.
05:43.700 --> 05:50.620
The distance function DSA requires access to both MDPs' transition and reward functions.
05:50.620 --> 05:55.960
We cannot assume we have access to the student program's inner mechanism. We can't control
05:55.960 --> 06:01.320
the randomness in the student's code either, meaning two MDPs can have different random
06:01.320 --> 06:07.000
initial starting positions. Therefore, when we interact with the student's MDP, we need
06:07.000 --> 06:12.640
to learn a parametrized distance function that can tell us how far the observed state-action
06:12.640 --> 06:17.600
pairs from the student MDP is from the correct MDP.
06:17.600 --> 06:23.120
Now we have two parametrized models. The agent requires training to find the bug. The classifier
06:23.120 --> 06:28.880
requires training to identify the bug. We call this the code star problem. So, if I
06:28.880 --> 06:34.360
have a classifier that can classify which state triggers a bug, then we can simply replace
06:34.360 --> 06:40.640
reward function in the MDP with this classifier and directly teach our agent. If I have an
06:40.640 --> 06:46.480
agent that can always reach the bug state, I can probably just collect a dataset of trajectories
06:46.480 --> 06:52.240
and train a good classifier. But at the beginning, neither the agent nor the classifier can do
06:52.240 --> 06:56.640
a very good job. Therefore, we introduce a procedure called
06:56.640 --> 07:01.440
collaborative training. The agent will start out as a random agent, where we can train
07:01.440 --> 07:07.680
the agent to maximize the original reward in the MDP. It collects trajectories and trains
07:07.680 --> 07:12.960
the classifier. Then we use the classifier as a reward function to guide the agent on
07:12.960 --> 07:18.360
how to reach bug states. They both start out bad, but the agent can help the classifier
07:18.360 --> 07:23.280
learn and the classifier can in return teach the agent.
07:23.280 --> 07:28.360
We present two baselines to train the bug classifier. Since we have some training data,
07:28.360 --> 07:33.560
though not a lot, we can simply apply coarse labeling, creating a dataset where all state-action
07:33.560 --> 07:40.240
pairs from the correct labeled MDP as non-bug states and all state-action pairs from the
07:40.240 --> 07:46.160
broken MDP as bug states. This is incredibly noisy because not all state-action pairs from
07:46.160 --> 07:51.600
the broken MDP are bug states, only a few of them are. But this is a good baseline to
07:51.600 --> 07:54.640
have. We can also train an unsupervised learning
07:54.640 --> 08:00.120
model to memorize all state-action pairs from the correct MDP and use log probability or
08:00.120 --> 08:06.200
reconstruction loss to detect abnormal state-action pairs in the broken MDP.
08:06.200 --> 08:12.280
Inspired by Hohr-Triples and MDP state equivalence literature, we designed two models to fully
08:12.280 --> 08:18.240
capture this notion of MDP-based state difference. We assume that the students can specify and
08:18.240 --> 08:23.680
set random seed for their game. Therefore, the game objects, such as a ball, will not
08:23.680 --> 08:30.000
always appear in the same initial state. Therefore, it is crucial for us to approximate one MDP's
08:30.000 --> 08:35.840
transition dynamics and reward function. When our agent interacts with a new MDP, this is
08:35.840 --> 08:41.560
where Hohr-LSTM comes in. We train it to model the correct MDP's transition dynamics and
08:41.560 --> 08:47.320
reward function and treat bug states in the new MDP when sufficient deviation occurs from
08:47.320 --> 08:52.440
the prediction. We further introduce contrastive Hohr-LSTM.
08:52.440 --> 08:57.880
Sometimes the agent will explore a new region that it might not have visited in the correct
08:57.880 --> 09:03.800
MDP. The predictive difference between the observed state and predictive state is in
09:03.800 --> 09:09.560
fact a function approximation error. In order to reduce this error, we approximate both
09:09.560 --> 09:13.760
the correct MDP and the broken MDP.
09:13.760 --> 09:18.600
Let's take a look at how these models work. We introduce a car environment. In here, the
09:18.600 --> 09:23.480
student miscalculated the boundary of this environment, so whenever the car goes outside
09:23.480 --> 09:28.240
of the red dotted line, it will get stuck and can only wriggle back and forth. This
09:28.240 --> 09:35.160
is a task where you will always reach a bug state at the end of each trajectory. Therefore,
09:35.160 --> 09:41.280
every single agent is already an optimal agent. We create a specific one that only knows how
09:41.280 --> 09:44.360
to drive north in a straight line.
09:44.360 --> 09:50.840
As we can see, almost all models, except Gaussian mixture model, can be close to 100% accuracy
09:50.840 --> 09:56.680
at classifying bug states and non-bug states. However, the agent that only knows how to
09:56.680 --> 10:01.440
drive north is not a very interesting agent, and we probably will never use that in real
10:01.440 --> 10:05.560
life. So what if we make it a little bit harder?
10:05.560 --> 10:10.840
We can create an agent that drives the car randomly. Now the trajectory will become different
10:10.840 --> 10:16.660
each time. We see a significant drop in performance for baseline solutions like noisy supervised
10:16.660 --> 10:22.880
learning and variational autoencoder. However, our LSTM-based models can still do very well
10:22.880 --> 10:28.400
at close to 100% accuracy. This is a pretty challenging task because we're measuring the
10:28.400 --> 10:33.880
accuracy of each classifier on every state in a trajectory, even though we're in a toy
10:33.880 --> 10:35.760
environment.
10:35.760 --> 10:40.360
Let's make this setting even harder. The car environment can stay the same, but for now,
10:40.360 --> 10:45.800
bugs can only be triggered if the agent successfully drives the car into some small red rectangular
10:45.800 --> 10:51.400
areas. Not all agents are optimal now, and it would be unlikely for a single-direction
10:51.400 --> 10:56.400
agent to ever see a bug state. We can now showcase the power of collaborative training
10:56.400 --> 10:58.880
through this example.
10:58.880 --> 11:03.200
We can see at the beginning, the agent is pretty random, and the classifier is pretty
11:03.200 --> 11:10.240
bad except for the LSTM models. However, after only one round of collaborative training,
11:10.240 --> 11:15.360
we see a substantial improvement for the two baseline models, both noisy supervised learning
11:15.360 --> 11:21.720
model and variational autoencoder are able to improve their accuracy by 30% and precision
11:21.720 --> 11:26.800
by 60%. This shows that the collaborative training is helping both the agent and the
11:26.800 --> 11:32.000
classifier to be more optimal, even for the weaker classifiers.
11:32.000 --> 11:37.680
We also notice that this improvement is not monotonic. Just like every other AI training
11:37.680 --> 11:43.560
scheme, overfitting sometimes happens. Only the most expressive classifiers, our proposed
11:43.560 --> 11:49.840
Horl LSTM and contrastive Horl LSTM can remain stable and even mildly improve their recall
11:49.840 --> 11:53.700
in the last round of collaborative training.
11:53.700 --> 12:00.040
We can directly examine the agent's learning by looking at its trajectory. At first, the
12:00.040 --> 12:05.320
agent drives the car randomly, but after only one round of collaborative training, the agent
12:05.320 --> 12:11.000
becomes sharply focused and only visits the possible buggy areas.
12:11.000 --> 12:16.560
We verify our method on a real student dataset that we obtained from code.org. We use this
12:16.560 --> 12:23.920
assignment as our motivating examples earlier. Bounce is a simple coding exercise where 450,000
12:23.920 --> 12:28.520
students have submitted their solutions. We built a simulator that can run and execute
12:28.520 --> 12:34.280
students' programs that conforms to the OpenAI GEM API. For each student program, we have
12:34.280 --> 12:40.040
created goal labels for bug behaviors. We further binarize them into a single label
12:40.040 --> 12:43.360
indicating correct or incorrect.
12:43.360 --> 12:48.360
Bounce is a lot more complicated than car. Learning to bounce a ball into the goalpost
12:48.360 --> 12:54.080
and understanding the physics is a lot more difficult for the agent. Therefore, we pre-train
12:54.080 --> 12:59.680
the agent using the score as a reward. We call this play-to-win agent. Then we use this
12:59.680 --> 13:06.400
agent to train our bug classifier. We're able to reach 94% accuracy with only 11 label
13:06.400 --> 13:13.440
programs as training data. A similar algorithm that uses code as text input cannot match
13:13.440 --> 13:18.960
our method's performance due to the smallness of the training dataset.
13:18.960 --> 13:24.240
In addition to just grading, since we're able to determine bugs at the state level,
13:24.240 --> 13:30.040
we can simply record a few frames before and after the bug occurs and compile a short video
13:30.040 --> 13:35.040
for the students to demonstrate what the bug is in their assignment.
13:35.040 --> 13:39.640
To summarize our work, we provide a fully functional simulator and a massive amount
13:39.640 --> 13:44.080
of real student programs with goal labels. We demonstrate that our solution achieves
13:44.080 --> 13:48.920
a high performance. However, there are still many problems remain. For example, can we
13:48.920 --> 13:53.840
know which bug is triggered in the student program? This is helpful for providing fine-grained
13:53.840 --> 13:59.200
feedback to the students. Training an RL agent with a classifier has also been explored in
13:59.200 --> 14:04.520
other areas like SafeRL, where unsafe states are predicted by a classifier.
14:04.520 --> 14:10.760
At last, we pose this question of creativity. Can our formulation accommodate creativity?
14:10.760 --> 14:15.520
Creative programs are different but not broken. A ball can move faster or slower than the
14:15.520 --> 14:20.040
teacher's solution, but it doesn't mean it's wrong. Exploring how we can recognize
14:20.040 --> 14:25.200
and encourage student creativity is crucial for automated grading. Thanks for listening.
14:25.200 --> 14:34.840
Come and chat with me during the poster session.