Spaces:
Running
Running
File size: 8,470 Bytes
cb71ef5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 |
WEBVTT
00:00.000 --> 00:14.520
Hi, my name is Maxwell Nye, and today I'll be talking about improving coherence and consistency
00:14.520 --> 00:19.620
in neural sequence models with dual system neurosymbolic reasoning.
00:19.620 --> 00:23.800
So I first want to give a little bit of a demo, which is to ask this question.
00:23.800 --> 00:26.920
A bat and a ball cost $1.10 in total.
00:26.920 --> 00:29.300
The bat costs $1 more than the ball.
00:29.300 --> 00:31.720
How much does the ball cost?
00:31.720 --> 00:34.920
So I'll let you think a little bit for this.
00:34.920 --> 00:39.200
So one answer that sort of might jump out at you is $0.10, but this is actually incorrect
00:39.200 --> 00:43.920
because the sum of the two objects should be $1.10.
00:43.920 --> 00:46.880
So the correct answer is actually $0.05.
00:46.880 --> 00:54.240
And this is an example from a cognitive reflection test, and these are questions designed to
00:54.240 --> 01:00.140
have a particular answer which comes to mind quite quickly, which is in fact wrong.
01:00.140 --> 01:06.640
And something that's interesting is that large-scale language models such as GPT-3 predict the
01:06.640 --> 01:08.320
wrong answers as well.
01:08.320 --> 01:11.300
And this is true not just for the sort of the classic cognitive reflection test, but
01:11.300 --> 01:15.160
also for variants with different numbers.
01:15.160 --> 01:19.680
So this is sort of an interesting thing.
01:19.680 --> 01:27.400
It talks about how neural language models often have issues with consistency and coherence.
01:27.400 --> 01:30.720
So another place that we can see this a little more concretely is the clutter data set.
01:30.720 --> 01:36.680
In the clutter data set, models are trained to...
01:36.680 --> 01:42.080
There are sentences about people and their family relationships and stories about those
01:42.080 --> 01:43.840
people.
01:43.840 --> 01:48.800
And this was originally devised as a question-answering data set where you ask what the relations
01:48.800 --> 01:49.800
are.
01:49.800 --> 01:58.080
One thing you can do is ask models to be trained on this data set and then generate new stories.
01:58.080 --> 02:02.880
And when you do that, you'll see that often the generated stories have inconsistency.
02:02.880 --> 02:06.560
So if we look at the bottom of the screen here, we can see an example of this.
02:06.560 --> 02:10.080
Robert and his brother Antonio played harmonicas together.
02:10.080 --> 02:13.440
Robert's daughter, Elsie, asked him to play with her.
02:13.440 --> 02:17.280
Elsie doesn't like having to babysit her younger brother, Antonio.
02:17.280 --> 02:21.240
And so we can see that this is a common sense error because Elsie is not the younger brother
02:21.240 --> 02:22.240
of Antonio.
02:22.240 --> 02:27.720
Or Elsie's younger brother is not Antonio.
02:27.720 --> 02:35.760
So what we've done is we've built a dual system model using large-scale neural networks and
02:35.760 --> 02:42.800
symbolic deliberative logic in order to try to help with these consistency issues.
02:42.800 --> 02:44.400
So the model is as follows.
02:44.400 --> 02:52.680
You use neural generation to generate sentences in a particular story.
02:52.680 --> 02:59.360
You might generate the next sentence using a model such as GPT-3 or BART.
02:59.360 --> 03:10.320
What you can then do is parse that sentence into the semantic meaning with respect to
03:10.320 --> 03:15.520
the family relationships and check whether or not it matches the current state of the
03:15.520 --> 03:20.960
family relationships that's been described so far, and only accept the candidate sentence
03:20.960 --> 03:25.800
generations that are actually consistent.
03:25.800 --> 03:27.600
So this has a few components.
03:27.600 --> 03:30.380
One of the components here is a symbolic world model.
03:30.380 --> 03:35.160
In the case of this clutter domain, the symbolic world model that we built encodes people and
03:35.160 --> 03:36.160
their family relationships.
03:36.160 --> 03:42.840
So in other words, you could take a sentence and encode what the underlying family relationship
03:42.840 --> 03:43.840
is.
03:43.840 --> 03:50.680
And what you can do is you can use SMT solvers such as the Z3 solver to check consistency.
03:50.680 --> 03:57.240
So given a new sentence, you can check that it doesn't disobey the rules of ancestry that
03:57.240 --> 03:58.240
we've defined here.
03:58.240 --> 04:04.120
And so some of those are, for example, what is the relationship between children and grandchildren?
04:04.120 --> 04:10.000
And then another is what are the rules about whether ancestry, can you be your own ancestor,
04:10.000 --> 04:12.180
et cetera.
04:12.180 --> 04:15.040
So one question is how is this semantic parsing done?
04:15.040 --> 04:19.560
And it turns out we can actually do this quite cheaply using GPT-3.
04:19.560 --> 04:26.920
So what we can see here in the dotted box is an actual example of a few-shot prompt
04:26.920 --> 04:34.440
we can use to parse each new sentence, each new candidate sentence from the system one
04:34.440 --> 04:42.360
generation model and parse it into the semantic form that we can then give to the world model
04:42.360 --> 04:46.280
solver.
04:46.280 --> 04:52.120
So the results here show that models that use this dual system neurosymbolic stories
04:52.120 --> 05:02.160
show improved coherence over just sentences that were constructed by a neural model.
05:02.160 --> 05:10.160
So the example here is that what we've done is we've used human judgments on which of
05:10.160 --> 05:14.800
the following sentences make more sense given the prior context of the story.
05:14.800 --> 05:25.280
And we see that if we use a symbolic world model and the parsing scheme described above,
05:25.280 --> 05:32.520
humans prefer the judgments given by this model.
05:32.520 --> 05:36.360
We can also apply the same sort of reasoning to a completely different task.
05:36.360 --> 05:42.080
Here we can discuss the grounded instruction following task, the grounded instruction following
05:42.080 --> 05:44.020
domain called gscan.
05:44.020 --> 05:49.360
In this domain, the goal is to have an agent, which is shown by this pink triangle, follow
05:49.360 --> 05:53.240
a command to perform some simple action in this grid world.
05:53.240 --> 06:00.520
So you can see here, walk to a small yellow cylinder might be an example of a command.
06:00.520 --> 06:06.800
Prior work has shown that one thing you can do is encode the initial state, encode the
06:06.800 --> 06:14.280
instruction and then train a neural model to predict the action sequences.
06:14.280 --> 06:19.600
Other work has also shown that one thing you can do is train a model to predict a distribution
06:19.600 --> 06:25.200
over the correct target location as part of the neural model.
06:25.200 --> 06:29.600
That will also increase the performance of the model.
06:29.600 --> 06:38.400
What we do here is show that if you do both of these things, you predict both an action
06:38.400 --> 06:43.800
sequence and a target location, like what is the location you should end up in, and
06:43.800 --> 06:48.600
then check whether or not when you execute the set of instructions, you will end up in
06:48.600 --> 06:50.720
the predicted target location.
06:50.720 --> 06:57.800
You can sort of check consistency between these two different predictions and only accept
06:57.800 --> 07:06.560
those instruction sequences which match the target location prediction.
07:06.560 --> 07:14.700
And this leads to also higher accuracy, especially in a low data regime.
07:14.700 --> 07:18.320
We have more details about the results of the paper.
07:18.320 --> 07:21.160
So that's a little bit of an overview of our paper.
07:21.160 --> 07:24.520
Our takeaways are that you can build systems with combined neural methods and explicit
07:24.520 --> 07:25.560
world knowledge.
07:25.560 --> 07:28.880
And if you add just a little bit of world knowledge, you can really help increase coherence
07:28.880 --> 07:34.880
and consistency for these large sequence models.
07:34.880 --> 07:38.520
There are some challenges here about parsing in larger scale domains and also what it would
07:38.520 --> 07:41.360
mean to automatically build a more complete world model.
07:41.360 --> 08:01.360
Thank you very much.
|