File size: 31,633 Bytes
4c7b631 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 |
<!DOCTYPE html>
<html lang="en"><head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1"><link rel="shortcut icon" type="image/x-icon" href="/narsil.github.io/favicon.ico"><!-- Begin Jekyll SEO tag v2.6.1 -->
<title>Model based encodings (3) | Narsil</title>
<meta name="generator" content="Jekyll v3.8.5" />
<meta property="og:title" content="Model based encodings (3)" />
<meta name="author" content="nicolas" />
<meta property="og:locale" content="en_US" />
<meta name="description" content="In the first segment we looked into how we could make a BPE based encoding, not only based on frequency in the dataset, but directly on the model probability measure of the next token. In that article I mention that dynamic BPE are costly because they stop being a one time operation but have to be done for every batch because the vocabulary might have changed. In this article I try to completely remove the “static” BPE approach and replace it completely with ML blocks." />
<meta property="og:description" content="In the first segment we looked into how we could make a BPE based encoding, not only based on frequency in the dataset, but directly on the model probability measure of the next token. In that article I mention that dynamic BPE are costly because they stop being a one time operation but have to be done for every batch because the vocabulary might have changed. In this article I try to completely remove the “static” BPE approach and replace it completely with ML blocks." />
<link rel="canonical" href="http://localhost:4000/narsil.github.io/ml/nlp/2019/08/06/model-based-bpe-encodings-3.html" />
<meta property="og:url" content="http://localhost:4000/narsil.github.io/ml/nlp/2019/08/06/model-based-bpe-encodings-3.html" />
<meta property="og:site_name" content="Narsil" />
<meta property="og:type" content="article" />
<meta property="article:published_time" content="2019-08-06T00:00:00+02:00" />
<script type="application/ld+json">
{"description":"In the first segment we looked into how we could make a BPE based encoding, not only based on frequency in the dataset, but directly on the model probability measure of the next token. In that article I mention that dynamic BPE are costly because they stop being a one time operation but have to be done for every batch because the vocabulary might have changed. In this article I try to completely remove the “static” BPE approach and replace it completely with ML blocks.","author":{"@type":"Person","name":"nicolas"},"mainEntityOfPage":{"@type":"WebPage","@id":"http://localhost:4000/narsil.github.io/ml/nlp/2019/08/06/model-based-bpe-encodings-3.html"},"@type":"BlogPosting","url":"http://localhost:4000/narsil.github.io/ml/nlp/2019/08/06/model-based-bpe-encodings-3.html","headline":"Model based encodings (3)","dateModified":"2019-08-06T00:00:00+02:00","datePublished":"2019-08-06T00:00:00+02:00","@context":"https://schema.org"}</script>
<!-- End Jekyll SEO tag -->
<link href="https://unpkg.com/@primer/css/dist/primer.css" rel="stylesheet" />
<link rel="stylesheet" href="/narsil.github.io/assets/main.css">
<link rel="stylesheet" href="//use.fontawesome.com/releases/v5.0.7/css/all.css"><link type="application/atom+xml" rel="alternate" href="http://localhost:4000/narsil.github.io/feed.xml" title="Narsil" />
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.11.1/dist/katex.min.css" integrity="sha384-zB1R0rpPzHqg7Kpt0Aljp8JPLqbXI3bhnPWROx27a9N0Ll6ZP/+DiW/UqRcLbRjq" crossorigin="anonymous">
<script type="text/javascript" async src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML"> </script>
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.1/dist/katex.min.js" integrity="sha384-y23I5Q6l+B6vatafAwxRu/0oK/79VlbSz7Q9aiSZUvyWYIYsd+qj+o24G5ZU2zJz" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.1/dist/contrib/auto-render.min.js" integrity="sha384-kWPLUVMOks5AQFrykwIup5lo0m3iMkkHrD0uJ4H5cjeGihAutqP0yW0J6dpFiVkI" crossorigin="anonymous"></script>
<script>
document.addEventListener("DOMContentLoaded", function() {
renderMathInElement( document.body, {
delimiters: [
{left: "$$", right: "$$", display: true},
{left: "[%", right: "%]", display: true},
{left: "$", right: "$", display: false}
]}
);
});
</script>
<script>
function wrap_img(fn) {
if (document.attachEvent ? document.readyState === "complete" : document.readyState !== "loading") {
var elements = document.querySelectorAll(".post img");
Array.prototype.forEach.call(elements, function(el, i) {
if (el.getAttribute("title")) {
const caption = document.createElement('figcaption');
var node = document.createTextNode(el.getAttribute("title"));
caption.appendChild(node);
const wrapper = document.createElement('figure');
wrapper.className = 'image';
el.parentNode.insertBefore(wrapper, el);
el.parentNode.removeChild(el);
wrapper.appendChild(el);
wrapper.appendChild(caption);
}
});
} else { document.addEventListener('DOMContentLoaded', fn); }
}
window.onload = wrap_img;
</script>
<script>
document.addEventListener("DOMContentLoaded", function(){
// add link icon to anchor tags
var elem = document.querySelectorAll(".anchor-link")
elem.forEach(e => (e.innerHTML = '<i class="fas fa-link fa-xs"></i>'));
// remove paragraph tags in rendered toc (happens from notebooks)
var toctags = document.querySelectorAll(".toc-entry")
toctags.forEach(e => (e.firstElementChild.innerText = e.firstElementChild.innerText.replace('¶', '')))
});
</script>
</head><body><header class="site-header" role="banner">
<div class="wrapper"><a class="site-title" rel="author" href="/narsil.github.io/">Narsil</a><nav class="site-nav">
<input type="checkbox" id="nav-trigger" class="nav-trigger" />
<label for="nav-trigger">
<span class="menu-icon">
<svg viewBox="0 0 18 15" width="18px" height="15px">
<path d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0 h15.032C17.335,0,18,0.665,18,1.484L18,1.484z M18,7.516C18,8.335,17.335,9,16.516,9H1.484C0.665,9,0,8.335,0,7.516l0,0 c0-0.82,0.665-1.484,1.484-1.484h15.032C17.335,6.031,18,6.696,18,7.516L18,7.516z M18,13.516C18,14.335,17.335,15,16.516,15H1.484 C0.665,15,0,14.335,0,13.516l0,0c0-0.82,0.665-1.483,1.484-1.483h15.032C17.335,12.031,18,12.695,18,13.516L18,13.516z"/>
</svg>
</span>
</label>
<div class="trigger"><a class="page-link" href="/narsil.github.io/about/">About Me</a><a class="page-link" href="/narsil.github.io/search/">Search</a><a class="page-link" href="/narsil.github.io/categories/">Tags</a></div>
</nav></div>
</header>
<main class="page-content" aria-label="Content">
<div class="wrapper">
<article class="post h-entry" itemscope itemtype="http://schema.org/BlogPosting">
<header class="post-header">
<h1 class="post-title p-name" itemprop="name headline">Model based encodings (3)</h1><p class="post-meta post-meta-title"><time class="dt-published" datetime="2019-08-06T00:00:00+02:00" itemprop="datePublished">
Aug 6, 2019
</time>•
<span itemprop="author" itemscope itemtype="http://schema.org/Person">
<span class="p-author h-card" itemprop="name">nicolas</span></span>
• <span class="read-time" title="Estimated read time">
12 min read
</span></p>
<p class="category-tags"><i class="fas fa-tags category-tags-icon"></i></i>
<a class="category-tags-link" href="/narsil.github.io/categories/#ml">ml</a>
<a class="category-tags-link" href="/narsil.github.io/categories/#nlp">nlp</a>
</p>
</header>
<div class="post-content e-content" itemprop="articleBody">
<p>In the <a href="/narsil.github.io/ml/nlp/2019/05/16/model-based-bpe-encodings.html">first segment</a>
we looked into how we could make a BPE
based encoding, not only based on frequency in the dataset, but directly on the
model probability measure of the next token. In that article I mention that
dynamic BPE are costly because they stop being a one time operation but have to
be done for every batch because the vocabulary might have changed. In this
article I try to completely remove the “static” BPE approach and replace it
completely with ML blocks.</p>
<blockquote>
<h1 id="tldr-in-this-article-we-present-an-idea-to-replace-classical-bpe-algorithm-with-a-pure-ml-version-of-it">TL;DR In this article we present an idea to replace classical BPE algorithm with a pure ML version of it.</h1>
</blockquote>
<h2 id="what-is-the-goal-">What is the goal ?</h2>
<p>So the goal is to replace BPE algorithm. So it’s go from something like</p>
<p>“T|h|e| |c|a|t| |a|t|e| |t|h|e| |a|p|p|l|e|.”</p>
<p>To something that has less elements :</p>
<p>“The |ca|t |at|e |the| |app|le|.”</p>
<p>In one sentence, BPE fuses bytes to form tokens based on frequency in the full
dataset. For a more detailed example, look that <a href="/narsil.github.io/ml/nlp/2019/05/16/model-based-bpe-encodings.html">the previous
article</a>.
In this example, you can see there is always a split after a space. That’s a
limitation of BPE so actually our target might look different, maybe more like</p>
<p>“The cat |at|e |the app|le|.”</p>
<p>Here we can notice that “The cat” is a full token and contain 2 actual words.
So the goal is to fuse some starting bytes into N tokens (let’s say ~10k) that
hopefully capture regularities in our dataset and are at least correlated to
frequency in the original dataset like BPE was.</p>
<p>Another property we need to have from BPE is that it can encode an arbitrary
string of text. It does not matter if it’s not the same language or even if it
makes sense, you CAN encode it, that is a very desirable property. It avoids
the <a href="https://medium.com/cisco-emerge/creating-semantic-representations-of-out-of-vocabulary-words-for-common-nlp-tasks-842dbdafba18">out-of-vocabulary</a> problem.</p>
<h2 id="approach">Approach</h2>
<h3 id="tokenization">Tokenization</h3>
<p>So let’s imagine we have a trained transformer like
<a href="https://openai.com/blog/better-language-models/">GPT-2</a>. But trained on bytes
directly NOT on tokens like the original transformer. Now we can use the idea
that when a model is highly confident, it probably means that what it’s about
to predict is “in the same token”. Let’s take an example. Try to predict the
following Character (as in a single letter) in the next 2 sentences</p>
<blockquote>
<p>Sentence 1: “Who are yo…”</p>
</blockquote>
<blockquote>
<p>Sentence 2 : “I like …”</p>
</blockquote>
<p>In the first sentence, normally you would vote with very high confidence for
“u”, whereas in the second sentence, you lack a lot of context to be exactly
sure on what’s coming next. So “you” would be a token, whereas “like …” can’t
be a single token, it has to be at least 2, “like “ and “…”.</p>
<p>Here is a small gif of actual probabilities of the language model on a small sentence</p>
<p><img src="/narsil.github.io/images/models-2-approach.gif" /></p>
<p>You can see the in the left of the graph the probabilities drop, those are the
tokens that try to get predicted but are missing context (because we have very
few characters before them. For the right side, you can see the drops in probability
are pretty consistent and correspond to word boundaries most often.</p>
<h3 id="handling-unknown-tokens">Handling unknown tokens</h3>
<p>Now we know how we are going to “fuse” characters, but we are not done yet. BPE
tokens are a discrete SET of identified values from 0 to N (~10k in this
experiment). Also BPE can encode an arbitrary new string by using it’s fusion
table. So we can’t just run our algorithm on some specific dataset, count all
the tokens created and declare that these are the N tokens for eternity. Let’s
imagine I feed my algorithm a new sentence, in a different language, French for
instance.</p>
<p>“J’adore l’Italie.”</p>
<p>We can run our “tokenizer” on this, and receive something like this</p>
<p>“J|’|ado|re |l’|Ita|lie.”</p>
<p>Now “ado” might not be in our original list, so what do we do with it ? Do we
declare the token wrong and split it ? That would be odd.</p>
<p>A key insight, is to remember that the first step of the discrete “token” once
it enters the model (all of them do that, it’s really not specific to
transformer or GPT-2) it gets embedded, meaning we go from a number between 1
and N, to a vector in <em>d</em> dimension space (<em>d</em> is between 100 and 1000 generally).</p>
<p>For instance token 3 gets mapped to [0.3, -0.15, 1.4, …] while token 4 gets mapped
to [-2.4, -0.014, 0.45, …]</p>
<p>So the idea it to generate directly a token embedding (a vector in <em>d</em>-dimension), not necessarily a
discrete value (a number between 0 and vocabulary size).</p>
<p>In order to do that we need that all tokens should now be represented in the
same way by a <em>d</em> dimension space vector. One way to achieve that is to use an
autoencoder.</p>
<p><img src="https://upload.wikimedia.org/wikipedia/commons/2/28/Autoencoder_structure.png" alt="" />
or with code</p>
<p>The core idea is that when we encounter a new unseen token like “ado” it will still have
a representation through the VAE, and will probably be close to a known token like “add”.
This can help the network overcome odd tokenization or spelling errors.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## The name is VAE but I didn't use the internal KL loss in the end as it prevented/slowed down the learning.
</span><span class="k">class</span> <span class="nc">VAE</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="nb">super</span><span class="p">(</span><span class="n">VAE</span><span class="p">,</span> <span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="n">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">M</span> <span class="o">=</span> <span class="n">config</span><span class="o">.</span><span class="n">CONTEXT_SIZE</span> <span class="o">*</span> <span class="n">config</span><span class="o">.</span><span class="n">EMBEDDING_DIM</span>
<span class="n">layer</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span>
<span class="n">m</span> <span class="o">=</span> <span class="mi">400</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fc1</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">M</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fc21</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">EMBEDDING_DIM</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fc22</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">EMBEDDING_DIM</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fc3</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">config</span><span class="o">.</span><span class="n">EMBEDDING_DIM</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fc4</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">M</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">encode</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="c1"># x is [Batch, Context size, Embedding dim]
</span> <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">M</span><span class="p">)</span>
<span class="n">h1</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fc1</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">fc21</span><span class="p">(</span><span class="n">h1</span><span class="p">),</span> <span class="bp">self</span><span class="o">.</span><span class="n">fc22</span><span class="p">(</span><span class="n">h1</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">reparameterize</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span><span class="p">):</span>
<span class="n">std</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="mf">0.5</span> <span class="o">*</span> <span class="n">logvar</span><span class="p">)</span>
<span class="n">eps</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">randn_like</span><span class="p">(</span><span class="n">std</span><span class="p">)</span>
<span class="k">return</span> <span class="n">mu</span> <span class="o">+</span> <span class="n">eps</span> <span class="o">*</span> <span class="n">std</span>
<span class="k">def</span> <span class="nf">decode</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">z</span><span class="p">):</span>
<span class="n">h3</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fc3</span><span class="p">(</span><span class="n">z</span><span class="p">))</span>
<span class="k">return</span> <span class="n">torch</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fc4</span><span class="p">(</span><span class="n">h3</span><span class="p">)</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">CONTEXT_SIZE</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">EMBEDDING_DIM</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">z</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">reparameterize</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span><span class="p">)</span>
<span class="k">return</span> <span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span><span class="p">,</span> <span class="n">z</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="final-network">Final network</h3>
<p><img src="/narsil.github.io/images/model-based-2.png" /></p>
<h2 id="results">Results</h2>
<p>Here is a summary of the values of the tokenization we got.</p>
<table>
<thead>
<tr>
<th> </th>
<th>Raw</th>
<th>BPE</th>
<th>Model based</th>
</tr>
</thead>
<tbody>
<tr>
<td>Vocabulary size</td>
<td>256</td>
<td>10000</td>
<td>26262</td>
</tr>
<tr>
<td>#Tokens</td>
<td>387k</td>
<td>90k</td>
<td>92k</td>
</tr>
<tr>
<td>Avg token length</td>
<td>1</td>
<td>3.3</td>
<td>6.65</td>
</tr>
</tbody>
</table>
<p>Here is a excerpt of the kind of tokenization we created</p>
<pre><i>|He w|as on|e of|
the |most |n|oticea|ble member|s of the| Reform| Club|, |th|ough| he| s|eemed
|always |to |avoid |att|racting at|tention|; an en|ig|mat|i|cal |p|erson|age|,|
|ab|out whom l|ittle| was |known|, |e|xc|ept that |he| w|as |a |poli|shed m|an|
o|f |th|e |wo|rld|. |Pe|ople sa|id| that h|e |re|sembl|ed| |Byron|--at least|
t|hat |his hea|d w|as |Byronic|; |but| he was |a |b|earde|d, tranquil| Byron|,
who| |might live| on a |thousand year|s |w|ithout g|r|owing o|ld|.|
|Certainly| an| English|man|, it |was |m|ore |doubt|ful w|h|ether |Phileas Fogg|
w|as |a |London|er|.</i></pre>
<p><a href="/txt/80day_tokenized_exp2.txt">Full text</a></p>
<p>This version has been done with epsilon=0.0015.</p>
<p>As you can see, “Phileas Fogg” is already a token in this situation, which is a multi-word token not
achievable by regular BPE. You can also see, a lot of words contain only single bytes tokens which
is why this method compresses LESS than regular BPE at the same vocabulary size.
Another note is that classical words like “was” is already a token (in the last sentence) but it’s not always
the case, this token is context dependent now !</p>
<h2 id="vae">VAE</h2>
<p>After the VAE step, the reconstruction is not perfect yet perfectly legible.</p>
<pre><i>|He w|as on|e of|
the |most |n|oticea|ihe member|s of the| reform| Club|, |th|ough| he| s|eemed
|always |to |asoid |att|nacting at|tention|, an en|ig|mat|i|cal |p|erson|age|,|
|ab|
it whom l|ittle| was | nown|, |e|xc| pt that |he| w|as |a |poli|shed m|an|
o|f |th|e |wo|rld|. |Pe|ople sa|id| that h|e |re|sembl|ed| |pyron| cat least|
t|hat |has hea|d w|as |blronic|; |but| he was |a |b|earde|in tranquil| pyron|
who| |eight live| on a |dar and year|s |w|ithout g|r|owing o|ld|.|
|rertainly| an| English|man|, it |was |m|ore |doubt|ful w|h|ether |Phileas Fogg|
w|as |a |London|er|.</i></pre>
<p><a href="/txt/80day_reconstructed2.txt">Full text</a></p>
<p>Most of the errors tend to lie in the first characters of <em>long tokens</em>.That’s because, I’m forced to padd
the input of the VAE and to mask that padding. In practice that means that the first characters of long tokens get updated
less that the others so necessarily they contain more errors. <a href="#notes">More information</a>.</p>
<h2 id="upper-level">Upper level</h2>
<p>In order to complete the experiment, we need to check that the original language model
done directly at BPE level can be done with this new model-based BPE encoding.</p>
<p>It’s pretty slow to train that upper level because we need to flow the
gradients all the way through the VAE decoder, and the lower layer decoding
step, in order to get the <strong>character level loss</strong> (softmax + nll_loss) to properly train something.
That’s a limit of the current approach.</p>
<p>If we randomly split the text into train&validation, we can learn almost perfectly (97% top-1 character level accuracy)
the language model on top of that Model based BPE.</p>
<p><img src="/narsil.github.io/images/models-2-overfit.png" /></p>
<p>However this can be considered <strong>overfitting</strong> because even though a specific input
was never seen in the valid set, a very close one <em>was</em>.</p>
<p>If instead we try to compare with a fixed split, where the last part of the book
is considered the valid set, then we get much lower result.</p>
<p>We could achieve 25% exact character matching, and ~77%
top-10 character matching on the valid set, which is the end of the book !
The same results happen with BPE, even worse ! we can’t get past 13% top-1 and 25% top-10
on the regular BPE. That’s understandable because the dataset is very small and
the last part of the book is different so it’s very hard to infer it from just the
beginning and no other text.</p>
<p>Another note, is that model based BPE are not tokenizing deterministicly, there
is some variance to it, depending on the context of a particular word.
This actually seems to be a good property (See <a href="https://arxiv.org/abs/1804.10959">this</a>) and
might explain away the better performance of model based BPE over regular BPE.
Keep in mind it’s 25% of the <strong>characters</strong> that are correct.
If we looked at a discrete view of <strong>tokens</strong> we probably would have a much higher prediction rate (it’s left for future work for now).</p>
<p>Here is a picture from the tensorboard values, P_1 is probability that the
character predicted is the correct one, P_10 is that it is in the top-10
values.</p>
<p><img src="/narsil.github.io/images/models-2-upper.png" /></p>
<p>The overfitting starts happening around the ~1M steps mark.</p>
<h3 id="notes">Notes</h3>
<ul>
<li>In the experiment we learned model by model, freezing the lower model
before training something on top. It’s because the batching of different
layers occur differently. Learning the whole thing end-to-end is probably going
to need some thought. The batching is easy for the lower level, every batch
needs a tensor of shape CONTEXT_SIZE (=64) of [0-255] ints. For the VAE, we
need to have a variable length (depending on the length token) times EMBEDDING_DIM
(=128). The upper level needs only tensors of size CONTEXT_SIZE *
EMBEDDING_DIM yet if we want to try and end-to-end training, we have <strong>no
idea</strong> how many bytes we need to generate 1 correct tensor in the upper layer.
We know it’s no more than CONTEXT_SIZE² but that would be prohibitive to use
that value.</li>
<li>The loss NEEDS to always be the byte-level nll loss. At first I thought a
simple MSE loss in the embedding space could be enough to learn the proper
models. It seems to not be the case. I could only achieve meaningful results by
always referring to the original strings and calculating the NLL Loss. When
using this loss, the MSE actually <em>increases</em>. This leads me to think that
encoding/decoding + softmax are highly anisotropic operators. Looking at the
singular values of the embedding matrix, we can see that the highest one is
7.35, the lowest one 0.12, so there are 2 orders of magnitude between the 2.
This anisotropy means that the MSE loss which considers all dimensions of the
embeddding equal is actually couting way too much some irrelevant dimensions.
It would be much faster and simpler if we could train directly on MSE (it would
enable us to train without running all the decoding steps to generate the
loss). So we need to add some spectral loss on the embedding on the lower
language model to test that hypothesis.</li>
<li>The tokens have variable lengths. In order to fix this, we have to padd all
sequences during learning. Because we padd, we have to mask the padding
during training for both VAE and upper LM. Keeping track of this is pretty
nifty and it means gradients on rarely used places will rarely get updated. So
we will almost surely miss some letters in our tokens. Either at the front or
the end of the token depending on how we padd the tokens.</li>
</ul>
<h2 id="future-work"><strong>Future work</strong></h2>
<ul>
<li>Actually testing discretizing the tokens to compare with the regular BPE. In that direction,
also comparing with a randomized tokenizer as used in <a href="https://github.com/google/sentencepiece">SentencePiece</a>
to make sure the results are actually comparable and are indeed linked to tokenization variance.</li>
<li>The masking problem really seems to be a current limit of the model. Finding a workaround would be really valuable.</li>
<li>The fact that the NLL loss is required slows down upper layers. It would be awesome if we could smooth out
the encoding/decoding matrix so that L2 directly for VAE and the upper layer works. It probably goes against regular
language model embedding so not sure it’s doable.</li>
<li>Making the epsilon based tokenization directly after the embedding layer. This would help <em>stack</em> those levels hopefully learning
higher and higer representations of text leading the sentence embedding and so on.</li>
<li>On the same idea, another direction would be to do actual discrete tokenization to allow for the models to stack.</li>
</ul>
</div><a class="u-url" href="/narsil.github.io/ml/nlp/2019/08/06/model-based-bpe-encodings-3.html" hidden></a>
</article>
</div>
</main><footer class="site-footer h-card">
<data class="u-url" href="/narsil.github.io/"></data>
<div class="wrapper">
<h2 class="footer-heading">Narsil</h2>
<div class="footer-col-wrapper">
<div class="footer-col footer-col-1">
<ul class="contact-list">
<li class="p-name">Narsil</li></ul>
</div>
<div class="footer-col footer-col-2"><ul class="social-media-list">
<li><a href="https://github.com/Narsil"><svg class="social svg-icon"><use xlink:href="/narsil.github.io/assets/minima-social-icons.svg#github"></use></svg> <span class="username">Narsil</span></a></li><li><a href="https://www.twitter.com/narsilou"><svg class="social svg-icon"><use xlink:href="/narsil.github.io/assets/minima-social-icons.svg#twitter"></use></svg> <span class="username">narsilou</span></a></li></ul>
</div>
<div class="footer-col footer-col-3">
<p>Small experiements insights from ML and software development.</p>
</div>
</div>
</div>
</footer>
</body>
</html>
|