|
<!DOCTYPE html> |
|
<html lang="en"><head> |
|
<meta charset="utf-8"> |
|
<meta http-equiv="X-UA-Compatible" content="IE=edge"> |
|
<meta name="viewport" content="width=device-width, initial-scale=1"><link rel="shortcut icon" type="image/x-icon" href="/narsil.github.io/favicon.ico"> |
|
<title>Model based encodings (3) | Narsil</title> |
|
<meta name="generator" content="Jekyll v3.8.5" /> |
|
<meta property="og:title" content="Model based encodings (3)" /> |
|
<meta name="author" content="nicolas" /> |
|
<meta property="og:locale" content="en_US" /> |
|
<meta name="description" content="In the first segment we looked into how we could make a BPE based encoding, not only based on frequency in the dataset, but directly on the model probability measure of the next token. In that article I mention that dynamic BPE are costly because they stop being a one time operation but have to be done for every batch because the vocabulary might have changed. In this article I try to completely remove the “static” BPE approach and replace it completely with ML blocks." /> |
|
<meta property="og:description" content="In the first segment we looked into how we could make a BPE based encoding, not only based on frequency in the dataset, but directly on the model probability measure of the next token. In that article I mention that dynamic BPE are costly because they stop being a one time operation but have to be done for every batch because the vocabulary might have changed. In this article I try to completely remove the “static” BPE approach and replace it completely with ML blocks." /> |
|
<link rel="canonical" href="http://localhost:4000/narsil.github.io/ml/nlp/2019/08/06/model-based-bpe-encodings-3.html" /> |
|
<meta property="og:url" content="http://localhost:4000/narsil.github.io/ml/nlp/2019/08/06/model-based-bpe-encodings-3.html" /> |
|
<meta property="og:site_name" content="Narsil" /> |
|
<meta property="og:type" content="article" /> |
|
<meta property="article:published_time" content="2019-08-06T00:00:00+02:00" /> |
|
<script type="application/ld+json"> |
|
{"description":"In the first segment we looked into how we could make a BPE based encoding, not only based on frequency in the dataset, but directly on the model probability measure of the next token. In that article I mention that dynamic BPE are costly because they stop being a one time operation but have to be done for every batch because the vocabulary might have changed. In this article I try to completely remove the “static” BPE approach and replace it completely with ML blocks.","author":{"@type":"Person","name":"nicolas"},"mainEntityOfPage":{"@type":"WebPage","@id":"http://localhost:4000/narsil.github.io/ml/nlp/2019/08/06/model-based-bpe-encodings-3.html"},"@type":"BlogPosting","url":"http://localhost:4000/narsil.github.io/ml/nlp/2019/08/06/model-based-bpe-encodings-3.html","headline":"Model based encodings (3)","dateModified":"2019-08-06T00:00:00+02:00","datePublished":"2019-08-06T00:00:00+02:00","@context":"https://schema.org"}</script> |
|
|
|
|
|
<link href="https://unpkg.com/@primer/css/dist/primer.css" rel="stylesheet" /> |
|
<link rel="stylesheet" href="/narsil.github.io/assets/main.css"> |
|
<link rel="stylesheet" href="//use.fontawesome.com/releases/v5.0.7/css/all.css"><link type="application/atom+xml" rel="alternate" href="http://localhost:4000/narsil.github.io/feed.xml" title="Narsil" /> |
|
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.11.1/dist/katex.min.css" integrity="sha384-zB1R0rpPzHqg7Kpt0Aljp8JPLqbXI3bhnPWROx27a9N0Ll6ZP/+DiW/UqRcLbRjq" crossorigin="anonymous"> |
|
<script type="text/javascript" async src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML"> </script> |
|
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.1/dist/katex.min.js" integrity="sha384-y23I5Q6l+B6vatafAwxRu/0oK/79VlbSz7Q9aiSZUvyWYIYsd+qj+o24G5ZU2zJz" crossorigin="anonymous"></script> |
|
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.1/dist/contrib/auto-render.min.js" integrity="sha384-kWPLUVMOks5AQFrykwIup5lo0m3iMkkHrD0uJ4H5cjeGihAutqP0yW0J6dpFiVkI" crossorigin="anonymous"></script> |
|
<script> |
|
document.addEventListener("DOMContentLoaded", function() { |
|
renderMathInElement( document.body, { |
|
delimiters: [ |
|
{left: "$$", right: "$$", display: true}, |
|
{left: "[%", right: "%]", display: true}, |
|
{left: "$", right: "$", display: false} |
|
]} |
|
); |
|
}); |
|
</script> |
|
|
|
|
|
<script> |
|
function wrap_img(fn) { |
|
if (document.attachEvent ? document.readyState === "complete" : document.readyState !== "loading") { |
|
var elements = document.querySelectorAll(".post img"); |
|
Array.prototype.forEach.call(elements, function(el, i) { |
|
if (el.getAttribute("title")) { |
|
const caption = document.createElement('figcaption'); |
|
var node = document.createTextNode(el.getAttribute("title")); |
|
caption.appendChild(node); |
|
const wrapper = document.createElement('figure'); |
|
wrapper.className = 'image'; |
|
el.parentNode.insertBefore(wrapper, el); |
|
el.parentNode.removeChild(el); |
|
wrapper.appendChild(el); |
|
wrapper.appendChild(caption); |
|
} |
|
}); |
|
} else { document.addEventListener('DOMContentLoaded', fn); } |
|
} |
|
window.onload = wrap_img; |
|
</script> |
|
|
|
<script> |
|
document.addEventListener("DOMContentLoaded", function(){ |
|
|
|
var elem = document.querySelectorAll(".anchor-link") |
|
elem.forEach(e => (e.innerHTML = '<i class="fas fa-link fa-xs"></i>')); |
|
|
|
var toctags = document.querySelectorAll(".toc-entry") |
|
toctags.forEach(e => (e.firstElementChild.innerText = e.firstElementChild.innerText.replace('¶', ''))) |
|
}); |
|
</script> |
|
</head><body><header class="site-header" role="banner"> |
|
|
|
<div class="wrapper"><a class="site-title" rel="author" href="/narsil.github.io/">Narsil</a><nav class="site-nav"> |
|
<input type="checkbox" id="nav-trigger" class="nav-trigger" /> |
|
<label for="nav-trigger"> |
|
<span class="menu-icon"> |
|
<svg viewBox="0 0 18 15" width="18px" height="15px"> |
|
<path d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0 h15.032C17.335,0,18,0.665,18,1.484L18,1.484z M18,7.516C18,8.335,17.335,9,16.516,9H1.484C0.665,9,0,8.335,0,7.516l0,0 c0-0.82,0.665-1.484,1.484-1.484h15.032C17.335,6.031,18,6.696,18,7.516L18,7.516z M18,13.516C18,14.335,17.335,15,16.516,15H1.484 C0.665,15,0,14.335,0,13.516l0,0c0-0.82,0.665-1.483,1.484-1.483h15.032C17.335,12.031,18,12.695,18,13.516L18,13.516z"/> |
|
</svg> |
|
</span> |
|
</label> |
|
|
|
<div class="trigger"><a class="page-link" href="/narsil.github.io/about/">About Me</a><a class="page-link" href="/narsil.github.io/search/">Search</a><a class="page-link" href="/narsil.github.io/categories/">Tags</a></div> |
|
</nav></div> |
|
</header> |
|
<main class="page-content" aria-label="Content"> |
|
<div class="wrapper"> |
|
<article class="post h-entry" itemscope itemtype="http://schema.org/BlogPosting"> |
|
|
|
<header class="post-header"> |
|
<h1 class="post-title p-name" itemprop="name headline">Model based encodings (3)</h1><p class="post-meta post-meta-title"><time class="dt-published" datetime="2019-08-06T00:00:00+02:00" itemprop="datePublished"> |
|
Aug 6, 2019 |
|
</time>• |
|
<span itemprop="author" itemscope itemtype="http://schema.org/Person"> |
|
<span class="p-author h-card" itemprop="name">nicolas</span></span> |
|
• <span class="read-time" title="Estimated read time"> |
|
|
|
|
|
12 min read |
|
|
|
</span></p> |
|
|
|
|
|
<p class="category-tags"><i class="fas fa-tags category-tags-icon"></i></i> |
|
|
|
<a class="category-tags-link" href="/narsil.github.io/categories/#ml">ml</a> |
|
|
|
|
|
<a class="category-tags-link" href="/narsil.github.io/categories/#nlp">nlp</a> |
|
|
|
|
|
</p> |
|
|
|
|
|
</header> |
|
|
|
<div class="post-content e-content" itemprop="articleBody"> |
|
<p>In the <a href="/narsil.github.io/ml/nlp/2019/05/16/model-based-bpe-encodings.html">first segment</a> |
|
we looked into how we could make a BPE |
|
based encoding, not only based on frequency in the dataset, but directly on the |
|
model probability measure of the next token. In that article I mention that |
|
dynamic BPE are costly because they stop being a one time operation but have to |
|
be done for every batch because the vocabulary might have changed. In this |
|
article I try to completely remove the “static” BPE approach and replace it |
|
completely with ML blocks.</p> |
|
|
|
<blockquote> |
|
<h1 id="tldr-in-this-article-we-present-an-idea-to-replace-classical-bpe-algorithm-with-a-pure-ml-version-of-it">TL;DR In this article we present an idea to replace classical BPE algorithm with a pure ML version of it.</h1> |
|
</blockquote> |
|
|
|
<h2 id="what-is-the-goal-">What is the goal ?</h2> |
|
|
|
<p>So the goal is to replace BPE algorithm. So it’s go from something like</p> |
|
|
|
<p>“T|h|e| |c|a|t| |a|t|e| |t|h|e| |a|p|p|l|e|.”</p> |
|
|
|
<p>To something that has less elements :</p> |
|
|
|
<p>“The |ca|t |at|e |the| |app|le|.”</p> |
|
|
|
<p>In one sentence, BPE fuses bytes to form tokens based on frequency in the full |
|
dataset. For a more detailed example, look that <a href="/narsil.github.io/ml/nlp/2019/05/16/model-based-bpe-encodings.html">the previous |
|
article</a>. |
|
In this example, you can see there is always a split after a space. That’s a |
|
limitation of BPE so actually our target might look different, maybe more like</p> |
|
|
|
<p>“The cat |at|e |the app|le|.”</p> |
|
|
|
<p>Here we can notice that “The cat” is a full token and contain 2 actual words. |
|
So the goal is to fuse some starting bytes into N tokens (let’s say ~10k) that |
|
hopefully capture regularities in our dataset and are at least correlated to |
|
frequency in the original dataset like BPE was.</p> |
|
|
|
<p>Another property we need to have from BPE is that it can encode an arbitrary |
|
string of text. It does not matter if it’s not the same language or even if it |
|
makes sense, you CAN encode it, that is a very desirable property. It avoids |
|
the <a href="https://medium.com/cisco-emerge/creating-semantic-representations-of-out-of-vocabulary-words-for-common-nlp-tasks-842dbdafba18">out-of-vocabulary</a> problem.</p> |
|
|
|
<h2 id="approach">Approach</h2> |
|
|
|
<h3 id="tokenization">Tokenization</h3> |
|
|
|
<p>So let’s imagine we have a trained transformer like |
|
<a href="https://openai.com/blog/better-language-models/">GPT-2</a>. But trained on bytes |
|
directly NOT on tokens like the original transformer. Now we can use the idea |
|
that when a model is highly confident, it probably means that what it’s about |
|
to predict is “in the same token”. Let’s take an example. Try to predict the |
|
following Character (as in a single letter) in the next 2 sentences</p> |
|
|
|
<blockquote> |
|
<p>Sentence 1: “Who are yo…”</p> |
|
</blockquote> |
|
|
|
<blockquote> |
|
<p>Sentence 2 : “I like …”</p> |
|
</blockquote> |
|
|
|
<p>In the first sentence, normally you would vote with very high confidence for |
|
“u”, whereas in the second sentence, you lack a lot of context to be exactly |
|
sure on what’s coming next. So “you” would be a token, whereas “like …” can’t |
|
be a single token, it has to be at least 2, “like “ and “…”.</p> |
|
|
|
<p>Here is a small gif of actual probabilities of the language model on a small sentence</p> |
|
|
|
<p><img src="/narsil.github.io/images/models-2-approach.gif" /></p> |
|
|
|
<p>You can see the in the left of the graph the probabilities drop, those are the |
|
tokens that try to get predicted but are missing context (because we have very |
|
few characters before them. For the right side, you can see the drops in probability |
|
are pretty consistent and correspond to word boundaries most often.</p> |
|
|
|
<h3 id="handling-unknown-tokens">Handling unknown tokens</h3> |
|
|
|
<p>Now we know how we are going to “fuse” characters, but we are not done yet. BPE |
|
tokens are a discrete SET of identified values from 0 to N (~10k in this |
|
experiment). Also BPE can encode an arbitrary new string by using it’s fusion |
|
table. So we can’t just run our algorithm on some specific dataset, count all |
|
the tokens created and declare that these are the N tokens for eternity. Let’s |
|
imagine I feed my algorithm a new sentence, in a different language, French for |
|
instance.</p> |
|
|
|
<p>“J’adore l’Italie.”</p> |
|
|
|
<p>We can run our “tokenizer” on this, and receive something like this</p> |
|
|
|
<p>“J|’|ado|re |l’|Ita|lie.”</p> |
|
|
|
<p>Now “ado” might not be in our original list, so what do we do with it ? Do we |
|
declare the token wrong and split it ? That would be odd.</p> |
|
|
|
<p>A key insight, is to remember that the first step of the discrete “token” once |
|
it enters the model (all of them do that, it’s really not specific to |
|
transformer or GPT-2) it gets embedded, meaning we go from a number between 1 |
|
and N, to a vector in <em>d</em> dimension space (<em>d</em> is between 100 and 1000 generally).</p> |
|
|
|
<p>For instance token 3 gets mapped to [0.3, -0.15, 1.4, …] while token 4 gets mapped |
|
to [-2.4, -0.014, 0.45, …]</p> |
|
|
|
<p>So the idea it to generate directly a token embedding (a vector in <em>d</em>-dimension), not necessarily a |
|
discrete value (a number between 0 and vocabulary size).</p> |
|
|
|
<p>In order to do that we need that all tokens should now be represented in the |
|
same way by a <em>d</em> dimension space vector. One way to achieve that is to use an |
|
autoencoder.</p> |
|
|
|
<p><img src="https://upload.wikimedia.org/wikipedia/commons/2/28/Autoencoder_structure.png" alt="" /> |
|
or with code</p> |
|
|
|
<p>The core idea is that when we encounter a new unseen token like “ado” it will still have |
|
a representation through the VAE, and will probably be close to a known token like “add”. |
|
This can help the network overcome odd tokenization or spelling errors.</p> |
|
|
|
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## The name is VAE but I didn't use the internal KL loss in the end as it prevented/slowed down the learning. |
|
</span><span class="k">class</span> <span class="nc">VAE</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span> |
|
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> |
|
<span class="nb">super</span><span class="p">(</span><span class="n">VAE</span><span class="p">,</span> <span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="n">__init__</span><span class="p">()</span> |
|
<span class="bp">self</span><span class="o">.</span><span class="n">M</span> <span class="o">=</span> <span class="n">config</span><span class="o">.</span><span class="n">CONTEXT_SIZE</span> <span class="o">*</span> <span class="n">config</span><span class="o">.</span><span class="n">EMBEDDING_DIM</span> |
|
<span class="n">layer</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span> |
|
<span class="n">m</span> <span class="o">=</span> <span class="mi">400</span> |
|
|
|
<span class="bp">self</span><span class="o">.</span><span class="n">fc1</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">M</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span> |
|
<span class="bp">self</span><span class="o">.</span><span class="n">fc21</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">EMBEDDING_DIM</span><span class="p">)</span> |
|
<span class="bp">self</span><span class="o">.</span><span class="n">fc22</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">EMBEDDING_DIM</span><span class="p">)</span> |
|
<span class="bp">self</span><span class="o">.</span><span class="n">fc3</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">config</span><span class="o">.</span><span class="n">EMBEDDING_DIM</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span> |
|
<span class="bp">self</span><span class="o">.</span><span class="n">fc4</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">M</span><span class="p">)</span> |
|
|
|
<span class="k">def</span> <span class="nf">encode</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> |
|
<span class="c1"># x is [Batch, Context size, Embedding dim] |
|
</span> <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">M</span><span class="p">)</span> |
|
<span class="n">h1</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fc1</span><span class="p">(</span><span class="n">x</span><span class="p">))</span> |
|
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">fc21</span><span class="p">(</span><span class="n">h1</span><span class="p">),</span> <span class="bp">self</span><span class="o">.</span><span class="n">fc22</span><span class="p">(</span><span class="n">h1</span><span class="p">)</span> |
|
|
|
<span class="k">def</span> <span class="nf">reparameterize</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span><span class="p">):</span> |
|
<span class="n">std</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="mf">0.5</span> <span class="o">*</span> <span class="n">logvar</span><span class="p">)</span> |
|
<span class="n">eps</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">randn_like</span><span class="p">(</span><span class="n">std</span><span class="p">)</span> |
|
<span class="k">return</span> <span class="n">mu</span> <span class="o">+</span> <span class="n">eps</span> <span class="o">*</span> <span class="n">std</span> |
|
|
|
<span class="k">def</span> <span class="nf">decode</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">z</span><span class="p">):</span> |
|
<span class="n">h3</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fc3</span><span class="p">(</span><span class="n">z</span><span class="p">))</span> |
|
<span class="k">return</span> <span class="n">torch</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span> |
|
<span class="bp">self</span><span class="o">.</span><span class="n">fc4</span><span class="p">(</span><span class="n">h3</span><span class="p">)</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">CONTEXT_SIZE</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">EMBEDDING_DIM</span><span class="p">)</span> |
|
<span class="p">)</span> |
|
|
|
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> |
|
<span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> |
|
<span class="n">z</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">reparameterize</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span><span class="p">)</span> |
|
<span class="k">return</span> <span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span><span class="p">,</span> <span class="n">z</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="n">z</span><span class="p">)</span> |
|
</code></pre></div></div> |
|
|
|
<h3 id="final-network">Final network</h3> |
|
|
|
<p><img src="/narsil.github.io/images/model-based-2.png" /></p> |
|
|
|
<h2 id="results">Results</h2> |
|
|
|
<p>Here is a summary of the values of the tokenization we got.</p> |
|
|
|
<table> |
|
<thead> |
|
<tr> |
|
<th> </th> |
|
<th>Raw</th> |
|
<th>BPE</th> |
|
<th>Model based</th> |
|
</tr> |
|
</thead> |
|
<tbody> |
|
<tr> |
|
<td>Vocabulary size</td> |
|
<td>256</td> |
|
<td>10000</td> |
|
<td>26262</td> |
|
</tr> |
|
<tr> |
|
<td>#Tokens</td> |
|
<td>387k</td> |
|
<td>90k</td> |
|
<td>92k</td> |
|
</tr> |
|
<tr> |
|
<td>Avg token length</td> |
|
<td>1</td> |
|
<td>3.3</td> |
|
<td>6.65</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
|
|
<p>Here is a excerpt of the kind of tokenization we created</p> |
|
|
|
<pre><i>|He w|as on|e of| |
|
the |most |n|oticea|ble member|s of the| Reform| Club|, |th|ough| he| s|eemed |
|
|always |to |avoid |att|racting at|tention|; an en|ig|mat|i|cal |p|erson|age|,| |
|
|ab|out whom l|ittle| was |known|, |e|xc|ept that |he| w|as |a |poli|shed m|an| |
|
o|f |th|e |wo|rld|. |Pe|ople sa|id| that h|e |re|sembl|ed| |Byron|--at least| |
|
t|hat |his hea|d w|as |Byronic|; |but| he was |a |b|earde|d, tranquil| Byron|, |
|
who| |might live| on a |thousand year|s |w|ithout g|r|owing o|ld|.| |
|
|
|
|Certainly| an| English|man|, it |was |m|ore |doubt|ful w|h|ether |Phileas Fogg| |
|
w|as |a |London|er|.</i></pre> |
|
|
|
<p><a href="/txt/80day_tokenized_exp2.txt">Full text</a></p> |
|
|
|
<p>This version has been done with epsilon=0.0015.</p> |
|
|
|
<p>As you can see, “Phileas Fogg” is already a token in this situation, which is a multi-word token not |
|
achievable by regular BPE. You can also see, a lot of words contain only single bytes tokens which |
|
is why this method compresses LESS than regular BPE at the same vocabulary size. |
|
Another note is that classical words like “was” is already a token (in the last sentence) but it’s not always |
|
the case, this token is context dependent now !</p> |
|
|
|
<h2 id="vae">VAE</h2> |
|
|
|
<p>After the VAE step, the reconstruction is not perfect yet perfectly legible.</p> |
|
|
|
<pre><i>|He w|as on|e of| |
|
the |most |n|oticea|ihe member|s of the| reform| Club|, |th|ough| he| s|eemed |
|
|always |to |asoid |att|nacting at|tention|, an en|ig|mat|i|cal |p|erson|age|,| |
|
|ab| |
|
it whom l|ittle| was | nown|, |e|xc| pt that |he| w|as |a |poli|shed m|an| |
|
o|f |th|e |wo|rld|. |Pe|ople sa|id| that h|e |re|sembl|ed| |pyron| cat least| |
|
t|hat |has hea|d w|as |blronic|; |but| he was |a |b|earde|in tranquil| pyron| |
|
who| |eight live| on a |dar and year|s |w|ithout g|r|owing o|ld|.| |
|
|
|
|rertainly| an| English|man|, it |was |m|ore |doubt|ful w|h|ether |Phileas Fogg| |
|
w|as |a |London|er|.</i></pre> |
|
|
|
<p><a href="/txt/80day_reconstructed2.txt">Full text</a></p> |
|
|
|
<p>Most of the errors tend to lie in the first characters of <em>long tokens</em>.That’s because, I’m forced to padd |
|
the input of the VAE and to mask that padding. In practice that means that the first characters of long tokens get updated |
|
less that the others so necessarily they contain more errors. <a href="#notes">More information</a>.</p> |
|
|
|
<h2 id="upper-level">Upper level</h2> |
|
|
|
<p>In order to complete the experiment, we need to check that the original language model |
|
done directly at BPE level can be done with this new model-based BPE encoding.</p> |
|
|
|
<p>It’s pretty slow to train that upper level because we need to flow the |
|
gradients all the way through the VAE decoder, and the lower layer decoding |
|
step, in order to get the <strong>character level loss</strong> (softmax + nll_loss) to properly train something. |
|
That’s a limit of the current approach.</p> |
|
|
|
<p>If we randomly split the text into train&validation, we can learn almost perfectly (97% top-1 character level accuracy) |
|
the language model on top of that Model based BPE.</p> |
|
|
|
<p><img src="/narsil.github.io/images/models-2-overfit.png" /></p> |
|
|
|
<p>However this can be considered <strong>overfitting</strong> because even though a specific input |
|
was never seen in the valid set, a very close one <em>was</em>.</p> |
|
|
|
<p>If instead we try to compare with a fixed split, where the last part of the book |
|
is considered the valid set, then we get much lower result.</p> |
|
|
|
<p>We could achieve 25% exact character matching, and ~77% |
|
top-10 character matching on the valid set, which is the end of the book ! |
|
The same results happen with BPE, even worse ! we can’t get past 13% top-1 and 25% top-10 |
|
on the regular BPE. That’s understandable because the dataset is very small and |
|
the last part of the book is different so it’s very hard to infer it from just the |
|
beginning and no other text.</p> |
|
|
|
<p>Another note, is that model based BPE are not tokenizing deterministicly, there |
|
is some variance to it, depending on the context of a particular word. |
|
This actually seems to be a good property (See <a href="https://arxiv.org/abs/1804.10959">this</a>) and |
|
might explain away the better performance of model based BPE over regular BPE. |
|
Keep in mind it’s 25% of the <strong>characters</strong> that are correct. |
|
If we looked at a discrete view of <strong>tokens</strong> we probably would have a much higher prediction rate (it’s left for future work for now).</p> |
|
|
|
<p>Here is a picture from the tensorboard values, P_1 is probability that the |
|
character predicted is the correct one, P_10 is that it is in the top-10 |
|
values.</p> |
|
|
|
<p><img src="/narsil.github.io/images/models-2-upper.png" /></p> |
|
|
|
<p>The overfitting starts happening around the ~1M steps mark.</p> |
|
|
|
<h3 id="notes">Notes</h3> |
|
|
|
<ul> |
|
<li>In the experiment we learned model by model, freezing the lower model |
|
before training something on top. It’s because the batching of different |
|
layers occur differently. Learning the whole thing end-to-end is probably going |
|
to need some thought. The batching is easy for the lower level, every batch |
|
needs a tensor of shape CONTEXT_SIZE (=64) of [0-255] ints. For the VAE, we |
|
need to have a variable length (depending on the length token) times EMBEDDING_DIM |
|
(=128). The upper level needs only tensors of size CONTEXT_SIZE * |
|
EMBEDDING_DIM yet if we want to try and end-to-end training, we have <strong>no |
|
idea</strong> how many bytes we need to generate 1 correct tensor in the upper layer. |
|
We know it’s no more than CONTEXT_SIZE² but that would be prohibitive to use |
|
that value.</li> |
|
<li>The loss NEEDS to always be the byte-level nll loss. At first I thought a |
|
simple MSE loss in the embedding space could be enough to learn the proper |
|
models. It seems to not be the case. I could only achieve meaningful results by |
|
always referring to the original strings and calculating the NLL Loss. When |
|
using this loss, the MSE actually <em>increases</em>. This leads me to think that |
|
encoding/decoding + softmax are highly anisotropic operators. Looking at the |
|
singular values of the embedding matrix, we can see that the highest one is |
|
7.35, the lowest one 0.12, so there are 2 orders of magnitude between the 2. |
|
This anisotropy means that the MSE loss which considers all dimensions of the |
|
embeddding equal is actually couting way too much some irrelevant dimensions. |
|
It would be much faster and simpler if we could train directly on MSE (it would |
|
enable us to train without running all the decoding steps to generate the |
|
loss). So we need to add some spectral loss on the embedding on the lower |
|
language model to test that hypothesis.</li> |
|
<li>The tokens have variable lengths. In order to fix this, we have to padd all |
|
sequences during learning. Because we padd, we have to mask the padding |
|
during training for both VAE and upper LM. Keeping track of this is pretty |
|
nifty and it means gradients on rarely used places will rarely get updated. So |
|
we will almost surely miss some letters in our tokens. Either at the front or |
|
the end of the token depending on how we padd the tokens.</li> |
|
</ul> |
|
|
|
<h2 id="future-work"><strong>Future work</strong></h2> |
|
|
|
<ul> |
|
<li>Actually testing discretizing the tokens to compare with the regular BPE. In that direction, |
|
also comparing with a randomized tokenizer as used in <a href="https://github.com/google/sentencepiece">SentencePiece</a> |
|
to make sure the results are actually comparable and are indeed linked to tokenization variance.</li> |
|
<li>The masking problem really seems to be a current limit of the model. Finding a workaround would be really valuable.</li> |
|
<li>The fact that the NLL loss is required slows down upper layers. It would be awesome if we could smooth out |
|
the encoding/decoding matrix so that L2 directly for VAE and the upper layer works. It probably goes against regular |
|
language model embedding so not sure it’s doable.</li> |
|
<li>Making the epsilon based tokenization directly after the embedding layer. This would help <em>stack</em> those levels hopefully learning |
|
higher and higer representations of text leading the sentence embedding and so on.</li> |
|
<li>On the same idea, another direction would be to do actual discrete tokenization to allow for the models to stack.</li> |
|
</ul> |
|
|
|
</div><a class="u-url" href="/narsil.github.io/ml/nlp/2019/08/06/model-based-bpe-encodings-3.html" hidden></a> |
|
</article> |
|
</div> |
|
</main><footer class="site-footer h-card"> |
|
<data class="u-url" href="/narsil.github.io/"></data> |
|
|
|
<div class="wrapper"> |
|
|
|
<h2 class="footer-heading">Narsil</h2> |
|
|
|
<div class="footer-col-wrapper"> |
|
<div class="footer-col footer-col-1"> |
|
<ul class="contact-list"> |
|
<li class="p-name">Narsil</li></ul> |
|
</div> |
|
|
|
<div class="footer-col footer-col-2"><ul class="social-media-list"> |
|
<li><a href="https://github.com/Narsil"><svg class="social svg-icon"><use xlink:href="/narsil.github.io/assets/minima-social-icons.svg#github"></use></svg> <span class="username">Narsil</span></a></li><li><a href="https://www.twitter.com/narsilou"><svg class="social svg-icon"><use xlink:href="/narsil.github.io/assets/minima-social-icons.svg#twitter"></use></svg> <span class="username">narsilou</span></a></li></ul> |
|
</div> |
|
|
|
<div class="footer-col footer-col-3"> |
|
<p>Small experiements insights from ML and software development.</p> |
|
</div> |
|
</div> |
|
|
|
</div> |
|
|
|
</footer> |
|
</body> |
|
|
|
</html> |
|
|