Spaces:

Narsil
/

home

Running

App Files Files Community

home / ml /nlp /2019 /08 /06 /model-based-bpe-encodings-3.html

Narsil HF Staff

Static space.

4c7b631 over 3 years ago

raw

history blame contribute delete

31.6 kB

	<!DOCTYPE html>
	<html lang="en"><head>
	<meta charset="utf-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<meta name="viewport" content="width=device-width, initial-scale=1"><link rel="shortcut icon" type="image/x-icon" href="/narsil.github.io/favicon.ico"><!-- Begin Jekyll SEO tag v2.6.1 -->
	<title>Model based encodings (3) \| Narsil</title>
	<meta name="generator" content="Jekyll v3.8.5" />
	<meta property="og:title" content="Model based encodings (3)" />
	<meta name="author" content="nicolas" />
	<meta property="og:locale" content="en_US" />
	<meta name="description" content="In the first segment we looked into how we could make a BPE based encoding, not only based on frequency in the dataset, but directly on the model probability measure of the next token. In that article I mention that dynamic BPE are costly because they stop being a one time operation but have to be done for every batch because the vocabulary might have changed. In this article I try to completely remove the “static” BPE approach and replace it completely with ML blocks." />
	<meta property="og:description" content="In the first segment we looked into how we could make a BPE based encoding, not only based on frequency in the dataset, but directly on the model probability measure of the next token. In that article I mention that dynamic BPE are costly because they stop being a one time operation but have to be done for every batch because the vocabulary might have changed. In this article I try to completely remove the “static” BPE approach and replace it completely with ML blocks." />
	<link rel="canonical" href="http://localhost:4000/narsil.github.io/ml/nlp/2019/08/06/model-based-bpe-encodings-3.html" />
	<meta property="og:url" content="http://localhost:4000/narsil.github.io/ml/nlp/2019/08/06/model-based-bpe-encodings-3.html" />
	<meta property="og:site_name" content="Narsil" />
	<meta property="og:type" content="article" />
	<meta property="article:published_time" content="2019-08-06T00:00:00+02:00" />
	<script type="application/ld+json">
	{"description":"In the first segment we looked into how we could make a BPE based encoding, not only based on frequency in the dataset, but directly on the model probability measure of the next token. In that article I mention that dynamic BPE are costly because they stop being a one time operation but have to be done for every batch because the vocabulary might have changed. In this article I try to completely remove the “static” BPE approach and replace it completely with ML blocks.","author":{"@type":"Person","name":"nicolas"},"mainEntityOfPage":{"@type":"WebPage","@id":"http://localhost:4000/narsil.github.io/ml/nlp/2019/08/06/model-based-bpe-encodings-3.html"},"@type":"BlogPosting","url":"http://localhost:4000/narsil.github.io/ml/nlp/2019/08/06/model-based-bpe-encodings-3.html","headline":"Model based encodings (3)","dateModified":"2019-08-06T00:00:00+02:00","datePublished":"2019-08-06T00:00:00+02:00","@context":"https://schema.org"}</script>
	<!-- End Jekyll SEO tag -->

	<link href="https://unpkg.com/@primer/css/dist/primer.css" rel="stylesheet" />
	<link rel="stylesheet" href="/narsil.github.io/assets/main.css">
	<link rel="stylesheet" href="//use.fontawesome.com/releases/v5.0.7/css/all.css"><link type="application/atom+xml" rel="alternate" href="http://localhost:4000/narsil.github.io/feed.xml" title="Narsil" />
	<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.11.1/dist/katex.min.css" integrity="sha384-zB1R0rpPzHqg7Kpt0Aljp8JPLqbXI3bhnPWROx27a9N0Ll6ZP/+DiW/UqRcLbRjq" crossorigin="anonymous">
	<script type="text/javascript" async src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML"> </script>
	<script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.1/dist/katex.min.js" integrity="sha384-y23I5Q6l+B6vatafAwxRu/0oK/79VlbSz7Q9aiSZUvyWYIYsd+qj+o24G5ZU2zJz" crossorigin="anonymous"></script>
	<script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.1/dist/contrib/auto-render.min.js" integrity="sha384-kWPLUVMOks5AQFrykwIup5lo0m3iMkkHrD0uJ4H5cjeGihAutqP0yW0J6dpFiVkI" crossorigin="anonymous"></script>
	<script>
	document.addEventListener("DOMContentLoaded", function() {
	renderMathInElement( document.body, {
	delimiters: [
	{left: "$$", right: "$$", display: true},
	{left: "[%", right: "%]", display: true},
	{left: "$", right: "$", display: false}
	]}
	);
	});
	</script>


	<script>
	function wrap_img(fn) {
	if (document.attachEvent ? document.readyState === "complete" : document.readyState !== "loading") {
	var elements = document.querySelectorAll(".post img");
	Array.prototype.forEach.call(elements, function(el, i) {
	if (el.getAttribute("title")) {
	const caption = document.createElement('figcaption');
	var node = document.createTextNode(el.getAttribute("title"));
	caption.appendChild(node);
	const wrapper = document.createElement('figure');
	wrapper.className = 'image';
	el.parentNode.insertBefore(wrapper, el);
	el.parentNode.removeChild(el);
	wrapper.appendChild(el);
	wrapper.appendChild(caption);
	}
	});
	} else { document.addEventListener('DOMContentLoaded', fn); }
	}
	window.onload = wrap_img;
	</script>

	<script>
	document.addEventListener("DOMContentLoaded", function(){
	// add link icon to anchor tags
	var elem = document.querySelectorAll(".anchor-link")
	elem.forEach(e => (e.innerHTML = '<i class="fas fa-link fa-xs"></i>'));
	// remove paragraph tags in rendered toc (happens from notebooks)
	var toctags = document.querySelectorAll(".toc-entry")
	toctags.forEach(e => (e.firstElementChild.innerText = e.firstElementChild.innerText.replace('¶', '')))
	});
	</script>
	</head><body><header class="site-header" role="banner">

	<div class="wrapper"><a class="site-title" rel="author" href="/narsil.github.io/">Narsil</a><nav class="site-nav">
	<input type="checkbox" id="nav-trigger" class="nav-trigger" />
	<label for="nav-trigger">
	<span class="menu-icon">
	<svg viewBox="0 0 18 15" width="18px" height="15px">
	<path d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0 h15.032C17.335,0,18,0.665,18,1.484L18,1.484z M18,7.516C18,8.335,17.335,9,16.516,9H1.484C0.665,9,0,8.335,0,7.516l0,0 c0-0.82,0.665-1.484,1.484-1.484h15.032C17.335,6.031,18,6.696,18,7.516L18,7.516z M18,13.516C18,14.335,17.335,15,16.516,15H1.484 C0.665,15,0,14.335,0,13.516l0,0c0-0.82,0.665-1.483,1.484-1.483h15.032C17.335,12.031,18,12.695,18,13.516L18,13.516z"/>
	</svg>
	</span>
	</label>

	<div class="trigger"><a class="page-link" href="/narsil.github.io/about/">About Me</a><a class="page-link" href="/narsil.github.io/search/">Search</a><a class="page-link" href="/narsil.github.io/categories/">Tags</a></div>
	</nav></div>
	</header>
	<main class="page-content" aria-label="Content">
	<div class="wrapper">
	<article class="post h-entry" itemscope itemtype="http://schema.org/BlogPosting">

	<header class="post-header">
	<h1 class="post-title p-name" itemprop="name headline">Model based encodings (3)</h1><p class="post-meta post-meta-title"><time class="dt-published" datetime="2019-08-06T00:00:00+02:00" itemprop="datePublished">
	Aug 6, 2019
	</time>•
	<span itemprop="author" itemscope itemtype="http://schema.org/Person">
	<span class="p-author h-card" itemprop="name">nicolas</span></span>
	• <span class="read-time" title="Estimated read time">


	12 min read

	</span></p>


	<p class="category-tags"><i class="fas fa-tags category-tags-icon"></i></i>

	<a class="category-tags-link" href="/narsil.github.io/categories/#ml">ml</a>


	<a class="category-tags-link" href="/narsil.github.io/categories/#nlp">nlp</a>


	</p>


	</header>

	<div class="post-content e-content" itemprop="articleBody">
	<p>In the <a href="/narsil.github.io/ml/nlp/2019/05/16/model-based-bpe-encodings.html">first segment</a>
	we looked into how we could make a BPE
	based encoding, not only based on frequency in the dataset, but directly on the
	model probability measure of the next token. In that article I mention that
	dynamic BPE are costly because they stop being a one time operation but have to
	be done for every batch because the vocabulary might have changed. In this
	article I try to completely remove the “static” BPE approach and replace it
	completely with ML blocks.</p>

	<blockquote>
	<h1 id="tldr-in-this-article-we-present-an-idea-to-replace-classical-bpe-algorithm-with-a-pure-ml-version-of-it">TL;DR In this article we present an idea to replace classical BPE algorithm with a pure ML version of it.</h1>
	</blockquote>

	<h2 id="what-is-the-goal-">What is the goal ?</h2>

	<p>So the goal is to replace BPE algorithm. So it’s go from something like</p>

	<p>“T\|h\|e\| \|c\|a\|t\| \|a\|t\|e\| \|t\|h\|e\| \|a\|p\|p\|l\|e\|.”</p>

	<p>To something that has less elements :</p>

	<p>“The \|ca\|t \|at\|e \|the\| \|app\|le\|.”</p>

	<p>In one sentence, BPE fuses bytes to form tokens based on frequency in the full
	dataset. For a more detailed example, look that <a href="/narsil.github.io/ml/nlp/2019/05/16/model-based-bpe-encodings.html">the previous
	article</a>.
	In this example, you can see there is always a split after a space. That’s a
	limitation of BPE so actually our target might look different, maybe more like</p>

	<p>“The cat \|at\|e \|the app\|le\|.”</p>

	<p>Here we can notice that “The cat” is a full token and contain 2 actual words.
	So the goal is to fuse some starting bytes into N tokens (let’s say ~10k) that
	hopefully capture regularities in our dataset and are at least correlated to
	frequency in the original dataset like BPE was.</p>

	<p>Another property we need to have from BPE is that it can encode an arbitrary
	string of text. It does not matter if it’s not the same language or even if it
	makes sense, you CAN encode it, that is a very desirable property. It avoids
	the <a href="https://medium.com/cisco-emerge/creating-semantic-representations-of-out-of-vocabulary-words-for-common-nlp-tasks-842dbdafba18">out-of-vocabulary</a> problem.</p>

	<h2 id="approach">Approach</h2>

	<h3 id="tokenization">Tokenization</h3>

	<p>So let’s imagine we have a trained transformer like
	<a href="https://openai.com/blog/better-language-models/">GPT-2</a>. But trained on bytes
	directly NOT on tokens like the original transformer. Now we can use the idea
	that when a model is highly confident, it probably means that what it’s about
	to predict is “in the same token”. Let’s take an example. Try to predict the
	following Character (as in a single letter) in the next 2 sentences</p>

	<blockquote>
	<p>Sentence 1: “Who are yo…”</p>
	</blockquote>

	<blockquote>
	<p>Sentence 2 : “I like …”</p>
	</blockquote>

	<p>In the first sentence, normally you would vote with very high confidence for
	“u”, whereas in the second sentence, you lack a lot of context to be exactly
	sure on what’s coming next. So “you” would be a token, whereas “like …” can’t
	be a single token, it has to be at least 2, “like “ and “…”.</p>

	<p>Here is a small gif of actual probabilities of the language model on a small sentence</p>

	<p><img src="/narsil.github.io/images/models-2-approach.gif" /></p>

	<p>You can see the in the left of the graph the probabilities drop, those are the
	tokens that try to get predicted but are missing context (because we have very
	few characters before them. For the right side, you can see the drops in probability
	are pretty consistent and correspond to word boundaries most often.</p>

	<h3 id="handling-unknown-tokens">Handling unknown tokens</h3>

	<p>Now we know how we are going to “fuse” characters, but we are not done yet. BPE
	tokens are a discrete SET of identified values from 0 to N (~10k in this
	experiment). Also BPE can encode an arbitrary new string by using it’s fusion
	table. So we can’t just run our algorithm on some specific dataset, count all
	the tokens created and declare that these are the N tokens for eternity. Let’s
	imagine I feed my algorithm a new sentence, in a different language, French for
	instance.</p>

	<p>“J’adore l’Italie.”</p>

	<p>We can run our “tokenizer” on this, and receive something like this</p>

	<p>“J\|’\|ado\|re \|l’\|Ita\|lie.”</p>

	<p>Now “ado” might not be in our original list, so what do we do with it ? Do we
	declare the token wrong and split it ? That would be odd.</p>

	<p>A key insight, is to remember that the first step of the discrete “token” once
	it enters the model (all of them do that, it’s really not specific to
	transformer or GPT-2) it gets embedded, meaning we go from a number between 1
	and N, to a vector in <em>d</em> dimension space (<em>d</em> is between 100 and 1000 generally).</p>

	<p>For instance token 3 gets mapped to [0.3, -0.15, 1.4, …] while token 4 gets mapped
	to [-2.4, -0.014, 0.45, …]</p>

	<p>So the idea it to generate directly a token embedding (a vector in <em>d</em>-dimension), not necessarily a
	discrete value (a number between 0 and vocabulary size).</p>

	<p>In order to do that we need that all tokens should now be represented in the
	same way by a <em>d</em> dimension space vector. One way to achieve that is to use an
	autoencoder.</p>

	<p><img src="https://upload.wikimedia.org/wikipedia/commons/2/28/Autoencoder_structure.png" alt="" />
	or with code</p>

	<p>The core idea is that when we encounter a new unseen token like “ado” it will still have
	a representation through the VAE, and will probably be close to a known token like “add”.
	This can help the network overcome odd tokenization or spelling errors.</p>

	<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## The name is VAE but I didn't use the internal KL loss in the end as it prevented/slowed down the learning.
	</span><span class="k">class</span> <span class="nc">VAE</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
	<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
	<span class="nb">super</span><span class="p">(</span><span class="n">VAE</span><span class="p">,</span> <span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="n">__init__</span><span class="p">()</span>
	<span class="bp">self</span><span class="o">.</span><span class="n">M</span> <span class="o">=</span> <span class="n">config</span><span class="o">.</span><span class="n">CONTEXT_SIZE</span> <span class="o">*</span> <span class="n">config</span><span class="o">.</span><span class="n">EMBEDDING_DIM</span>
	<span class="n">layer</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span>
	<span class="n">m</span> <span class="o">=</span> <span class="mi">400</span>

	<span class="bp">self</span><span class="o">.</span><span class="n">fc1</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">M</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span>
	<span class="bp">self</span><span class="o">.</span><span class="n">fc21</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">EMBEDDING_DIM</span><span class="p">)</span>
	<span class="bp">self</span><span class="o">.</span><span class="n">fc22</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">EMBEDDING_DIM</span><span class="p">)</span>
	<span class="bp">self</span><span class="o">.</span><span class="n">fc3</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">config</span><span class="o">.</span><span class="n">EMBEDDING_DIM</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span>
	<span class="bp">self</span><span class="o">.</span><span class="n">fc4</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">M</span><span class="p">)</span>

	<span class="k">def</span> <span class="nf">encode</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
	<span class="c1"># x is [Batch, Context size, Embedding dim]
	</span> <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">M</span><span class="p">)</span>
	<span class="n">h1</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fc1</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
	<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">fc21</span><span class="p">(</span><span class="n">h1</span><span class="p">),</span> <span class="bp">self</span><span class="o">.</span><span class="n">fc22</span><span class="p">(</span><span class="n">h1</span><span class="p">)</span>

	<span class="k">def</span> <span class="nf">reparameterize</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span><span class="p">):</span>
	<span class="n">std</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="mf">0.5</span> <span class="o">*</span> <span class="n">logvar</span><span class="p">)</span>
	<span class="n">eps</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">randn_like</span><span class="p">(</span><span class="n">std</span><span class="p">)</span>
	<span class="k">return</span> <span class="n">mu</span> <span class="o">+</span> <span class="n">eps</span> <span class="o">*</span> <span class="n">std</span>

	<span class="k">def</span> <span class="nf">decode</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">z</span><span class="p">):</span>
	<span class="n">h3</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fc3</span><span class="p">(</span><span class="n">z</span><span class="p">))</span>
	<span class="k">return</span> <span class="n">torch</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span>
	<span class="bp">self</span><span class="o">.</span><span class="n">fc4</span><span class="p">(</span><span class="n">h3</span><span class="p">)</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">CONTEXT_SIZE</span><span class="p">,</span> <span class="n">config</span><span class="o">.</span><span class="n">EMBEDDING_DIM</span><span class="p">)</span>
	<span class="p">)</span>

	<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
	<span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
	<span class="n">z</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">reparameterize</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span><span class="p">)</span>
	<span class="k">return</span> <span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span><span class="p">,</span> <span class="n">z</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
	</code></pre></div></div>

	<h3 id="final-network">Final network</h3>

	<p><img src="/narsil.github.io/images/model-based-2.png" /></p>

	<h2 id="results">Results</h2>

	<p>Here is a summary of the values of the tokenization we got.</p>

	<table>
	<thead>
	<tr>
	<th> </th>
	<th>Raw</th>
	<th>BPE</th>
	<th>Model based</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>Vocabulary size</td>
	<td>256</td>
	<td>10000</td>
	<td>26262</td>
	</tr>
	<tr>
	<td>#Tokens</td>
	<td>387k</td>
	<td>90k</td>
	<td>92k</td>
	</tr>
	<tr>
	<td>Avg token length</td>
	<td>1</td>
	<td>3.3</td>
	<td>6.65</td>
	</tr>
	</tbody>
	</table>

	<p>Here is a excerpt of the kind of tokenization we created</p>

	<pre><i>\|He w\|as on\|e of\|
	the \|most \|n\|oticea\|ble member\|s of the\| Reform\| Club\|, \|th\|ough\| he\| s\|eemed
	\|always \|to \|avoid \|att\|racting at\|tention\|; an en\|ig\|mat\|i\|cal \|p\|erson\|age\|,\|
	\|ab\|out whom l\|ittle\| was \|known\|, \|e\|xc\|ept that \|he\| w\|as \|a \|poli\|shed m\|an\|
	o\|f \|th\|e \|wo\|rld\|. \|Pe\|ople sa\|id\| that h\|e \|re\|sembl\|ed\| \|Byron\|--at least\|
	t\|hat \|his hea\|d w\|as \|Byronic\|; \|but\| he was \|a \|b\|earde\|d, tranquil\| Byron\|,
	who\| \|might live\| on a \|thousand year\|s \|w\|ithout g\|r\|owing o\|ld\|.\|

	\|Certainly\| an\| English\|man\|, it \|was \|m\|ore \|doubt\|ful w\|h\|ether \|Phileas Fogg\|
	w\|as \|a \|London\|er\|.</i></pre>

	<p><a href="/txt/80day_tokenized_exp2.txt">Full text</a></p>

	<p>This version has been done with epsilon=0.0015.</p>

	<p>As you can see, “Phileas Fogg” is already a token in this situation, which is a multi-word token not
	achievable by regular BPE. You can also see, a lot of words contain only single bytes tokens which
	is why this method compresses LESS than regular BPE at the same vocabulary size.
	Another note is that classical words like “was” is already a token (in the last sentence) but it’s not always
	the case, this token is context dependent now !</p>

	<h2 id="vae">VAE</h2>

	<p>After the VAE step, the reconstruction is not perfect yet perfectly legible.</p>

	<pre><i>\|He w\|as on\|e of\|
	the \|most \|n\|oticea\|ihe member\|s of the\| reform\| Club\|, \|th\|ough\| he\| s\|eemed
	\|always \|to \|asoid \|att\|nacting at\|tention\|, an en\|ig\|mat\|i\|cal \|p\|erson\|age\|,\|
	\|ab\|
	it whom l\|ittle\| was \| nown\|, \|e\|xc\| pt that \|he\| w\|as \|a \|poli\|shed m\|an\|
	o\|f \|th\|e \|wo\|rld\|. \|Pe\|ople sa\|id\| that h\|e \|re\|sembl\|ed\| \|pyron\| cat least\|
	t\|hat \|has hea\|d w\|as \|blronic\|; \|but\| he was \|a \|b\|earde\|in tranquil\| pyron\|
	who\| \|eight live\| on a \|dar and year\|s \|w\|ithout g\|r\|owing o\|ld\|.\|

	\|rertainly\| an\| English\|man\|, it \|was \|m\|ore \|doubt\|ful w\|h\|ether \|Phileas Fogg\|
	w\|as \|a \|London\|er\|.</i></pre>

	<p><a href="/txt/80day_reconstructed2.txt">Full text</a></p>

	<p>Most of the errors tend to lie in the first characters of <em>long tokens</em>.That’s because, I’m forced to padd
	the input of the VAE and to mask that padding. In practice that means that the first characters of long tokens get updated
	less that the others so necessarily they contain more errors. <a href="#notes">More information</a>.</p>

	<h2 id="upper-level">Upper level</h2>

	<p>In order to complete the experiment, we need to check that the original language model
	done directly at BPE level can be done with this new model-based BPE encoding.</p>

	<p>It’s pretty slow to train that upper level because we need to flow the
	gradients all the way through the VAE decoder, and the lower layer decoding
	step, in order to get the <strong>character level loss</strong> (softmax + nll_loss) to properly train something.
	That’s a limit of the current approach.</p>

	<p>If we randomly split the text into train&validation, we can learn almost perfectly (97% top-1 character level accuracy)
	the language model on top of that Model based BPE.</p>

	<p><img src="/narsil.github.io/images/models-2-overfit.png" /></p>

	<p>However this can be considered <strong>overfitting</strong> because even though a specific input
	was never seen in the valid set, a very close one <em>was</em>.</p>

	<p>If instead we try to compare with a fixed split, where the last part of the book
	is considered the valid set, then we get much lower result.</p>

	<p>We could achieve 25% exact character matching, and ~77%
	top-10 character matching on the valid set, which is the end of the book !
	The same results happen with BPE, even worse ! we can’t get past 13% top-1 and 25% top-10
	on the regular BPE. That’s understandable because the dataset is very small and
	the last part of the book is different so it’s very hard to infer it from just the
	beginning and no other text.</p>

	<p>Another note, is that model based BPE are not tokenizing deterministicly, there
	is some variance to it, depending on the context of a particular word.
	This actually seems to be a good property (See <a href="https://arxiv.org/abs/1804.10959">this</a>) and
	might explain away the better performance of model based BPE over regular BPE.
	Keep in mind it’s 25% of the <strong>characters</strong> that are correct.
	If we looked at a discrete view of <strong>tokens</strong> we probably would have a much higher prediction rate (it’s left for future work for now).</p>

	<p>Here is a picture from the tensorboard values, P_1 is probability that the
	character predicted is the correct one, P_10 is that it is in the top-10
	values.</p>

	<p><img src="/narsil.github.io/images/models-2-upper.png" /></p>

	<p>The overfitting starts happening around the ~1M steps mark.</p>

	<h3 id="notes">Notes</h3>

	<ul>
	<li>In the experiment we learned model by model, freezing the lower model
	before training something on top. It’s because the batching of different
	layers occur differently. Learning the whole thing end-to-end is probably going
	to need some thought. The batching is easy for the lower level, every batch
	needs a tensor of shape CONTEXT_SIZE (=64) of [0-255] ints. For the VAE, we
	need to have a variable length (depending on the length token) times EMBEDDING_DIM
	(=128). The upper level needs only tensors of size CONTEXT_SIZE *
	EMBEDDING_DIM yet if we want to try and end-to-end training, we have <strong>no
	idea</strong> how many bytes we need to generate 1 correct tensor in the upper layer.
	We know it’s no more than CONTEXT_SIZE² but that would be prohibitive to use
	that value.</li>
	<li>The loss NEEDS to always be the byte-level nll loss. At first I thought a
	simple MSE loss in the embedding space could be enough to learn the proper
	models. It seems to not be the case. I could only achieve meaningful results by
	always referring to the original strings and calculating the NLL Loss. When
	using this loss, the MSE actually <em>increases</em>. This leads me to think that
	encoding/decoding + softmax are highly anisotropic operators. Looking at the
	singular values of the embedding matrix, we can see that the highest one is
	7.35, the lowest one 0.12, so there are 2 orders of magnitude between the 2.
	This anisotropy means that the MSE loss which considers all dimensions of the
	embeddding equal is actually couting way too much some irrelevant dimensions.
	It would be much faster and simpler if we could train directly on MSE (it would
	enable us to train without running all the decoding steps to generate the
	loss). So we need to add some spectral loss on the embedding on the lower
	language model to test that hypothesis.</li>
	<li>The tokens have variable lengths. In order to fix this, we have to padd all
	sequences during learning. Because we padd, we have to mask the padding
	during training for both VAE and upper LM. Keeping track of this is pretty
	nifty and it means gradients on rarely used places will rarely get updated. So
	we will almost surely miss some letters in our tokens. Either at the front or
	the end of the token depending on how we padd the tokens.</li>
	</ul>

	<h2 id="future-work"><strong>Future work</strong></h2>

	<ul>
	<li>Actually testing discretizing the tokens to compare with the regular BPE. In that direction,
	also comparing with a randomized tokenizer as used in <a href="https://github.com/google/sentencepiece">SentencePiece</a>
	to make sure the results are actually comparable and are indeed linked to tokenization variance.</li>
	<li>The masking problem really seems to be a current limit of the model. Finding a workaround would be really valuable.</li>
	<li>The fact that the NLL loss is required slows down upper layers. It would be awesome if we could smooth out
	the encoding/decoding matrix so that L2 directly for VAE and the upper layer works. It probably goes against regular
	language model embedding so not sure it’s doable.</li>
	<li>Making the epsilon based tokenization directly after the embedding layer. This would help <em>stack</em> those levels hopefully learning
	higher and higer representations of text leading the sentence embedding and so on.</li>
	<li>On the same idea, another direction would be to do actual discrete tokenization to allow for the models to stack.</li>
	</ul>

	</div><a class="u-url" href="/narsil.github.io/ml/nlp/2019/08/06/model-based-bpe-encodings-3.html" hidden></a>
	</article>
	</div>
	</main><footer class="site-footer h-card">
	<data class="u-url" href="/narsil.github.io/"></data>

	<div class="wrapper">

	<h2 class="footer-heading">Narsil</h2>

	<div class="footer-col-wrapper">
	<div class="footer-col footer-col-1">
	<ul class="contact-list">
	<li class="p-name">Narsil</li></ul>
	</div>

	<div class="footer-col footer-col-2"><ul class="social-media-list">
	<li><a href="https://github.com/Narsil"><svg class="social svg-icon"><use xlink:href="/narsil.github.io/assets/minima-social-icons.svg#github"></use></svg> <span class="username">Narsil</span></a></li><li><a href="https://www.twitter.com/narsilou"><svg class="social svg-icon"><use xlink:href="/narsil.github.io/assets/minima-social-icons.svg#twitter"></use></svg> <span class="username">narsilou</span></a></li></ul>
	</div>

	<div class="footer-col footer-col-3">
	<p>Small experiements insights from ML and software development.</p>
	</div>
	</div>

	</div>

	</footer>
	</body>

	</html>