Spaces:
Running
Running
 | |
 | |
 | |
<font size=4><b>Language Model on One Billion Word Benchmark</b></font> | |
<b>Authors:</b> | |
Oriol Vinyals (vinyals@google.com, github: OriolVinyals), | |
Xin Pan | |
<b>Paper Authors:</b> | |
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu | |
<b>TL;DR</b> | |
This is a pretrained model on One Billion Word Benchmark. | |
If you use this model in your publication, please cite the original paper: | |
@article{jozefowicz2016exploring, | |
title={Exploring the Limits of Language Modeling}, | |
author={Jozefowicz, Rafal and Vinyals, Oriol and Schuster, Mike | |
and Shazeer, Noam and Wu, Yonghui}, | |
journal={arXiv preprint arXiv:1602.02410}, | |
year={2016} | |
} | |
<b>Introduction</b> | |
In this release, we open source a model trained on the One Billion Word | |
Benchmark (http://arxiv.org/abs/1312.3005), a large language corpus in English | |
which was released in 2013. This dataset contains about one billion words, and | |
has a vocabulary size of about 800K words. It contains mostly news data. Since | |
sentences in the training set are shuffled, models can ignore the context and | |
focus on sentence level language modeling. | |
In the original release and subsequent work, people have used the same test set | |
to train models on this dataset as a standard benchmark for language modeling. | |
Recently, we wrote an article (http://arxiv.org/abs/1602.02410) describing a | |
model hybrid between character CNN, a large and deep LSTM, and a specific | |
Softmax architecture which allowed us to train the best model on this dataset | |
thus far, almost halving the best perplexity previously obtained by others. | |
<b>Code Release</b> | |
The open-sourced components include: | |
* TensorFlow GraphDef proto buffer text file. | |
* TensorFlow pre-trained checkpoint shards. | |
* Code used to evaluate the pre-trained model. | |
* Vocabulary file. | |
* Test set from LM-1B evaluation. | |
The code supports 4 evaluation modes: | |
* Given provided dataset, calculate the model's perplexity. | |
* Given a prefix sentence, predict the next words. | |
* Dump the softmax embedding, character-level CNN word embeddings. | |
* Give a sentence, dump the embedding from the LSTM state. | |
<b>Results</b> | |
Model | Test Perplexity | Number of Params [billions] | |
------|-----------------|---------------------------- | |
Sigmoid-RNN-2048 [Blackout] | 68.3 | 4.1 | |
Interpolated KN 5-gram, 1.1B n-grams [chelba2013one] | 67.6 | 1.76 | |
Sparse Non-Negative Matrix LM [shazeer2015sparse] | 52.9 | 33 | |
RNN-1024 + MaxEnt 9-gram features [chelba2013one] | 51.3 | 20 | |
LSTM-512-512 | 54.1 | 0.82 | |
LSTM-1024-512 | 48.2 | 0.82 | |
LSTM-2048-512 | 43.7 | 0.83 | |
LSTM-8192-2048 (No Dropout) | 37.9 | 3.3 | |
LSTM-8192-2048 (50\% Dropout) | 32.2 | 3.3 | |
2-Layer LSTM-8192-1024 (BIG LSTM) | 30.6 | 1.8 | |
(THIS RELEASE) BIG LSTM+CNN Inputs | <b>30.0</b> | <b>1.04</b> | |
<b>How To Run</b> | |
Prerequisites: | |
* Install TensorFlow. | |
* Install Bazel. | |
* Download the data files: | |
* Model GraphDef file: | |
[link](http://download.tensorflow.org/models/LM_LSTM_CNN/graph-2016-09-10.pbtxt) | |
* Model Checkpoint sharded file: | |
[1](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-base) | |
[2](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-char-embedding) | |
[3](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-lstm) | |
[4](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax0) | |
[5](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax1) | |
[6](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax2) | |
[7](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax3) | |
[8](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax4) | |
[9](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax5) | |
[10](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax6) | |
[11](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax7) | |
[12](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax8) | |
* Vocabulary file: | |
[link](http://download.tensorflow.org/models/LM_LSTM_CNN/vocab-2016-09-10.txt) | |
* test dataset: link | |
[link](http://download.tensorflow.org/models/LM_LSTM_CNN/test/news.en.heldout-00000-of-00050) | |
* It is recommended to run on a modern desktop instead of a laptop. | |
```shell | |
# 1. Clone the code to your workspace. | |
# 2. Download the data to your workspace. | |
# 3. Create an empty WORKSPACE file in your workspace. | |
# 4. Create an empty output directory in your workspace. | |
# Example directory structure below: | |
$ ls -R | |
.: | |
data lm_1b output WORKSPACE | |
./data: | |
ckpt-base ckpt-lstm ckpt-softmax1 ckpt-softmax3 ckpt-softmax5 | |
ckpt-softmax7 graph-2016-09-10.pbtxt vocab-2016-09-10.txt | |
ckpt-char-embedding ckpt-softmax0 ckpt-softmax2 ckpt-softmax4 ckpt-softmax6 | |
ckpt-softmax8 news.en.heldout-00000-of-00050 | |
./lm_1b: | |
BUILD data_utils.py lm_1b_eval.py README.md | |
./output: | |
# Build the codes. | |
$ bazel build -c opt lm_1b/... | |
# Run sample mode: | |
$ bazel-bin/lm_1b/lm_1b_eval --mode sample \ | |
--prefix "I love that I" \ | |
--pbtxt data/graph-2016-09-10.pbtxt \ | |
--vocab_file data/vocab-2016-09-10.txt \ | |
--ckpt 'data/ckpt-*' | |
...(omitted some TensorFlow output) | |
I love | |
I love that | |
I love that I | |
I love that I find | |
I love that I find that | |
I love that I find that amazing | |
...(omitted) | |
# Run eval mode: | |
$ bazel-bin/lm_1b/lm_1b_eval --mode eval \ | |
--pbtxt data/graph-2016-09-10.pbtxt \ | |
--vocab_file data/vocab-2016-09-10.txt \ | |
--input_data data/news.en.heldout-00000-of-00050 \ | |
--ckpt 'data/ckpt-*' | |
...(omitted some TensorFlow output) | |
Loaded step 14108582. | |
# perplexity is high initially because words without context are harder to | |
# predict. | |
Eval Step: 0, Average Perplexity: 2045.512297. | |
Eval Step: 1, Average Perplexity: 229.478699. | |
Eval Step: 2, Average Perplexity: 208.116787. | |
Eval Step: 3, Average Perplexity: 338.870601. | |
Eval Step: 4, Average Perplexity: 228.950107. | |
Eval Step: 5, Average Perplexity: 197.685857. | |
Eval Step: 6, Average Perplexity: 156.287063. | |
Eval Step: 7, Average Perplexity: 124.866189. | |
Eval Step: 8, Average Perplexity: 147.204975. | |
Eval Step: 9, Average Perplexity: 90.124864. | |
Eval Step: 10, Average Perplexity: 59.897914. | |
Eval Step: 11, Average Perplexity: 42.591137. | |
...(omitted) | |
Eval Step: 4529, Average Perplexity: 29.243668. | |
Eval Step: 4530, Average Perplexity: 29.302362. | |
Eval Step: 4531, Average Perplexity: 29.285674. | |
...(omitted. At convergence, it should be around 30.) | |
# Run dump_emb mode: | |
$ bazel-bin/lm_1b/lm_1b_eval --mode dump_emb \ | |
--pbtxt data/graph-2016-09-10.pbtxt \ | |
--vocab_file data/vocab-2016-09-10.txt \ | |
--ckpt 'data/ckpt-*' \ | |
--save_dir output | |
...(omitted some TensorFlow output) | |
Finished softmax weights | |
Finished word embedding 0/793471 | |
Finished word embedding 1/793471 | |
Finished word embedding 2/793471 | |
...(omitted) | |
$ ls output/ | |
embeddings_softmax.npy ... | |
# Run dump_lstm_emb mode: | |
$ bazel-bin/lm_1b/lm_1b_eval --mode dump_lstm_emb \ | |
--pbtxt data/graph-2016-09-10.pbtxt \ | |
--vocab_file data/vocab-2016-09-10.txt \ | |
--ckpt 'data/ckpt-*' \ | |
--sentence "I love who I am ." \ | |
--save_dir output | |
$ ls output/ | |
lstm_emb_step_0.npy lstm_emb_step_2.npy lstm_emb_step_4.npy | |
lstm_emb_step_6.npy lstm_emb_step_1.npy lstm_emb_step_3.npy | |
lstm_emb_step_5.npy | |
``` | |