Spaces:
Running
Running
 | |
 | |
 | |
# Show and Tell: A Neural Image Caption Generator | |
A TensorFlow implementation of the image-to-text model described in the paper: | |
"Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning | |
Challenge." | |
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan. | |
*IEEE transactions on pattern analysis and machine intelligence (2016).* | |
Full text available at: http://arxiv.org/abs/1609.06647 | |
## Contact | |
***Author:*** Chris Shallue | |
***Pull requests and issues:*** @cshallue | |
## Contents | |
* [Model Overview](#model-overview) | |
* [Introduction](#introduction) | |
* [Architecture](#architecture) | |
* [Getting Started](#getting-started) | |
* [A Note on Hardware and Training Time](#a-note-on-hardware-and-training-time) | |
* [Install Required Packages](#install-required-packages) | |
* [Prepare the Training Data](#prepare-the-training-data) | |
* [Download the Inception v3 Checkpoint](#download-the-inception-v3-checkpoint) | |
* [Training a Model](#training-a-model) | |
* [Initial Training](#initial-training) | |
* [Fine Tune the Inception v3 Model](#fine-tune-the-inception-v3-model) | |
* [Generating Captions](#generating-captions) | |
## Model Overview | |
### Introduction | |
The *Show and Tell* model is a deep neural network that learns how to describe | |
the content of images. For example: | |
 | |
### Architecture | |
The *Show and Tell* model is an example of an *encoder-decoder* neural network. | |
It works by first "encoding" an image into a fixed-length vector representation, | |
and then "decoding" the representation into a natural language description. | |
The image encoder is a deep convolutional neural network. This type of | |
network is widely used for image tasks and is currently state-of-the-art for | |
object recognition and detection. Our particular choice of network is the | |
[*Inception v3*](http://arxiv.org/abs/1512.00567) image recognition model | |
pretrained on the | |
[ILSVRC-2012-CLS](http://www.image-net.org/challenges/LSVRC/2012/) image | |
classification dataset. | |
The decoder is a long short-term memory (LSTM) network. This type of network is | |
commonly used for sequence modeling tasks such as language modeling and machine | |
translation. In the *Show and Tell* model, the LSTM network is trained as a | |
language model conditioned on the image encoding. | |
Words in the captions are represented with an embedding model. Each word in the | |
vocabulary is associated with a fixed-length vector representation that is | |
learned during training. | |
The following diagram illustrates the model architecture. | |
 | |
In this diagram, \{*s*<sub>0</sub>, *s*<sub>1</sub>, ..., *s*<sub>*N*-1</sub>\} | |
are the words of the caption and \{*w*<sub>*e*</sub>*s*<sub>0</sub>, | |
*w*<sub>*e*</sub>*s*<sub>1</sub>, ..., *w*<sub>*e*</sub>*s*<sub>*N*-1</sub>\} | |
are their corresponding word embedding vectors. The outputs \{*p*<sub>1</sub>, | |
*p*<sub>2</sub>, ..., *p*<sub>*N*</sub>\} of the LSTM are probability | |
distributions generated by the model for the next word in the sentence. The | |
terms \{log *p*<sub>1</sub>(*s*<sub>1</sub>), | |
log *p*<sub>2</sub>(*s*<sub>2</sub>), ..., | |
log *p*<sub>*N*</sub>(*s*<sub>*N*</sub>)\} are the log-likelihoods of the | |
correct word at each step; the negated sum of these terms is the minimization | |
objective of the model. | |
During the first phase of training the parameters of the *Inception v3* model | |
are kept fixed: it is simply a static image encoder function. A single trainable | |
layer is added on top of the *Inception v3* model to transform the image | |
embedding into the word embedding vector space. The model is trained with | |
respect to the parameters of the word embeddings, the parameters of the layer on | |
top of *Inception v3* and the parameters of the LSTM. In the second phase of | |
training, all parameters - including the parameters of *Inception v3* - are | |
trained to jointly fine-tune the image encoder and the LSTM. | |
Given a trained model and an image we use *beam search* to generate captions for | |
that image. Captions are generated word-by-word, where at each step *t* we use | |
the set of sentences already generated with length *t* - 1 to generate a new set | |
of sentences with length *t*. We keep only the top *k* candidates at each step, | |
where the hyperparameter *k* is called the *beam size*. We have found the best | |
performance with *k* = 3. | |
## Getting Started | |
### A Note on Hardware and Training Time | |
The time required to train the *Show and Tell* model depends on your specific | |
hardware and computational capacity. In this guide we assume you will be running | |
training on a single machine with a GPU. In our experience on an NVIDIA Tesla | |
K20m GPU the initial training phase takes 1-2 weeks. The second training phase | |
may take several additional weeks to achieve peak performance (but you can stop | |
this phase early and still get reasonable results). | |
It is possible to achieve a speed-up by implementing distributed training across | |
a cluster of machines with GPUs, but that is not covered in this guide. | |
Whilst it is possible to run this code on a CPU, beware that this may be | |
approximately 10 times slower. | |
### Install Required Packages | |
First ensure that you have installed the following required packages: | |
* **Bazel** ([instructions](http://bazel.io/docs/install.html)) | |
* **Python 2.7** | |
* **TensorFlow** 1.0 or greater ([instructions](https://www.tensorflow.org/install/)) | |
* **NumPy** ([instructions](http://www.scipy.org/install.html)) | |
* **Natural Language Toolkit (NLTK)**: | |
* First install NLTK ([instructions](http://www.nltk.org/install.html)) | |
* Then install the NLTK data package "punkt" ([instructions](http://www.nltk.org/data.html)) | |
* **Unzip** | |
### Prepare the Training Data | |
To train the model you will need to provide training data in native TFRecord | |
format. The TFRecord format consists of a set of sharded files containing | |
serialized `tf.SequenceExample` protocol buffers. Each `tf.SequenceExample` | |
proto contains an image (JPEG format), a caption and metadata such as the image | |
id. | |
Each caption is a list of words. During preprocessing, a dictionary is created | |
that assigns each word in the vocabulary to an integer-valued id. Each caption | |
is encoded as a list of integer word ids in the `tf.SequenceExample` protos. | |
We have provided a script to download and preprocess the [MSCOCO](http://mscoco.org/) image captioning data set into this format. Downloading | |
and preprocessing the data may take several hours depending on your network and | |
computer speed. Please be patient. | |
Before running the script, ensure that your hard disk has at least 150GB of | |
available space for storing the downloaded and processed data. | |
```shell | |
# Location to save the MSCOCO data. | |
MSCOCO_DIR="${HOME}/im2txt/data/mscoco" | |
# Build the preprocessing script. | |
cd research/im2txt | |
bazel build //im2txt:download_and_preprocess_mscoco | |
# Run the preprocessing script. | |
bazel-bin/im2txt/download_and_preprocess_mscoco "${MSCOCO_DIR}" | |
``` | |
The final line of the output should read: | |
``` | |
2016-09-01 16:47:47.296630: Finished processing all 20267 image-caption pairs in data set 'test'. | |
``` | |
When the script finishes you will find 256 training, 4 validation and 8 testing | |
files in `DATA_DIR`. The files will match the patterns `train-?????-of-00256`, | |
`val-?????-of-00004` and `test-?????-of-00008`, respectively. | |
### Download the Inception v3 Checkpoint | |
The *Show and Tell* model requires a pretrained *Inception v3* checkpoint file | |
to initialize the parameters of its image encoder submodel. | |
This checkpoint file is provided by the | |
[TensorFlow-Slim image classification library](https://github.com/tensorflow/models/tree/master/research/slim#tensorflow-slim-image-classification-library) | |
which provides a suite of pre-trained image classification models. You can read | |
more about the models provided by the library | |
[here](https://github.com/tensorflow/models/tree/master/research/slim#pre-trained-models). | |
Run the following commands to download the *Inception v3* checkpoint. | |
```shell | |
# Location to save the Inception v3 checkpoint. | |
INCEPTION_DIR="${HOME}/im2txt/data" | |
mkdir -p ${INCEPTION_DIR} | |
wget "http://download.tensorflow.org/models/inception_v3_2016_08_28.tar.gz" | |
tar -xvf "inception_v3_2016_08_28.tar.gz" -C ${INCEPTION_DIR} | |
rm "inception_v3_2016_08_28.tar.gz" | |
``` | |
Note that the *Inception v3* checkpoint will only be used for initializing the | |
parameters of the *Show and Tell* model. Once the *Show and Tell* model starts | |
training it will save its own checkpoint files containing the values of all its | |
parameters (including copies of the *Inception v3* parameters). If training is | |
stopped and restarted, the parameter values will be restored from the latest | |
*Show and Tell* checkpoint and the *Inception v3* checkpoint will be ignored. In | |
other words, the *Inception v3* checkpoint is only used in the 0-th global step | |
(initialization) of training the *Show and Tell* model. | |
## Training a Model | |
### Initial Training | |
Run the training script. | |
```shell | |
# Directory containing preprocessed MSCOCO data. | |
MSCOCO_DIR="${HOME}/im2txt/data/mscoco" | |
# Inception v3 checkpoint file. | |
INCEPTION_CHECKPOINT="${HOME}/im2txt/data/inception_v3.ckpt" | |
# Directory to save the model. | |
MODEL_DIR="${HOME}/im2txt/model" | |
# Build the model. | |
cd research/im2txt | |
bazel build -c opt //im2txt/... | |
# Run the training script. | |
bazel-bin/im2txt/train \ | |
--input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" \ | |
--inception_checkpoint_file="${INCEPTION_CHECKPOINT}" \ | |
--train_dir="${MODEL_DIR}/train" \ | |
--train_inception=false \ | |
--number_of_steps=1000000 | |
``` | |
Run the evaluation script in a separate process. This will log evaluation | |
metrics to TensorBoard which allows training progress to be monitored in | |
real-time. | |
Note that you may run out of memory if you run the evaluation script on the same | |
GPU as the training script. You can run the command | |
`export CUDA_VISIBLE_DEVICES=""` to force the evaluation script to run on CPU. | |
If evaluation runs too slowly on CPU, you can decrease the value of | |
`--num_eval_examples`. | |
```shell | |
MSCOCO_DIR="${HOME}/im2txt/data/mscoco" | |
MODEL_DIR="${HOME}/im2txt/model" | |
# Ignore GPU devices (only necessary if your GPU is currently memory | |
# constrained, for example, by running the training script). | |
export CUDA_VISIBLE_DEVICES="" | |
# Run the evaluation script. This will run in a loop, periodically loading the | |
# latest model checkpoint file and computing evaluation metrics. | |
bazel-bin/im2txt/evaluate \ | |
--input_file_pattern="${MSCOCO_DIR}/val-?????-of-00004" \ | |
--checkpoint_dir="${MODEL_DIR}/train" \ | |
--eval_dir="${MODEL_DIR}/eval" | |
``` | |
Run a TensorBoard server in a separate process for real-time monitoring of | |
training progress and evaluation metrics. | |
```shell | |
MODEL_DIR="${HOME}/im2txt/model" | |
# Run a TensorBoard server. | |
tensorboard --logdir="${MODEL_DIR}" | |
``` | |
### Fine Tune the Inception v3 Model | |
Your model will already be able to generate reasonable captions after the first | |
phase of training. Try it out! (See [Generating Captions](#generating-captions)). | |
You can further improve the performance of the model by running a | |
second training phase to jointly fine-tune the parameters of the *Inception v3* | |
image submodel and the LSTM. | |
```shell | |
# Restart the training script with --train_inception=true. | |
bazel-bin/im2txt/train \ | |
--input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" \ | |
--train_dir="${MODEL_DIR}/train" \ | |
--train_inception=true \ | |
--number_of_steps=3000000 # Additional 2M steps (assuming 1M in initial training). | |
``` | |
Note that training will proceed much slower now, and the model will continue to | |
improve by a small amount for a long time. We have found that it will improve | |
slowly for an additional 2-2.5 million steps before it begins to overfit. This | |
may take several weeks on a single GPU. If you don't care about absolutely | |
optimal performance then feel free to halt training sooner by stopping the | |
training script or passing a smaller value to the flag `--number_of_steps`. Your | |
model will still work reasonably well. | |
## Generating Captions | |
Your trained *Show and Tell* model can generate captions for any JPEG image! The | |
following command line will generate captions for an image from the test set. | |
```shell | |
# Path to checkpoint file or a directory containing checkpoint files. Passing | |
# a directory will only work if there is also a file named 'checkpoint' which | |
# lists the available checkpoints in the directory. It will not work if you | |
# point to a directory with just a copy of a model checkpoint: in that case, | |
# you will need to pass the checkpoint path explicitly. | |
CHECKPOINT_PATH="${HOME}/im2txt/model/train" | |
# Vocabulary file generated by the preprocessing script. | |
VOCAB_FILE="${HOME}/im2txt/data/mscoco/word_counts.txt" | |
# JPEG image file to caption. | |
IMAGE_FILE="${HOME}/im2txt/data/mscoco/raw-data/val2014/COCO_val2014_000000224477.jpg" | |
# Build the inference binary. | |
cd research/im2txt | |
bazel build -c opt //im2txt:run_inference | |
# Ignore GPU devices (only necessary if your GPU is currently memory | |
# constrained, for example, by running the training script). | |
export CUDA_VISIBLE_DEVICES="" | |
# Run inference to generate captions. | |
bazel-bin/im2txt/run_inference \ | |
--checkpoint_path=${CHECKPOINT_PATH} \ | |
--vocab_file=${VOCAB_FILE} \ | |
--input_files=${IMAGE_FILE} | |
``` | |
Example output: | |
``` | |
Captions for image COCO_val2014_000000224477.jpg: | |
0) a man riding a wave on top of a surfboard . (p=0.040413) | |
1) a person riding a surf board on a wave (p=0.017452) | |
2) a man riding a wave on a surfboard in the ocean . (p=0.005743) | |
``` | |
Note: you may get different results. Some variation between different models is | |
expected. | |
Here is the image: | |
 | |