Spaces:

NCTCMumbai
/

NCTC

Running

App Files Files Community

NCTC / models /research /im2txt /README.md

NCTCMumbai

Upload 2571 files

0b8359d about 2 years ago

preview code

raw

history blame contribute delete

14 kB

	![No Maintenance Intended](https://img.shields.io/badge/No%20Maintenance%20Intended-%E2%9C%95-red.svg)
	![TensorFlow Requirement: 1.x](https://img.shields.io/badge/TensorFlow%20Requirement-1.x-brightgreen)
	![TensorFlow 2 Not Supported](https://img.shields.io/badge/TensorFlow%202%20Not%20Supported-%E2%9C%95-red.svg)

	# Show and Tell: A Neural Image Caption Generator

	A TensorFlow implementation of the image-to-text model described in the paper:

	"Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning
	Challenge."

	Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan.

	IEEE transactions on pattern analysis and machine intelligence (2016).

	Full text available at: http://arxiv.org/abs/1609.06647

	## Contact
	*Author:* Chris Shallue

	*Pull requests and issues:* @cshallue

	## Contents
	* [Model Overview](#model-overview)
	* [Introduction](#introduction)
	* [Architecture](#architecture)
	* [Getting Started](#getting-started)
	* [A Note on Hardware and Training Time](#a-note-on-hardware-and-training-time)
	* [Install Required Packages](#install-required-packages)
	* [Prepare the Training Data](#prepare-the-training-data)
	* [Download the Inception v3 Checkpoint](#download-the-inception-v3-checkpoint)
	* [Training a Model](#training-a-model)
	* [Initial Training](#initial-training)
	* [Fine Tune the Inception v3 Model](#fine-tune-the-inception-v3-model)
	* [Generating Captions](#generating-captions)

	## Model Overview

	### Introduction

	The Show and Tell model is a deep neural network that learns how to describe
	the content of images. For example:

	![Example captions](g3doc/example_captions.jpg)

	### Architecture

	The Show and Tell model is an example of an encoder-decoder neural network.
	It works by first "encoding" an image into a fixed-length vector representation,
	and then "decoding" the representation into a natural language description.

	The image encoder is a deep convolutional neural network. This type of
	network is widely used for image tasks and is currently state-of-the-art for
	object recognition and detection. Our particular choice of network is the
	[Inception v3](http://arxiv.org/abs/1512.00567) image recognition model
	pretrained on the
	[ILSVRC-2012-CLS](http://www.image-net.org/challenges/LSVRC/2012/) image
	classification dataset.

	The decoder is a long short-term memory (LSTM) network. This type of network is
	commonly used for sequence modeling tasks such as language modeling and machine
	translation. In the Show and Tell model, the LSTM network is trained as a
	language model conditioned on the image encoding.

	Words in the captions are represented with an embedding model. Each word in the
	vocabulary is associated with a fixed-length vector representation that is
	learned during training.

	The following diagram illustrates the model architecture.

	![Show and Tell Architecture](g3doc/show_and_tell_architecture.png)

	In this diagram, \{s<sub>0</sub>, s<sub>1</sub>, ..., s<sub>N-1</sub>\}
	are the words of the caption and \{w<sub>e</sub>s<sub>0</sub>,
	w<sub>e</sub>s<sub>1</sub>, ..., w<sub>e</sub>s<sub>N-1</sub>\}
	are their corresponding word embedding vectors. The outputs \{p<sub>1</sub>,
	p<sub>2</sub>, ..., p<sub>N</sub>\} of the LSTM are probability
	distributions generated by the model for the next word in the sentence. The
	terms \{log p<sub>1</sub>(s<sub>1</sub>),
	log p<sub>2</sub>(s<sub>2</sub>), ...,
	log p<sub>N</sub>(s<sub>N</sub>)\} are the log-likelihoods of the
	correct word at each step; the negated sum of these terms is the minimization
	objective of the model.

	During the first phase of training the parameters of the Inception v3 model
	are kept fixed: it is simply a static image encoder function. A single trainable
	layer is added on top of the Inception v3 model to transform the image
	embedding into the word embedding vector space. The model is trained with
	respect to the parameters of the word embeddings, the parameters of the layer on
	top of Inception v3 and the parameters of the LSTM. In the second phase of
	training, all parameters - including the parameters of Inception v3 - are
	trained to jointly fine-tune the image encoder and the LSTM.

	Given a trained model and an image we use beam search to generate captions for
	that image. Captions are generated word-by-word, where at each step t we use
	the set of sentences already generated with length t - 1 to generate a new set
	of sentences with length t. We keep only the top k candidates at each step,
	where the hyperparameter k is called the beam size. We have found the best
	performance with k = 3.

	## Getting Started

	### A Note on Hardware and Training Time

	The time required to train the Show and Tell model depends on your specific
	hardware and computational capacity. In this guide we assume you will be running
	training on a single machine with a GPU. In our experience on an NVIDIA Tesla
	K20m GPU the initial training phase takes 1-2 weeks. The second training phase
	may take several additional weeks to achieve peak performance (but you can stop
	this phase early and still get reasonable results).

	It is possible to achieve a speed-up by implementing distributed training across
	a cluster of machines with GPUs, but that is not covered in this guide.

	Whilst it is possible to run this code on a CPU, beware that this may be
	approximately 10 times slower.

	### Install Required Packages
	First ensure that you have installed the following required packages:

	* Bazel ([instructions](http://bazel.io/docs/install.html))
	* Python 2.7
	* TensorFlow 1.0 or greater ([instructions](https://www.tensorflow.org/install/))
	* NumPy ([instructions](http://www.scipy.org/install.html))
	* Natural Language Toolkit (NLTK):
	* First install NLTK ([instructions](http://www.nltk.org/install.html))
	* Then install the NLTK data package "punkt" ([instructions](http://www.nltk.org/data.html))
	* Unzip
	### Prepare the Training Data

	To train the model you will need to provide training data in native TFRecord
	format. The TFRecord format consists of a set of sharded files containing
	serialized `tf.SequenceExample` protocol buffers. Each `tf.SequenceExample`
	proto contains an image (JPEG format), a caption and metadata such as the image
	id.

	Each caption is a list of words. During preprocessing, a dictionary is created
	that assigns each word in the vocabulary to an integer-valued id. Each caption
	is encoded as a list of integer word ids in the `tf.SequenceExample` protos.

	We have provided a script to download and preprocess the [MSCOCO](http://mscoco.org/) image captioning data set into this format. Downloading
	and preprocessing the data may take several hours depending on your network and
	computer speed. Please be patient.

	Before running the script, ensure that your hard disk has at least 150GB of
	available space for storing the downloaded and processed data.

	```shell
	# Location to save the MSCOCO data.
	MSCOCO_DIR="${HOME}/im2txt/data/mscoco"

	# Build the preprocessing script.
	cd research/im2txt
	bazel build //im2txt:download_and_preprocess_mscoco

	# Run the preprocessing script.
	bazel-bin/im2txt/download_and_preprocess_mscoco "${MSCOCO_DIR}"
	```

	The final line of the output should read:

	```
	2016-09-01 16:47:47.296630: Finished processing all 20267 image-caption pairs in data set 'test'.
	```

	When the script finishes you will find 256 training, 4 validation and 8 testing
	files in `DATA_DIR`. The files will match the patterns `train-?????-of-00256`,
	`val-?????-of-00004` and `test-?????-of-00008`, respectively.

	### Download the Inception v3 Checkpoint

	The Show and Tell model requires a pretrained Inception v3 checkpoint file
	to initialize the parameters of its image encoder submodel.

	This checkpoint file is provided by the
	[TensorFlow-Slim image classification library](https://github.com/tensorflow/models/tree/master/research/slim#tensorflow-slim-image-classification-library)
	which provides a suite of pre-trained image classification models. You can read
	more about the models provided by the library
	[here](https://github.com/tensorflow/models/tree/master/research/slim#pre-trained-models).


	Run the following commands to download the Inception v3 checkpoint.

	```shell
	# Location to save the Inception v3 checkpoint.
	INCEPTION_DIR="${HOME}/im2txt/data"
	mkdir -p ${INCEPTION_DIR}

	wget "http://download.tensorflow.org/models/inception_v3_2016_08_28.tar.gz"
	tar -xvf "inception_v3_2016_08_28.tar.gz" -C ${INCEPTION_DIR}
	rm "inception_v3_2016_08_28.tar.gz"
	```

	Note that the Inception v3 checkpoint will only be used for initializing the
	parameters of the Show and Tell model. Once the Show and Tell model starts
	training it will save its own checkpoint files containing the values of all its
	parameters (including copies of the Inception v3 parameters). If training is
	stopped and restarted, the parameter values will be restored from the latest
	Show and Tell checkpoint and the Inception v3 checkpoint will be ignored. In
	other words, the Inception v3 checkpoint is only used in the 0-th global step
	(initialization) of training the Show and Tell model.

	## Training a Model

	### Initial Training

	Run the training script.

	```shell
	# Directory containing preprocessed MSCOCO data.
	MSCOCO_DIR="${HOME}/im2txt/data/mscoco"

	# Inception v3 checkpoint file.
	INCEPTION_CHECKPOINT="${HOME}/im2txt/data/inception_v3.ckpt"

	# Directory to save the model.
	MODEL_DIR="${HOME}/im2txt/model"

	# Build the model.
	cd research/im2txt
	bazel build -c opt //im2txt/...

	# Run the training script.
	bazel-bin/im2txt/train \
	--input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" \
	--inception_checkpoint_file="${INCEPTION_CHECKPOINT}" \
	--train_dir="${MODEL_DIR}/train" \
	--train_inception=false \
	--number_of_steps=1000000
	```

	Run the evaluation script in a separate process. This will log evaluation
	metrics to TensorBoard which allows training progress to be monitored in
	real-time.

	Note that you may run out of memory if you run the evaluation script on the same
	GPU as the training script. You can run the command
	`export CUDA_VISIBLE_DEVICES=""` to force the evaluation script to run on CPU.
	If evaluation runs too slowly on CPU, you can decrease the value of
	`--num_eval_examples`.

	```shell
	MSCOCO_DIR="${HOME}/im2txt/data/mscoco"
	MODEL_DIR="${HOME}/im2txt/model"

	# Ignore GPU devices (only necessary if your GPU is currently memory
	# constrained, for example, by running the training script).
	export CUDA_VISIBLE_DEVICES=""

	# Run the evaluation script. This will run in a loop, periodically loading the
	# latest model checkpoint file and computing evaluation metrics.
	bazel-bin/im2txt/evaluate \
	--input_file_pattern="${MSCOCO_DIR}/val-?????-of-00004" \
	--checkpoint_dir="${MODEL_DIR}/train" \
	--eval_dir="${MODEL_DIR}/eval"
	```

	Run a TensorBoard server in a separate process for real-time monitoring of
	training progress and evaluation metrics.

	```shell
	MODEL_DIR="${HOME}/im2txt/model"

	# Run a TensorBoard server.
	tensorboard --logdir="${MODEL_DIR}"
	```

	### Fine Tune the Inception v3 Model

	Your model will already be able to generate reasonable captions after the first
	phase of training. Try it out! (See [Generating Captions](#generating-captions)).

	You can further improve the performance of the model by running a
	second training phase to jointly fine-tune the parameters of the Inception v3
	image submodel and the LSTM.

	```shell
	# Restart the training script with --train_inception=true.
	bazel-bin/im2txt/train \
	--input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" \
	--train_dir="${MODEL_DIR}/train" \
	--train_inception=true \
	--number_of_steps=3000000 # Additional 2M steps (assuming 1M in initial training).
	```

	Note that training will proceed much slower now, and the model will continue to
	improve by a small amount for a long time. We have found that it will improve
	slowly for an additional 2-2.5 million steps before it begins to overfit. This
	may take several weeks on a single GPU. If you don't care about absolutely
	optimal performance then feel free to halt training sooner by stopping the
	training script or passing a smaller value to the flag `--number_of_steps`. Your
	model will still work reasonably well.

	## Generating Captions

	Your trained Show and Tell model can generate captions for any JPEG image! The
	following command line will generate captions for an image from the test set.

	```shell
	# Path to checkpoint file or a directory containing checkpoint files. Passing
	# a directory will only work if there is also a file named 'checkpoint' which
	# lists the available checkpoints in the directory. It will not work if you
	# point to a directory with just a copy of a model checkpoint: in that case,
	# you will need to pass the checkpoint path explicitly.
	CHECKPOINT_PATH="${HOME}/im2txt/model/train"

	# Vocabulary file generated by the preprocessing script.
	VOCAB_FILE="${HOME}/im2txt/data/mscoco/word_counts.txt"

	# JPEG image file to caption.
	IMAGE_FILE="${HOME}/im2txt/data/mscoco/raw-data/val2014/COCO_val2014_000000224477.jpg"

	# Build the inference binary.
	cd research/im2txt
	bazel build -c opt //im2txt:run_inference

	# Ignore GPU devices (only necessary if your GPU is currently memory
	# constrained, for example, by running the training script).
	export CUDA_VISIBLE_DEVICES=""

	# Run inference to generate captions.
	bazel-bin/im2txt/run_inference \
	--checkpoint_path=${CHECKPOINT_PATH} \
	--vocab_file=${VOCAB_FILE} \
	--input_files=${IMAGE_FILE}
	```

	Example output:

	```
	Captions for image COCO_val2014_000000224477.jpg:
	0) a man riding a wave on top of a surfboard . (p=0.040413)
	1) a person riding a surf board on a wave (p=0.017452)
	2) a man riding a wave on a surfboard in the ocean . (p=0.005743)
	```

	Note: you may get different results. Some variation between different models is
	expected.

	Here is the image:

	![Surfer](g3doc/COCO_val2014_000000224477.jpg)