Spaces:

NCTCMumbai
/

NCTC

Running

App Files Files Community

NCTC / models /research /skip_thoughts /README.md

NCTCMumbai

Upload 2571 files

0b8359d almost 2 years ago

preview code

raw

history blame contribute delete

18.3 kB

	![No Maintenance Intended](https://img.shields.io/badge/No%20Maintenance%20Intended-%E2%9C%95-red.svg)
	![TensorFlow Requirement: 1.x](https://img.shields.io/badge/TensorFlow%20Requirement-1.x-brightgreen)
	![TensorFlow 2 Not Supported](https://img.shields.io/badge/TensorFlow%202%20Not%20Supported-%E2%9C%95-red.svg)

	# Skip-Thought Vectors

	This is a TensorFlow implementation of the model described in:

	Jamie Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel,
	Antonio Torralba, Raquel Urtasun, Sanja Fidler.
	[Skip-Thought Vectors](https://papers.nips.cc/paper/5950-skip-thought-vectors.pdf).
	In NIPS, 2015.


	## Contact
	*Code author:* Chris Shallue

	*Pull requests and issues:* @cshallue

	## Contents
	* [Model Overview](#model-overview)
	* [Getting Started](#getting-started)
	* [Install Required Packages](#install-required-packages)
	* [Download Pretrained Models (Optional)](#download-pretrained-models-optional)
	* [Training a Model](#training-a-model)
	* [Prepare the Training Data](#prepare-the-training-data)
	* [Run the Training Script](#run-the-training-script)
	* [Track Training Progress](#track-training-progress)
	* [Expanding the Vocabulary](#expanding-the-vocabulary)
	* [Overview](#overview)
	* [Preparation](#preparation)
	* [Run the Vocabulary Expansion Script](#run-the-vocabulary-expansion-script)
	* [Evaluating a Model](#evaluating-a-model)
	* [Overview](#overview-1)
	* [Preparation](#preparation-1)
	* [Run the Evaluation Tasks](#run-the-evaluation-tasks)
	* [Encoding Sentences](#encoding-sentences)

	## Model overview

	The Skip-Thoughts model is a sentence encoder. It learns to encode input
	sentences into a fixed-dimensional vector representation that is useful for many
	tasks, for example to detect paraphrases or to classify whether a product review
	is positive or negative. See the
	[Skip-Thought Vectors](https://papers.nips.cc/paper/5950-skip-thought-vectors.pdf)
	paper for details of the model architecture and more example applications.

	A trained Skip-Thoughts model will encode similar sentences nearby each other
	in the embedding vector space. The following examples show the nearest neighbor by
	cosine similarity of some sentences from the
	[movie review dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/).


	\| Input sentence \| Nearest Neighbor \|
	\|----------------\|------------------\|
	\| Simplistic, silly and tedious. \| Trite, banal, cliched, mostly inoffensive. \|
	\| Not so much farcical as sour. \| Not only unfunny, but downright repellent. \|
	\| A sensitive and astute first feature by Anne-Sophie Birot. \| Absorbing character study by André Turpin . \|
	\| An enthralling, entertaining feature. \| A slick, engrossing melodrama. \|

	## Getting Started

	### Install Required Packages
	First ensure that you have installed the following required packages:

	* Bazel ([instructions](http://bazel.build/docs/install.html))
	* TensorFlow ([instructions](https://www.tensorflow.org/install/))
	* NumPy ([instructions](http://www.scipy.org/install.html))
	* scikit-learn ([instructions](http://scikit-learn.org/stable/install.html))
	* Natural Language Toolkit (NLTK)
	* First install NLTK ([instructions](http://www.nltk.org/install.html))
	* Then install the NLTK data ([instructions](http://www.nltk.org/data.html))
	* gensim ([instructions](https://radimrehurek.com/gensim/install.html))
	* Only required if you will be expanding your vocabulary with the [word2vec](https://code.google.com/archive/p/word2vec/) model.


	### Download Pretrained Models (Optional)

	You can download model checkpoints pretrained on the
	[BookCorpus](http://yknzhu.wixsite.com/mbweb) dataset in the following
	configurations:

	* Unidirectional RNN encoder ("uni-skip" in the paper)
	* Bidirectional RNN encoder ("bi-skip" in the paper)

	```shell
	# Directory to download the pretrained models to.
	PRETRAINED_MODELS_DIR="${HOME}/skip_thoughts/pretrained/"

	mkdir -p ${PRETRAINED_MODELS_DIR}
	cd ${PRETRAINED_MODELS_DIR}

	# Download and extract the unidirectional model.
	wget "http://download.tensorflow.org/models/skip_thoughts_uni_2017_02_02.tar.gz"
	tar -xvf skip_thoughts_uni_2017_02_02.tar.gz
	rm skip_thoughts_uni_2017_02_02.tar.gz

	# Download and extract the bidirectional model.
	wget "http://download.tensorflow.org/models/skip_thoughts_bi_2017_02_16.tar.gz"
	tar -xvf skip_thoughts_bi_2017_02_16.tar.gz
	rm skip_thoughts_bi_2017_02_16.tar.gz
	```

	You can now skip to the sections [Evaluating a Model](#evaluating-a-model) and
	[Encoding Sentences](#encoding-sentences).


	## Training a Model

	### Prepare the Training Data

	To train a model you will need to provide training data in TFRecord format. The
	TFRecord format consists of a set of sharded files containing serialized
	`tf.Example` protocol buffers. Each `tf.Example` proto contains three
	sentences:

	* `encode`: The sentence to encode.
	* `decode_pre`: The sentence preceding `encode` in the original text.
	* `decode_post`: The sentence following `encode` in the original text.

	Each sentence is a list of words. During preprocessing, a dictionary is created
	that assigns each word in the vocabulary to an integer-valued id. Each sentence
	is encoded as a list of integer word ids in the `tf.Example` protos.

	We have provided a script to preprocess any set of text-files into this format.
	You may wish to use the [BookCorpus](http://yknzhu.wixsite.com/mbweb) dataset.
	Note that the preprocessing script may take 12 hours or more to complete
	on this large dataset.

	```shell
	# Comma-separated list of globs matching the input input files. The format of
	# the input files is assumed to be a list of newline-separated sentences, where
	# each sentence is already tokenized.
	INPUT_FILES="${HOME}/skip_thoughts/bookcorpus/*.txt"

	# Location to save the preprocessed training and validation data.
	DATA_DIR="${HOME}/skip_thoughts/data"

	# Build the preprocessing script.
	cd tensorflow-models/skip_thoughts
	bazel build -c opt //skip_thoughts/data:preprocess_dataset

	# Run the preprocessing script.
	bazel-bin/skip_thoughts/data/preprocess_dataset \
	--input_files=${INPUT_FILES} \
	--output_dir=${DATA_DIR}
	```

	When the script finishes you will find 100 training files and 1 validation file
	in `DATA_DIR`. The files will match the patterns `train-?????-of-00100` and
	`validation-00000-of-00001` respectively.

	The script will also produce a file named `vocab.txt`. The format of this file
	is a list of newline-separated words where the word id is the corresponding 0-
	based line index. Words are sorted by descending order of frequency in the input
	data. Only the top 20,000 words are assigned unique ids; all other words are
	assigned the "unknown id" of 1 in the processed data.

	### Run the Training Script

	Execute the following commands to start the training script. By default it will
	run for 500k steps (around 9 days on a GeForce GTX 1080 GPU).

	```shell
	# Directory containing the preprocessed data.
	DATA_DIR="${HOME}/skip_thoughts/data"

	# Directory to save the model.
	MODEL_DIR="${HOME}/skip_thoughts/model"

	# Build the model.
	cd tensorflow-models/skip_thoughts
	bazel build -c opt //skip_thoughts/...

	# Run the training script.
	bazel-bin/skip_thoughts/train \
	--input_file_pattern="${DATA_DIR}/train-?????-of-00100" \
	--train_dir="${MODEL_DIR}/train"
	```

	### Track Training Progress

	Optionally, you can run the `track_perplexity` script in a separate process.
	This will log per-word perplexity on the validation set which allows training
	progress to be monitored on
	[TensorBoard](https://www.tensorflow.org/get_started/summaries_and_tensorboard).

	Note that you may run out of memory if you run the this script on the same GPU
	as the training script. You can set the environment variable
	`CUDA_VISIBLE_DEVICES=""` to force the script to run on CPU. If it runs too
	slowly on CPU, you can decrease the value of `--num_eval_examples`.

	```shell
	DATA_DIR="${HOME}/skip_thoughts/data"
	MODEL_DIR="${HOME}/skip_thoughts/model"

	# Ignore GPU devices (only necessary if your GPU is currently memory
	# constrained, for example, by running the training script).
	export CUDA_VISIBLE_DEVICES=""

	# Run the evaluation script. This will run in a loop, periodically loading the
	# latest model checkpoint file and computing evaluation metrics.
	bazel-bin/skip_thoughts/track_perplexity \
	--input_file_pattern="${DATA_DIR}/validation-?????-of-00001" \
	--checkpoint_dir="${MODEL_DIR}/train" \
	--eval_dir="${MODEL_DIR}/val" \
	--num_eval_examples=50000
	```

	If you started the `track_perplexity` script, run a
	[TensorBoard](https://www.tensorflow.org/get_started/summaries_and_tensorboard)
	server in a separate process for real-time monitoring of training summaries and
	validation perplexity.

	```shell
	MODEL_DIR="${HOME}/skip_thoughts/model"

	# Run a TensorBoard server.
	tensorboard --logdir="${MODEL_DIR}"
	```

	## Expanding the Vocabulary

	### Overview

	The vocabulary generated by the preprocessing script contains only 20,000 words
	which is insufficient for many tasks. For example, a sentence from Wikipedia
	might contain nouns that do not appear in this vocabulary.

	A solution to this problem described in the
	[Skip-Thought Vectors](https://papers.nips.cc/paper/5950-skip-thought-vectors.pdf)
	paper is to learn a mapping that transfers word representations from one model to
	another. This idea is based on the "Translation Matrix" method from the paper
	[Exploiting Similarities Among Languages for Machine Translation](https://arxiv.org/abs/1309.4168).


	Specifically, we will load the word embeddings from a trained Skip-Thoughts
	model and from a trained [word2vec model](https://arxiv.org/pdf/1301.3781.pdf)
	(which has a much larger vocabulary). We will train a linear regression model
	without regularization to learn a linear mapping from the word2vec embedding
	space to the Skip-Thoughts embedding space. We will then apply the linear
	model to all words in the word2vec vocabulary, yielding vectors in the *Skip-
	Thoughts* word embedding space for the union of the two vocabularies.

	The linear regression task is to learn a parameter matrix W to minimize
	\|\| X - Y \ W \|\|<sup>2</sup>, where X* is a matrix of Skip-Thoughts
	embeddings of shape `[num_words, dim1]`, Y is a matrix of word2vec embeddings
	of shape `[num_words, dim2]`, and W is a matrix of shape `[dim2, dim1]`.

	### Preparation

	First you will need to download and unpack a pretrained
	[word2vec model](https://arxiv.org/pdf/1301.3781.pdf) from
	[this website](https://code.google.com/archive/p/word2vec/)
	([direct download link](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing)).
	This model was trained on the Google News dataset (about 100 billion words).


	Also ensure that you have already [installed gensim](https://radimrehurek.com/gensim/install.html).

	### Run the Vocabulary Expansion Script

	```shell
	# Path to checkpoint file or a directory containing checkpoint files (the script
	# will select the most recent).
	CHECKPOINT_PATH="${HOME}/skip_thoughts/model/train"

	# Vocabulary file generated by the preprocessing script.
	SKIP_THOUGHTS_VOCAB="${HOME}/skip_thoughts/data/vocab.txt"

	# Path to downloaded word2vec model.
	WORD2VEC_MODEL="${HOME}/skip_thoughts/googlenews/GoogleNews-vectors-negative300.bin"

	# Output directory.
	EXP_VOCAB_DIR="${HOME}/skip_thoughts/exp_vocab"

	# Build the vocabulary expansion script.
	cd tensorflow-models/skip_thoughts
	bazel build -c opt //skip_thoughts:vocabulary_expansion

	# Run the vocabulary expansion script.
	bazel-bin/skip_thoughts/vocabulary_expansion \
	--skip_thoughts_model=${CHECKPOINT_PATH} \
	--skip_thoughts_vocab=${SKIP_THOUGHTS_VOCAB} \
	--word2vec_model=${WORD2VEC_MODEL} \
	--output_dir=${EXP_VOCAB_DIR}
	```

	## Evaluating a Model

	### Overview

	The model can be evaluated using the benchmark tasks described in the
	[Skip-Thought Vectors](https://papers.nips.cc/paper/5950-skip-thought-vectors.pdf)
	paper. The following tasks are supported (refer to the paper for full details):

	* SICK semantic relatedness task.
	* MSRP (Microsoft Research Paraphrase Corpus) paraphrase detection task.
	* Binary classification tasks:
	* MR movie review sentiment task.
	* CR customer product review task.
	* SUBJ subjectivity/objectivity task.
	* MPQA opinion polarity task.
	* TREC question-type classification task.

	### Preparation

	You will need to clone or download the
	[skip-thoughts GitHub repository](https://github.com/ryankiros/skip-thoughts) by
	[ryankiros](https://github.com/ryankiros) (the first author of the Skip-Thoughts
	paper):

	```shell
	# Folder to clone the repository to.
	ST_KIROS_DIR="${HOME}/skip_thoughts/skipthoughts_kiros"

	# Clone the repository.
	git clone git@github.com:ryankiros/skip-thoughts.git "${ST_KIROS_DIR}/skipthoughts"

	# Make the package importable.
	export PYTHONPATH="${ST_KIROS_DIR}/:${PYTHONPATH}"
	```

	You will also need to download the data needed for each evaluation task. See the
	instructions [here](https://github.com/ryankiros/skip-thoughts).

	For example, the CR (customer review) dataset is found [here](http://nlp.stanford.edu/~sidaw/home/projects:nbsvm). For this task we want the
	files `custrev.pos` and `custrev.neg`.

	### Run the Evaluation Tasks

	In the following example we will evaluate a unidirectional model ("uni-skip" in
	the paper) on the CR task. To use a bidirectional model ("bi-skip" in the
	paper), simply pass the flags `--bi_vocab_file`, `--bi_embeddings_file` and
	`--bi_checkpoint_path` instead. To use the "combine-skip" model described in the
	paper you will need to pass both the unidirectional and bidirectional flags.

	```shell
	# Path to checkpoint file or a directory containing checkpoint files (the script
	# will select the most recent).
	CHECKPOINT_PATH="${HOME}/skip_thoughts/model/train"

	# Vocabulary file generated by the vocabulary expansion script.
	VOCAB_FILE="${HOME}/skip_thoughts/exp_vocab/vocab.txt"

	# Embeddings file generated by the vocabulary expansion script.
	EMBEDDINGS_FILE="${HOME}/skip_thoughts/exp_vocab/embeddings.npy"

	# Directory containing files custrev.pos and custrev.neg.
	EVAL_DATA_DIR="${HOME}/skip_thoughts/eval_data"

	# Build the evaluation script.
	cd tensorflow-models/skip_thoughts
	bazel build -c opt //skip_thoughts:evaluate

	# Run the evaluation script.
	bazel-bin/skip_thoughts/evaluate \
	--eval_task=CR \
	--data_dir=${EVAL_DATA_DIR} \
	--uni_vocab_file=${VOCAB_FILE} \
	--uni_embeddings_file=${EMBEDDINGS_FILE} \
	--uni_checkpoint_path=${CHECKPOINT_PATH}
	```

	Output:

	```python
	[0.82539682539682535, 0.84084880636604775, 0.83023872679045096,
	0.86206896551724133, 0.83554376657824936, 0.85676392572944293,
	0.84084880636604775, 0.83023872679045096, 0.85145888594164454,
	0.82758620689655171]
	```

	The output is a list of accuracies of 10 cross-validation classification models.
	To get a single number, simply take the average:

	```python
	ipython # Launch iPython.

	In [0]:
	import numpy as np
	np.mean([0.82539682539682535, 0.84084880636604775, 0.83023872679045096,
	0.86206896551724133, 0.83554376657824936, 0.85676392572944293,
	0.84084880636604775, 0.83023872679045096, 0.85145888594164454,
	0.82758620689655171])

	Out [0]: 0.84009936423729525
	```

	## Encoding Sentences

	In this example we will encode data from the
	[movie review dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/)
	(specifically the [sentence polarity dataset v1.0](https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz)).

	```python
	ipython # Launch iPython.

	In [0]:

	# Imports.
	from __future__ import absolute_import
	from __future__ import division
	from __future__ import print_function
	import numpy as np
	import os.path
	import scipy.spatial.distance as sd
	from skip_thoughts import configuration
	from skip_thoughts import encoder_manager

	In [1]:
	# Set paths to the model.
	VOCAB_FILE = "/path/to/vocab.txt"
	EMBEDDING_MATRIX_FILE = "/path/to/embeddings.npy"
	CHECKPOINT_PATH = "/path/to/model.ckpt-9999"
	# The following directory should contain files rt-polarity.neg and
	# rt-polarity.pos.
	MR_DATA_DIR = "/dir/containing/mr/data"

	In [2]:
	# Set up the encoder. Here we are using a single unidirectional model.
	# To use a bidirectional model as well, call load_model() again with
	# configuration.model_config(bidirectional_encoder=True) and paths to the
	# bidirectional model's files. The encoder will use the concatenation of
	# all loaded models.
	encoder = encoder_manager.EncoderManager()
	encoder.load_model(configuration.model_config(),
	vocabulary_file=VOCAB_FILE,
	embedding_matrix_file=EMBEDDING_MATRIX_FILE,
	checkpoint_path=CHECKPOINT_PATH)

	In [3]:
	# Load the movie review dataset.
	data = []
	with open(os.path.join(MR_DATA_DIR, 'rt-polarity.neg'), 'rb') as f:
	data.extend([line.decode('latin-1').strip() for line in f])
	with open(os.path.join(MR_DATA_DIR, 'rt-polarity.pos'), 'rb') as f:
	data.extend([line.decode('latin-1').strip() for line in f])

	In [4]:
	# Generate Skip-Thought Vectors for each sentence in the dataset.
	encodings = encoder.encode(data)

	In [5]:
	# Define a helper function to generate nearest neighbors.
	def get_nn(ind, num=10):
	encoding = encodings[ind]
	scores = sd.cdist([encoding], encodings, "cosine")[0]
	sorted_ids = np.argsort(scores)
	print("Sentence:")
	print("", data[ind])
	print("\nNearest neighbors:")
	for i in range(1, num + 1):
	print(" %d. %s (%.3f)" %
	(i, data[sorted_ids[i]], scores[sorted_ids[i]]))

	In [6]:
	# Compute nearest neighbors of the first sentence in the dataset.
	get_nn(0)
	```

	Output:

	```
	Sentence:
	simplistic , silly and tedious .

	Nearest neighbors:
	1. trite , banal , cliched , mostly inoffensive . (0.247)
	2. banal and predictable . (0.253)
	3. witless , pointless , tasteless and idiotic . (0.272)
	4. loud , silly , stupid and pointless . (0.295)
	5. grating and tedious . (0.299)
	6. idiotic and ugly . (0.330)
	7. black-and-white and unrealistic . (0.335)
	8. hopelessly inane , humorless and under-inspired . (0.335)
	9. shallow , noisy and pretentious . (0.340)
	10. . . . unlikable , uninteresting , unfunny , and completely , utterly inept . (0.346)
	```