Spaces:
Running
Running
 | |
 | |
 | |
# LexNET for Noun Compound Relation Classification | |
This is a [Tensorflow](http://www.tensorflow.org/) implementation of the LexNET | |
algorithm for classifying relationships, specifically applied to classifying the | |
relationships that hold between noun compounds: | |
* *olive oil* is oil that is *made from* olives | |
* *cooking oil* which is oil that is *used for* cooking | |
* *motor oil* is oil that is *contained in* a motor | |
The model is a supervised classifier that predicts the relationship that holds | |
between the constituents of a two-word noun compound using: | |
1. A neural "paraphrase" of each syntactic dependency path that connects the | |
constituents in a large corpus. For example, given a sentence like *This fine | |
oil is made from first-press olives*, the dependency path is something like | |
`oil <NSUBJPASS made PREP> from POBJ> olive`. | |
2. The distributional information provided by the individual words; i.e., the | |
word embeddings of the two consituents. | |
3. The distributional signal provided by the compound itself; i.e., the | |
embedding of the noun compound in context. | |
The model includes several variants: *path-based model* uses (1) alone, the | |
*distributional model* uses (2) alone, and the *integrated model* uses (1) and | |
(2). The *distributional-nc model* and the *integrated-nc* model each add (3). | |
Training a model requires the following: | |
1. A collection of noun compounds that have been labeled using a *relation | |
inventory*. The inventory describes the specific relationships that you'd | |
like the model to differentiate (e.g. *part of* versus *composed of* versus | |
*purpose*), and generally may consist of tens of classes. You can download | |
the dataset used in the paper from | |
[here](https://vered1986.github.io/papers/Tratz2011_Dataset.tar.gz). | |
2. A collection of word embeddings: the path-based model uses the word | |
embeddings as part of the path representation, and the distributional models | |
use the word embeddings directly as prediction features. | |
3. The path-based model requires a collection of syntactic dependency parses | |
that connect the constituents for each noun compound. To generate these, | |
you'll need a corpus from which to train this data; we used Wikipedia and the | |
[LDC GigaWord5](https://catalog.ldc.upenn.edu/LDC2011T07) corpora. | |
# Contents | |
The following source code is included here: | |
* `learn_path_embeddings.py` is a script that trains and evaluates a path-based | |
model to predict a noun-compound relationship given labeled noun-compounds and | |
dependency parse paths. | |
* `learn_classifier.py` is a script that trains and evaluates a classifier based | |
on any combination of paths, word embeddings, and noun-compound embeddings. | |
* `get_indicative_paths.py` is a script that generates the most indicative | |
syntactic dependency paths for a particular relationship. | |
Also included are utilities for preparing data for training: | |
* `text_embeddings_to_binary.py` converts a text file containing word embeddings | |
into a binary file that is quicker to load. | |
* `extract_paths.py` finds all the dependency paths that connect words in a | |
corpus. | |
* `sorted_paths_to_examples.py` processes the output of `extract_paths.py` to | |
produce summarized training data. | |
This code (in particular, the utilities used to prepare the data) differs from | |
the code that was used to prepare data for the paper. Notably, we used a | |
proprietary dependency parser instead of spaCy, which is used here. | |
# Dependencies | |
* [TensorFlow](http://www.tensorflow.org/): see detailed installation | |
instructions at that site. | |
* [SciKit Learn](http://scikit-learn.org/): you can probably just install this | |
with `pip install sklearn`. | |
* [SpaCy](https://spacy.io/): `pip install spacy` ought to do the trick, along | |
with the English model. | |
# Creating the Model | |
This sections described the steps necessary to create and evaluate the model | |
described in the paper. | |
## Generate Path Data | |
To begin, you need three text files: | |
1. **Corpus**. This file should contain natural language sentences, written with | |
one sentence per line. For purposes of exposition, we'll assume that you | |
have English Wikipedia serialized this way in `${HOME}/data/wiki.txt`. | |
2. **Labeled Noun Compound Pairs**. This file contain (modfier, head, label) | |
tuples, tab-separated, with one per line. The *label* represented the | |
relationship between the head and the modifier; e.g., if `purpose` is one | |
your labels, you could possibly include `tooth<tab>paste<tab>purpose`. | |
3. **Word Embeddings**. We used the | |
[GloVe](https://nlp.stanford.edu/projects/glove/) word embeddings; in | |
particular the 6B token, 300d variant. We'll assume you have this file as | |
`${HOME}/data/glove.6B.300d.txt`. | |
We first processed the embeddings from their text format into something that we | |
can load a little bit more quickly: | |
./text_embeddings_to_binary.py \ | |
--input ${HOME}/data/glove.6B.300d.txt \ | |
--output_vocab ${HOME}/data/vocab.txt \ | |
--output_npy ${HOME}/data/glove.6B.300d.npy | |
Next, we'll extract all the dependency parse paths connecting our labeled pairs | |
from the corpus. This process takes a *looooong* time, but is trivially | |
parallelized using map-reduce if you have access to that technology. | |
./extract_paths.py \ | |
--corpus ${HOME}/data/wiki.txt \ | |
--labeled_pairs ${HOME}/data/labeled-pairs.tsv \ | |
--output ${HOME}/data/paths.tsv | |
The file it produces (`paths.tsv`) is a tab-separated file that contains the | |
modifier, the head, the label, the encoded path, and the sentence from which the | |
path was drawn. (This last is mostly for sanity checking.) A sample row might | |
look something like this (where newlines would actually be tab characters): | |
navy | |
captain | |
owner_emp_use | |
<X>/PROPN/dobj/>::enter/VERB/ROOT/^::follow/VERB/advcl/<::in/ADP/prep/<::footstep/NOUN/pobj/<::of/ADP/prep/<::father/NOUN/pobj/<::bover/PROPN/appos/<::<Y>/PROPN/compound/< | |
He entered the Royal Navy following in the footsteps of his father Captain John Bover and two of his elder brothers as volunteer aboard HMS Perseus | |
This file must be sorted as follows: | |
sort -k1,3 -t$'\t' paths.tsv > sorted.paths.tsv | |
In particular, rows with the same modifier, head, and label must appear | |
contiguously. | |
We next create a file that contains all the relation labels from our original | |
labeled pairs: | |
awk 'BEGIN {FS="\t"} {print $3}' < ${HOME}/data/labeled-pairs.tsv \ | |
| sort -u > ${HOME}/data/relations.txt | |
With these in hand, we're ready to produce the train, validation, and test data: | |
./sorted_paths_to_examples.py \ | |
--input ${HOME}/data/sorted.paths.tsv \ | |
--vocab ${HOME}/data/vocab.txt \ | |
--relations ${HOME}/data/relations.txt \ | |
--splits ${HOME}/data/splits.txt \ | |
--output_dir ${HOME}/data | |
Here, `splits.txt` is a file that indicates which "split" (train, test, or | |
validation) you want the pair to appear in. It should be a tab-separate file | |
which conatins the modifier, head, and the dataset ( `train`, `test`, or `val`) | |
into which the pair should be placed; e.g.,: | |
tooth <TAB> paste <TAB> train | |
banana <TAB> seat <TAB> test | |
The program will produce a separate file for each dataset split in the directory | |
specified by `--output_dir`. Each file is contains `tf.train.Example` protocol | |
buffers encoded using the `TFRecord` file format. | |
## Create Path Embeddings | |
Now we're ready to train the path embeddings using `learn_path_embeddings.py`: | |
./learn_path_embeddings.py \ | |
--train ${HOME}/data/train.tfrecs.gz \ | |
--val ${HOME}/data/val.tfrecs.gz \ | |
--text ${HOME}/data/test.tfrecs.gz \ | |
--embeddings ${HOME}/data/glove.6B.300d.npy | |
--relations ${HOME}/data/relations.txt | |
--output ${HOME}/data/path-embeddings \ | |
--logdir /tmp/learn_path_embeddings | |
The path embeddings will be placed at the location specified by `--output`. | |
## Train classifiers | |
Train classifiers and evaluate on the validation and test data using | |
`train_classifiers.py` script. This shell script fragment will iterate through | |
each dataset, split, corpus, and model type to train and evaluate classifiers. | |
LOGDIR=/tmp/learn_classifier | |
for DATASET in tratz/fine_grained tratz/coarse_grained ; do | |
for SPLIT in random lexical_head lexical_mod lexical_full ; do | |
for CORPUS in wiki_gigiawords ; do | |
for MODEL in dist dist-nc path integrated integrated-nc ; do | |
# Filename for the log that will contain the classifier results. | |
LOGFILE=$(echo "${DATASET}.${SPLIT}.${CORPUS}.${MODEL}.log" | sed -e "s,/,.,g") | |
python learn_classifier.py \ | |
--dataset_dir ~/lexnet/datasets \ | |
--dataset "${DATASET}" \ | |
--corpus "${SPLIT}/${CORPUS}" \ | |
--embeddings_base_path ~/lexnet/embeddings \ | |
--logdir ${LOGDIR} \ | |
--input "${MODEL}" > "${LOGDIR}/${LOGFILE}" | |
done | |
done | |
done | |
done | |
The log file will contain the final performance (precision, recall, F1) on the | |
train, dev, and test sets, and will include a confusion matrix for each. | |
# Contact | |
If you have any questions, issues, or suggestions, feel free to contact either | |
@vered1986 or @waterson. | |
If you use this code for any published research, please include the following citation: | |
Olive Oil Is Made of Olives, Baby Oil Is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural Model. | |
Vered Shwartz and Chris Waterson. NAACL 2018. [link](https://arxiv.org/pdf/1803.08073.pdf). | |