datause-annotation / guidelines.md
rafmacalaba's picture
init in ai4data org
1376fd4
# Entity Tag Guide
This document describes the annotation tags you will see in the NER / merged NER output. Each **entity** corresponds to a labeled span in the text.
| Entity | Meaning |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **`match_named`** | Model and ground-truth agree on an explicit, uniquely named dataset span. |
| **`actual_named`** | A named dataset span present in the ground-truth but missed by the model. |
| **`pred_named`** | A named dataset span predicted by the model but not in the ground-truth. |
| **`match_unnamed`** | Model and ground-truth agree on a clearly described but unnamed dataset span. |
| **`actual_unnamed`** | An unnamed dataset span present in the ground-truth but missed by the model. |
| **`pred_unnamed`** | An unnamed dataset span predicted by the model but not in the ground-truth. |
| **`match_vague`** | Model and ground-truth agree on a vague dataset mention (lacking specific identifying details). |
| **`actual_vague`** | A vague dataset mention present in the ground-truth but missed by the model. |
| **`pred_vague`** | A vague dataset mention predicted by the model but not in the ground-truth. |
| **`<span> <> acronym`** | Relation: marks the dataset’s acronym (e.g. `RUV <> acronym`). |
| **`<span> <> data description`** | Relation: describes what the dataset contains or how it was collected. |
| **`<span> <> data geography`** | Relation: indicates the geographic coverage of the dataset (e.g. country, region). |
| **`<span> <> data source`** | Relation: links to the original source or repository of the data. |
| **`<span> <> data type`** | Relation: specifies the type of data (e.g. survey, census, register). |
| **`<span> <> geography`** | Relation: connects the dataset to its referenced geography (may duplicate data geography). |
| **`<span> <> publication year`** | Relation: the year the dataset (or its documentation) was published. |
| **`<span> <> publisher`** | Relation: the organization or entity that published the dataset. |
| **`<span> <> reference year`** | Relation: the year the data were actually collected or refer to. |
| **`<span> <> version`** | Relation: the version identifier of the dataset (e.g. “v5”, “Version 2”). |
Use this guide when reviewing model predictions to quickly identify correct matches, false positives, and false negatives, as well as any extracted relations.