Token Classification
PyTorch
English
bert
medical
File size: 18,967 Bytes
265af67
 
 
 
 
 
 
 
4e8d81b
265af67
 
 
cca3f12
 
 
 
cb30c83
 
 
 
 
 
 
 
01aafaf
c640966
 
cb30c83
 
 
 
 
01aafaf
cb30c83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c2abb2a
 
 
001f7a2
c2abb2a
 
 
 
001f7a2
 
c2abb2a
9948694
c2abb2a
86f1bac
 
4778372
 
 
fe2367a
4778372
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9948694
 
 
 
 
4778372
 
 
 
 
9948694
2de6490
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c640966
97faabc
 
 
c640966
 
97faabc
c640966
2de6490
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
license: mit
datasets:
- disi-unibo-nlp/Pile-NER-biomed-IOB
- disi-unibo-nlp/Pile-NER-biomed-descriptions
language:
- en
base_model:
- dmis-lab/biobert-v1.1
pipeline_tag: token-classification
tags:
- medical
---

# Model card for OpenBioNER

We introduce **OpenBioNER**, a lightweight BERT-based model tailored for *open-domain* Biomedical NER. 
This model can find unseen target entity types based solely on their **natural language descriptions**, eliminating the need for retraining.

OpenBioNER is pretrained on synthetic silver annotations generated through LLM self-supervision.
Extensive experiments demonstrate that OpenBioNER outperforms specialized LLMs, such as UniNER and GPT-4o, achieving an F1 score improvement of up to 10\% in zero-shot settings across various biomedical benchmarks.
In comparison to smaller baselines such as GLiNER, our model achieves better performance while using up to 4x fewer parameters.

# Links
- Blog: [link to blog](https://medium.com/@a.cocchieri/zero-shot-biomedical-named-entity-recognition-through-entity-type-description-3fd3518fca17)
- Demo: [link to demo](https://huggingface.co/spaces/disi-unibo-nlp/openbioner-demo)
- Example usage in Colab: [link to colab](https://colab.research.google.com/drive/136yfjTZdDLeej_Odx73nqFDv-oS4HGR3?usp=sharing)
  
# Installation
To use this model, you must install the IBM Zshot library (from main branch before next release):

```bash
!pip install -U zshot==0.0.11 datasets gliner
!python -m spacy download en_core_web_sm
```

# Usage
```python
import spacy

from zshot import PipelineConfig, displacy
from zshot.linker import LinkerSMXM
from zshot.evaluation.metrics._seqeval._seqeval import Seqeval
from zshot.utils.data_models import Entity
from zshot.evaluation.zshot_evaluate import evaluate, prettify_evaluate_report

# define your list of candidate entity types
entities = [
     Entity(name='BACTERIUM', description='A bacterium refers to a type of microorganism that can exist as a single cell and may cause infections or play a role in various biological processes. Examples include species like Streptococcus pneumoniae and Streptomyces ahygroscopicus.', vocabulary=None),
]

nlp = spacy.blank("en")
nlp_config = PipelineConfig(
    linker=LinkerSMXM(model_name="disi-unibo-nlp/openbioner-base"),
    entities=entities,
    device='cuda' # or 'cpu' if GPU not available
)
nlp.add_pipe("zshot", config=nlp_config, last=True)


sentence = "Impact of cofactor - binding loop mutations on thermotolerance and activity of E. coli transketolase"
doc = nlp(sentence)

displacy.render(doc, style="ent")
```

# Performance

OpenBioNER outperforms all competing models, achieving the **highest average performance** across all datasets.

| Model                 | Size  | AnatEM | NCBI | JNLPBA | BC2GM | BC4CHEMD | BC5CDR | JNLPBA-R | MedMentions-R | AVG  |
| :-------------------- | :---- | :----- | :--- | :----- | :---- | :------- | :----- | :------- | :------------ | :--- |
| GPT-4o                | -     | **38.7** | 50.0 | 41.9   | 37.3  | 36.4     | 66.4   | 26.6     | 49.1          | 43.3 |
| UniNER               | 7B    | 25.1   | 60.4 | 48.1   | 46.2  | 47.9     | **68.0** | 50.2     | **53.4** | 49.9 |
| GLiNER_large-v1     | 459M  | 33.3   | **61.9** | **57.1** | 47.9  | 43.1     | 66.4   | 51.9     | **53.4** | 51.9 |
| OpenBioNER *(Ours)* | 110M  | 35.2   | 58.5 | **57.1** | **49.1** | **48.0** | 60.4   | **63.9** | 50.9          | **52.9** |
| OpenBioNER *(Ours)* - Zshot | 110M   |     34.8 |   57.8 |     56.8 |    49.5 |       47.1 |     60.1 |          64.6 |               52.9 |  53.0 |

> ⚠️ **Disclaimer**: Please note that running evaluations using the `zshot` library may lead to slightly different results on certain benchmarks compared to those reported in the paper (above). This discrepancy is due to differences in token alignment: `zshot` uses spaCy's character-based span matching, while our experiments use token-level alignment as handled by BERT-based NER pipelines. These differences can affect how entity spans are matched and evaluated, particularly in cases with subword tokenization or punctuation.


### Descriptions

Below we provide all the descriptions used to evaluate *OpenBioNER* for each dataset. 

---

### Negative Class

This is the description used as NEG class (e.g. not an entity) for all the datasets, execept for MedMentions-Rare:
> Coal, water, oil, etc. are normally used for traditional electricity generation. However using liquefied natural gas as fuel for joint circulatory electricity generation has advantages. The chief financial officer is the only one there taking the fall. It has a very talented team, eh. What will happen to the wildlife? I just tell them, you've got to change. They're here to stay. They have no insurance on their cars. What else would you like? Whether holding an international cultural event or setting the city's cultural policies, she always asks for the participation or input of other cities and counties.


---

### NCBI

| TYPE    | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| :------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| DISEASE | A disease is a medical condition that disrupts normal bodily functions or structures, affecting various organs or systems, and leading to symptoms like muscle weakness, fatigue, stiffness, or cognitive impairment. Diseases can impact muscles, the nervous system, heart, eyes, and more, and may be chronic or acute, such as diabetes, cardiovascular or neurological disorders, and cancer-related conditions like lymphoblastic leukemia or lymphoma. |


### AnatEM

| TYPE    | Description                                                                                                                                                                                                                                                                                                                         |
| :------ | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ANATOMY | The anatomy refers to biological components at various scales, including cells, tissues, and organs. These entities can be identified by proper nouns referring to cell types (e.g., HeLa cells, neurospheres, NSCLC, SCC), body parts (e.g., serum, blood) or biological substances (e.g., vegetables, meats, cow milk) or tumors. |


### BC4CHEMD

| TYPE     | Description                                                                                                                                                                                                                                                                                                                                                                   |
| :------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| CHEMICAL | Chemicals are substances that are composed of one or more elements, typically consisting of atoms bonded together by chemical bonds. They can be naturally occurring, such as vitamins or sterols, or synthesized, like alkylcarbazoles or tetrachlorodibenzo-p-dioxins (TCDD). Chemicals can also be modified or combined to form new compounds, such as esters or polymers. |


### BC2GM

| TYPE | Description                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| :--- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| GENE | A gene is a unit of heredity that carries information from one generation to the next and is composed of DNA sequences that encode the instructions for the development, growth, and function of an organism. It can be a segment of DNA that is passed from one generation to the next and is responsible for the transmission of traits from parents to offspring. A gene is often represented using a three-letter code (e.g., trios, ABL, DNA-PK). |


### BC5CDR

| TYPE     | Description                                                                                                                                                                                                                                                                                                                                                      |
| :------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| CHEMICAL | Chemicals are substances that are composed of atoms, either bonded together in a molecule or as a mixture of different substances. This includes medications (e.g., nitroarginine methyl ester, nifedipine, prednisolone, methyldopa), compounds (e.g., potassium, calcium, ammonium), and other substances that can have various effects on the body.           |
| DISEASE  | Diseases are any medical condition that affects the normal functioning of the body, resulting in symptoms, discomfort, or potentially life-threatening complications. This includes chronic and acute disorders, conditions affecting specific bodily systems, cancer-related conditions, and complications arising from medical treatments or external factors. |


### JNLPBA

| TYPE       | Description                                                                                                                                                                                                                                                                              |
| :--------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| PROTEIN    | A protein is a large biomolecule composed of one or more chains of amino acids, essential for structure and function within cells. Proteins serve as enzymes, receptors, and signaling molecules, playing critical roles in hormone action, immune response, and cellular communication. |
| DNA        | DNA refers to a molecule that contains the genetic instructions used in the development and function of all living organisms. It is composed of two strands of nucleotides that are coiled together in a double helix structure.                                                         |
| CELL\_TYPE | A cell type refers to a specific category of cells defined by characteristic morphology, function, and molecular markers. Examples include lymphocytes, leukocytes, mononuclear cells, polymorphonuclear leukocytes, and B-lymphoblastoid cells.                                         |
| CELL\_LINE | A cell line is a population of cells derived from a single cell, cultured in vitro or in vivo. It can be normal or transformed, with genetic changes like mutations. Cell lines, such as B-cells or HeLa cells, are used in research to study cellular processes, model diseases, and develop treatments.                |
| RNA        | RNA is a type of nucleic acid that plays a crucial role in the transmission of genetic information from DNA to proteins. It is a single-stranded molecule composed of nucleotides, and its primary function is to carry genetic information from the nucleus to the ribosomes, where it is translated into proteins. |


### JNLPBA-Rare

| TYPE       | Description                                                                                                                                                                                                                                                               |
| :--------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| CELL\_LINE | A cell line is a population of cells derived from a single cell, cultured in vitro or in vivo. It can be normal or transformed, with genetic changes like mutations. Cell lines, such as B-cells or HeLa cells, are used in research to study cellular processes, model diseases, and develop treatments. |
| RNA        | RNA is a type of nucleic acid that plays a crucial role in the transmission of genetic information from DNA to proteins. It is a single-stranded molecule composed of nucleotides, and its primary function is to carry genetic information from the nucleus to the ribosomes, where it is translated into proteins.  |


### MedMentions-Rare

| TYPE | Description                                                                                                                                                                                                             |
| :--- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| NEG  | In this study, we fabricated prevascularized synthetic device ports to help mitigate this limitation. Thus, the optimum range of pore size for prevascularization of these membranes was estimated to be 75 - 100 μm. A total of 51 patients were included, 16 in group I and 35 in group II.                                                                                                                      |
| Bacterium (T007) | A bacterium refers to a type of microorganism that can exist as a single cell and may cause infections or play a role in various biological processes. Examples include species like Streptococcus pneumoniae and Streptomyces ahygroscopicus.                                                                  |
| Body Substance (T031) | A body substance is any material produced by or found within the body, such as blood, serum, saliva, sweat, or gastric acid. Specific examples include serum cytokine levels for immune responses, blood lipids for metabolic studies, and hemolymph glucose for stress responses.                                                                                            |
| Food (T168) | A food refers to any substance consumed to provide nutritional support for the body. This includes a wide range of items such as snacks, meat, dairy products, grains like wheat, and edible substances like carbohydrates, proteins, and fats.                  |
| Body System (T022) | A body system consists of interconnected organs and tissues working together to carry out essential functions. Examples include the gastrointestinal tract for digestion, the nervous system for sensory and motor control, the hematological system for blood-related functions, and the endocrine system for hormone regulation. |
| Professional or Occupational Group (T097) | A professional refers to individuals who share the same profession, occupation, or role within a specific field. Examples include cardiologists, psychologists, assessors, hospice staff, and volunteers.               |

---



# 🧬 How to Write Effective Entity Type Descriptions

Entity type descriptions are crucial for improving generalization in OpenBioNER. Well-written descriptions help models disambiguate types, handle rare classes, and align with real-world usage across diverse datasets.

### ✅ Best Practices

- **Start with a clear definition**: Briefly explain what the entity type is.

- **Include functions or context**: Add what it does, its purpose, or where it appears.

- **List 3–5 concrete examples**: Use domain-relevant examples (e.g., real diseases, proteins, or food items).

- **Mention subtypes or synonyms (optional)**: Helps capture lexical variation and rare mentions.

- **Keep it concise**: 1–3 well-structured sentences are ideal.

### ⚠️ Common Mistakes to Avoid

- Vague or overly generic descriptions  
- No examples  
- Just a list of terms  
- Redundant or circular wording  

### 🧪 Template (Recommended Format)

```text
A [TYPE] refers to [concise definition]. It includes examples such as [example1], [example2], and [example3].
```

# Authors
- [Alessio Cocchieri](https://huggingface.co/alecocc)
- [Giacomo Frisoni](https://huggingface.co/giacomo-frisoni)
- [Marcos Martinez Galindo](https://huggingface.co/marmg)
- Gianluca Moro
- Giuseppe Tagliavini
- [Francesco Candoli](https://huggingface.co/CheccoCando)

# 📬 Contacts

For questions, collaborations, or feedback, feel free to reach out:

- Alessio: [a.cocchieri@unibo.it](mailto:a.cocchieri@unibo.it)
- Giacomo: [giacomo.frisoni@unibo.it](mailto:giacomo.frisoni@unibo.it)
- Marcos: [marcos.martinez.galindo@ibm.com](mailto:marcos.martinez.galindo@ibm.com)