File size: 18,967 Bytes
265af67 4e8d81b 265af67 cca3f12 cb30c83 01aafaf c640966 cb30c83 01aafaf cb30c83 c2abb2a 001f7a2 c2abb2a 001f7a2 c2abb2a 9948694 c2abb2a 86f1bac 4778372 fe2367a 4778372 9948694 4778372 9948694 2de6490 c640966 97faabc c640966 97faabc c640966 2de6490 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 |
---
license: mit
datasets:
- disi-unibo-nlp/Pile-NER-biomed-IOB
- disi-unibo-nlp/Pile-NER-biomed-descriptions
language:
- en
base_model:
- dmis-lab/biobert-v1.1
pipeline_tag: token-classification
tags:
- medical
---
# Model card for OpenBioNER
We introduce **OpenBioNER**, a lightweight BERT-based model tailored for *open-domain* Biomedical NER.
This model can find unseen target entity types based solely on their **natural language descriptions**, eliminating the need for retraining.
OpenBioNER is pretrained on synthetic silver annotations generated through LLM self-supervision.
Extensive experiments demonstrate that OpenBioNER outperforms specialized LLMs, such as UniNER and GPT-4o, achieving an F1 score improvement of up to 10\% in zero-shot settings across various biomedical benchmarks.
In comparison to smaller baselines such as GLiNER, our model achieves better performance while using up to 4x fewer parameters.
# Links
- Blog: [link to blog](https://medium.com/@a.cocchieri/zero-shot-biomedical-named-entity-recognition-through-entity-type-description-3fd3518fca17)
- Demo: [link to demo](https://huggingface.co/spaces/disi-unibo-nlp/openbioner-demo)
- Example usage in Colab: [link to colab](https://colab.research.google.com/drive/136yfjTZdDLeej_Odx73nqFDv-oS4HGR3?usp=sharing)
# Installation
To use this model, you must install the IBM Zshot library (from main branch before next release):
```bash
!pip install -U zshot==0.0.11 datasets gliner
!python -m spacy download en_core_web_sm
```
# Usage
```python
import spacy
from zshot import PipelineConfig, displacy
from zshot.linker import LinkerSMXM
from zshot.evaluation.metrics._seqeval._seqeval import Seqeval
from zshot.utils.data_models import Entity
from zshot.evaluation.zshot_evaluate import evaluate, prettify_evaluate_report
# define your list of candidate entity types
entities = [
Entity(name='BACTERIUM', description='A bacterium refers to a type of microorganism that can exist as a single cell and may cause infections or play a role in various biological processes. Examples include species like Streptococcus pneumoniae and Streptomyces ahygroscopicus.', vocabulary=None),
]
nlp = spacy.blank("en")
nlp_config = PipelineConfig(
linker=LinkerSMXM(model_name="disi-unibo-nlp/openbioner-base"),
entities=entities,
device='cuda' # or 'cpu' if GPU not available
)
nlp.add_pipe("zshot", config=nlp_config, last=True)
sentence = "Impact of cofactor - binding loop mutations on thermotolerance and activity of E. coli transketolase"
doc = nlp(sentence)
displacy.render(doc, style="ent")
```
# Performance
OpenBioNER outperforms all competing models, achieving the **highest average performance** across all datasets.
| Model | Size | AnatEM | NCBI | JNLPBA | BC2GM | BC4CHEMD | BC5CDR | JNLPBA-R | MedMentions-R | AVG |
| :-------------------- | :---- | :----- | :--- | :----- | :---- | :------- | :----- | :------- | :------------ | :--- |
| GPT-4o | - | **38.7** | 50.0 | 41.9 | 37.3 | 36.4 | 66.4 | 26.6 | 49.1 | 43.3 |
| UniNER | 7B | 25.1 | 60.4 | 48.1 | 46.2 | 47.9 | **68.0** | 50.2 | **53.4** | 49.9 |
| GLiNER_large-v1 | 459M | 33.3 | **61.9** | **57.1** | 47.9 | 43.1 | 66.4 | 51.9 | **53.4** | 51.9 |
| OpenBioNER *(Ours)* | 110M | 35.2 | 58.5 | **57.1** | **49.1** | **48.0** | 60.4 | **63.9** | 50.9 | **52.9** |
| OpenBioNER *(Ours)* - Zshot | 110M | 34.8 | 57.8 | 56.8 | 49.5 | 47.1 | 60.1 | 64.6 | 52.9 | 53.0 |
> ⚠️ **Disclaimer**: Please note that running evaluations using the `zshot` library may lead to slightly different results on certain benchmarks compared to those reported in the paper (above). This discrepancy is due to differences in token alignment: `zshot` uses spaCy's character-based span matching, while our experiments use token-level alignment as handled by BERT-based NER pipelines. These differences can affect how entity spans are matched and evaluated, particularly in cases with subword tokenization or punctuation.
### Descriptions
Below we provide all the descriptions used to evaluate *OpenBioNER* for each dataset.
---
### Negative Class
This is the description used as NEG class (e.g. not an entity) for all the datasets, execept for MedMentions-Rare:
> Coal, water, oil, etc. are normally used for traditional electricity generation. However using liquefied natural gas as fuel for joint circulatory electricity generation has advantages. The chief financial officer is the only one there taking the fall. It has a very talented team, eh. What will happen to the wildlife? I just tell them, you've got to change. They're here to stay. They have no insurance on their cars. What else would you like? Whether holding an international cultural event or setting the city's cultural policies, she always asks for the participation or input of other cities and counties.
---
### NCBI
| TYPE | Description |
| :------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| DISEASE | A disease is a medical condition that disrupts normal bodily functions or structures, affecting various organs or systems, and leading to symptoms like muscle weakness, fatigue, stiffness, or cognitive impairment. Diseases can impact muscles, the nervous system, heart, eyes, and more, and may be chronic or acute, such as diabetes, cardiovascular or neurological disorders, and cancer-related conditions like lymphoblastic leukemia or lymphoma. |
### AnatEM
| TYPE | Description |
| :------ | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ANATOMY | The anatomy refers to biological components at various scales, including cells, tissues, and organs. These entities can be identified by proper nouns referring to cell types (e.g., HeLa cells, neurospheres, NSCLC, SCC), body parts (e.g., serum, blood) or biological substances (e.g., vegetables, meats, cow milk) or tumors. |
### BC4CHEMD
| TYPE | Description |
| :------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| CHEMICAL | Chemicals are substances that are composed of one or more elements, typically consisting of atoms bonded together by chemical bonds. They can be naturally occurring, such as vitamins or sterols, or synthesized, like alkylcarbazoles or tetrachlorodibenzo-p-dioxins (TCDD). Chemicals can also be modified or combined to form new compounds, such as esters or polymers. |
### BC2GM
| TYPE | Description |
| :--- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| GENE | A gene is a unit of heredity that carries information from one generation to the next and is composed of DNA sequences that encode the instructions for the development, growth, and function of an organism. It can be a segment of DNA that is passed from one generation to the next and is responsible for the transmission of traits from parents to offspring. A gene is often represented using a three-letter code (e.g., trios, ABL, DNA-PK). |
### BC5CDR
| TYPE | Description |
| :------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| CHEMICAL | Chemicals are substances that are composed of atoms, either bonded together in a molecule or as a mixture of different substances. This includes medications (e.g., nitroarginine methyl ester, nifedipine, prednisolone, methyldopa), compounds (e.g., potassium, calcium, ammonium), and other substances that can have various effects on the body. |
| DISEASE | Diseases are any medical condition that affects the normal functioning of the body, resulting in symptoms, discomfort, or potentially life-threatening complications. This includes chronic and acute disorders, conditions affecting specific bodily systems, cancer-related conditions, and complications arising from medical treatments or external factors. |
### JNLPBA
| TYPE | Description |
| :--------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| PROTEIN | A protein is a large biomolecule composed of one or more chains of amino acids, essential for structure and function within cells. Proteins serve as enzymes, receptors, and signaling molecules, playing critical roles in hormone action, immune response, and cellular communication. |
| DNA | DNA refers to a molecule that contains the genetic instructions used in the development and function of all living organisms. It is composed of two strands of nucleotides that are coiled together in a double helix structure. |
| CELL\_TYPE | A cell type refers to a specific category of cells defined by characteristic morphology, function, and molecular markers. Examples include lymphocytes, leukocytes, mononuclear cells, polymorphonuclear leukocytes, and B-lymphoblastoid cells. |
| CELL\_LINE | A cell line is a population of cells derived from a single cell, cultured in vitro or in vivo. It can be normal or transformed, with genetic changes like mutations. Cell lines, such as B-cells or HeLa cells, are used in research to study cellular processes, model diseases, and develop treatments. |
| RNA | RNA is a type of nucleic acid that plays a crucial role in the transmission of genetic information from DNA to proteins. It is a single-stranded molecule composed of nucleotides, and its primary function is to carry genetic information from the nucleus to the ribosomes, where it is translated into proteins. |
### JNLPBA-Rare
| TYPE | Description |
| :--------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| CELL\_LINE | A cell line is a population of cells derived from a single cell, cultured in vitro or in vivo. It can be normal or transformed, with genetic changes like mutations. Cell lines, such as B-cells or HeLa cells, are used in research to study cellular processes, model diseases, and develop treatments. |
| RNA | RNA is a type of nucleic acid that plays a crucial role in the transmission of genetic information from DNA to proteins. It is a single-stranded molecule composed of nucleotides, and its primary function is to carry genetic information from the nucleus to the ribosomes, where it is translated into proteins. |
### MedMentions-Rare
| TYPE | Description |
| :--- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| NEG | In this study, we fabricated prevascularized synthetic device ports to help mitigate this limitation. Thus, the optimum range of pore size for prevascularization of these membranes was estimated to be 75 - 100 μm. A total of 51 patients were included, 16 in group I and 35 in group II. |
| Bacterium (T007) | A bacterium refers to a type of microorganism that can exist as a single cell and may cause infections or play a role in various biological processes. Examples include species like Streptococcus pneumoniae and Streptomyces ahygroscopicus. |
| Body Substance (T031) | A body substance is any material produced by or found within the body, such as blood, serum, saliva, sweat, or gastric acid. Specific examples include serum cytokine levels for immune responses, blood lipids for metabolic studies, and hemolymph glucose for stress responses. |
| Food (T168) | A food refers to any substance consumed to provide nutritional support for the body. This includes a wide range of items such as snacks, meat, dairy products, grains like wheat, and edible substances like carbohydrates, proteins, and fats. |
| Body System (T022) | A body system consists of interconnected organs and tissues working together to carry out essential functions. Examples include the gastrointestinal tract for digestion, the nervous system for sensory and motor control, the hematological system for blood-related functions, and the endocrine system for hormone regulation. |
| Professional or Occupational Group (T097) | A professional refers to individuals who share the same profession, occupation, or role within a specific field. Examples include cardiologists, psychologists, assessors, hospice staff, and volunteers. |
---
# 🧬 How to Write Effective Entity Type Descriptions
Entity type descriptions are crucial for improving generalization in OpenBioNER. Well-written descriptions help models disambiguate types, handle rare classes, and align with real-world usage across diverse datasets.
### ✅ Best Practices
- **Start with a clear definition**: Briefly explain what the entity type is.
- **Include functions or context**: Add what it does, its purpose, or where it appears.
- **List 3–5 concrete examples**: Use domain-relevant examples (e.g., real diseases, proteins, or food items).
- **Mention subtypes or synonyms (optional)**: Helps capture lexical variation and rare mentions.
- **Keep it concise**: 1–3 well-structured sentences are ideal.
### ⚠️ Common Mistakes to Avoid
- Vague or overly generic descriptions
- No examples
- Just a list of terms
- Redundant or circular wording
### 🧪 Template (Recommended Format)
```text
A [TYPE] refers to [concise definition]. It includes examples such as [example1], [example2], and [example3].
```
# Authors
- [Alessio Cocchieri](https://huggingface.co/alecocc)
- [Giacomo Frisoni](https://huggingface.co/giacomo-frisoni)
- [Marcos Martinez Galindo](https://huggingface.co/marmg)
- Gianluca Moro
- Giuseppe Tagliavini
- [Francesco Candoli](https://huggingface.co/CheccoCando)
# 📬 Contacts
For questions, collaborations, or feedback, feel free to reach out:
- Alessio: [a.cocchieri@unibo.it](mailto:a.cocchieri@unibo.it)
- Giacomo: [giacomo.frisoni@unibo.it](mailto:giacomo.frisoni@unibo.it)
- Marcos: [marcos.martinez.galindo@ibm.com](mailto:marcos.martinez.galindo@ibm.com)
|