|
--- |
|
license: mit |
|
base_model: microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned |
|
tags: |
|
- text-embeddings |
|
- sentence-transformers |
|
- llm2vec |
|
- medical |
|
- chest-xray |
|
- radiology |
|
- clinical-nlp |
|
language: |
|
- en |
|
pipeline_tag: feature-extraction |
|
library_name: transformers |
|
--- |
|
|
|
# LLM2Vec4CXR - Fine-tuned Model for Chest X-ray Report Analysis |
|
|
|
This model is a fine-tuned version of [microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned](https://huggingface.co/microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned) specifically optimized for chest X-ray report analysis and medical text understanding. |
|
|
|
## Model Description |
|
|
|
LLM2Vec4CXR is a bidirectional language model that converts the base decoder-only LLM into a text encoder optimized for medical text embeddings. The model has been fully fine-tuned with modified pooling strategy (`latent_attention`) to better capture semantic relationships in chest X-ray reports. |
|
|
|
### Key Features |
|
|
|
- **Base Architecture**: LLM2CLIP-Llama-3.2-1B-Instruct |
|
- **Pooling Mode**: Latent Attention (fine-tuned weights automatically loaded) |
|
- **Bidirectional Processing**: Enabled for better context understanding |
|
- **Medical Domain**: Specialized for chest X-ray report analysis |
|
- **Max Length**: 512 tokens |
|
- **Precision**: bfloat16 |
|
- **Automatic Loading**: Latent attention weights are automatically loaded from safetensors |
|
- **Simple API**: Built-in methods for similarity computation and instruction-based encoding |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
- Fully fine-tuned on chest X-ray reports and medical text data |
|
- Training focused on understanding pleural effusion status and other chest X-ray findings |
|
|
|
### Training Configuration |
|
- **Pooling Mode**: `latent_attention` (modified from base model) |
|
- **Enable Bidirectional**: True |
|
- **Max Length**: 512 |
|
- **Torch Dtype**: bfloat16 |
|
- **Full Fine-tuning**: All model weights were updated during training |
|
|
|
## Usage |
|
|
|
### Installation |
|
|
|
```bash |
|
# Install the LLM2Vec4CXR package directly from GitHub |
|
pip install git+https://github.com/lukeingawesome/llm2vec4cxr.git |
|
|
|
# Or clone and install in development mode |
|
git clone https://github.com/lukeingawesome/llm2vec4cxr.git |
|
cd llm2vec4cxr |
|
pip install -e . |
|
``` |
|
|
|
### Basic Usage |
|
|
|
```python |
|
import torch |
|
from llm2vec_wrapper import LLM2VecWrapper as LLM2Vec |
|
|
|
# Load the model - latent attention weights are automatically loaded! |
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
model = LLM2Vec.from_pretrained( |
|
base_model_name_or_path='lukeingawesome/llm2vec4cxr', |
|
pooling_mode="latent_attention", |
|
max_length=512, |
|
enable_bidirectional=True, |
|
torch_dtype=torch.bfloat16, |
|
use_safetensors=True, |
|
).to(device).eval() |
|
|
|
# Configure tokenizer |
|
model.tokenizer.padding_side = 'left' |
|
|
|
# Simple text encoding |
|
report = "There is a small increase in the left-sided effusion. There continues to be volume loss at both bases." |
|
embedding = model.encode_text([report]) |
|
|
|
# Multiple texts at once |
|
reports = [ |
|
"No acute cardiopulmonary abnormality.", |
|
"Small bilateral pleural effusions.", |
|
"Large left pleural effusion with compressive atelectasis." |
|
] |
|
embeddings = model.encode_text(reports) |
|
``` |
|
|
|
### Advanced Usage with Instructions and Similarity |
|
|
|
```python |
|
# For instruction-following tasks with separator |
|
instruction = 'Determine the change or the status of the pleural effusion.' |
|
report = 'There is a small increase in the left-sided effusion.' |
|
query_text = instruction + '!@#$%^&*()' + report |
|
|
|
# Compare against multiple options |
|
candidates = [ |
|
'No pleural effusion', |
|
'Pleural effusion present', |
|
'Pleural effusion is worsening', |
|
'Pleural effusion is improving' |
|
] |
|
|
|
# Get similarity scores using the built-in method |
|
similarities = model.compute_similarities(query_text, candidates) |
|
print(f"Similarities: {similarities}") |
|
|
|
# For custom separator-based encoding |
|
embeddings = model.encode_with_separator([query_text], separator='!@#$%^&*()') |
|
``` |
|
|
|
**Note**: The model now includes convenient methods like `compute_similarities()` and `encode_with_separator()` that handle complex tokenization automatically. |
|
|
|
### Quick Start Example |
|
|
|
Here's a complete example showing the model's capabilities: |
|
|
|
```python |
|
import torch |
|
from llm2vec_wrapper import LLM2VecWrapper as LLM2Vec |
|
|
|
# Load model |
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
model = LLM2Vec.from_pretrained( |
|
base_model_name_or_path='lukeingawesome/llm2vec4cxr', |
|
pooling_mode="latent_attention", |
|
max_length=512, |
|
enable_bidirectional=True, |
|
torch_dtype=torch.bfloat16, |
|
use_safetensors=True, |
|
).to(device).eval() |
|
|
|
# Configure tokenizer |
|
model.tokenizer.padding_side = 'left' |
|
|
|
# Medical text analysis |
|
instruction = 'Determine the change or the status of the pleural effusion.' |
|
report = 'There is a small increase in the left-sided effusion.' |
|
query = instruction + '!@#$%^&*()' + report |
|
|
|
# Compare with different diagnoses |
|
options = [ |
|
'No pleural effusion', |
|
'Pleural effusion is worsening', |
|
'Pleural effusion is stable', |
|
'Pleural effusion is improving' |
|
] |
|
|
|
# Get similarity scores |
|
scores = model.compute_similarities(query, options) |
|
best_match = options[torch.argmax(scores)] |
|
print(f"Best match: {best_match} (score: {torch.max(scores):.4f})") |
|
``` |
|
|
|
## API Reference |
|
|
|
The model provides several convenient methods: |
|
|
|
### Core Methods |
|
|
|
- **`encode_text(texts)`**: Simple text encoding with automatic embed_mask handling |
|
- **`encode_with_separator(texts, separator='!@#$%^&*()')`**: Encoding with instruction/content separation |
|
- **`compute_similarities(query_text, candidate_texts)`**: One-line similarity computation |
|
- **`from_pretrained(..., pooling_mode="latent_attention")`**: Automatic latent attention weight loading |
|
|
|
### Migration from Manual Usage |
|
|
|
If you were previously using manual tokenization, you can now simply use: |
|
|
|
```python |
|
# Old way (still works) |
|
tokenized = model.tokenizer(text, return_tensors="pt", ...) |
|
tokenized["embed_mask"] = tokenized["attention_mask"].clone() |
|
embeddings = model(tokenized) |
|
|
|
# New way (recommended) |
|
embeddings = model.encode_text([text]) |
|
``` |
|
|
|
## Evaluation |
|
|
|
The model has been evaluated on chest X-ray report analysis tasks, particularly for: |
|
- Text retrieval/encoder |
|
- Medical text similarity comparison |
|
- Clinical finding extraction |
|
|
|
### Sample Performance |
|
|
|
The model shows improved performance compared to the base model on medical text understanding tasks, particularly in distinguishing between different pleural effusion states and medical abbreviations. |
|
|
|
## Intended Use |
|
|
|
### Primary Use Cases |
|
- **Medical Text Embeddings**: Generate embeddings for chest X-ray reports |
|
- **Clinical Text Similarity**: Compare medical texts for semantic similarity |
|
- **Medical Information Retrieval**: Find relevant medical reports or findings |
|
- **Clinical NLP Research**: Foundation model for medical text analysis |
|
|
|
### Limitations |
|
- Specialized for chest X-ray reports - may not generalize to other medical domains |
|
- Requires careful preprocessing for optimal performance |
|
- Should be used as part of a larger clinical decision support system, not for standalone diagnosis |
|
|
|
## Technical Specifications |
|
|
|
- **Model Type**: Bidirectional Language Model (LLM2Vec) |
|
- **Architecture**: LlamaBiModel (modified Llama 3.2) |
|
- **Parameters**: ~1B parameters |
|
- **Input Length**: Up to 512 tokens |
|
- **Output**: Dense embeddings |
|
- **Precision**: bfloat16 |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@misc{llm2vec4cxr, |
|
title={LLM2Vec4CXR: Fine-tuned LLM for Chest X-ray Report Analysis}, |
|
author={Hanbin Ko}, |
|
year={2025}, |
|
howpublished={\\url{https://huggingface.co/lukeingawesome/llm2vec4cxr}}, |
|
} |
|
``` |
|
|
|
A preprint of this model will be released soon. |
|
|
|
## Acknowledgments |
|
|
|
This model is built upon: |
|
- [LLM2Vec](https://github.com/McGill-NLP/llm2vec) - Framework for converting decoder-only LLMs into text encoders |
|
- [LLM2CLIP](https://github.com/microsoft/LLM2CLIP) - Microsoft's implementation for connecting LLMs with CLIP models |
|
|
|
## License |
|
|
|
This model is licensed under the MIT License. |
|
|