llm2vec4cxr / README.md
lukeingawesome's picture
Update README.md
e2bbf2a verified
---
license: mit
base_model: microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned
tags:
- text-embeddings
- sentence-transformers
- llm2vec
- medical
- chest-xray
- radiology
- clinical-nlp
language:
- en
pipeline_tag: feature-extraction
library_name: transformers
---
# LLM2Vec4CXR - Fine-tuned Model for Chest X-ray Report Analysis
This model is a fine-tuned version of [microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned](https://huggingface.co/microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned) specifically optimized for chest X-ray report analysis and medical text understanding.
## Model Description
LLM2Vec4CXR is a bidirectional language model that converts the base decoder-only LLM into a text encoder optimized for medical text embeddings. The model has been fully fine-tuned with modified pooling strategy (`latent_attention`) to better capture semantic relationships in chest X-ray reports.
### Key Features
- **Base Architecture**: LLM2CLIP-Llama-3.2-1B-Instruct
- **Pooling Mode**: Latent Attention (fine-tuned weights automatically loaded)
- **Bidirectional Processing**: Enabled for better context understanding
- **Medical Domain**: Specialized for chest X-ray report analysis
- **Max Length**: 512 tokens
- **Precision**: bfloat16
- **Automatic Loading**: Latent attention weights are automatically loaded from safetensors
- **Simple API**: Built-in methods for similarity computation and instruction-based encoding
## Training Details
### Training Data
- Fully fine-tuned on chest X-ray reports and medical text data
- Training focused on understanding pleural effusion status and other chest X-ray findings
### Training Configuration
- **Pooling Mode**: `latent_attention` (modified from base model)
- **Enable Bidirectional**: True
- **Max Length**: 512
- **Torch Dtype**: bfloat16
- **Full Fine-tuning**: All model weights were updated during training
## Usage
### Installation
```bash
# Install the LLM2Vec4CXR package directly from GitHub
pip install git+https://github.com/lukeingawesome/llm2vec4cxr.git
# Or clone and install in development mode
git clone https://github.com/lukeingawesome/llm2vec4cxr.git
cd llm2vec4cxr
pip install -e .
```
### Basic Usage
```python
import torch
from llm2vec_wrapper import LLM2VecWrapper as LLM2Vec
# Load the model - latent attention weights are automatically loaded!
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LLM2Vec.from_pretrained(
base_model_name_or_path='lukeingawesome/llm2vec4cxr',
pooling_mode="latent_attention",
max_length=512,
enable_bidirectional=True,
torch_dtype=torch.bfloat16,
use_safetensors=True,
).to(device).eval()
# Configure tokenizer
model.tokenizer.padding_side = 'left'
# Simple text encoding
report = "There is a small increase in the left-sided effusion. There continues to be volume loss at both bases."
embedding = model.encode_text([report])
# Multiple texts at once
reports = [
"No acute cardiopulmonary abnormality.",
"Small bilateral pleural effusions.",
"Large left pleural effusion with compressive atelectasis."
]
embeddings = model.encode_text(reports)
```
### Advanced Usage with Instructions and Similarity
```python
# For instruction-following tasks with separator
instruction = 'Determine the change or the status of the pleural effusion.'
report = 'There is a small increase in the left-sided effusion.'
query_text = instruction + '!@#$%^&*()' + report
# Compare against multiple options
candidates = [
'No pleural effusion',
'Pleural effusion present',
'Pleural effusion is worsening',
'Pleural effusion is improving'
]
# Get similarity scores using the built-in method
similarities = model.compute_similarities(query_text, candidates)
print(f"Similarities: {similarities}")
# For custom separator-based encoding
embeddings = model.encode_with_separator([query_text], separator='!@#$%^&*()')
```
**Note**: The model now includes convenient methods like `compute_similarities()` and `encode_with_separator()` that handle complex tokenization automatically.
### Quick Start Example
Here's a complete example showing the model's capabilities:
```python
import torch
from llm2vec_wrapper import LLM2VecWrapper as LLM2Vec
# Load model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LLM2Vec.from_pretrained(
base_model_name_or_path='lukeingawesome/llm2vec4cxr',
pooling_mode="latent_attention",
max_length=512,
enable_bidirectional=True,
torch_dtype=torch.bfloat16,
use_safetensors=True,
).to(device).eval()
# Configure tokenizer
model.tokenizer.padding_side = 'left'
# Medical text analysis
instruction = 'Determine the change or the status of the pleural effusion.'
report = 'There is a small increase in the left-sided effusion.'
query = instruction + '!@#$%^&*()' + report
# Compare with different diagnoses
options = [
'No pleural effusion',
'Pleural effusion is worsening',
'Pleural effusion is stable',
'Pleural effusion is improving'
]
# Get similarity scores
scores = model.compute_similarities(query, options)
best_match = options[torch.argmax(scores)]
print(f"Best match: {best_match} (score: {torch.max(scores):.4f})")
```
## API Reference
The model provides several convenient methods:
### Core Methods
- **`encode_text(texts)`**: Simple text encoding with automatic embed_mask handling
- **`encode_with_separator(texts, separator='!@#$%^&*()')`**: Encoding with instruction/content separation
- **`compute_similarities(query_text, candidate_texts)`**: One-line similarity computation
- **`from_pretrained(..., pooling_mode="latent_attention")`**: Automatic latent attention weight loading
### Migration from Manual Usage
If you were previously using manual tokenization, you can now simply use:
```python
# Old way (still works)
tokenized = model.tokenizer(text, return_tensors="pt", ...)
tokenized["embed_mask"] = tokenized["attention_mask"].clone()
embeddings = model(tokenized)
# New way (recommended)
embeddings = model.encode_text([text])
```
## Evaluation
The model has been evaluated on chest X-ray report analysis tasks, particularly for:
- Text retrieval/encoder
- Medical text similarity comparison
- Clinical finding extraction
### Sample Performance
The model shows improved performance compared to the base model on medical text understanding tasks, particularly in distinguishing between different pleural effusion states and medical abbreviations.
## Intended Use
### Primary Use Cases
- **Medical Text Embeddings**: Generate embeddings for chest X-ray reports
- **Clinical Text Similarity**: Compare medical texts for semantic similarity
- **Medical Information Retrieval**: Find relevant medical reports or findings
- **Clinical NLP Research**: Foundation model for medical text analysis
### Limitations
- Specialized for chest X-ray reports - may not generalize to other medical domains
- Requires careful preprocessing for optimal performance
- Should be used as part of a larger clinical decision support system, not for standalone diagnosis
## Technical Specifications
- **Model Type**: Bidirectional Language Model (LLM2Vec)
- **Architecture**: LlamaBiModel (modified Llama 3.2)
- **Parameters**: ~1B parameters
- **Input Length**: Up to 512 tokens
- **Output**: Dense embeddings
- **Precision**: bfloat16
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{llm2vec4cxr,
title={LLM2Vec4CXR: Fine-tuned LLM for Chest X-ray Report Analysis},
author={Hanbin Ko},
year={2025},
howpublished={\\url{https://huggingface.co/lukeingawesome/llm2vec4cxr}},
}
```
A preprint of this model will be released soon.
## Acknowledgments
This model is built upon:
- [LLM2Vec](https://github.com/McGill-NLP/llm2vec) - Framework for converting decoder-only LLMs into text encoders
- [LLM2CLIP](https://github.com/microsoft/LLM2CLIP) - Microsoft's implementation for connecting LLMs with CLIP models
## License
This model is licensed under the MIT License.