File size: 8,036 Bytes
313bbea
 
71d3180
 
 
 
 
 
 
 
 
 
 
 
 
313bbea
 
71d3180
 
 
 
 
 
 
 
 
 
 
cfb7750
71d3180
 
 
 
cfb7750
 
71d3180
 
 
 
 
 
 
 
 
 
 
 
 
 
313bbea
 
71d3180
313bbea
71d3180
1cc6474
 
 
 
 
 
 
71d3180
 
 
313bbea
71d3180
cfb7750
71d3180
 
cfb7750
067b717
71d3180
 
067b717
71d3180
cfb7750
71d3180
cfb7750
067b717
 
 
 
71d3180
cfb7750
6c19590
cfb7750
6c19590
 
 
 
 
 
 
 
 
71d3180
cfb7750
71d3180
6c19590
 
 
 
cfb7750
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
067b717
cfb7750
067b717
cfb7750
067b717
 
cfb7750
 
067b717
 
 
 
cfb7750
 
 
 
 
 
 
 
 
 
 
 
 
6c19590
cfb7750
 
 
 
71d3180
 
cfb7750
 
 
af8b7ea
cfb7750
71d3180
cfb7750
 
 
 
 
 
 
 
71d3180
 
cfb7750
 
 
 
 
 
 
313bbea
 
71d3180
 
 
e2bbf2a
71d3180
 
 
 
 
 
 
 
313bbea
71d3180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2bbf2a
 
 
71d3180
 
313bbea
71d3180
e2bbf2a
 
71d3180
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
---
license: mit
base_model: microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned
tags:
- text-embeddings
- sentence-transformers
- llm2vec
- medical
- chest-xray
- radiology
- clinical-nlp
language:
- en
pipeline_tag: feature-extraction
library_name: transformers
---

# LLM2Vec4CXR - Fine-tuned Model for Chest X-ray Report Analysis

This model is a fine-tuned version of [microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned](https://huggingface.co/microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned) specifically optimized for chest X-ray report analysis and medical text understanding.

## Model Description

LLM2Vec4CXR is a bidirectional language model that converts the base decoder-only LLM into a text encoder optimized for medical text embeddings. The model has been fully fine-tuned with modified pooling strategy (`latent_attention`) to better capture semantic relationships in chest X-ray reports.

### Key Features

- **Base Architecture**: LLM2CLIP-Llama-3.2-1B-Instruct
- **Pooling Mode**: Latent Attention (fine-tuned weights automatically loaded)
- **Bidirectional Processing**: Enabled for better context understanding
- **Medical Domain**: Specialized for chest X-ray report analysis
- **Max Length**: 512 tokens
- **Precision**: bfloat16
- **Automatic Loading**: Latent attention weights are automatically loaded from safetensors
- **Simple API**: Built-in methods for similarity computation and instruction-based encoding

## Training Details

### Training Data
- Fully fine-tuned on chest X-ray reports and medical text data
- Training focused on understanding pleural effusion status and other chest X-ray findings

### Training Configuration
- **Pooling Mode**: `latent_attention` (modified from base model)
- **Enable Bidirectional**: True
- **Max Length**: 512
- **Torch Dtype**: bfloat16
- **Full Fine-tuning**: All model weights were updated during training

## Usage

### Installation

```bash
# Install the LLM2Vec4CXR package directly from GitHub
pip install git+https://github.com/lukeingawesome/llm2vec4cxr.git

# Or clone and install in development mode
git clone https://github.com/lukeingawesome/llm2vec4cxr.git
cd llm2vec4cxr
pip install -e .
```

### Basic Usage

```python
import torch
from llm2vec_wrapper import LLM2VecWrapper as LLM2Vec

# Load the model - latent attention weights are automatically loaded!
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LLM2Vec.from_pretrained(
    base_model_name_or_path='lukeingawesome/llm2vec4cxr',
    pooling_mode="latent_attention",
    max_length=512,
    enable_bidirectional=True,
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
).to(device).eval()

# Configure tokenizer
model.tokenizer.padding_side = 'left'

# Simple text encoding
report = "There is a small increase in the left-sided effusion. There continues to be volume loss at both bases."
embedding = model.encode_text([report])

# Multiple texts at once
reports = [
    "No acute cardiopulmonary abnormality.",
    "Small bilateral pleural effusions.",
    "Large left pleural effusion with compressive atelectasis."
]
embeddings = model.encode_text(reports)
```

### Advanced Usage with Instructions and Similarity

```python
# For instruction-following tasks with separator
instruction = 'Determine the change or the status of the pleural effusion.'
report = 'There is a small increase in the left-sided effusion.'
query_text = instruction + '!@#$%^&*()' + report

# Compare against multiple options
candidates = [
    'No pleural effusion',
    'Pleural effusion present',
    'Pleural effusion is worsening',
    'Pleural effusion is improving'
]

# Get similarity scores using the built-in method
similarities = model.compute_similarities(query_text, candidates)
print(f"Similarities: {similarities}")

# For custom separator-based encoding
embeddings = model.encode_with_separator([query_text], separator='!@#$%^&*()')
```

**Note**: The model now includes convenient methods like `compute_similarities()` and `encode_with_separator()` that handle complex tokenization automatically.

### Quick Start Example

Here's a complete example showing the model's capabilities:

```python
import torch
from llm2vec_wrapper import LLM2VecWrapper as LLM2Vec

# Load model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LLM2Vec.from_pretrained(
    base_model_name_or_path='lukeingawesome/llm2vec4cxr',
    pooling_mode="latent_attention",
    max_length=512,
    enable_bidirectional=True,
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
).to(device).eval()

# Configure tokenizer
model.tokenizer.padding_side = 'left'

# Medical text analysis
instruction = 'Determine the change or the status of the pleural effusion.'
report = 'There is a small increase in the left-sided effusion.'
query = instruction + '!@#$%^&*()' + report

# Compare with different diagnoses
options = [
    'No pleural effusion',
    'Pleural effusion is worsening', 
    'Pleural effusion is stable',
    'Pleural effusion is improving'
]

# Get similarity scores
scores = model.compute_similarities(query, options)
best_match = options[torch.argmax(scores)]
print(f"Best match: {best_match} (score: {torch.max(scores):.4f})")
```

## API Reference

The model provides several convenient methods:

### Core Methods

- **`encode_text(texts)`**: Simple text encoding with automatic embed_mask handling
- **`encode_with_separator(texts, separator='!@#$%^&*()')`**: Encoding with instruction/content separation
- **`compute_similarities(query_text, candidate_texts)`**: One-line similarity computation
- **`from_pretrained(..., pooling_mode="latent_attention")`**: Automatic latent attention weight loading

### Migration from Manual Usage

If you were previously using manual tokenization, you can now simply use:

```python
# Old way (still works)
tokenized = model.tokenizer(text, return_tensors="pt", ...)
tokenized["embed_mask"] = tokenized["attention_mask"].clone()
embeddings = model(tokenized)

# New way (recommended)
embeddings = model.encode_text([text])
```

## Evaluation

The model has been evaluated on chest X-ray report analysis tasks, particularly for:
- Text retrieval/encoder
- Medical text similarity comparison
- Clinical finding extraction

### Sample Performance

The model shows improved performance compared to the base model on medical text understanding tasks, particularly in distinguishing between different pleural effusion states and medical abbreviations.

## Intended Use

### Primary Use Cases
- **Medical Text Embeddings**: Generate embeddings for chest X-ray reports
- **Clinical Text Similarity**: Compare medical texts for semantic similarity
- **Medical Information Retrieval**: Find relevant medical reports or findings
- **Clinical NLP Research**: Foundation model for medical text analysis

### Limitations
- Specialized for chest X-ray reports - may not generalize to other medical domains
- Requires careful preprocessing for optimal performance
- Should be used as part of a larger clinical decision support system, not for standalone diagnosis

## Technical Specifications

- **Model Type**: Bidirectional Language Model (LLM2Vec)
- **Architecture**: LlamaBiModel (modified Llama 3.2)
- **Parameters**: ~1B parameters
- **Input Length**: Up to 512 tokens
- **Output**: Dense embeddings
- **Precision**: bfloat16

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{llm2vec4cxr,
  title={LLM2Vec4CXR: Fine-tuned LLM for Chest X-ray Report Analysis},
  author={Hanbin Ko},
  year={2025},
  howpublished={\\url{https://huggingface.co/lukeingawesome/llm2vec4cxr}},
}
```

A preprint of this model will be released soon.

## Acknowledgments

This model is built upon:
- [LLM2Vec](https://github.com/McGill-NLP/llm2vec) - Framework for converting decoder-only LLMs into text encoders
- [LLM2CLIP](https://github.com/microsoft/LLM2CLIP) - Microsoft's implementation for connecting LLMs with CLIP models

## License

This model is licensed under the MIT License.