README.md · lukeingawesome/llm2vec4cxr at main

llm2vec4cxr / README.md

lukeingawesome

Update README.md

e2bbf2a verified 13 days ago

preview code

raw

history blame contribute delete

8.04 kB

	---
	license: mit
	base_model: microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned
	tags:
	- text-embeddings
	- sentence-transformers
	- llm2vec
	- medical
	- chest-xray
	- radiology
	- clinical-nlp
	language:
	- en
	pipeline_tag: feature-extraction
	library_name: transformers
	---

	# LLM2Vec4CXR - Fine-tuned Model for Chest X-ray Report Analysis

	This model is a fine-tuned version of [microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned](https://huggingface.co/microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned) specifically optimized for chest X-ray report analysis and medical text understanding.

	## Model Description

	LLM2Vec4CXR is a bidirectional language model that converts the base decoder-only LLM into a text encoder optimized for medical text embeddings. The model has been fully fine-tuned with modified pooling strategy (`latent_attention`) to better capture semantic relationships in chest X-ray reports.

	### Key Features

	- Base Architecture: LLM2CLIP-Llama-3.2-1B-Instruct
	- Pooling Mode: Latent Attention (fine-tuned weights automatically loaded)
	- Bidirectional Processing: Enabled for better context understanding
	- Medical Domain: Specialized for chest X-ray report analysis
	- Max Length: 512 tokens
	- Precision: bfloat16
	- Automatic Loading: Latent attention weights are automatically loaded from safetensors
	- Simple API: Built-in methods for similarity computation and instruction-based encoding

	## Training Details

	### Training Data
	- Fully fine-tuned on chest X-ray reports and medical text data
	- Training focused on understanding pleural effusion status and other chest X-ray findings

	### Training Configuration
	- Pooling Mode: `latent_attention` (modified from base model)
	- Enable Bidirectional: True
	- Max Length: 512
	- Torch Dtype: bfloat16
	- Full Fine-tuning: All model weights were updated during training

	## Usage

	### Installation

	```bash
	# Install the LLM2Vec4CXR package directly from GitHub
	pip install git+https://github.com/lukeingawesome/llm2vec4cxr.git

	# Or clone and install in development mode
	git clone https://github.com/lukeingawesome/llm2vec4cxr.git
	cd llm2vec4cxr
	pip install -e .
	```

	### Basic Usage

	```python
	import torch
	from llm2vec_wrapper import LLM2VecWrapper as LLM2Vec

	# Load the model - latent attention weights are automatically loaded!
	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
	model = LLM2Vec.from_pretrained(
	base_model_name_or_path='lukeingawesome/llm2vec4cxr',
	pooling_mode="latent_attention",
	max_length=512,
	enable_bidirectional=True,
	torch_dtype=torch.bfloat16,
	use_safetensors=True,
	).to(device).eval()

	# Configure tokenizer
	model.tokenizer.padding_side = 'left'

	# Simple text encoding
	report = "There is a small increase in the left-sided effusion. There continues to be volume loss at both bases."
	embedding = model.encode_text([report])

	# Multiple texts at once
	reports = [
	"No acute cardiopulmonary abnormality.",
	"Small bilateral pleural effusions.",
	"Large left pleural effusion with compressive atelectasis."
	]
	embeddings = model.encode_text(reports)
	```

	### Advanced Usage with Instructions and Similarity

	```python
	# For instruction-following tasks with separator
	instruction = 'Determine the change or the status of the pleural effusion.'
	report = 'There is a small increase in the left-sided effusion.'
	query_text = instruction + '!@#$%^&*()' + report

	# Compare against multiple options
	candidates = [
	'No pleural effusion',
	'Pleural effusion present',
	'Pleural effusion is worsening',
	'Pleural effusion is improving'
	]

	# Get similarity scores using the built-in method
	similarities = model.compute_similarities(query_text, candidates)
	print(f"Similarities: {similarities}")

	# For custom separator-based encoding
	embeddings = model.encode_with_separator([query_text], separator='!@#$%^&*()')
	```

	Note: The model now includes convenient methods like `compute_similarities()` and `encode_with_separator()` that handle complex tokenization automatically.

	### Quick Start Example

	Here's a complete example showing the model's capabilities:

	```python
	import torch
	from llm2vec_wrapper import LLM2VecWrapper as LLM2Vec

	# Load model
	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
	model = LLM2Vec.from_pretrained(
	base_model_name_or_path='lukeingawesome/llm2vec4cxr',
	pooling_mode="latent_attention",
	max_length=512,
	enable_bidirectional=True,
	torch_dtype=torch.bfloat16,
	use_safetensors=True,
	).to(device).eval()

	# Configure tokenizer
	model.tokenizer.padding_side = 'left'

	# Medical text analysis
	instruction = 'Determine the change or the status of the pleural effusion.'
	report = 'There is a small increase in the left-sided effusion.'
	query = instruction + '!@#$%^&*()' + report

	# Compare with different diagnoses
	options = [
	'No pleural effusion',
	'Pleural effusion is worsening',
	'Pleural effusion is stable',
	'Pleural effusion is improving'
	]

	# Get similarity scores
	scores = model.compute_similarities(query, options)
	best_match = options[torch.argmax(scores)]
	print(f"Best match: {best_match} (score: {torch.max(scores):.4f})")
	```

	## API Reference

	The model provides several convenient methods:

	### Core Methods

	- `encode_text(texts)`: Simple text encoding with automatic embed_mask handling
	- *`encode_with_separator(texts, separator='!@#$%^&()')`**: Encoding with instruction/content separation
	- `compute_similarities(query_text, candidate_texts)`: One-line similarity computation
	- `from_pretrained(..., pooling_mode="latent_attention")`: Automatic latent attention weight loading

	### Migration from Manual Usage

	If you were previously using manual tokenization, you can now simply use:

	```python
	# Old way (still works)
	tokenized = model.tokenizer(text, return_tensors="pt", ...)
	tokenized["embed_mask"] = tokenized["attention_mask"].clone()
	embeddings = model(tokenized)

	# New way (recommended)
	embeddings = model.encode_text([text])
	```

	## Evaluation

	The model has been evaluated on chest X-ray report analysis tasks, particularly for:
	- Text retrieval/encoder
	- Medical text similarity comparison
	- Clinical finding extraction

	### Sample Performance

	The model shows improved performance compared to the base model on medical text understanding tasks, particularly in distinguishing between different pleural effusion states and medical abbreviations.

	## Intended Use

	### Primary Use Cases
	- Medical Text Embeddings: Generate embeddings for chest X-ray reports
	- Clinical Text Similarity: Compare medical texts for semantic similarity
	- Medical Information Retrieval: Find relevant medical reports or findings
	- Clinical NLP Research: Foundation model for medical text analysis

	### Limitations
	- Specialized for chest X-ray reports - may not generalize to other medical domains
	- Requires careful preprocessing for optimal performance
	- Should be used as part of a larger clinical decision support system, not for standalone diagnosis

	## Technical Specifications

	- Model Type: Bidirectional Language Model (LLM2Vec)
	- Architecture: LlamaBiModel (modified Llama 3.2)
	- Parameters: ~1B parameters
	- Input Length: Up to 512 tokens
	- Output: Dense embeddings
	- Precision: bfloat16

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{llm2vec4cxr,
	title={LLM2Vec4CXR: Fine-tuned LLM for Chest X-ray Report Analysis},
	author={Hanbin Ko},
	year={2025},
	howpublished={\\url{https://huggingface.co/lukeingawesome/llm2vec4cxr}},
	}
	```

	A preprint of this model will be released soon.

	## Acknowledgments

	This model is built upon:
	- [LLM2Vec](https://github.com/McGill-NLP/llm2vec) - Framework for converting decoder-only LLMs into text encoders
	- [LLM2CLIP](https://github.com/microsoft/LLM2CLIP) - Microsoft's implementation for connecting LLMs with CLIP models

	## License

	This model is licensed under the MIT License.