Model Card
Model Description
MedSwin-7B-KD is a high-performance 7B parameter language model for medical question-answering and clinical reasoning. It was created by applying a novel Dual-Phase Knowledge Distillation (KD) pipeline to the medalpaca/medalpaca-7b base model. Unlike its SFT predecessor, this model leverages the superior knowledge and reasoning capabilities of the larger google/medgemma-27b-it model as a "teacher" to guide the training of the smaller, more efficient "student" model. This results in a compact model that captures the clinical acumen of a much larger counterpart.
- Developed by: Medical Swinburne University of Technology AI Team
- Funded by: Swinburne University of Technology
- Base Model (Student): medalpaca/medalpaca-7b
- Teacher Model: google/medgemma-27b-it
- Language(s): English
- License: Apache 2.0
Intended Use
This model is intended for research purposes in the following domains:
- AI-assisted medicine and clinical decision support research.
- Biomedical natural language processing (NLP).
- Exploration of efficient knowledge distillation and model compression in specialized domains.
- Generating high-quality, clinically-grounded synthetic data.
Training Data
The model was trained on the same curated and augmented collection of medical QA datasets as the SFT version, but the target outputs were generated by the teacher model.
- PubMedQA: Original and processed (map, u, l) variants for factoid and research-oriented questions.
- HealthCareMagic & iCliniq: Real-world patient-doctor interactions from online portals.
Data Curation & Knowledge Distillation Pipeline
The training pipeline was fundamentally redesigned to center on knowledge distillation, moving beyond simple paraphrasing to focus on transferring deep reasoning patterns.
| Stage | Purpose | Methodology & Quality Control |
|---|---|---|
| A. Augmented Query Generation | Create a diverse set of high-quality input prompts. | Utilizes the same multi-model paraphrasing, back-translation, and style standardization pipeline from the SFT model to generate a rich variety of instructions and inputs. |
| B. Teacher Forcing & Output Generation | Generate "gold-standard" responses using the superior teacher model. | Teacher Model: google/medgemma-27b-it. Generation Strategy: Low-temperature sampling with contrastive decoding to produce confident, factually-dense, and well-structured answers. Input: The entire augmented set of (Instruction, Input) pairs from Stage A. |
| C. Response Filtering & Alignment | Ensure the teacher's outputs are of the highest quality for student training. | Factual Consistency Check: Cross-referencing key medical claims against the original context. Style Alignment: Enforcing the neutral, professional clinical tone. Complexity Pruning: Removing outputs that are overly verbose or rely on reasoning chains too complex for the student model to learn effectively. |
| D. Dual-Phase Knowledge Distillation | Transfer knowledge from teacher to student. | Phase 1 (Response Mimicking): The student model is trained to directly reproduce the teacher's filtered outputs, learning its style and factual presentation. Phase 2 (Logit Matching): The student is trained to align its internal probability distributions (logits) with the teacher's for the same input, capturing the teacher's "thinking process" and confidence calibration. |
| E. Quality Assurance | Ensure the final training pairs are optimal for distillation. | F1. Data Cleaning: PHI removal; MD5-based deduplication. F2. KD-Specific Validation: Checking for alignment between query complexity and response depth; ensuring student-trainable reasoning patterns. |
Output Format
All training data was formatted into the same standardized SFT structure, but the outputs are now teacher-generated:
### Instruction:
{Task descriptor and/or user question with context}
### Input:
{Additional user question or context, if any}
### Output:
{The teacher model's (MedGemma-27b) target response}
Each data point includes metadata tags for its augmentation source and a distilled_from: medgemma-27b tag.
Usage
You can load and use the model with the Hugging Face transformers library, identical to the SFT version but with potentially improved performance.
import transformers
model_id = "MedAI-COS30018/MedSwin-7B-KD"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
device_map="auto", # Use GPU if available
)
# Format your input according to the training template
instruction = "Based on the provided context, what is the most likely diagnosis?"
context = "A 45-year-old male presents with acute, crushing substernal chest pain radiating to the left arm, associated with diaphoresis and nausea for the past hour."
formatted_prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n### Output:\n"
# Generate a response
sequences = pipeline(
formatted_prompt,
max_new_tokens=256,
do_sample=True,
temperature=0.3,
top_p=0.9,
eos_token_id=pipeline.tokenizer.eos_token_id,
)
print(sequences[0]['generated_text'])
Bias, Risks, and Limitations
The model inherits and may amplify biases present in its base model, teacher model, and training data. These can include:
- Demographic Biases: Biases related to race, gender, age, or socioeconomic status based on patterns in the source data.
- Clinical Biases: Potential over-representation of certain conditions, treatments, or clinical perspectives.
- Factual Accuracy: While the teacher model is highly capable, it is not infallible. The distilled model may propagate or even amplify any errors made by the teacher. It is not a certified medical knowledge base and can generate incorrect or outdated information.
- Safe Deployment: Use a Human-in-the-Loop (HITL) system for any real-world application. Outputs must be verified by a qualified healthcare professional. Do not use for direct patient care without rigorous clinical validation.
Technical Specifications & Evaluation
Model Architecture: Based on LLaMA, fine-tuned via Dual-Phase Knowledge Distillation.
Model Size: 7 Billion parameters.
Teacher Model Size: 27 Billion parameters.
Input Format: Instruction-Input-Output structure.
Key Metric:
- BERTScore (F1): 0.84.
Review all model metrics benchmark via Benchmark Document Preview.
- Downloads last month
- 81
Model tree for MedSwin/MedSwin-7B-KD
Base model
google/gemma-3-27b-pt