Model Card

Model Description

MedSwin-7B-KD is a high-performance 7B parameter language model for medical question-answering and clinical reasoning. It was created by applying a novel Dual-Phase Knowledge Distillation (KD) pipeline to the medalpaca/medalpaca-7b base model. Unlike its SFT predecessor, this model leverages the superior knowledge and reasoning capabilities of the larger google/medgemma-27b-it model as a "teacher" to guide the training of the smaller, more efficient "student" model. This results in a compact model that captures the clinical acumen of a much larger counterpart.

Developed by: Medical Swinburne University of Technology AI Team
Funded by: Swinburne University of Technology
Base Model (Student): medalpaca/medalpaca-7b
Teacher Model: google/medgemma-27b-it
Language(s): English
License: Apache 2.0

Intended Use

This model is intended for research purposes in the following domains:

AI-assisted medicine and clinical decision support research.
Biomedical natural language processing (NLP).
Exploration of efficient knowledge distillation and model compression in specialized domains.
Generating high-quality, clinically-grounded synthetic data.

Training Data

The model was trained on the same curated and augmented collection of medical QA datasets as the SFT version, but the target outputs were generated by the teacher model.

PubMedQA: Original and processed (map, u, l) variants for factoid and research-oriented questions.
HealthCareMagic & iCliniq: Real-world patient-doctor interactions from online portals.

Data Curation & Knowledge Distillation Pipeline

The training pipeline was fundamentally redesigned to center on knowledge distillation, moving beyond simple paraphrasing to focus on transferring deep reasoning patterns.

Stage	Purpose	Methodology & Quality Control
A. Augmented Query Generation	Create a diverse set of high-quality input prompts.	Utilizes the same multi-model paraphrasing, back-translation, and style standardization pipeline from the SFT model to generate a rich variety of instructions and inputs.
B. Teacher Forcing & Output Generation	Generate "gold-standard" responses using the superior teacher model.	Teacher Model: `google/medgemma-27b-it`. Generation Strategy: Low-temperature sampling with contrastive decoding to produce confident, factually-dense, and well-structured answers. Input: The entire augmented set of `(Instruction, Input)` pairs from Stage A.
C. Response Filtering & Alignment	Ensure the teacher's outputs are of the highest quality for student training.	Factual Consistency Check: Cross-referencing key medical claims against the original context. Style Alignment: Enforcing the neutral, professional clinical tone. Complexity Pruning: Removing outputs that are overly verbose or rely on reasoning chains too complex for the student model to learn effectively.
D. Dual-Phase Knowledge Distillation	Transfer knowledge from teacher to student.	Phase 1 (Response Mimicking): The student model is trained to directly reproduce the teacher's filtered outputs, learning its style and factual presentation. Phase 2 (Logit Matching): The student is trained to align its internal probability distributions (logits) with the teacher's for the same input, capturing the teacher's "thinking process" and confidence calibration.
E. Quality Assurance	Ensure the final training pairs are optimal for distillation.	F1. Data Cleaning: PHI removal; MD5-based deduplication. F2. KD-Specific Validation: Checking for alignment between query complexity and response depth; ensuring student-trainable reasoning patterns.

Output Format

All training data was formatted into the same standardized SFT structure, but the outputs are now teacher-generated:

### Instruction:
{Task descriptor and/or user question with context}

### Input:
{Additional user question or context, if any}

### Output:
{The teacher model's (MedGemma-27b) target response}

Each data point includes metadata tags for its augmentation source and a distilled_from: medgemma-27b tag.

Usage

You can load and use the model with the Hugging Face transformers library, identical to the SFT version but with potentially improved performance.

import transformers

model_id = "MedAI-COS30018/MedSwin-7B-KD"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    device_map="auto", # Use GPU if available
)

# Format your input according to the training template
instruction = "Based on the provided context, what is the most likely diagnosis?"
context = "A 45-year-old male presents with acute, crushing substernal chest pain radiating to the left arm, associated with diaphoresis and nausea for the past hour."
formatted_prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n### Output:\n"

# Generate a response
sequences = pipeline(
    formatted_prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.3,
    top_p=0.9,
    eos_token_id=pipeline.tokenizer.eos_token_id,
)
print(sequences[0]['generated_text'])

Bias, Risks, and Limitations

The model inherits and may amplify biases present in its base model, teacher model, and training data. These can include:

Demographic Biases: Biases related to race, gender, age, or socioeconomic status based on patterns in the source data.
Clinical Biases: Potential over-representation of certain conditions, treatments, or clinical perspectives.
Factual Accuracy: While the teacher model is highly capable, it is not infallible. The distilled model may propagate or even amplify any errors made by the teacher. It is not a certified medical knowledge base and can generate incorrect or outdated information.
Safe Deployment: Use a Human-in-the-Loop (HITL) system for any real-world application. Outputs must be verified by a qualified healthcare professional. Do not use for direct patient care without rigorous clinical validation.

Technical Specifications & Evaluation

Model Architecture: Based on LLaMA, fine-tuned via Dual-Phase Knowledge Distillation.
Model Size: 7 Billion parameters.
Teacher Model Size: 27 Billion parameters.
Input Format: Instruction-Input-Output structure.
Key Metric:
- BERTScore (F1): 0.84.
Benchmark Dataset
Benchmark Logs
Review all model metrics benchmark via Benchmark Document Preview.