vit-large-patch32-384-finetuned-skin-lesion-classification

Vision Transformer model fine-tuned for skin lesion classification across 12 classes. This model builds on the pre-trained ViT (originally trained on ImageNet-21k and fine-tuned on ImageNet-2012) and adapts a checkpoint initially focused on melanoma detection (UnipaPolitoUnimore/vit-large-patch32-384-melanoma).


Model Description

  • Architecture: Vision Transformer (ViT) that processes images as fixed-size patches (specifically 384x384 pixels).
  • Modifications:
    • Replaced the original melanoma model's three-class head with a new linear classifier for 12 classes.
    • Classes: actinic keratosis, basal cell carcinoma, clear skin, dermatofibroma, melanoma, melanoma metastasis, nevus, random, seborrheic keratosis, solar lentigo, squamous cell carcinoma, and vascular lesion.
  • Feature Extractor: Leverages pretrained skin lesion features for transfer learning.

Intended Uses & Limitations

  • Intended Uses:
    • Automated skin lesion classification for research and decision support in dermatology.
  • Limitations:
    • Data bias may persist despite augmentation. Training would benefit from more data, especially for rare classes.
    • Performance may vary across imaging conditions or devices.

Training and Evaluation

  • Training Data:
    • Approximately 70k images assembled via real and high-quality synthetic data, with improved class balance.
  • Training Setup:
    • Optimizer: AdamW (default settings) with learning rate 2e-05.
    • Epochs: 3
    • Batch sizes: 8 (train) / 16 (eval)
  • Evaluation Results:

Validation Set Performance

Class Precision Recall F1-Score Support
Actinic Keratosis 0.74 0.77 0.76 163
Basal Cell Carcinoma 0.90 0.86 0.88 551
Clear Skin 1.00 1.00 1.00 13
Dermatofibroma 0.85 0.68 0.76 25
Melanoma 0.93 0.81 0.87 600
Melanoma Metastasis 0.85 0.77 0.81 95
Nevus 0.84 0.96 0.90 847
Random 1.00 1.00 1.00 52
Seborrheic Keratosis 0.74 0.77 0.75 190
Solar Lentigo 0.71 0.64 0.68 42
Squamous Cell Carcinoma 0.87 0.71 0.78 84
Vascular Lesion 0.92 0.52 0.67 23
Accuracy 0.86 2685
Macro Avg 0.86 0.79 0.82 2685
Weighted Avg 0.86 0.86 0.86 2685

Test Set Performance

Class Precision Recall F1-Score Support
Actinic Keratosis 0.81 0.80 0.81 164
Basal Cell Carcinoma 0.86 0.93 0.89 552
Dermatofibroma 0.95 0.77 0.85 26
Melanoma 0.88 0.89 0.89 601
Melanoma Metastasis 0.96 0.73 0.83 95
Nevus 0.91 0.93 0.92 848
Seborrheic Keratosis 0.81 0.76 0.79 191
Solar Lentigo 0.82 0.63 0.71 43
Squamous Cell Carcinoma 0.91 0.74 0.82 84
Vascular Lesion 1.00 0.83 0.90 23
Accuracy 0.88 2627
Macro Avg 0.89 0.80 0.84 2627
Weighted Avg 0.88 0.88 0.88 2627

Confusion Matrix

The confusion matrices for both validation and test sets have been generated to provide insight into per-class performance. They are available as PNG files:

  • Validation Confusion Matrix:
    Validation Confusion Matrix

  • Test Confusion Matrix:
    Test Confusion Matrix


How to Use

Example inference code:

from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import torch

# Load model and processor
model_id = "path_or_repo_identifier_for_your_model"
processor = ViTImageProcessor.from_pretrained(model_id)
model = ViTForImageClassification.from_pretrained(model_id)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
model = model.to(device)
model.eval()

# Load and process image
image_path = "path/to/skin_lesion_image.jpg"
image = Image.open(image_path).convert("RGB")
inputs = processor(images=image, return_tensors="pt").to(device)

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)

# Get prediction results
logits = outputs.logits
probabilities = torch.nn.functional.softmax(logits, dim=1)[0]
predicted_class_idx = torch.argmax(probabilities).item()
predicted_class = model.config.id2label[predicted_class_idx]
confidence = probabilities[predicted_class_idx].item()

print(f"Predicted class: {predicted_class}")
print(f"Confidence: {confidence:.2%}")

Citation

If you use this model, please cite the original ViT paper:

@misc{dosovitskiy2020image,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and et al.},
  year={2020},
  eprint={2010.11929},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Downloads last month
172
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support