File size: 8,171 Bytes
7698b40 13df246 7698b40 13df246 7698b40 861d115 7698b40 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 |
---
license: apache-2.0
language: en
library_name: pytorch
tags:
- image-classification
- facial-expression-recognition
- clip
- fer-2013
datasets:
- msambare/fer2013
metrics:
- accuracy
- f1
- precision
- recall
base_model: openai/clip-vit-large-patch14
---
# CLIP-based Facial Expression Recognition (FER-2013)
This is a fine-tuned version of **`openai/clip-vit-large-patch14`** for the task of Facial Expression Recognition (FER).
This model was trained on the **FER-2013 dataset** and can classify a facial image into one of seven emotions: angry, disgust, fear, happy, neutral, sad, and surprise. It was created using a transfer learning approach where the pre-trained CLIP vision encoder was frozen, and a new linear classification head was trained on top of it to recognize the emotion classes.
## Model Description
- **Base Model:** `openai/clip-vit-large-patch14`
- **Task:** Image Classification (Facial Expression Recognition)
- **Framework:** PyTorch
- **Dataset:** FER-2013
- **Final Accuracy (Test Set):** 72%
## Intended Uses & Limitations
### Intended Uses
This model is intended for academic research and as a baseline for developing more advanced emotion recognition systems. Potential applications include:
- Analyzing sentiment in user-submitted media (e.g., product review videos).
- Content analysis for social science research on emotion portrayal in images.
- A building block for assistive technology applications.
### Limitations and Bias
This model inherits the limitations of its training data, the FER-2013 dataset.
- **Dataset Bias:** The FER-2013 dataset is known to have biases in its representation of age, gender, and race. As a result, the model's performance may be inconsistent across different demographic groups. It is not recommended for use in production systems that affect individuals without thorough bias evaluation and mitigation.
- **Posed vs. Natural Expressions:** The dataset primarily contains posed, front-facing, and often exaggerated expressions. The model will likely perform worse on real-world images that feature subtle, natural, or non-frontal expressions.
- **Ambiguity of Emotion:** Emotion is subjective and context-dependent. A static image cannot capture the full story. The model's predictions are based on learned visual patterns from the dataset and should not be considered an objective measure of a person's true emotional state.
- **Misuse Potential:** This model should **NOT** be used for applications that involve making automated judgments about an individual's character, truthfulness, or employability. It is not suitable for surveillance or any application that could have a significant adverse impact on people's lives.
## How to Use
To use this model, you first need to define its custom architecture, then load the saved weights from this repository.
### 1. Installation
```bash
pip install transformers torch safetensors Pillow huggingface_hub requests
```
### 2. Prediction Script
This runnable script downloads the model from this repository and predicts the emotion for an example image from the web.
```python
import torch
import torch.nn as nn
from PIL import Image
from transformers import AutoProcessor, CLIPVisionModel
from safetensors.torch import load_file
import torch.nn.functional as F
from huggingface_hub import hf_hub_download
import requests
# --- Configuration ---
# This is the repository ID for the model on the Hugging Face Hub
REPO_ID = "syntheticbot/clip-face-expression"
FILENAME = "model.safetensors"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# The class names must be in alphabetical order as used during training
CLASS_NAMES = ['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise']
NUM_CLASSES = len(CLASS_NAMES)
# --- Define the Model Architecture ---
# This class must be defined to match the architecture of the saved model
class ClipClassifier(nn.Module):
def __init__(self, vision_model, num_classes):
super(ClipClassifier, self).__init__()
self.vision_model = vision_model
# The base model's config is needed to get the hidden size for the classifier
self.classifier = nn.Linear(vision_model.config.hidden_size, num_classes)
def forward(self, pixel_values):
outputs = self.vision_model(pixel_values=pixel_values)
image_features = outputs.pooler_output
logits = self.classifier(image_features)
return logits
# --- Load Model and Processor ---
print("Loading model and processor...")
processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14")
vision_model = CLIPVisionModel.from_pretrained("openai/clip-vit-large-patch14").to(DEVICE)
# Instantiate the custom classifier
model = ClipClassifier(vision_model, NUM_CLASSES).to(DEVICE)
# Download the model weights from the Hub and load them
print(f"Downloading model from {REPO_ID}...")
model_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME)
state_dict = load_file(model_path, device=DEVICE)
model.load_state_dict(state_dict)
model.eval()
# --- Run Prediction on an Example Image ---
# Example image from the web
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/9/9d/Carol_Burnett_1958.JPG/250px-Carol_Burnett_1958.JPG"
try:
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
except Exception as e:
print(f"Could not load image from URL: {e}")
exit()
print("Processing image and making prediction...")
with torch.no_grad():
processed_image = processor(images=image, return_tensors="pt")['pixel_values'].to(DEVICE)
logits = model(processed_image)
probabilities = F.softmax(logits, dim=1)
top_prob, top_idx = torch.max(probabilities, 1)
predicted_class = CLASS_NAMES[top_idx.item()]
print(f"\nPredicted Emotion: {predicted_class}")
print(f"Confidence: {top_prob.item() * 100:.2f}%")
```
## Training Procedure
The model was trained using a transfer learning approach. The `openai/clip-vit-large-patch14` vision encoder was used as a frozen feature extractor. A single linear layer was added as a classification head and trained on the FER-2013 dataset.
### Hyperparameters
- **Learning Rate:** 1e-3
- **Batch Size:** 128
- **Optimizer:** Adam
- **Loss Function:** Cross-Entropy Loss
- **Number of Epochs:** 10
## Evaluation Results
The model was evaluated on the FER-2013 test set, which contains 7,178 images.
**Overall Accuracy: 72%**
### Classification Report
| | precision | recall | f1-score | support |
|:------------|:---------:|:------:|:--------:|:-------:|
| **angry** | 0.67 | 0.62 | 0.65 | 958 |
| **disgust** | 0.68 | 0.64 | 0.66 | 111 |
| **fear** | 0.54 | 0.51 | 0.53 | 1024 |
| **happy** | 0.89 | 0.93 | 0.91 | 1774 |
| **neutral** | 0.68 | 0.74 | 0.71 | 1233 |
| **sad** | 0.61 | 0.61 | 0.61 | 1247 |
| **surprise**| 0.83 | 0.76 | 0.79 | 831 |
| | | | | |
| **accuracy**| | | 0.72 | 7178 |
| **macro avg** | 0.70 | 0.69 | 0.69 | 7178 |
| **weighted avg**| 0.72 | 0.72 | 0.72 | 7178 |
### Confusion Matrix

## Citation
### Citing this Model
```
@misc{syntheticbot_2024_ferclip,
author = {syntheticbot},
title = {CLIP-based Facial Expression Recognition (FER-2013)},
year = {2024},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/syntheticbot/clip-face-expression}}
}
```
### Citing the Original CLIP Model
```bibtex
@inproceedings{radford2021learning,
title={Learning transferable visual models from natural language supervision},
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
booktitle={International conference on machine learning},
pages={8748--8763},
year={2021},
organization={PMLR}
}
```
|