syntheticbot commited on
Commit
99d6217
·
verified ·
1 Parent(s): 65a06a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +250 -3
README.md CHANGED
@@ -1,3 +1,250 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ license: apache-2.0
4
+ language: en
5
+ library_name: transformers
6
+ tags:
7
+ - clip
8
+ - image-classification
9
+ - multi-task-classification
10
+ - fairface
11
+ - vision
12
+ - autoeval-has-no-ethical-license
13
+ model-index:
14
+ - name: clip-face-attribute-classifier
15
+ results:
16
+ - task:
17
+ type: image-classification
18
+ name: image-classification
19
+ dataset:
20
+ name: FairFace
21
+ type: joojs/fairface
22
+ split: validation
23
+ metrics:
24
+ - type: accuracy
25
+ value: 0.9638
26
+ name: Gender Accuracy
27
+ - type: accuracy
28
+ value: 0.7322
29
+ name: Race Accuracy
30
+ - type: accuracy
31
+ value: 0.5917
32
+ name: Age Accuracy
33
+ ---
34
+
35
+ # Fine-tuned CLIP Model for Face Attribute Classification
36
+
37
+ This repository contains the model **`clip-face-attribute-classifier`**, a fine-tuned version of the **[openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)** model. It has been adapted for multi-task classification of perceived age, gender, and race from facial images.
38
+
39
+ The model was trained on the **[FairFace dataset](https://github.com/joojs/fairface)**, which is designed to be balanced across these demographic categories. This model card provides a detailed look at its performance, limitations, and intended use to encourage responsible application.
40
+
41
+ ## Model Description
42
+
43
+ The base model, CLIP (Contrastive Language-Image Pre-Training), learns rich visual representations by matching images to their corresponding text descriptions. This fine-tuned version repurposes the powerful vision encoder from CLIP for a specific classification task.
44
+
45
+ It takes an image as input and outputs three separate predictions for:
46
+ * **Age:** 9 categories (0-2, 3-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, more than 70)
47
+ * **Gender:** 2 categories (Male, Female)
48
+ * **Race:** 7 categories (White, Black, Indian, East Asian, Southeast Asian, Middle Eastern, Latino_Hispanic)
49
+
50
+ ## Intended Uses & Limitations
51
+
52
+ This model is intended primarily for research and analysis purposes.
53
+
54
+ ### Intended Uses
55
+ * **Research on model fairness and bias:** Analyzing the model's performance differences across demographic groups.
56
+ * **Providing a public baseline:** Serving as a starting point for researchers aiming to improve performance on these specific classification tasks.
57
+ * **Educational purposes:** Demonstrating a multi-task fine-tuning approach on a vision model.
58
+
59
+ ### Out-of-Scope and Prohibited Uses
60
+ This model makes predictions about sensitive demographic attributes and carries significant risks if misused. The following uses are explicitly out-of-scope and strongly discouraged:
61
+ * **Surveillance, monitoring, or tracking of individuals.**
62
+ * **Automated decision-making that impacts an individual's rights or opportunities** (e.g., loan applications, hiring decisions, insurance eligibility).
63
+ * **Inferring or assigning an individual's self-identity.** The model's predictions are based on learned visual patterns and do not reflect how a person identifies.
64
+ * **Creating or reinforcing harmful social stereotypes.**
65
+
66
+ ## How to Get Started
67
+
68
+ To use this model, you need to import its custom `MultiTaskClipVisionModel` class, as it is not a standard `AutoModel`.
69
+
70
+ ```python
71
+ import torch
72
+ from PIL import Image
73
+ from transformers import CLIPImageProcessor, AutoModel
74
+ import os
75
+ import torch.nn as nn
76
+
77
+ # --- 0. Define the Custom Model Class ---
78
+ # You must define the model architecture to load the weights into it.
79
+ class MultiTaskClipVisionModel(nn.Module):
80
+ def __init__(self, num_labels):
81
+ super(MultiTaskClipVisionModel, self).__init__()
82
+ # Load the vision part of a CLIP model
83
+ self.vision_model = AutoModel.from_pretrained("openai/clip-vit-large-patch14").vision_model
84
+
85
+ hidden_size = self.vision_model.config.hidden_size
86
+ self.age_head = nn.Linear(hidden_size, num_labels['age'])
87
+ self.gender_head = nn.Linear(hidden_size, num_labels['gender'])
88
+ self.race_head = nn.Linear(hidden_size, num_labels['race'])
89
+
90
+ def forward(self, pixel_values):
91
+ outputs = self.vision_model(pixel_values=pixel_values)
92
+ pooled_output = outputs.pooler_output
93
+ return {
94
+ 'age': self.age_head(pooled_output),
95
+ 'gender': self.gender_head(pooled_output),
96
+ 'race': self.race_head(pooled_output),
97
+ }
98
+
99
+ # --- 1. Configuration ---
100
+ MODEL_PATH = "syntheticbot/clip-face-attribute-classifier"
101
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
102
+
103
+ # --- 2. Define Label Mappings (must match training) ---
104
+ age_labels = ['0-2', '10-19', '20-29', '3-9', '30-39', '40-49', '50-59', '60-69', 'more than 70']
105
+ gender_labels = ['Female', 'Male']
106
+ race_labels = ['Black', 'East Asian', 'Indian', 'Latino_Hispanic', 'Middle Eastern', 'Southeast Asian', 'White']
107
+
108
+ # Use sorted lists to create a consistent mapping
109
+ id_mappings = {
110
+ 'age': {i: label for i, label in enumerate(sorted(age_labels))},
111
+ 'gender': {i: label for i, label in enumerate(sorted(gender_labels))},
112
+ 'race': {i: label for i, label in enumerate(sorted(race_labels))},
113
+ }
114
+ NUM_LABELS = { 'age': len(age_labels), 'gender': len(gender_labels), 'race': len(race_labels) }
115
+
116
+ # --- 3. Load Model and Processor ---
117
+ processor = CLIPImageProcessor.from_pretrained(MODEL_PATH)
118
+ model = MultiTaskClipVisionModel(num_labels=NUM_LABELS)
119
+
120
+ # Load the fine-tuned weights from the hub
121
+ model.load_state_dict(
122
+ torch.hub.load_state_dict_from_url(
123
+ f"https://huggingface.co/{MODEL_PATH}/resolve/main/pytorch_model.bin",
124
+ map_location=DEVICE
125
+ )
126
+ )
127
+ model.to(DEVICE)
128
+ model.eval()
129
+
130
+ # --- 4. Prediction Function ---
131
+ def predict(image_path):
132
+ if not os.path.exists(image_path):
133
+ print(f"Error: Image not found at {image_path}")
134
+ return
135
+
136
+ image = Image.open(image_path).convert("RGB")
137
+ inputs = processor(images=image, return_tensors="pt").to(DEVICE)
138
+
139
+ with torch.no_grad():
140
+ logits = model(pixel_values=inputs['pixel_values'])
141
+
142
+ predictions = {}
143
+ for task in ['age', 'gender', 'race']:
144
+ pred_id = torch.argmax(logits[task], dim=-1).item()
145
+ pred_label = id_mappings[task][pred_id]
146
+ predictions[task] = pred_label
147
+
148
+ print(f"Predictions for {image_path}:")
149
+ for task, label in predictions.items():
150
+ print(f" - {task.capitalize()}: {label}")
151
+ return predictions
152
+
153
+ # --- 5. Run Prediction ---
154
+ # Download a sample image for testing
155
+ # !wget -q https://huggingface.co/syntheticbot/clip-face-attribute-classifier/resolve/main/sample.jpg -O sample.jpg
156
+ predict('sample.jpg') # Replace with the path to your image
157
+ ```
158
+
159
+ ## Training Details
160
+
161
+ * **Base Model:** [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)
162
+ * **Dataset:** [FairFace](https://github.com/joojs/fairface)
163
+ * **Training Procedure:** The model was fine-tuned for 5 epochs. The vision encoder was mostly frozen, with only the final 3 transformer layers being unfrozen for training. A separate linear classification head was added for each task (age, gender, race). The total loss was the sum of the Cross-Entropy Loss from each of the three tasks.
164
+
165
+ ## Evaluation
166
+
167
+ The model was evaluated on the FairFace validation split, which contains 10,954 images.
168
+
169
+ ### Performance Metrics
170
+
171
+ The following reports detail the model's performance on each task.
172
+
173
+ #### **Gender Classification (Overall Accuracy: 96.38%)**
174
+ ```
175
+ precision recall f1-score support
176
+
177
+ Female 0.96 0.96 0.96 5162
178
+ Male 0.96 0.97 0.97 5792
179
+
180
+ accuracy 0.96 10954
181
+ macro avg 0.96 0.96 0.96 10954
182
+ weighted avg 0.96 0.96 0.96 10954
183
+ ```
184
+
185
+ #### **Race Classification (Overall Accuracy: 73.22%)**
186
+ ```
187
+ precision recall f1-score support
188
+
189
+ Black 0.90 0.89 0.89 1556
190
+ East Asian 0.74 0.78 0.76 1550
191
+ Indian 0.81 0.75 0.78 1516
192
+ Latino_Hispanic 0.58 0.62 0.60 1623
193
+ Middle Eastern 0.69 0.57 0.62 1209
194
+ Southeast Asian 0.66 0.65 0.65 1415
195
+ White 0.75 0.80 0.77 2085
196
+
197
+ accuracy 0.73 10954
198
+ macro avg 0.73 0.72 0.73 10954
199
+ weighted avg 0.73 0.73 0.73 10954
200
+ ```
201
+
202
+ #### **Age Classification (Overall Accuracy: 59.17%)**
203
+ ```
204
+ precision recall f1-score support
205
+
206
+ 0-2 0.93 0.45 0.60 199
207
+ 10-19 0.62 0.41 0.50 1181
208
+ 20-29 0.64 0.76 0.70 3300
209
+ 3-9 0.77 0.88 0.82 1356
210
+ 30-39 0.49 0.50 0.49 2330
211
+ 40-49 0.46 0.44 0.45 1353
212
+ 50-59 0.47 0.40 0.43 796
213
+ 60-69 0.45 0.32 0.38 321
214
+ more than 70 0.75 0.10 0.18 118
215
+
216
+ accuracy 0.59 10954
217
+ macro avg 0.62 0.47 0.51 10954
218
+ weighted avg 0.59 0.59 0.58 10954
219
+ ```
220
+
221
+ ## Bias, Risks, and Limitations
222
+
223
+ * **Perceptual vs. Identity:** The model predicts perceived attributes based on visual data. These predictions are not a determination of an individual's true self-identity.
224
+ * **Performance Disparities:** The evaluation clearly shows that performance is not uniform across all categories. The model is significantly less accurate for certain racial groups (e.g., Latino_Hispanic, Middle Eastern) and older age groups. Using this model in any application will perpetuate these biases.
225
+ * **Data Representation:** While trained on FairFace, a balanced dataset, the model may still reflect societal biases present in the original pre-training data of CLIP.
226
+ * **Risk of Misclassification:** Any misclassification, particularly of sensitive attributes, can have negative social consequences. The model's moderate accuracy in age and race prediction makes this a significant risk.
227
+
228
+ ### Citation
229
+
230
+ **Original CLIP Model:**
231
+ ```bibtex
232
+ @inproceedings{radford2021learning,
233
+ title={Learning Transferable Visual Models From Natural Language Supervision},
234
+ author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
235
+ booktitle={International Conference on Machine Learning},
236
+ year={2021}
237
+ }
238
+ ```
239
+
240
+ **FairFace Dataset:**
241
+ ```bibtex
242
+ @inproceedings{karkkainenfairface,
243
+ title={FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age},
244
+ author={Karkkainen, Kimmo and Joo, Jungseock},
245
+ booktitle={IEEE Winter Conference on Applications of Computer Vision (WACV)},
246
+ pages={1548--1558},
247
+ year={2021}
248
+ }
249
+ ```
250
+ ```