Spaces:
Running
A newer version of the Gradio SDK is available:
5.41.1
Data Processing for slimface Training πΌοΈ
Table of Contents
Command-Line Arguments
Command-Line Arguments for process_dataset.py
When running python scripts/process_dataset.py
, you can customize the dataset processing with the following command-line arguments:
Argument | Type | Default | Description |
---|---|---|---|
--dataset_slug |
str |
vasukipatel/face-recognition-dataset |
The Kaggle dataset slug in username/dataset-name format. Specifies which dataset to download from Kaggle. |
--base_dir |
str |
./data |
The base directory where the dataset will be stored and processed. |
--augment |
flag |
False |
Enables data augmentation (e.g., flipping, rotation) for training images to increase dataset variety. Use --augment to enable. |
--random_state |
int |
42 |
Random seed for reproducibility in the train-test split. Ensures consistent splitting across runs. |
--test_split_rate |
float |
0.2 |
Proportion of data to use for validation (between 0 and 1). For example, 0.2 means 20% of the data is used for validation. |
--rotation_range |
int |
15 |
Maximum rotation angle in degrees for data augmentation (if --augment is enabled). Images may be rotated randomly within this range. |
--source_subdir |
str |
Original Images/Original Images |
Subdirectory within raw_dir containing the images to process. Used for both Kaggle and custom datasets. |
--delete_raw |
flag |
False |
Deletes the raw folder after processing to save storage. Use --delete_raw to enable. |
Example Usage
To process a Kaggle dataset with augmentation and a custom validation split:
python scripts/process_dataset.py \
--augment \
--test_split_rate 0.3 \
--rotation_range 15
To process a custom dataset with a specific subdirectory and delete the raw folder:
python scripts/process_dataset.py \
--source_subdir your_custom_dataset_dir \
--delete_raw
Step-by-step process for handling a dataset
These options allow flexible dataset processing tailored to your needs. π
Step 1: Clone the Repository
Ensure the slimface
project is set up by cloning the repository and navigating to the project directory:
git clone https://github.com/danhtran2mind/slimface/
cd slimface
Step 2: Process the Dataset
Option 1: Using Dataset from Kaggle
To download and process the sample dataset from Kaggle, run:
python scripts/process_dataset.py
This script organizes the dataset into the following structure under data/
:
data/
βββ processed_ds/
β βββ train_data/
β β βββ Charlize Theron/
β β β βββ Charlize Theron_70.jpg
β β β βββ Charlize Theron_46.jpg
β β β ...
β β βββ Dwayne Johnson/
β β β βββ Dwayne Johnson_58.jpg
β β β βββ Dwayne Johnson_9.jpg
β β β ...
β βββ val_data/
β βββ Charlize Theron/
β β βββ Charlize Theron_60.jpg
β β βββ Charlize Theron_45.jpg
β β ...
β βββ Dwayne Johnson/
β β βββ Dwayne Johnson_11.jpg
β β βββ Dwayne Johnson_46.jpg
β β ...
βββ raw/
β βββ Faces/
β β βββ Jessica Alba_90.jpg
β β βββ Hugh Jackman_70.jpg
β β ...
β βββ Original Images/
β β βββ Charlize Theron/
β β β βββ Charlize Theron_60.jpg
β β β βββ Charlize Theron_70.jpg
β β β ...
β β βββ Dwayne Johnson/
β β β βββ Dwayne Johnson_11.jpg
β β β βββ Dwayne Johnson_58.jpg
β β β ...
β βββ dataset.zip
β βββ Dataset.csv
βββ .gitignore
Option 2: Using a Custom Dataset
If you prefer to use your own dataset, place it in ./data/raw/your_custom_dataset_dir/
with the following structure:
data/
βββ raw/
β βββ your_custom_dataset_dir/
β β βββ Charlize Theron/
β β β βββ Charlize Theron_60.jpg
β β β βββ Charlize Theron_70.jpg
β β β ...
β β βββ Dwayne Johnson/
β β β βββ Dwayne Johnson_11.jpg
β β β βββ Dwayne Johnson_58.jpg
β β β ...
If you use your dataset, you do not need to include only human faces, because we support face extraction using face detection, and all extracted faces are saved at data/processed_ds
.
Then, process your custom dataset by specifying the subdirectory:
python scripts/process_dataset.py \
--source_subdir your_custom_dataset_dir
This ensures your dataset is properly formatted for training. π