SlimFace-demo / docs /data /data_processing.md
danhtran2mind's picture
Upload 164 files
b7f710c verified

A newer version of the Gradio SDK is available: 5.41.1

Upgrade

Data Processing for slimface Training πŸ–ΌοΈ

Table of Contents

Command-Line Arguments

Command-Line Arguments for process_dataset.py

When running python scripts/process_dataset.py, you can customize the dataset processing with the following command-line arguments:

Argument Type Default Description
--dataset_slug str vasukipatel/face-recognition-dataset The Kaggle dataset slug in username/dataset-name format. Specifies which dataset to download from Kaggle.
--base_dir str ./data The base directory where the dataset will be stored and processed.
--augment flag False Enables data augmentation (e.g., flipping, rotation) for training images to increase dataset variety. Use --augment to enable.
--random_state int 42 Random seed for reproducibility in the train-test split. Ensures consistent splitting across runs.
--test_split_rate float 0.2 Proportion of data to use for validation (between 0 and 1). For example, 0.2 means 20% of the data is used for validation.
--rotation_range int 15 Maximum rotation angle in degrees for data augmentation (if --augment is enabled). Images may be rotated randomly within this range.
--source_subdir str Original Images/Original Images Subdirectory within raw_dir containing the images to process. Used for both Kaggle and custom datasets.
--delete_raw flag False Deletes the raw folder after processing to save storage. Use --delete_raw to enable.

Example Usage

To process a Kaggle dataset with augmentation and a custom validation split:

python scripts/process_dataset.py \
    --augment \
    --test_split_rate 0.3 \
    --rotation_range 15

To process a custom dataset with a specific subdirectory and delete the raw folder:

python scripts/process_dataset.py \
    --source_subdir your_custom_dataset_dir \
    --delete_raw

Step-by-step process for handling a dataset

These options allow flexible dataset processing tailored to your needs. πŸš€

Step 1: Clone the Repository

Ensure the slimface project is set up by cloning the repository and navigating to the project directory:

git clone https://github.com/danhtran2mind/slimface/
cd slimface

Step 2: Process the Dataset

Option 1: Using Dataset from Kaggle

To download and process the sample dataset from Kaggle, run:

python scripts/process_dataset.py

This script organizes the dataset into the following structure under data/:

data/
β”œβ”€β”€ processed_ds/
β”‚   β”œβ”€β”€ train_data/
β”‚   β”‚   β”œβ”€β”€ Charlize Theron/
β”‚   β”‚   β”‚   β”œβ”€β”€ Charlize Theron_70.jpg
β”‚   β”‚   β”‚   β”œβ”€β”€ Charlize Theron_46.jpg
β”‚   β”‚   β”‚   ...
β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson/
β”‚   β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson_58.jpg
β”‚   β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson_9.jpg
β”‚   β”‚   β”‚   ...
β”‚   └── val_data/
β”‚       β”œβ”€β”€ Charlize Theron/
β”‚       β”‚   β”œβ”€β”€ Charlize Theron_60.jpg
β”‚       β”‚   β”œβ”€β”€ Charlize Theron_45.jpg
β”‚       β”‚   ...
β”‚       β”œβ”€β”€ Dwayne Johnson/
β”‚       β”‚   β”œβ”€β”€ Dwayne Johnson_11.jpg
β”‚       β”‚   β”œβ”€β”€ Dwayne Johnson_46.jpg
β”‚       β”‚   ...
β”œβ”€β”€ raw/
β”‚   β”œβ”€β”€ Faces/
β”‚   β”‚   β”œβ”€β”€ Jessica Alba_90.jpg
β”‚   β”‚   β”œβ”€β”€ Hugh Jackman_70.jpg
β”‚   β”‚   ...
β”‚   β”œβ”€β”€ Original Images/
β”‚   β”‚   β”œβ”€β”€ Charlize Theron/
β”‚   β”‚   β”‚   β”œβ”€β”€ Charlize Theron_60.jpg
β”‚   β”‚   β”‚   β”œβ”€β”€ Charlize Theron_70.jpg
β”‚   β”‚   β”‚   ...
β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson/
β”‚   β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson_11.jpg
β”‚   β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson_58.jpg
β”‚   β”‚   β”‚   ...
β”‚   β”œβ”€β”€ dataset.zip
β”‚   └── Dataset.csv
└── .gitignore

Option 2: Using a Custom Dataset

If you prefer to use your own dataset, place it in ./data/raw/your_custom_dataset_dir/ with the following structure:

data/
β”œβ”€β”€ raw/
β”‚   β”œβ”€β”€ your_custom_dataset_dir/
β”‚   β”‚   β”œβ”€β”€ Charlize Theron/
β”‚   β”‚   β”‚   β”œβ”€β”€ Charlize Theron_60.jpg
β”‚   β”‚   β”‚   β”œβ”€β”€ Charlize Theron_70.jpg
β”‚   β”‚   β”‚   ...
β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson/
β”‚   β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson_11.jpg
β”‚   β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson_58.jpg
β”‚   β”‚   β”‚   ...

If you use your dataset, you do not need to include only human faces, because we support face extraction using face detection, and all extracted faces are saved at data/processed_ds.

Then, process your custom dataset by specifying the subdirectory:

python scripts/process_dataset.py \
    --source_subdir your_custom_dataset_dir

This ensures your dataset is properly formatted for training. πŸš€