SlimFace-demo / docs /data /data_processing.md
danhtran2mind's picture
Upload 164 files
b7f710c verified
# Data Processing for slimface Training πŸ–ΌοΈ
## Table of Contents
- [Data Processing for slimface Training πŸ–ΌοΈ](#data-processing-for-slimface-training-)
- [Command-Line Arguments](#command-line-arguments)
- [Command-Line Arguments for `process_dataset.py`](#command-line-arguments-for-process_datasetpy)
- [Example Usage](#example-usage)
- [Step-by-step process for handling a dataset](#step-by-step-process-for-handling-a-dataset)
- [Step 1: Clone the Repository](#step-1-clone-the-repository)
- [Step 2: Process the Dataset](#step-2-process-the-dataset)
- [Option 1: Using Dataset from Kaggle](#option-1-using-dataset-from-kaggle)
- [Option 2: Using a Custom Dataset](#option-2-using-a-custom-dataset)
## Command-Line Arguments
### Command-Line Arguments for `process_dataset.py`
When running `python scripts/process_dataset.py`, you can customize the dataset processing with the following command-line arguments:
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--dataset_slug` | `str` | `vasukipatel/face-recognition-dataset` | The Kaggle dataset slug in `username/dataset-name` format. Specifies which dataset to download from Kaggle. |
| `--base_dir` | `str` | `./data` | The base directory where the dataset will be stored and processed. |
| `--augment` | `flag` | `False` | Enables data augmentation (e.g., flipping, rotation) for training images to increase dataset variety. Use `--augment` to enable. |
| `--random_state` | `int` | `42` | Random seed for reproducibility in the train-test split. Ensures consistent splitting across runs. |
| `--test_split_rate` | `float` | `0.2` | Proportion of data to use for validation (between 0 and 1). For example, `0.2` means 20% of the data is used for validation. |
| `--rotation_range` | `int` | `15` | Maximum rotation angle in degrees for data augmentation (if `--augment` is enabled). Images may be rotated randomly within this range. |
| `--source_subdir` | `str` | `Original Images/Original Images` | Subdirectory within `raw_dir` containing the images to process. Used for both Kaggle and custom datasets. |
| `--delete_raw` | `flag` | `False` | Deletes the raw folder after processing to save storage. Use `--delete_raw` to enable. |
### Example Usage
To process a Kaggle dataset with augmentation and a custom validation split:
```bash
python scripts/process_dataset.py \
--augment \
--test_split_rate 0.3 \
--rotation_range 15
```
To process a **custom dataset** with a specific subdirectory and delete the raw folder:
```bash
python scripts/process_dataset.py \
--source_subdir your_custom_dataset_dir \
--delete_raw
```
## Step-by-step process for handling a dataset
These options allow flexible dataset processing tailored to your needs. πŸš€
### Step 1: Clone the Repository
Ensure the `slimface` project is set up by cloning the repository and navigating to the project directory:
```bash
git clone https://github.com/danhtran2mind/slimface/
cd slimface
```
### Step 2: Process the Dataset
#### Option 1: Using Dataset from Kaggle
To download and process the sample dataset from Kaggle, run:
```bash
python scripts/process_dataset.py
```
This script organizes the dataset into the following structure under `data/`:
```markdown
data/
β”œβ”€β”€ processed_ds/
β”‚ β”œβ”€β”€ train_data/
β”‚ β”‚ β”œβ”€β”€ Charlize Theron/
β”‚ β”‚ β”‚ β”œβ”€β”€ Charlize Theron_70.jpg
β”‚ β”‚ β”‚ β”œβ”€β”€ Charlize Theron_46.jpg
β”‚ β”‚ β”‚ ...
β”‚ β”‚ β”œβ”€β”€ Dwayne Johnson/
β”‚ β”‚ β”‚ β”œβ”€β”€ Dwayne Johnson_58.jpg
β”‚ β”‚ β”‚ β”œβ”€β”€ Dwayne Johnson_9.jpg
β”‚ β”‚ β”‚ ...
β”‚ └── val_data/
β”‚ β”œβ”€β”€ Charlize Theron/
β”‚ β”‚ β”œβ”€β”€ Charlize Theron_60.jpg
β”‚ β”‚ β”œβ”€β”€ Charlize Theron_45.jpg
β”‚ β”‚ ...
β”‚ β”œβ”€β”€ Dwayne Johnson/
β”‚ β”‚ β”œβ”€β”€ Dwayne Johnson_11.jpg
β”‚ β”‚ β”œβ”€β”€ Dwayne Johnson_46.jpg
β”‚ β”‚ ...
β”œβ”€β”€ raw/
β”‚ β”œβ”€β”€ Faces/
β”‚ β”‚ β”œβ”€β”€ Jessica Alba_90.jpg
β”‚ β”‚ β”œβ”€β”€ Hugh Jackman_70.jpg
β”‚ β”‚ ...
β”‚ β”œβ”€β”€ Original Images/
β”‚ β”‚ β”œβ”€β”€ Charlize Theron/
β”‚ β”‚ β”‚ β”œβ”€β”€ Charlize Theron_60.jpg
β”‚ β”‚ β”‚ β”œβ”€β”€ Charlize Theron_70.jpg
β”‚ β”‚ β”‚ ...
β”‚ β”‚ β”œβ”€β”€ Dwayne Johnson/
β”‚ β”‚ β”‚ β”œβ”€β”€ Dwayne Johnson_11.jpg
β”‚ β”‚ β”‚ β”œβ”€β”€ Dwayne Johnson_58.jpg
β”‚ β”‚ β”‚ ...
β”‚ β”œβ”€β”€ dataset.zip
β”‚ └── Dataset.csv
└── .gitignore
```
#### Option 2: Using a Custom Dataset
If you prefer to use your own dataset, place it in `./data/raw/your_custom_dataset_dir/` with the following structure:
```markdown
data/
β”œβ”€β”€ raw/
β”‚ β”œβ”€β”€ your_custom_dataset_dir/
β”‚ β”‚ β”œβ”€β”€ Charlize Theron/
β”‚ β”‚ β”‚ β”œβ”€β”€ Charlize Theron_60.jpg
β”‚ β”‚ β”‚ β”œβ”€β”€ Charlize Theron_70.jpg
β”‚ β”‚ β”‚ ...
β”‚ β”‚ β”œβ”€β”€ Dwayne Johnson/
β”‚ β”‚ β”‚ β”œβ”€β”€ Dwayne Johnson_11.jpg
β”‚ β”‚ β”‚ β”œβ”€β”€ Dwayne Johnson_58.jpg
β”‚ β”‚ β”‚ ...
```
If you use your dataset, you do not need to include only human faces, because **we support face extraction using face detection**, and all extracted faces are saved at `data/processed_ds`.
Then, process your custom dataset by specifying the subdirectory:
```bash
python scripts/process_dataset.py \
--source_subdir your_custom_dataset_dir
```
This ensures your dataset is properly formatted for training. πŸš€