Spaces:
Running
Running
File size: 5,711 Bytes
b7f710c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
# Data Processing for slimface Training πΌοΈ
## Table of Contents
- [Data Processing for slimface Training πΌοΈ](#data-processing-for-slimface-training-)
- [Command-Line Arguments](#command-line-arguments)
- [Command-Line Arguments for `process_dataset.py`](#command-line-arguments-for-process_datasetpy)
- [Example Usage](#example-usage)
- [Step-by-step process for handling a dataset](#step-by-step-process-for-handling-a-dataset)
- [Step 1: Clone the Repository](#step-1-clone-the-repository)
- [Step 2: Process the Dataset](#step-2-process-the-dataset)
- [Option 1: Using Dataset from Kaggle](#option-1-using-dataset-from-kaggle)
- [Option 2: Using a Custom Dataset](#option-2-using-a-custom-dataset)
## Command-Line Arguments
### Command-Line Arguments for `process_dataset.py`
When running `python scripts/process_dataset.py`, you can customize the dataset processing with the following command-line arguments:
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--dataset_slug` | `str` | `vasukipatel/face-recognition-dataset` | The Kaggle dataset slug in `username/dataset-name` format. Specifies which dataset to download from Kaggle. |
| `--base_dir` | `str` | `./data` | The base directory where the dataset will be stored and processed. |
| `--augment` | `flag` | `False` | Enables data augmentation (e.g., flipping, rotation) for training images to increase dataset variety. Use `--augment` to enable. |
| `--random_state` | `int` | `42` | Random seed for reproducibility in the train-test split. Ensures consistent splitting across runs. |
| `--test_split_rate` | `float` | `0.2` | Proportion of data to use for validation (between 0 and 1). For example, `0.2` means 20% of the data is used for validation. |
| `--rotation_range` | `int` | `15` | Maximum rotation angle in degrees for data augmentation (if `--augment` is enabled). Images may be rotated randomly within this range. |
| `--source_subdir` | `str` | `Original Images/Original Images` | Subdirectory within `raw_dir` containing the images to process. Used for both Kaggle and custom datasets. |
| `--delete_raw` | `flag` | `False` | Deletes the raw folder after processing to save storage. Use `--delete_raw` to enable. |
### Example Usage
To process a Kaggle dataset with augmentation and a custom validation split:
```bash
python scripts/process_dataset.py \
--augment \
--test_split_rate 0.3 \
--rotation_range 15
```
To process a **custom dataset** with a specific subdirectory and delete the raw folder:
```bash
python scripts/process_dataset.py \
--source_subdir your_custom_dataset_dir \
--delete_raw
```
## Step-by-step process for handling a dataset
These options allow flexible dataset processing tailored to your needs. π
### Step 1: Clone the Repository
Ensure the `slimface` project is set up by cloning the repository and navigating to the project directory:
```bash
git clone https://github.com/danhtran2mind/slimface/
cd slimface
```
### Step 2: Process the Dataset
#### Option 1: Using Dataset from Kaggle
To download and process the sample dataset from Kaggle, run:
```bash
python scripts/process_dataset.py
```
This script organizes the dataset into the following structure under `data/`:
```markdown
data/
βββ processed_ds/
β βββ train_data/
β β βββ Charlize Theron/
β β β βββ Charlize Theron_70.jpg
β β β βββ Charlize Theron_46.jpg
β β β ...
β β βββ Dwayne Johnson/
β β β βββ Dwayne Johnson_58.jpg
β β β βββ Dwayne Johnson_9.jpg
β β β ...
β βββ val_data/
β βββ Charlize Theron/
β β βββ Charlize Theron_60.jpg
β β βββ Charlize Theron_45.jpg
β β ...
β βββ Dwayne Johnson/
β β βββ Dwayne Johnson_11.jpg
β β βββ Dwayne Johnson_46.jpg
β β ...
βββ raw/
β βββ Faces/
β β βββ Jessica Alba_90.jpg
β β βββ Hugh Jackman_70.jpg
β β ...
β βββ Original Images/
β β βββ Charlize Theron/
β β β βββ Charlize Theron_60.jpg
β β β βββ Charlize Theron_70.jpg
β β β ...
β β βββ Dwayne Johnson/
β β β βββ Dwayne Johnson_11.jpg
β β β βββ Dwayne Johnson_58.jpg
β β β ...
β βββ dataset.zip
β βββ Dataset.csv
βββ .gitignore
```
#### Option 2: Using a Custom Dataset
If you prefer to use your own dataset, place it in `./data/raw/your_custom_dataset_dir/` with the following structure:
```markdown
data/
βββ raw/
β βββ your_custom_dataset_dir/
β β βββ Charlize Theron/
β β β βββ Charlize Theron_60.jpg
β β β βββ Charlize Theron_70.jpg
β β β ...
β β βββ Dwayne Johnson/
β β β βββ Dwayne Johnson_11.jpg
β β β βββ Dwayne Johnson_58.jpg
β β β ...
```
If you use your dataset, you do not need to include only human faces, because **we support face extraction using face detection**, and all extracted faces are saved at `data/processed_ds`.
Then, process your custom dataset by specifying the subdirectory:
```bash
python scripts/process_dataset.py \
--source_subdir your_custom_dataset_dir
```
This ensures your dataset is properly formatted for training. π |