File size: 5,711 Bytes
b7f710c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# Data Processing for slimface Training πŸ–ΌοΈ

## Table of Contents

- [Data Processing for slimface Training πŸ–ΌοΈ](#data-processing-for-slimface-training-)
  - [Command-Line Arguments](#command-line-arguments)
    - [Command-Line Arguments for `process_dataset.py`](#command-line-arguments-for-process_datasetpy)
    - [Example Usage](#example-usage)
  - [Step-by-step process for handling a dataset](#step-by-step-process-for-handling-a-dataset)
    - [Step 1: Clone the Repository](#step-1-clone-the-repository)
    - [Step 2: Process the Dataset](#step-2-process-the-dataset)
      - [Option 1: Using Dataset from Kaggle](#option-1-using-dataset-from-kaggle)
      - [Option 2: Using a Custom Dataset](#option-2-using-a-custom-dataset)

## Command-Line Arguments
### Command-Line Arguments for `process_dataset.py`

When running `python scripts/process_dataset.py`, you can customize the dataset processing with the following command-line arguments:

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--dataset_slug` | `str` | `vasukipatel/face-recognition-dataset` | The Kaggle dataset slug in `username/dataset-name` format. Specifies which dataset to download from Kaggle. |
| `--base_dir` | `str` | `./data` | The base directory where the dataset will be stored and processed. |
| `--augment` | `flag` | `False` | Enables data augmentation (e.g., flipping, rotation) for training images to increase dataset variety. Use `--augment` to enable. |
| `--random_state` | `int` | `42` | Random seed for reproducibility in the train-test split. Ensures consistent splitting across runs. |
| `--test_split_rate` | `float` | `0.2` | Proportion of data to use for validation (between 0 and 1). For example, `0.2` means 20% of the data is used for validation. |
| `--rotation_range` | `int` | `15` | Maximum rotation angle in degrees for data augmentation (if `--augment` is enabled). Images may be rotated randomly within this range. |
| `--source_subdir` | `str` | `Original Images/Original Images` | Subdirectory within `raw_dir` containing the images to process. Used for both Kaggle and custom datasets. |
| `--delete_raw` | `flag` | `False` | Deletes the raw folder after processing to save storage. Use `--delete_raw` to enable. |

### Example Usage
To process a Kaggle dataset with augmentation and a custom validation split:

```bash

python scripts/process_dataset.py \

    --augment \

    --test_split_rate 0.3 \

    --rotation_range 15

```

To process a **custom dataset** with a specific subdirectory and delete the raw folder:

```bash

python scripts/process_dataset.py \

    --source_subdir your_custom_dataset_dir \

    --delete_raw

```
## Step-by-step process for handling a dataset
These options allow flexible dataset processing tailored to your needs. πŸš€

### Step 1: Clone the Repository
Ensure the `slimface` project is set up by cloning the repository and navigating to the project directory:

```bash

git clone https://github.com/danhtran2mind/slimface/

cd slimface

```

### Step 2: Process the Dataset

#### Option 1: Using Dataset from Kaggle
To download and process the sample dataset from Kaggle, run:

```bash

python scripts/process_dataset.py

```

This script organizes the dataset into the following structure under `data/`:

```markdown

data/

β”œβ”€β”€ processed_ds/

β”‚   β”œβ”€β”€ train_data/

β”‚   β”‚   β”œβ”€β”€ Charlize Theron/

β”‚   β”‚   β”‚   β”œβ”€β”€ Charlize Theron_70.jpg

β”‚   β”‚   β”‚   β”œβ”€β”€ Charlize Theron_46.jpg

β”‚   β”‚   β”‚   ...

β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson/

β”‚   β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson_58.jpg

β”‚   β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson_9.jpg

β”‚   β”‚   β”‚   ...

β”‚   └── val_data/

β”‚       β”œβ”€β”€ Charlize Theron/

β”‚       β”‚   β”œβ”€β”€ Charlize Theron_60.jpg

β”‚       β”‚   β”œβ”€β”€ Charlize Theron_45.jpg

β”‚       β”‚   ...

β”‚       β”œβ”€β”€ Dwayne Johnson/

β”‚       β”‚   β”œβ”€β”€ Dwayne Johnson_11.jpg

β”‚       β”‚   β”œβ”€β”€ Dwayne Johnson_46.jpg

β”‚       β”‚   ...

β”œβ”€β”€ raw/

β”‚   β”œβ”€β”€ Faces/

β”‚   β”‚   β”œβ”€β”€ Jessica Alba_90.jpg

β”‚   β”‚   β”œβ”€β”€ Hugh Jackman_70.jpg

β”‚   β”‚   ...

β”‚   β”œβ”€β”€ Original Images/

β”‚   β”‚   β”œβ”€β”€ Charlize Theron/

β”‚   β”‚   β”‚   β”œβ”€β”€ Charlize Theron_60.jpg

β”‚   β”‚   β”‚   β”œβ”€β”€ Charlize Theron_70.jpg

β”‚   β”‚   β”‚   ...

β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson/

β”‚   β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson_11.jpg

β”‚   β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson_58.jpg

β”‚   β”‚   β”‚   ...

β”‚   β”œβ”€β”€ dataset.zip

β”‚   └── Dataset.csv

└── .gitignore

```

#### Option 2: Using a Custom Dataset
If you prefer to use your own dataset, place it in `./data/raw/your_custom_dataset_dir/` with the following structure:

```markdown

data/

β”œβ”€β”€ raw/

β”‚   β”œβ”€β”€ your_custom_dataset_dir/

β”‚   β”‚   β”œβ”€β”€ Charlize Theron/

β”‚   β”‚   β”‚   β”œβ”€β”€ Charlize Theron_60.jpg

β”‚   β”‚   β”‚   β”œβ”€β”€ Charlize Theron_70.jpg

β”‚   β”‚   β”‚   ...

β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson/

β”‚   β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson_11.jpg

β”‚   β”‚   β”‚   β”œβ”€β”€ Dwayne Johnson_58.jpg

β”‚   β”‚   β”‚   ...

```

If you use your dataset, you do not need to include only human faces, because **we support face extraction using face detection**, and all extracted faces are saved at `data/processed_ds`.

Then, process your custom dataset by specifying the subdirectory:

```bash

python scripts/process_dataset.py \

    --source_subdir your_custom_dataset_dir

```

This ensures your dataset is properly formatted for training. πŸš€