File size: 12,738 Bytes
5d32408
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
# Dataset Management

- [Dataset Management](#dataset-management)
  - [Dataset Format](#dataset-format)
  - [Dataset to CSV](#dataset-to-csv)
  - [Manage datasets](#manage-datasets)
    - [Requirement](#requirement)
    - [Basic Usage](#basic-usage)
    - [Score filtering](#score-filtering)
    - [Documentation](#documentation)
  - [Transform datasets](#transform-datasets)
    - [Resize](#resize)
    - [Frame extraction](#frame-extraction)
    - [Crop Midjourney 4 grid](#crop-midjourney-4-grid)
  - [Analyze datasets](#analyze-datasets)
  - [Data Process Pipeline](#data-process-pipeline)

After preparing the raw dataset according to the [instructions](/docs/datasets.md), you can use the following commands to manage the dataset.

## Dataset Format

All dataset should be provided in a `.csv` file (or `parquet.gzip` to save space), which is used for both training and data preprocessing. The columns should follow the words below:

- `path`: the relative/absolute path or url to the image or video file. Required.
- `text`: the caption or description of the image or video. Required for training.
- `num_frames`: the number of frames in the video. Required for training.
- `width`: the width of the video frame. Required for dynamic bucket.
- `height`: the height of the video frame. Required for dynamic bucket.
- `aspect_ratio`: the aspect ratio of the video frame (height / width). Required for dynamic bucket.
- `resolution`: height x width. For analysis.
- `text_len`: the number of tokens in the text. For analysis.
- `aes`: aesthetic score calculated by [asethetic scorer](/tools/aesthetic/README.md). For filtering.
- `flow`: optical flow score calculated by [UniMatch](/tools/scoring/README.md). For filtering.
- `match`: matching score of a image-text/video-text pair calculated by [CLIP](/tools/scoring/README.md). For filtering.
- `fps`: the frame rate of the video. Optional.
- `cmotion`: the camera motion.

An example ready for training:

```csv
path, text, num_frames, width, height, aspect_ratio
/absolute/path/to/image1.jpg, caption, 1, 720, 1280, 0.5625
/absolute/path/to/video1.mp4, caption, 120, 720, 1280, 0.5625
/absolute/path/to/video2.mp4, caption, 20, 256, 256, 1
```

We use pandas to manage the `.csv` or `.parquet` files. The following code is for reading and writing files:

```python
df = pd.read_csv(input_path)
df = df.to_csv(output_path, index=False)
# or use parquet, which is smaller
df = pd.read_parquet(input_path)
df = df.to_parquet(output_path, index=False)
```

## Dataset to CSV

As a start point, `convert.py` is used to convert the dataset to a CSV file. You can use the following commands to convert the dataset to a CSV file:

```bash
python -m tools.datasets.convert DATASET-TYPE DATA_FOLDER

# general video folder
python -m tools.datasets.convert video VIDEO_FOLDER --output video.csv
# general image folder
python -m tools.datasets.convert image IMAGE_FOLDER --output image.csv
# imagenet
python -m tools.datasets.convert imagenet IMAGENET_FOLDER --split train
# ucf101
python -m tools.datasets.convert ucf101 UCF101_FOLDER --split videos
# vidprom
python -m tools.datasets.convert vidprom VIDPROM_FOLDER --info VidProM_semantic_unique.csv
```

## Manage datasets

Use `datautil` to manage the dataset.

### Requirement

To accelerate processing speed, you can install [pandarallel](https://github.com/nalepae/pandarallel):

```bash
pip install pandarallel
```

To get image and video information, you need to install [opencv-python](https://github.com/opencv/opencv-python):

```bash
pip install opencv-python
# If your videos are in av1 codec instead of h264, you need to
# - install ffmpeg first
# - install via conda to support av1 codec
conda install -c conda-forge opencv
```

Or to get video information, you can install ffmpeg and ffmpeg-python:

```bash
pip install ffmpeg-python
```

To filter a specific language, you need to install [lingua](https://github.com/pemistahl/lingua-py):

```bash
pip install lingua-language-detector
```

### Basic Usage

You can use the following commands to process the `csv` or `parquet` files. The output file will be saved in the same directory as the input, with different suffixes indicating the processed method.

```bash
# datautil takes multiple CSV files as input and merge them into one CSV file
# output: DATA1+DATA2.csv
python -m tools.datasets.datautil DATA1.csv DATA2.csv

# shard CSV files into multiple CSV files
# output: DATA1_0.csv, DATA1_1.csv, ...
python -m tools.datasets.datautil DATA1.csv --shard 10

# filter frames between 128 and 256, with captions
# output: DATA1_fmin_128_fmax_256.csv
python -m tools.datasets.datautil DATA.csv --fmin 128 --fmax 256

# Disable parallel processing
python -m tools.datasets.datautil DATA.csv --fmin 128 --fmax 256 --disable-parallel

# Compute num_frames, height, width, fps, aspect_ratio for videos or images
# output: IMG_DATA+VID_DATA_vinfo.csv
python -m tools.datasets.datautil IMG_DATA.csv VID_DATA.csv --video-info

# You can run multiple operations at the same time.
python -m tools.datasets.datautil DATA.csv --video-info --remove-empty-caption --remove-url --lang en
```

### Score filtering

To examine and filter the quality of the dataset by aesthetic score and clip score, you can use the following commands:

```bash
# sort the dataset by aesthetic score
# output: DATA_sort.csv
python -m tools.datasets.datautil DATA.csv --sort aesthetic_score
# View examples of high aesthetic score
head -n 10 DATA_sort.csv
# View examples of low aesthetic score
tail -n 10 DATA_sort.csv

# sort the dataset by clip score
# output: DATA_sort.csv
python -m tools.datasets.datautil DATA.csv --sort clip_score

# filter the dataset by aesthetic score
# output: DATA_aesmin_0.5.csv
python -m tools.datasets.datautil DATA.csv --aesmin 0.5
# filter the dataset by clip score
# output: DATA_matchmin_0.5.csv
python -m tools.datasets.datautil DATA.csv --matchmin 0.5
```

### Documentation

You can also use `python -m tools.datasets.datautil --help` to see usage.

| Args                        | File suffix    | Description                                                   |
| --------------------------- | -------------- | ------------------------------------------------------------- |
| `--output OUTPUT`           |                | Output path                                                   |
| `--format FORMAT`           |                | Output format (csv, parquet, parquet.gzip)                    |
| `--disable-parallel`        |                | Disable `pandarallel`                                         |
| `--seed SEED`               |                | Random seed                                                   |
| `--shard SHARD`             | `_0`,`_1`, ... | Shard the dataset                                             |
| `--sort KEY`                | `_sort`        | Sort the dataset by KEY                                       |
| `--sort-descending KEY`     | `_sort`        | Sort the dataset by KEY in descending order                   |
| `--difference DATA.csv`     |                | Remove the paths in DATA.csv from the dataset                 |
| `--intersection DATA.csv`   |                | Keep the paths in DATA.csv from the dataset and merge columns |
| `--info`                    | `_info`        | Get the basic information of each video and image (cv2)       |
| `--ext`                     | `_ext`         | Remove rows if the file does not exist                        |
| `--relpath`                 | `_relpath`     | Modify the path to relative path by root given                |
| `--abspath`                 | `_abspath`     | Modify the path to absolute path by root given                |
| `--remove-empty-caption`    | `_noempty`     | Remove rows with empty caption                                |
| `--remove-url`              | `_nourl`       | Remove rows with url in caption                               |
| `--lang LANG`               | `_lang`        | Remove rows with other language                               |
| `--remove-path-duplication` | `_noduppath`   | Remove rows with duplicated path                              |
| `--remove-text-duplication` | `_noduptext`   | Remove rows with duplicated caption                           |
| `--refine-llm-caption`      | `_llm`         | Modify the caption generated by LLM                           |
| `--clean-caption MODEL`     | `_clean`       | Modify the caption according to T5 pipeline to suit training  |
| `--unescape`                | `_unescape`    | Unescape the caption                                          |
| `--merge-cmotion`           | `_cmotion`     | Merge the camera motion to the caption                        |
| `--count-num-token`         | `_ntoken`      | Count the number of tokens in the caption                     |
| `--load-caption EXT`        | `_load`        | Load the caption from the file                                |
| `--fmin FMIN`               | `_fmin`        | Filter the dataset by minimum number of frames                |
| `--fmax FMAX`               | `_fmax`        | Filter the dataset by maximum number of frames                |
| `--hwmax HWMAX`             | `_hwmax`       | Filter the dataset by maximum height x width                  |
| `--aesmin AESMIN`           | `_aesmin`      | Filter the dataset by minimum aesthetic score                 |
| `--matchmin MATCHMIN`       | `_matchmin`    | Filter the dataset by minimum clip score                      |
| `--flowmin FLOWMIN`         | `_flowmin`     | Filter the dataset by minimum optical flow score              |

## Transform datasets

The `tools.datasets.transform` module provides a set of tools to transform the dataset. The general usage is as follows:

```bash
python -m tools.datasets.transform TRANSFORM_TYPE META.csv ORIGINAL_DATA_FOLDER DATA_FOLDER_TO_SAVE_RESULTS --additional-args
```

### Resize

Sometimes you may need to resize the images or videos to a specific resolution. You can use the following commands to resize the dataset:

```bash
python -m tools.datasets.transform meta.csv /path/to/raw/data /path/to/new/data --length 2160
```

### Frame extraction

To extract frames from videos, you can use the following commands:

```bash
python -m tools.datasets.transform vid_frame_extract meta.csv /path/to/raw/data /path/to/new/data --points 0.1 0.5 0.9
```

### Crop Midjourney 4 grid

Randomly select one of the 4 images in the 4 grid generated by Midjourney.

```bash
python -m tools.datasets.transform img_rand_crop meta.csv /path/to/raw/data /path/to/new/data
```

## Analyze datasets

You can easily get basic information about a `.csv` dataset by using the following commands:

```bash
# examine the first 10 rows of the CSV file
head -n 10 DATA1.csv
# count the number of data in the CSV file (approximately)
wc -l DATA1.csv
```

For the dataset provided in a `.csv` or `.parquet` file, you can easily analyze the dataset using the following commands. Plots will be automatically saved.

```python
pyhton -m tools.datasets.analyze DATA_info.csv
```

## Data Process Pipeline

```bash
# Suppose videos and images under ~/dataset/
# 1. Convert dataset to CSV
python -m tools.datasets.convert video ~/dataset --output meta.csv

# 2. Get video information
python -m tools.datasets.datautil meta.csv --info --fmin 1

# 3. Get caption
# 3.1. generate caption
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava meta_info_fmin1.csv --dp-size 8 --tp-size 1 --model-path liuhaotian/llava-v1.6-mistral-7b --prompt video
# merge generated results
python -m tools.datasets.datautil meta_info_fmin1_caption_part*.csv --output meta_caption.csv
# merge caption and info
python -m tools.datasets.datautil meta_info_fmin1.csv --intersection meta_caption.csv --output meta_caption_info.csv
# clean caption
python -m tools.datasets.datautil meta_caption_info.csv --clean-caption --refine-llm-caption --remove-empty-caption --output meta_caption_processed.csv
# 3.2. extract caption
python -m tools.datasets.datautil meta_info_fmin1.csv --load-caption json --remove-empty-caption --clean-caption

# 4. Scoring
# aesthetic scoring
torchrun --standalone --nproc_per_node 8 -m tools.scoring.aesthetic.inference meta_caption_processed.csv
python -m tools.datasets.datautil meta_caption_processed_part*.csv --output meta_caption_processed_aes.csv
# optical flow scoring
torchrun --standalone --nproc_per_node 8 -m tools.scoring.optical_flow.inference meta_caption_processed.csv
# matching scoring
torchrun --standalone --nproc_per_node 8 -m tools.scoring.matching.inference meta_caption_processed.csv
# camera motion
python -m tools.caption.camera_motion_detect meta_caption_processed.csv
```