Spaces:
Runtime error
Runtime error
File size: 4,042 Bytes
5d32408 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
# Video Captioning
Human labeling of videos is expensive and time-consuming. We adopt powerful image captioning models to generate captions for videos. Although GPT-4V achieves a better performance, its 20s/sample speed is too slow for us. LLaVA is the second best open-source model in [MMMU](https://mmmu-benchmark.github.io/) and accepts any resolution. We find the quality of 34B model is comparable.

## LLaVA Captioning
We extract three frames from the video for captioning. With batch inference, we can achieve 10 times speedup. With approximatly 720p resolution and 3 frames, the speed is 2~3 videos/s on 8 GPUs. If we resize the smaller side to 336, the speed can be 8 videos/s.
### Requirement
```bash
# create conda env
conda create -n llava python=3.10 -y
conda activate llava
# install torch
pip install torch torchvision
# clone llava
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
# CAUTION: This line is to remove torch dependency in pyproject.toml, which is:
# "torch==2.1.2", "torchvision==0.16.2",
# It is better manually remove it in your local pyproject.toml
sed -i '16d' pyproject.toml
# install llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
# install flash attention
pip install flash-attn --no-build-isolation
# install colossalai and decord
pip install colossalai decord
```
Since only the 34B model's performance is comparable to GPT-4V, we only provide the usage of the 34B model. The 34B model is available [here](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b), or run our script and it will be downloaded automatically.
### Usage
Prepare a csv file for processing. The csv file can be generated by `convert_dataset.py` according to its [documentation](/tools/datasets/README.md). Then, run the following command to generate captions for videos/images with LLaVA:
```bash
# we run this on 8xH800 GPUs
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava DATA.csv --tp-size 2 --dp-size 4 --bs 16
# at least two 80G GPUs are required
torchrun --nproc_per_node 2 --standalone -m tools.caption.caption_llava DATA.csv --tp-size 2 --dp-size 1 --bs 16
# can also caption images
torchrun --nproc_per_node 2 --standalone -m tools.caption.caption_llava DATA.csv --tp-size 2 --dp-size 1 --bs 16 --prompt image-3ex
# caption with llava-34B
# NOTE: remember to enable flash attention for this model
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava DATA.csv --dp-size 4 --tp-size 2 --model-path liuhaotian/llava-v1.6-34b --prompt image-3ex --flash-attention
# caption with mistral-7B
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava DATA.csv --dp-size 8 --tp-size 1 --model-path liuhaotian/llava-v1.6-mistral-7b --prompt video
# bs can be 48
```
Please note that you should add the `--flash-attention` flag when running with Llama-based Llava models as it provides speedup but do turn it off for mistral-based ones. Reasons can be found in [this issue](https://discuss.huggingface.co/t/flash-attention-has-no-effect-on-inference/73453).
After running the script, with `dp-size=N`, you will get `N` parts of csv files. Run the following command to merge them:
```bash
python -m tools.datasets.datautil DATA_caption_part*.csv --output DATA_caption.csv
```
### Resume
Sometimes the process may be interrupted. We can resume the process by running the following command:
```bash
# merge generated results
python -m tools.datasets.datautil DATA_caption_part*.csv --output DATA_caption.csv
# get the remaining videos
python -m tools.datasets.datautil DATA.csv --difference DATA_caption.csv --output DATA_remaining.csv
```
Then use the output csv file to resume the process.
## GPT-4V Captioning
Run the following command to generate captions for videos with GPT-4V:
```bash
# output: DATA_caption.csv
python -m tools.caption.caption_gpt4 DATA.csv --key $OPENAI_API_KEY
```
The cost is approximately $0.01 per video (3 frames per video).
|