Vedio-generation

Sleeping

File size: 6,945 Bytes

25bb7a0

# WORK IN PROGRESS

Spreadsheet (WIP) of supported models and their supported features: https://docs.google.com/spreadsheets/d/16eA6mSL8XkTcu9fSWkPSHfRIqyAKJbR1O99xnuGdCKY/edit?usp=sharing

## Update 9
Added preliminary support for [Go-with-the-Flow](https://github.com/VGenAI-Netflix-Eyeline-Research/Go-with-the-Flow)

This uses LoRA weights available here: https://huggingface.co/Eyeline-Research/Go-with-the-Flow/tree/main

To create the input videos for the NoiseWarp process, I've added a node to KJNodes that works alongside my SplineEditor, and either [comfyui-inpaint-nodes](https://github.com/Acly/comfyui-inpaint-nodes) or just cv2 inpainting to create the cut and drag input videos.

The workflows are in the example_workflows -folder.



Quick video to showcase: First mask the subject, then use the cut and drag -workflow to create a video as seen here, then that video is used as input to the NoiseWarp node in the main workflow.



https://github.com/user-attachments/assets/112706b0-a38b-4c3c-b779-deba0827af4f



## BREAKING Update8



This is big one, and unfortunately to do the necessary cleanup and refactoring this will break every old workflow as they are.

I apologize for the inconvenience, if I don't do this now I'll keep making it worse until maintaining becomes too much of a chore, so from my pov there was no choice.



*Please either use the new workflows or fix the nodes in your old ones before posting issue reports!*



Old version will be kept in a legacy branch, but not maintained



- Support CogVideoX 1.5 models

- Major code cleanup (it was bad, still isn't great, wip)

- Merge Fun -model functionality into main pipeline:

    - All Fun specific nodes, besides image encode node for Fun -InP models are gone

    - Main CogVideo Sampler works with Fun models

    - DimensionX LoRAs now work with Fun models as well



- Remove width/height from the sampler widgets and detect from input instead, this meanst text2vid now requires using empty latents

- Separate VAE from the model, allow using fp32 VAE

- Add ability to load some of the non-GGUF models as single files (only few available for now: https://huggingface.co/Kijai/CogVideoX-comfy)

- Add some torchao quantizations as options

- Add interpolation as option for the main encode node, old interpolation specific node is gone

- torch.compile optimizations

- Remove PAB in favor of FasterCache and cleaner code

- other smaller things I forgot about at this point



For Fun -model based workflows it's more drastic change, for others migrating generally means re-setting many of the nodes.



## Update7



- Refactored the Fun version's sampler to accept any resolution, this should make it lot simpler to use with Tora. **BREAKS OLD WORKFLOWS**, old FunSampler nodes need to be remade.

- The old bucket resizing is now on it's own node (CogVideoXFunResizeToClosestBucket) to keep the functionality, I honestly don't know if it matters at all, but just in case.

- Fun version's vid2vid is now also in the same node, the old vid2vid node is deprecated.

- Added support for FasterCache, this trades more VRAM use for speed with slight quality hit, similar to PAB: https://github.com/Vchitect/FasterCache

- Improved torch.compile support, it actually works now



## Update6



Initial support for Tora (https://github.com/alibaba/Tora)



Converted model (included in the autodownload node):



https://huggingface.co/Kijai/CogVideoX-5b-Tora/tree/main





https://github.com/user-attachments/assets/d5334237-03dc-48f5-8bec-3ae5998660c6





## Update5

This week there's been some bigger updates that will most likely affect some old workflows, sampler node especially probably need to be refreshed (re-created) if it errors out!



New features:

- Initial context windowing with FreeNoise noise shuffling mainly for vid2vid and pose2vid pipelines for longer generations, haven't figured it out for img2vid yet

- GGUF models and tiled encoding for I2V and pose pipelines (thanks to MinusZoneAI)

- [sageattention](https://github.com/thu-ml/SageAttention) support (Linux only) for a speed boost, I experienced ~20-30% increase with it, stacks with fp8 fast mode, doesn't need compiling

- Support CogVideoX-Fun 1.1 and it's pose models with additional control strength and application step settings, this model's input does NOT have to be just dwpose skeletons, just about anything can work

- Support LoRAs



https://github.com/user-attachments/assets/ddeb8f38-a647-42b3-a4b1-c6936f961deb



https://github.com/user-attachments/assets/c78b2832-9571-4941-8c97-fbcc1a4cc23d



https://github.com/user-attachments/assets/d9ed98b1-f917-432b-a16e-e01e87efb1f9







## Update4

Initial support for the official I2V version of CogVideoX: https://huggingface.co/THUDM/CogVideoX-5b-I2V



**Also needs diffusers 0.30.3**



https://github.com/user-attachments/assets/c672d0af-a676-495d-a42c-7e3dd802b4b0







## Update3



Added initial support for CogVideoX-Fun: https://github.com/aigc-apps/CogVideoX-Fun



Note that while this one can do image2vid, this is NOT the official I2V model yet, though it should also be released very soon.



https://github.com/user-attachments/assets/68f9ed16-ee53-4955-b931-1799461ac561





## Updade2



Added **experimental** support for onediff, this reduced sampling time by ~40% for me, reaching 4.23 s/it on 4090 with 49 frames. 

This requires using Linux, torch 2.4.0, onediff and nexfort installation:



`pip install --pre onediff onediffx`



`pip install nexfort`



First run will take around 5 mins for the compilation.



## Update

5b model is now also supported for basic text2vid: https://huggingface.co/THUDM/CogVideoX-5b



It is also autodownloaded to `ComfyUI/models/CogVideo/CogVideoX-5b`, text encoder is not needed as we use the ComfyUI T5.



https://github.com/user-attachments/assets/991205cc-826e-4f93-831a-c10441f0f2ce



Requires diffusers 0.30.1 (this is specified in requirements.txt)



Uses same T5 model than SD3 and Flux, fp8 works fine too. Memory requirements depend mostly on the video length. 

VAE decoding seems to be the only big that takes a lot of VRAM when everything is offloaded, peaks at around 13-14GB momentarily at that stage.

Sampling itself takes only maybe 5-6GB.





Hacked in img2img to attempt vid2vid workflow, works interestingly with some inputs, highly experimental.



https://github.com/user-attachments/assets/e6951ef4-ea7a-4752-94f6-cf24f2503d83



https://github.com/user-attachments/assets/9e41f37b-2bb3-411c-81fa-e91b80da2559



Also added temporal tiling as means of generating endless videos:



https://github.com/kijai/ComfyUI-CogVideoXWrapper



https://github.com/user-attachments/assets/ecdac8b8-d434-48b6-abd6-90755b6b552d







Original repo:

https://github.com/THUDM/CogVideo



CogVideoX-Fun:

https://github.com/aigc-apps/CogVideoX-Fun



Controlnet:

https://github.com/TheDenk/cogvideox-controlnet