Spaces:

jbilcke-hf
/

VideoModelStudio

Running

App Files Files Community

jbilcke-hf HF Staff commited on Apr 16

Commit

56d5816

1 Parent(s): 14ba40f

add docs for image conditioning

Browse files

Files changed (12) hide show

docs/finetrainers/documentation_README.md +129 -0
docs/finetrainers/documentation_args.md +316 -0
docs/finetrainers/documentation_models_README.md +11 -4
docs/finetrainers/documentation_models_wan.md +65 -3
docs/finetrainers/documentation_trainers_control_trainer.md +3 -0
docs/finetrainers/documentation_trainers_sft_trainer.md +3 -0
docs/finetrainers/examples_training_wan_image_conditioning__train.sh +175 -0
docs/finetrainers/examples_training_wan_image_conditioning__ttraining.json +26 -0
docs/finetrainers/examples_training_wan_image_conditioning__tvalidation.json +44 -0
requirements.txt +3 -2
requirements_without_flash_attention.txt +3 -2
train.py +9 -3

docs/finetrainers/documentation_README.md ADDED Viewed

	@@ -0,0 +1,129 @@

+# finetrainers 🧪
+Finetrainers is a work-in-progress library to support (accessible) training of diffusion models and various commonly used training algorithms.
+<table align="center">
+<tr>
+  <td align="center"><video src="https://github.com/user-attachments/assets/aad07161-87cb-4784-9e6b-16d06581e3e5">Your browser does not support the video tag.</video></td>
+  <td align="center"><video src="https://github.com/user-attachments/assets/c23d53e2-b422-4084-9156-3fce9fd01dad">Your browser does not support the video tag.</video></td>
+</tr>
+<tr>
+  <th align="center">CogVideoX LoRA training as the first iteration of this project</th>
+  <th align="center">Replication of PikaEffects</th>
+</tr>
+</table>
+## Table of Contents
+- [Quickstart](#quickstart)
+- [Features](#features)
+- [News](#news)
+- [Support Matrix](#support-matrix)
+- [Featured Projects](#featured-projects-)
+- [Acknowledgements](#acknowledgements)
+## Quickstart
+Clone the repository and make sure the requirements are installed: `pip install -r requirements.txt` and install `diffusers` from source by `pip install git+https://github.com/huggingface/diffusers`. The requirements specify `diffusers>=0.32.1`, but it is always recommended to use the `main` branch of Diffusers for the latest features and bugfixes. Note that the `main` branch for `finetrainers` is also the development branch, and stable support should be expected from the release tags.
+Checkout to the latest stable release tag:
+```bash
+git fetch --all --tags
+git checkout tags/v0.1.0
+```
+Follow the instructions mentioned in the [README](https://github.com/a-r-r-o-w/finetrainers/tree/v0.1.0) for the latest stable release.
+#### Using the main branch
+To get started quickly with example training scripts on the main development branch, refer to the following:
+- [LTX-Video Pika Effects Crush](./examples/training/sft/ltx_video/crush_smol_lora/)
+- [CogVideoX Pika Effects Crush](./examples/training/sft/cogvideox/crush_smol_lora/)
+- [Wan T2V Pika Effects Crush](./examples/training/sft/wan/crush_smol_lora/)
+The following are some simple datasets/HF orgs with good datasets to test training with quickly:
+- [Disney Video Generation Dataset](https://huggingface.co/datasets/Wild-Heart/Disney-VideoGeneration-Dataset)
+- [bigdatapw Video Dataset Collection](https://huggingface.co/bigdata-pw)
+- [Finetrainers HF Dataset Collection](https://huggingface.co/finetrainers)
+Please checkout [`docs/models`](./docs/models/) and [`examples/training`](./examples/training/) to learn more about supported models for training & example reproducible training launch scripts. For a full list of arguments that can be set for training, refer to [`docs/args`](./docs/args.md).
+> [!IMPORTANT]
+> It is recommended to use Pytorch 2.5.1 or above for training. Previous versions can lead to completely black videos, OOM errors, or other issues and are not tested. For fully reproducible training, please use the same environment as mentioned in [environment.md](./docs/environment.md).
+## Features
+- DDP, FSDP-2 & HSDP support for all models
+- LoRA and full-rank finetuning; Conditional Control training
+- Memory-efficient single-GPU training
+- Auto-detection of commonly used dataset formats
+- Combined image/video datasets, multiple chainable local/remote datasets, multi-resolution bucketing & more
+- Memory-efficient precomputation support with/without on-the-fly precomputation for large scale datasets
+- Standardized model specification format for training arbitrary models
+- Fake FP8 training (QAT upcoming!)
+## News
+- 🔥 **2025-04-12**: Channel-concatenated control conditioning support added for CogView4 and Wan!
+- 🔥 **2025-04-08**: `torch.compile` support added!
+- 🔥 **2025-04-06**: Flux support added!
+- 🔥 **2025-03-07**: CogView4 support added!
+- 🔥 **2025-03-03**: Wan T2V support added!
+- 🔥 **2025-03-03**: We have shipped a complete refactor to support multi-backend distributed training, better precomputation handling for big datasets, model specification format (externally usable for training custom models), FSDP & more.
+- 🔥 **2025-02-12**: We have shipped a set of tooling to curate small and high-quality video datasets for fine-tuning. See [video-dataset-scripts](https://github.com/huggingface/video-dataset-scripts) documentation page for details!
+- 🔥 **2025-02-12**: Check out [eisneim/ltx_lora_training_i2v_t2v](https://github.com/eisneim/ltx_lora_training_i2v_t2v/)! It builds off of `finetrainers` to support image to video training for LTX-Video and STG guidance for inference.
+- 🔥 **2025-01-15**: Support for naive FP8 weight-casting training added! This allows training HunyuanVideo in under 24 GB upto specific resolutions.
+- 🔥 **2025-01-13**: Support for T2V full-finetuning added! Thanks to [@ArEnSc](https://github.com/ArEnSc) for taking up the initiative!
+- 🔥 **2025-01-03**: Support for T2V LoRA finetuning of [CogVideoX](https://huggingface.co/docs/diffusers/main/api/pipelines/cogvideox) added!
+- 🔥 **2024-12-20**: Support for T2V LoRA finetuning of [Hunyuan Video](https://huggingface.co/docs/diffusers/main/api/pipelines/hunyuan_video) added! We would like to thank @SHYuanBest for his work on a training script [here](https://github.com/huggingface/diffusers/pull/10254).
+- 🔥 **2024-12-18**: Support for T2V LoRA finetuning of [LTX Video](https://huggingface.co/docs/diffusers/main/api/pipelines/ltx_video) added!
+## Support Matrix
+The following trainers are currently supported:
+- [SFT Trainer](./docs/trainer/sft_trainer.md)
+- [Control Trainer](./docs/trainer/control_trainer.md)
+> [!NOTE]
+> The following numbers were obtained from the [release branch](https://github.com/a-r-r-o-w/finetrainers/tree/v0.0.1). The `main` branch is unstable at the moment and may use higher memory.
+<div align="center">
+| **Model Name**                                 | **Tasks**     | **Min. LoRA VRAM<sup>*</sup>**     | **Min. Full Finetuning VRAM<sup>^</sup>**     |
+|:----------------------------------------------:|:-------------:|:----------------------------------:|:---------------------------------------------:|
+| [LTX-Video](./docs/models/ltx_video.md)        | Text-to-Video | 5 GB                               | 21 GB                                         |
+| [HunyuanVideo](./docs/models/hunyuan_video.md) | Text-to-Video | 32 GB                              | OOM                                           |
+| [CogVideoX-5b](./docs/models/cogvideox.md)     | Text-to-Video | 18 GB                              | 53 GB                                         |
+| [Wan](./docs/models/wan.md)                    | Text-to-Video | TODO                               | TODO                                          |
+| [CogView4](./docs/models/cogview4.md)          | Text-to-Image | TODO                               | TODO                                          |
+| [Flux](./docs/models/flux.md)                  | Text-to-Image | TODO                               | TODO                                          |
+</div>
+<sub><sup>*</sup>Noted for training-only, no validation, at resolution `49x512x768`, rank 128, with pre-computation, using **FP8** weights & gradient checkpointing. Pre-computation of conditions and latents may require higher limits (but typically under 16 GB).</sub><br/>
+<sub><sup>^</sup>Noted for training-only, no validation, at resolution `49x512x768`, with pre-computation, using **BF16** weights & gradient checkpointing.</sub>
+If you would like to use a custom dataset, refer to the dataset preparation guide [here](./docs/dataset/README.md).
+## Featured Projects 🔥
+Checkout some amazing projects citing `finetrainers`:
+- [Diffusion as Shader](https://github.com/IGL-HKUST/DiffusionAsShader)
+- [SkyworkAI's SkyReels-A1](https://github.com/SkyworkAI/SkyReels-A1) & [SkyReels-A2](https://github.com/SkyworkAI/SkyReels-A2)
+- [Aether](https://github.com/OpenRobotLab/Aether)
+- [MagicMotion](https://github.com/quanhaol/MagicMotion)
+- [eisneim's LTX Image-to-Video](https://github.com/eisneim/ltx_lora_training_i2v_t2v/)
+- [wileewang's TransPixar](https://github.com/wileewang/TransPixar)
+- [Feizc's Video-In-Context](https://github.com/feizc/Video-In-Context)
+Checkout the following UIs built for `finetrainers`:
+- [jbilcke's VideoModelStudio](https://github.com/jbilcke-hf/VideoModelStudio)
+- [neph1's finetrainers-ui](https://github.com/neph1/finetrainers-ui)
+## Acknowledgements
+* `finetrainers` builds on top of & takes inspiration from great open-source libraries - `transformers`, `accelerate`, `torchtune`, `torchtitan`, `peft`, `diffusers`, `bitsandbytes`, `torchao` and `deepspeed` - to name a few.
+* Some of the design choices of `finetrainers` were inspired by [`SimpleTuner`](https://github.com/bghira/SimpleTuner).
+`

docs/finetrainers/documentation_args.md ADDED Viewed

	@@ -0,0 +1,316 @@

+# Arguments
+This document lists all the arguments that can be passed to the `train.py` script. For more information, please take a look at the `finetrainers/args.py` file.
+## Table of contents
+- [General arguments](#general)
+- [SFT training arguments](#sft-training)
+- [Control training arguments](#control-training)
+## General
+<!-- TODO(aryan): write a github workflow that automatically updates this page -->
+```
+PARALLEL ARGUMENTS
+------------------
+parallel_backend (`str`, defaults to `accelerate`):
+    The parallel backend to use for training. Choose between ['accelerate', 'ptd'].
+pp_degree (`int`, defaults to `1`):
+    The degree of pipeline parallelism.
+dp_degree (`int`, defaults to `1`):
+    The degree of data parallelism (number of model replicas).
+dp_shards (`int`, defaults to `-1`):
+    The number of data parallel shards (number of model partitions).
+cp_degree (`int`, defaults to `1`):
+    The degree of context parallelism.
+MODEL ARGUMENTS
+---------------
+model_name (`str`):
+    Name of model to train. To get a list of models, run `python train.py --list_models`.
+pretrained_model_name_or_path (`str`):
+    Path to pretrained model or model identifier from https://huggingface.co/models. The model should be
+    loadable based on specified `model_name`.
+revision (`str`, defaults to `None`):
+    If provided, the model will be loaded from a specific branch of the model repository.
+variant (`str`, defaults to `None`):
+    Variant of model weights to use. Some models provide weight variants, such as `fp16`, to reduce disk
+    storage requirements.
+cache_dir (`str`, defaults to `None`):
+    The directory where the downloaded models and datasets will be stored, or loaded from.
+tokenizer_id (`str`, defaults to `None`):
+    Identifier for the tokenizer model. This is useful when using a different tokenizer than the default from `pretrained_model_name_or_path`.
+tokenizer_2_id (`str`, defaults to `None`):
+    Identifier for the second tokenizer model. This is useful when using a different tokenizer than the default from `pretrained_model_name_or_path`.
+tokenizer_3_id (`str`, defaults to `None`):
+    Identifier for the third tokenizer model. This is useful when using a different tokenizer than the default from `pretrained_model_name_or_path`.
+text_encoder_id (`str`, defaults to `None`):
+    Identifier for the text encoder model. This is useful when using a different text encoder than the default from `pretrained_model_name_or_path`.
+text_encoder_2_id (`str`, defaults to `None`):
+    Identifier for the second text encoder model. This is useful when using a different text encoder than the default from `pretrained_model_name_or_path`.
+text_encoder_3_id (`str`, defaults to `None`):
+    Identifier for the third text encoder model. This is useful when using a different text encoder than the default from `pretrained_model_name_or_path`.
+transformer_id (`str`, defaults to `None`):
+    Identifier for the transformer model. This is useful when using a different transformer model than the default from `pretrained_model_name_or_path`.
+vae_id (`str`, defaults to `None`):
+    Identifier for the VAE model. This is useful when using a different VAE model than the default from `pretrained_model_name_or_path`.
+text_encoder_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
+    Data type for the text encoder when generating text embeddings.
+text_encoder_2_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
+    Data type for the text encoder 2 when generating text embeddings.
+text_encoder_3_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
+    Data type for the text encoder 3 when generating text embeddings.
+transformer_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
+    Data type for the transformer model.
+vae_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
+    Data type for the VAE model.
+layerwise_upcasting_modules (`List[str]`, defaults to `[]`):
+    Modules that should have fp8 storage weights but higher precision computation. Choose between ['transformer'].
+layerwise_upcasting_storage_dtype (`torch.dtype`, defaults to `float8_e4m3fn`):
+    Data type for the layerwise upcasting storage. Choose between ['float8_e4m3fn', 'float8_e5m2'].
+layerwise_upcasting_skip_modules_pattern (`List[str]`, defaults to `["patch_embed", "pos_embed", "x_embedder", "context_embedder", "^proj_in$", "^proj_out$", "norm"]`):
+    Modules to skip for layerwise upcasting. Layers such as normalization and modulation, when casted to fp8 precision
+    naively (as done in layerwise upcasting), can lead to poorer training and inference quality. We skip these layers
+    by default, and recommend adding more layers to the default list based on the model architecture.
+compile_modules (`List[str]`, defaults to `[]`):
+    Modules that should be regionally compiled with `torch.compile`. Choose one or more from ['transformer'].
+DATASET ARGUMENTS
+-----------------
+dataset_config (`str`):
+    File to a dataset file containing information about training data. This file can contain information about one or
+    more datasets in JSON format. The file must have a key called "datasets", which is a list of dictionaries. Each
+    dictionary must contain the following keys:
+        - "data_root": (`str`)
+            The root directory containing the dataset. This parameter must be provided if `dataset_file` is not provided.
+        - "dataset_file": (`str`)
+            Path to a CSV/JSON/JSONL/PARQUET/ARROW/HF_HUB_DATASET file containing metadata for training. This parameter
+            must be provided if `data_root` is not provided.
+        - "dataset_type": (`str`)
+            Type of dataset. Choose between ['image', 'video'].
+        - "id_token": (`str`)
+            Identifier token appended to the start of each prompt if provided. This is useful for LoRA-type training
+            for single subject/concept/style training, but is not necessary.
+        - "image_resolution_buckets": (`List[Tuple[int, int]]`)
+            Resolution buckets for image. This should be a list of tuples containing 2 values, where each tuple
+            represents the resolution (height, width). All images will be resized to the nearest bucket resolution.
+            This parameter must be provided if `dataset_type` is 'image'.
+        - "video_resolution_buckets": (`List[Tuple[int, int, int]]`)
+            Resolution buckets for video. This should be a list of tuples containing 3 values, where each tuple
+            represents the resolution (num_frames, height, width). All videos will be resized to the nearest bucket
+            resolution. This parameter must be provided if `dataset_type` is 'video'.
+        - "reshape_mode": (`str`)
+            All input images/videos are reshaped using this mode. Choose between the following:
+            ["center_crop", "random_crop", "bicubic"].
+        - "remove_common_llm_caption_prefixes": (`boolean`)
+            Whether or not to remove common LLM caption prefixes. See `~constants.py` for the list of common prefixes.
+dataset_shuffle_buffer_size (`int`, defaults to `1`):
+    The buffer size for shuffling the dataset. This is useful for shuffling the dataset before training. The default
+    value of `1` means that the dataset will not be shuffled.
+precomputation_items (`int`, defaults to `512`):
+    Number of data samples to precompute at once for memory-efficient training. The higher this value,
+    the more disk memory will be used to save the precomputed samples (conditions and latents).
+precomputation_dir (`str`, defaults to `None`):
+    The directory where the precomputed samples will be stored. If not provided, the precomputed samples
+    will be stored in a temporary directory of the output directory.
+precomputation_once (`bool`, defaults to `False`):
+    Precompute embeddings from all datasets at once before training. This is useful to save time during training
+    with smaller datasets. If set to `False`, will save disk space by precomputing embeddings on-the-fly during
+    training when required. Make sure to set `precomputation_items` to a reasonable value in line with the size
+    of your dataset(s).
+DATALOADER_ARGUMENTS
+--------------------
+See https://pytorch.org/docs/stable/data.html for more information.
+dataloader_num_workers (`int`, defaults to `0`):
+    Number of subprocesses to use for data loading. `0` means that the data will be loaded in a blocking manner
+    on the main process.
+pin_memory (`bool`, defaults to `False`):
+    Whether or not to use the pinned memory setting in PyTorch dataloader. This is useful for faster data loading.
+DIFFUSION ARGUMENTS
+-------------------
+flow_resolution_shifting (`bool`, defaults to `False`):
+    Resolution-dependent shifting of timestep schedules.
+    [Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https://arxiv.org/abs/2403.03206).
+    TODO(aryan): We don't support this yet.
+flow_base_seq_len (`int`, defaults to `256`):
+    Base number of tokens for images/video when applying resolution-dependent shifting.
+flow_max_seq_len (`int`, defaults to `4096`):
+    Maximum number of tokens for images/video when applying resolution-dependent shifting.
+flow_base_shift (`float`, defaults to `0.5`):
+    Base shift for timestep schedules when applying resolution-dependent shifting.
+flow_max_shift (`float`, defaults to `1.15`):
+    Maximum shift for timestep schedules when applying resolution-dependent shifting.
+flow_shift (`float`, defaults to `1.0`):
+    Instead of training with uniform/logit-normal sigmas, shift them as (shift * sigma) / (1 + (shift - 1) * sigma).
+    Setting it higher is helpful when trying to train models for high-resolution generation or to produce better
+    samples in lower number of inference steps.
+flow_weighting_scheme (`str`, defaults to `none`):
+    We default to the "none" weighting scheme for uniform sampling and uniform loss.
+    Choose between ['sigma_sqrt', 'logit_normal', 'mode', 'cosmap', 'none'].
+flow_logit_mean (`float`, defaults to `0.0`):
+    Mean to use when using the `'logit_normal'` weighting scheme.
+flow_logit_std (`float`, defaults to `1.0`):
+    Standard deviation to use when using the `'logit_normal'` weighting scheme.
+flow_mode_scale (`float`, defaults to `1.29`):
+    Scale of mode weighting scheme. Only effective when using the `'mode'` as the `weighting_scheme`.
+TRAINING ARGUMENTS
+------------------
+training_type (`str`, defaults to `None`):
+    Type of training to perform. Choose between ['lora'].
+seed (`int`, defaults to `42`):
+    A seed for reproducible training.
+batch_size (`int`, defaults to `1`):
+    Per-device batch size.
+train_steps (`int`, defaults to `1000`):
+    Total number of training steps to perform.
+max_data_samples (`int`, defaults to `2**64`):
+    Maximum number of data samples observed during training training. If lesser than that required by `train_steps`,
+    the training will stop early.
+gradient_accumulation_steps (`int`, defaults to `1`):
+    Number of gradients steps to accumulate before performing an optimizer step.
+gradient_checkpointing (`bool`, defaults to `False`):
+    Whether or not to use gradient/activation checkpointing to save memory at the expense of slower
+    backward pass.
+checkpointing_steps (`int`, defaults to `500`):
+    Save a checkpoint of the training state every X training steps. These checkpoints can be used both
+    as final checkpoints in case they are better than the last checkpoint, and are also suitable for
+    resuming training using `resume_from_checkpoint`.
+checkpointing_limit (`int`, defaults to `None`):
+    Max number of checkpoints to store.
+resume_from_checkpoint (`str`, defaults to `None`):
+    Whether training should be resumed from a previous checkpoint. Use a path saved by `checkpointing_steps`,
+    or `"latest"` to automatically select the last available checkpoint.
+OPTIMIZER ARGUMENTS
+-------------------
+optimizer (`str`, defaults to `adamw`):
+    The optimizer type to use. Choose between the following:
+        - Torch optimizers: ["adam", "adamw"]
+        - Bitsandbytes optimizers: ["adam-bnb", "adamw-bnb", "adam-bnb-8bit", "adamw-bnb-8bit"]
+lr (`float`, defaults to `1e-4`):
+    Initial learning rate (after the potential warmup period) to use.
+lr_scheduler (`str`, defaults to `cosine_with_restarts`):
+    The scheduler type to use. Choose between ['linear', 'cosine', 'cosine_with_restarts', 'polynomial',
+    'constant', 'constant_with_warmup'].
+lr_warmup_steps (`int`, defaults to `500`):
+    Number of steps for the warmup in the lr scheduler.
+lr_num_cycles (`int`, defaults to `1`):
+    Number of hard resets of the lr in cosine_with_restarts scheduler.
+lr_power (`float`, defaults to `1.0`):
+    Power factor of the polynomial scheduler.
+beta1 (`float`, defaults to `0.9`):
+beta2 (`float`, defaults to `0.95`):
+beta3 (`float`, defaults to `0.999`):
+weight_decay (`float`, defaults to `0.0001`):
+    Penalty for large weights in the model.
+epsilon (`float`, defaults to `1e-8`):
+    Small value to avoid division by zero in the optimizer.
+max_grad_norm (`float`, defaults to `1.0`):
+    Maximum gradient norm to clip the gradients.
+VALIDATION ARGUMENTS
+--------------------
+validation_dataset_file (`str`, defaults to `None`):
+    Path to a CSV/JSON/PARQUET/ARROW file containing information for validation. The file must contain atleast the
+    "caption" column. Other columns such as "image_path" and "video_path" can be provided too. If provided, "image_path"
+    will be used to load a PIL.Image.Image and set the "image" key in the sample dictionary. Similarly, "video_path"
+    will be used to load a List[PIL.Image.Image] and set the "video" key in the sample dictionary.
+    The validation dataset file may contain other attributes specific to inference/validation such as:
+        - "height" and "width" and "num_frames": Resolution
+        - "num_inference_steps": Number of inference steps
+        - "guidance_scale": Classifier-free Guidance Scale
+        - ... (any number of additional attributes can be provided. The ModelSpecification::validate method will be
+          invoked with the sample dictionary to validate the sample.)
+validation_steps (`int`, defaults to `500`):
+    Number of training steps after which a validation step is performed.
+enable_model_cpu_offload (`bool`, defaults to `False`):
+    Whether or not to offload different modeling components to CPU during validation.
+MISCELLANEOUS ARGUMENTS
+-----------------------
+tracker_name (`str`, defaults to `finetrainers`):
+    Name of the tracker/project to use for logging training metrics.
+push_to_hub (`bool`, defaults to `False`):
+    Whether or not to push the model to the Hugging Face Hub.
+hub_token (`str`, defaults to `None`):
+    The API token to use for pushing the model to the Hugging Face Hub.
+hub_model_id (`str`, defaults to `None`):
+    The model identifier to use for pushing the model to the Hugging Face Hub.
+output_dir (`str`, defaults to `None`):
+    The directory where the model checkpoints and logs will be stored.
+logging_dir (`str`, defaults to `logs`):
+    The directory where the logs will be stored.
+logging_steps (`int`, defaults to `1`):
+    Training logs will be tracked every `logging_steps` steps.
+allow_tf32 (`bool`, defaults to `False`):
+    Whether or not to allow the use of TF32 matmul on compatible hardware.
+nccl_timeout (`int`, defaults to `1800`):
+    Timeout for the NCCL communication.
+report_to (`str`, defaults to `wandb`):
+    The name of the logger to use for logging training metrics. Choose between ['wandb'].
+verbose (`int`, defaults to `1`):
+    Whether or not to print verbose logs.
+        - 0: Diffusers/Transformers warning logging on local main process only
+        - 1: Diffusers/Transformers info logging on local main process only
+        - 2: Diffusers/Transformers debug logging on local main process only
+        - 3: Diffusers/Transformers debug logging on all processes
+```
+## SFT training
+If using `--training_type lora`, these arguments can be specified.
+```
+rank (int):
+    Rank of the low rank approximation.
+lora_alpha (int):
+    The lora_alpha parameter to compute scaling factor (lora_alpha / rank) for low-rank matrices.
+target_modules (`str` or `List[str]`):
+    Target modules for the low rank approximation. Can be a regex string or a list of regex strings.
+```
+No additional arguments are required for `--training_type full-finetune`.
+## Control training
+If using `--training_type control-lora`, these arguments can be specified.
+```
+control_type (`str`, defaults to `"canny"`):
+    Control type for the low rank approximation matrices. Can be "canny", "custom".
+rank (int, defaults to `64`):
+    Rank of the low rank approximation matrix.
+lora_alpha (int, defaults to `64`):
+    The lora_alpha parameter to compute scaling factor (lora_alpha / rank) for low-rank matrices.
+target_modules (`str` or `List[str]`, defaults to `"(transformer_blocks|single_transformer_blocks).*(to_q|to_k|to_v|to_out.0|ff.net.0.proj|ff.net.2)"`):
+    Target modules for the low rank approximation matrices. Can be a regex string or a list of regex strings.
+train_qk_norm (`bool`, defaults to `False`):
+    Whether to train the QK normalization layers.
+frame_conditioning_type (`str`, defaults to `"full"`):
+    Type of frame conditioning. Can be "index", "prefix", "random", "first_and_last", or "full".
+frame_conditioning_index (int, defaults to `0`):
+    Index of the frame conditioning. Only used if `frame_conditioning_type` is "index".
+frame_conditioning_concatenate_mask (`bool`, defaults to `False`):
+    Whether to concatenate the frame mask with the latents across channel dim.
+```
+If using `--training_type control-full-finetune`, these arguments can be specified.
+```
+control_type (`str`, defaults to `"canny"`):
+    Control type for the low rank approximation matrices. Can be "canny", "custom".
+train_qk_norm (`bool`, defaults to `False`):
+    Whether to train the QK normalization layers.
+frame_conditioning_type (`str`, defaults to `"index"`):
+    Type of frame conditioning. Can be "index", "prefix", "random", "first_and_last", or "full".
+frame_conditioning_index (int, defaults to `0`):
+    Index of the frame conditioning. Only used if `frame_conditioning_type` is "index".
+frame_conditioning_concatenate_mask (`bool`, defaults to `False`):
+    Whether to concatenate the frame mask with the latents across channel dim.
+```

docs/finetrainers/documentation_models_README.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# FineTrainers training documentation
 This directory contains the training-related specifications for all the models we support in `finetrainers`. Each model page has:
 - an example training command
@@ -20,14 +20,21 @@ The following table shows the algorithms supported for training and the models t
 | Model                                     | SFT | Control | ControlNet | Distillation |
 |:-----------------------------------------:|:---:|:-------:|:----------:|:------------:|
-| [CogVideoX](./cogvideox.md)             | 🤗 | 😡 | 😡 | 😡 |
-| [LTX-Video](./ltx_video.md)             | 🤗 | 😡 | 😡 | 😡 |
-| [HunyuanVideo](./hunyuan_video.md))     | 🤗 | 😡 | 😡 | 😡 |
 For launching SFT Training:
 - `--training_type lora`: Trains a new set of low-rank weights of the model, yielding a smaller adapter model. Currently, only LoRA is supported from [🤗 PEFT](https://github.com/huggingface/peft)
 - `--training_type full-finetune`: Trains the full-rank weights of the model, yielding a full-parameter trained model.
 Any model architecture loadable in diffusers/transformers for above models can be used for training. For example, [SkyReels-T2V](https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-T2V) is a finetune of HunyuanVideo, which is compatible for continual training out-of-the-box. Custom models can be loaded either by writing your own [ModelSpecification](TODO(aryan): add link) or by using the following set of arguments:
 - `--tokenizer_id`, `--tokenizer_2_id`, `--tokenizer_3_id`: The tokenizers to use for training in conjunction with text encoder conditioning models.
 - `--text_encoder_id`, `--text_encoder_2_id`, `--text_encoder_3_id`: The text encoder conditioning models.

+# Finetrainers training documentation
 This directory contains the training-related specifications for all the models we support in `finetrainers`. Each model page has:
 - an example training command
 | Model                                     | SFT | Control | ControlNet | Distillation |
 |:-----------------------------------------:|:---:|:-------:|:----------:|:------------:|
+| [CogVideoX](./cogvideox.md)               | 🤗 | 😡 | 😡 | 😡 |
+| [CogView4](./cogview4.md)                 | 🤗 | 🤗 | 😡 | 😡 |
+| [Flux](./flux.md)                         | 🤗 | 😡 | 😡 | 😡 |
+| [HunyuanVideo](./hunyuan_video.md)        | 🤗 | 😡 | 😡 | 😡 |
+| [LTX-Video](./ltx_video.md)               | 🤗 | 😡 | 😡 | 😡 |
+| [Wan](./wan.md)                           | 🤗 | 🤗 | 😡 | 😡 |
 For launching SFT Training:
 - `--training_type lora`: Trains a new set of low-rank weights of the model, yielding a smaller adapter model. Currently, only LoRA is supported from [🤗 PEFT](https://github.com/huggingface/peft)
 - `--training_type full-finetune`: Trains the full-rank weights of the model, yielding a full-parameter trained model.
+For launching Control Training:
+- `--training_type control-lora`: Trains lora-rank weights for additional channel-wise concatenated control condition.
+- `--training_type control-full-finetune`: Trains the full-rank control conditioned model.
 Any model architecture loadable in diffusers/transformers for above models can be used for training. For example, [SkyReels-T2V](https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-T2V) is a finetune of HunyuanVideo, which is compatible for continual training out-of-the-box. Custom models can be loaded either by writing your own [ModelSpecification](TODO(aryan): add link) or by using the following set of arguments:
 - `--tokenizer_id`, `--tokenizer_2_id`, `--tokenizer_3_id`: The tokenizers to use for training in conjunction with text encoder conditioning models.
 - `--text_encoder_id`, `--text_encoder_2_id`, `--text_encoder_3_id`: The text encoder conditioning models.

docs/finetrainers/documentation_models_wan.md CHANGED Viewed

@@ -7,6 +7,7 @@ For LoRA training, specify `--training_type lora`. For full finetuning, specify
 Examples available:
 - [PIKA crush effect](../../examples/training/sft/wan/crush_smol_lora/)
 - [3DGS dissolve](../../examples/training/sft/wan/3dgs_dissolve/)
 To run an example, run the following from the root directory of the repository (assuming you have installed the requirements and are using Linux/WSL):
@@ -36,8 +37,69 @@ video = pipe("<my-awesome-prompt>").frames[0]
 export_to_video(video, "output.mp4", fps=8)
 ```
 You can refer to the following guides to know more about the model pipeline and performing LoRA inference in `diffusers`:
-* [Wan in Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/wan)
-* [Load LoRAs for inference](https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference)
-* [Merge LoRAs](https://huggingface.co/docs/diffusers/main/en/using-diffusers/merge_loras)

 Examples available:
 - [PIKA crush effect](../../examples/training/sft/wan/crush_smol_lora/)
 - [3DGS dissolve](../../examples/training/sft/wan/3dgs_dissolve/)
+- [I2V conditioning](../../examples/training/control/wan/image_condition/)
 To run an example, run the following from the root directory of the repository (assuming you have installed the requirements and are using Linux/WSL):
 export_to_video(video, "output.mp4", fps=8)
 ```
+To use trained Control LoRAs, the following can be used for inference (ideally, you should raise a support request in Diffusers):
+<details>
+<summary> Control Lora inference </summary>
+```python
+import numpy as np
+import torch
+from diffusers import WanPipeline
+from diffusers.utils import export_to_video, load_video
+from finetrainers.trainer.control_trainer.data import apply_frame_conditioning_on_latents
+from finetrainers.models.utils import _expand_conv3d_with_zeroed_weights
+from finetrainers.patches import load_lora_weights
+from finetrainers.patches.dependencies.diffusers.control import control_channel_concat
+dtype = torch.bfloat16
+device = torch.device("cuda")
+generator = torch.Generator().manual_seed(0)
+pipe = WanPipeline.from_pretrained("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", torch_dtype=dtype).to(device)
+in_channels = pipe.transformer.config.in_channels
+patch_channels = pipe.transformer.patch_embedding.in_channels
+pipe.transformer.patch_embedding = _expand_conv3d_with_zeroed_weights(pipe.transformer.patch_embedding, new_in_channels=2 * patch_channels)
+load_lora_weights(pipe, "/raid/aryan/wan-control-image-condition", "wan-lora")
+pipe.to(device)
+prompt = "The video shows a vibrant green Mustang GT parked in a parking lot. The car is positioned at an angle, showcasing its sleek design and black rims. The car's hood is black, contrasting with the green body. The Mustang GT logo is visible on the side of the car. The parking lot appears to be empty, with the car being the main focus of the video. The car's position and the absence of other vehicles suggest that the video might be a promotional or showcase video for the Mustang GT. The overall style of the video is simple and straightforward, focusing on the car and its design."
+control_video = load_video("examples/training/control/wan/image_condition/validation_dataset/0.mp4")
+height, width, num_frames = 480, 704, 49
+# Take evenly space `num_frames` frames from the control video
+indices = np.linspace(0, len(control_video) - 1, num_frames).astype(int)
+control_video = [control_video[i] for i in indices]
+with torch.no_grad():
+    latents = pipe.prepare_latents(1, in_channels, height, width, num_frames, dtype, device, generator)
+    latents_mean = torch.tensor(pipe.vae.config.latents_mean).view(1, -1, 1, 1, 1).to(latents)
+    latents_std = 1.0 / torch.tensor(pipe.vae.config.latents_std).view(1, -1, 1, 1, 1).to(latents)
+    control_video = pipe.video_processor.preprocess_video(control_video, height=height, width=width)
+    control_video = control_video.to(device=device, dtype=dtype)
+    control_latents = pipe.vae.encode(control_video).latent_dist.sample(generator=generator)
+    control_latents = ((control_latents.float() - latents_mean) * latents_std).to(dtype)
+    control_latents = apply_frame_conditioning_on_latents(
+        control_latents,
+        expected_num_frames=latents.size(2),
+        channel_dim=1,
+        frame_dim=2,
+        frame_conditioning_type="index",
+        frame_conditioning_index=0,
+        concatenate_mask=False,
+    )
+with control_channel_concat(pipe.transformer, ["hidden_states"], [control_latents], dims=[1]):
+    video = pipe(prompt, latents=latents, num_inference_steps=30, generator=generator).frames[0]
+export_to_video(video, "output.mp4", fps=16)
+```
+</details>
 You can refer to the following guides to know more about the model pipeline and performing LoRA inference in `diffusers`:
+- [Wan in Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/wan)
+- [Load LoRAs for inference](https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference)
+- [Merge LoRAs](https://huggingface.co/docs/diffusers/main/en/using-diffusers/merge_loras)

docs/finetrainers/documentation_trainers_control_trainer.md ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ # Control Trainer
2	+
3	+ The Control trainer supports channel-concatenated control conditioning for models either using low-rank adapters or full-rank training. It involves adding extra input channels to the patch embedding layer (referred to as the "control injection" layer in finetrainers), to mix conditioning features into the latent stream. This architecture choice is very common and has been seen before in many models - CogVideoX-I2V, HunyuanVideo-I2V, Alibaba's Fun Control models, etc.

docs/finetrainers/documentation_trainers_sft_trainer.md ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ # SFT Trainer
2	+
3	+ The SFT trainer supports low-rank and full-rank finetuning of models.

docs/finetrainers/examples_training_wan_image_conditioning__train.sh ADDED Viewed

	@@ -0,0 +1,175 @@

+#!/bin/bash
+set -e -x
+# export TORCH_LOGS="+dynamo,recompiles,graph_breaks"
+# export TORCHDYNAMO_VERBOSE=1
+export WANDB_MODE="offline"
+export NCCL_P2P_DISABLE=1
+export NCCL_IB_DISABLE=1
+export TORCH_NCCL_ENABLE_MONITORING=0
+export FINETRAINERS_LOG_LEVEL="INFO"
+# Download the validation dataset
+if [ ! -d "examples/training/control/wan/image_condition/validation_dataset" ]; then
+  echo "Downloading validation dataset..."
+  huggingface-cli download --repo-type dataset finetrainers/OpenVid-1k-split-validation --local-dir examples/training/control/wan/image_condition/validation_dataset
+else
+  echo "Validation dataset already exists. Skipping download."
+fi
+# Finetrainers supports multiple backends for distributed training. Select your favourite and benchmark the differences!
+# BACKEND="accelerate"
+BACKEND="ptd"
+# In this setting, I'm using 1 GPU on 4-GPU node for training
+NUM_GPUS=1
+CUDA_VISIBLE_DEVICES="3"
+# Check the JSON files for the expected JSON format
+TRAINING_DATASET_CONFIG="examples/training/control/wan/image_condition/training.json"
+VALIDATION_DATASET_FILE="examples/training/control/wan/image_condition/validation.json"
+# Depending on how many GPUs you have available, choose your degree of parallelism and technique!
+DDP_1="--parallel_backend $BACKEND --pp_degree 1 --dp_degree 1 --dp_shards 1 --cp_degree 1 --tp_degree 1"
+DDP_2="--parallel_backend $BACKEND --pp_degree 1 --dp_degree 2 --dp_shards 1 --cp_degree 1 --tp_degree 1"
+DDP_4="--parallel_backend $BACKEND --pp_degree 1 --dp_degree 4 --dp_shards 1 --cp_degree 1 --tp_degree 1"
+DDP_8="--parallel_backend $BACKEND --pp_degree 1 --dp_degree 8 --dp_shards 1 --cp_degree 1 --tp_degree 1"
+FSDP_2="--parallel_backend $BACKEND --pp_degree 1 --dp_degree 1 --dp_shards 2 --cp_degree 1 --tp_degree 1"
+FSDP_4="--parallel_backend $BACKEND --pp_degree 1 --dp_degree 1 --dp_shards 4 --cp_degree 1 --tp_degree 1"
+HSDP_2_2="--parallel_backend $BACKEND --pp_degree 1 --dp_degree 2 --dp_shards 2 --cp_degree 1 --tp_degree 1"
+# Parallel arguments
+parallel_cmd=(
+  $DDP_1
+)
+# Model arguments
+model_cmd=(
+  --model_name "wan"
+  --pretrained_model_name_or_path "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
+  --compile_modules transformer
+)
+# Control arguments
+control_cmd=(
+  --control_type none
+  --rank 128
+  --lora_alpha 128
+  --target_modules "blocks.*(to_q|to_k|to_v|to_out.0|ff.net.0.proj|ff.net.2)"
+  --frame_conditioning_type index
+  --frame_conditioning_index 0
+)
+# Dataset arguments
+dataset_cmd=(
+  --dataset_config $TRAINING_DATASET_CONFIG
+  --dataset_shuffle_buffer_size 32
+)
+# Dataloader arguments
+dataloader_cmd=(
+  --dataloader_num_workers 0
+)
+# Diffusion arguments
+diffusion_cmd=(
+  --flow_weighting_scheme "logit_normal"
+)
+# Training arguments
+# We target just the attention projections layers for LoRA training here.
+# You can modify as you please and target any layer (regex is supported)
+training_cmd=(
+  --training_type control-lora
+  --seed 42
+  --batch_size 1
+  --train_steps 10000
+  --gradient_accumulation_steps 1
+  --gradient_checkpointing
+  --checkpointing_steps 1000
+  --checkpointing_limit 2
+  # --resume_from_checkpoint 3000
+  --enable_slicing
+  --enable_tiling
+)
+# Optimizer arguments
+optimizer_cmd=(
+  --optimizer "adamw"
+  --lr 2e-5
+  --lr_scheduler "constant_with_warmup"
+  --lr_warmup_steps 1000
+  --lr_num_cycles 1
+  --beta1 0.9
+  --beta2 0.99
+  --weight_decay 1e-4
+  --epsilon 1e-8
+  --max_grad_norm 1.0
+)
+# Validation arguments
+validation_cmd=(
+  --validation_dataset_file "$VALIDATION_DATASET_FILE"
+  --validation_steps 501
+)
+# Miscellaneous arguments
+miscellaneous_cmd=(
+  --tracker_name "finetrainers-wan-control"
+  --output_dir "/raid/aryan/wan-control-image-condition"
+  --init_timeout 600
+  --nccl_timeout 600
+  --report_to "wandb"
+)
+# Execute the training script
+if [ "$BACKEND" == "accelerate" ]; then
+  ACCELERATE_CONFIG_FILE=""
+  if [ "$NUM_GPUS" == 1 ]; then
+    ACCELERATE_CONFIG_FILE="accelerate_configs/uncompiled_1.yaml"
+  elif [ "$NUM_GPUS" == 2 ]; then
+    ACCELERATE_CONFIG_FILE="accelerate_configs/uncompiled_2.yaml"
+  elif [ "$NUM_GPUS" == 4 ]; then
+    ACCELERATE_CONFIG_FILE="accelerate_configs/uncompiled_4.yaml"
+  elif [ "$NUM_GPUS" == 8 ]; then
+    ACCELERATE_CONFIG_FILE="accelerate_configs/uncompiled_8.yaml"
+  fi
+  accelerate launch --config_file "$ACCELERATE_CONFIG_FILE" --gpu_ids $CUDA_VISIBLE_DEVICES train.py \
+    "${parallel_cmd[@]}" \
+    "${model_cmd[@]}" \
+    "${control_cmd[@]}" \
+    "${dataset_cmd[@]}" \
+    "${dataloader_cmd[@]}" \
+    "${diffusion_cmd[@]}" \
+    "${training_cmd[@]}" \
+    "${optimizer_cmd[@]}" \
+    "${validation_cmd[@]}" \
+    "${miscellaneous_cmd[@]}"
+elif [ "$BACKEND" == "ptd" ]; then
+  export CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
+  torchrun \
+    --standalone \
+    --nnodes=1 \
+    --nproc_per_node=$NUM_GPUS \
+    --rdzv_backend c10d \
+    --rdzv_endpoint="localhost:19242" \
+    train.py \
+      "${parallel_cmd[@]}" \
+      "${model_cmd[@]}" \
+      "${control_cmd[@]}" \
+      "${dataset_cmd[@]}" \
+      "${dataloader_cmd[@]}" \
+      "${diffusion_cmd[@]}" \
+      "${training_cmd[@]}" \
+      "${optimizer_cmd[@]}" \
+      "${validation_cmd[@]}" \
+      "${miscellaneous_cmd[@]}"
+fi
+echo -ne "-------------------- Finished executing script --------------------\n\n"

docs/finetrainers/examples_training_wan_image_conditioning__ttraining.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "datasets": [
+    {
+      "data_root": "recoilme/aesthetic_photos_xs",
+      "dataset_type": "image",
+      "image_resolution_buckets": [
+        [1024, 1024]
+      ],
+      "reshape_mode": "bicubic",
+      "remove_common_llm_caption_prefixes": true
+    },
+    {
+      "data_root": "finetrainers/OpenVid-1k-split",
+      "dataset_type": "video",
+      "video_resolution_buckets": [
+        [49, 512, 512],
+        [49, 768, 768],
+        [49, 1024, 1024],
+        [49, 480, 704],
+        [49, 704, 480]
+      ],
+      "reshape_mode": "bicubic",
+      "remove_common_llm_caption_prefixes": true
+    }
+  ]
+}

docs/finetrainers/examples_training_wan_image_conditioning__tvalidation.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "data": [
+    {
+      "caption": "A vibrant green Mustang GT parked in a parking lot. The car is positioned at an angle, showcasing its sleek design and black rims. The car's hood is black, contrasting with the green body. The Mustang GT logo is visible on the side of the car. The parking lot appears to be empty, with the car being the main focus of the video. The car's position and the absence of other vehicles suggest that the video might be a promotional or showcase video for the Mustang GT. The overall style of the video is simple and straightforward, focusing on the car and its design.",
+      "video_path": "examples/training/control/wan/image_condition/validation_dataset/0.mp4",
+      "num_inference_steps": 30,
+      "num_frames": 49,
+      "height": 480,
+      "width": 704,
+      "frame_conditioning_type": "index",
+      "frame_conditioning_index": 0
+    },
+    {
+      "caption": "A cooking tutorial featuring a man in a kitchen. He is wearing a white t-shirt and a black apron. In the first frame, he is holding a bowl and appears to be explaining something. In the second frame, he is mixing ingredients in the bowl. In the third frame, he is pouring the mixture into another bowl. The kitchen is well-equipped with various appliances and utensils. There are also other people in the background, possibly assisting with the cooking process. The style of the video is informative and instructional, with a focus on the cooking process.",
+      "video_path": "examples/training/control/wan/image_condition/validation_dataset/1.mp4",
+      "num_inference_steps": 30,
+      "num_frames": 49,
+      "height": 480,
+      "width": 704,
+      "frame_conditioning_type": "index",
+      "frame_conditioning_index": 0
+    },
+    {
+      "caption": "A man in a suit and tie, standing against a blue background with a digital pattern. He appears to be speaking or presenting, as suggested by his open mouth and focused expression. The style of the video is likely a news segment or an interview, given the professional attire of the man and the formal setting. The background suggests a modern, technology-focused theme, which could be related to the content of the man's speech or the context of the interview. The overall impression is one of professionalism and seriousness.",
+      "video_path": "examples/training/control/wan/image_condition/validation_dataset/2.mp4",
+      "num_inference_steps": 30,
+      "num_frames": 49,
+      "height": 480,
+      "width": 704,
+      "frame_conditioning_type": "index",
+      "frame_conditioning_index": 0
+    },
+    {
+      "caption": "A man in a workshop, dressed in a black shirt and a beige hat, with a beard and glasses. He is holding a hammer and a metal object, possibly a piece of iron or a tool. The workshop is filled with various tools and equipment, including a workbench, a vice, and a shelf with various items. The man appears to be engaged in a craft or a repair project. The style of the video is realistic and documentary, capturing the man's actions and the environment of the workshop in detail. The focus is on the man and his work, with no additional text or graphics. The lighting is natural, suggesting that the video was taken during the day. The overall impression is of a skilled craftsman at work in his workshop.",
+      "video_path": "examples/training/control/wan/image_condition/validation_dataset/3.mp4",
+      "num_inference_steps": 30,
+      "num_frames": 49,
+      "height": 480,
+      "width": 704,
+      "frame_conditioning_type": "index",
+      "frame_conditioning_index": 0
+    }
+  ]
+}

requirements.txt CHANGED Viewed

@@ -3,8 +3,9 @@
 # For GPU monitoring of NVIDIA chipsets
 pynvml
-finetrainers==0.1.0
-#finetrainers @ git+https://github.com/a-r-r-o-w/finetrainers.git@main
 # temporary fix for pip install bug:
 #finetrainers @ git+https://github.com/jbilcke-hf/finetrainers-patches.git@fix_missing_sft_trainer_files

 # For GPU monitoring of NVIDIA chipsets
 pynvml
+# we are waiting for the next PyPI release
+#finetrainers==0.1.0
+finetrainers @ git+https://github.com/a-r-r-o-w/finetrainers.git@main
 # temporary fix for pip install bug:
 #finetrainers @ git+https://github.com/jbilcke-hf/finetrainers-patches.git@fix_missing_sft_trainer_files

requirements_without_flash_attention.txt CHANGED Viewed

@@ -7,8 +7,9 @@ pynvml
 #eva-decord==0.6.1
 #decord
-finetrainers==0.1.0
-#finetrainers @ git+https://github.com/a-r-r-o-w/finetrainers.git@main
 # temporary fix for pip install bug:
 #finetrainers @ git+https://github.com/jbilcke-hf/finetrainers-patches.git@fix_missing_sft_trainer_files

 #eva-decord==0.6.1
 #decord
+# we are waiting for the next PyPI release
+#finetrainers==0.1.0
+finetrainers @ git+https://github.com/a-r-r-o-w/finetrainers.git@main
 # temporary fix for pip install bug:
 #finetrainers @ git+https://github.com/jbilcke-hf/finetrainers-patches.git@fix_missing_sft_trainer_files

train.py CHANGED Viewed

@@ -1,8 +1,9 @@
 import sys
 import traceback
-from finetrainers import BaseArgs, SFTTrainer, TrainingType, get_logger
 from finetrainers.config import _get_model_specifiction_cls
 from finetrainers.trainer.sft_trainer.config import SFTFullRankConfig, SFTLowRankConfig
@@ -35,11 +36,14 @@ def main():
             training_cls = SFTLowRankConfig
         elif training_type == TrainingType.FULL_FINETUNE:
             training_cls = SFTFullRankConfig
         else:
             raise ValueError(f"Training type {training_type} not supported.")
-        training_config = training_cls()
-        args.extend_args(training_config.add_args, training_config.map_args, training_config.validate_args)
         args = args.parse_args()
         model_specification_cls = _get_model_specifiction_cls(args.model_name, args.training_type)
@@ -64,6 +68,8 @@ def main():
         if args.training_type in [TrainingType.LORA, TrainingType.FULL_FINETUNE]:
             trainer = SFTTrainer(args, model_specification)
         else:
             raise ValueError(f"Training type {args.training_type} not supported.")

 import sys
 import traceback
+from finetrainers import BaseArgs, ControlTrainer, SFTTrainer, TrainingType, get_logger
 from finetrainers.config import _get_model_specifiction_cls
+from finetrainers.trainer.control_trainer.config import ControlFullRankConfig, ControlLowRankConfig
 from finetrainers.trainer.sft_trainer.config import SFTFullRankConfig, SFTLowRankConfig
             training_cls = SFTLowRankConfig
         elif training_type == TrainingType.FULL_FINETUNE:
             training_cls = SFTFullRankConfig
+        elif training_type == TrainingType.CONTROL_LORA:
+            training_cls = ControlLowRankConfig
+        elif training_type == TrainingType.CONTROL_FULL_FINETUNE:
+            training_cls = ControlFullRankConfig
         else:
             raise ValueError(f"Training type {training_type} not supported.")
+        args.register_args(training_cls())
         args = args.parse_args()
         model_specification_cls = _get_model_specifiction_cls(args.model_name, args.training_type)
         if args.training_type in [TrainingType.LORA, TrainingType.FULL_FINETUNE]:
             trainer = SFTTrainer(args, model_specification)
+        elif args.training_type in [TrainingType.CONTROL_LORA, TrainingType.CONTROL_FULL_FINETUNE]:
+            trainer = ControlTrainer(args, model_specification)
         else:
             raise ValueError(f"Training type {args.training_type} not supported.")