ControlNet Training Documentation

This document outlines the process for training a ControlNet model using the provided Python scripts (train.py and train_controlnet.py). The scripts facilitate training a ControlNet model integrated with a Stable Diffusion pipeline for conditional image generation. Below, we describe the training process and provide a detailed table of the command-line arguments used to configure the training.

Overview

The training process involves two main scripts:

train.py: A wrapper script that executes train_controlnet.py with the provided command-line arguments.
train_controlnet.py: The core script that handles the training of the ControlNet model, including dataset preparation, model initialization, training loop, and validation.

Training Workflow

Argument Parsing: The script parses command-line arguments to configure the training process, such as model paths, dataset details, and hyperparameters.
Dataset Preparation: Loads and preprocesses the dataset (either from HuggingFace Hub or a local directory) with transformations for images and captions.
Model Initialization: Loads pretrained models (e.g., Stable Diffusion, VAE, UNet, text encoder) and initializes or loads ControlNet weights.
Training Loop: Trains the ControlNet model using the Accelerate library for distributed training, with support for mixed precision, gradient checkpointing, and learning rate scheduling.
Validation: Periodically validates the model by generating images using validation prompts and images, logging results to TensorBoard or Weights & Biases.
Checkpointing and Saving: Saves model checkpoints during training and the final model to the output directory. Optionally pushes the model to the HuggingFace Hub.
Model Card Creation: Generates a model card with training details and example images for documentation.

Command-Line Arguments

The following table describes the command-line arguments available in train_controlnet.py for configuring the training process:

Argument	Type	Default	Description
`--pretrained_model_name_or_path`	`str`	None	Path to pretrained model or model identifier from huggingface.co/models. Required.
`--controlnet_model_name_or_path`	`str`	None	Path to pretrained ControlNet model or model identifier. If not specified, ControlNet weights are initialized from UNet.
`--revision`	`str`	None	Revision of pretrained model identifier from huggingface.co/models.
`--variant`	`str`	None	Variant of the model files (e.g., 'fp16').
`--tokenizer_name`	`str`	None	Pretrained tokenizer name or path if different from model_name.
`--output_dir`	`str`	"controlnet-model"	Directory where model predictions and checkpoints are saved.
`--cache_dir`	`str`	None	Directory for storing downloaded models and datasets.
`--seed`	`int`	None	Seed for reproducible training.
`--resolution`	`int`	512	Resolution for input images (must be divisible by 8).
`--train_batch_size`	`int`	4	Batch size per device for the training dataloader.
`--num_train_epochs`	`int`	1	Number of training epochs.
`--max_train_steps`	`int`	None	Total number of training steps. Overrides `num_train_epochs` if provided.
`--checkpointing_steps`	`int`	500	Save a checkpoint every X updates.
`--checkpoints_total_limit`	`int`	None	Maximum number of checkpoints to store.
`--resume_from_checkpoint`	`str`	None	Resume training from a previous checkpoint path or "latest".
`--gradient_accumulation_steps`	`int`	1	Number of update steps to accumulate before a backward pass.
`--gradient_checkpointing`	`flag`	False	Enable gradient checkpointing to save memory at the cost of slower backward passes.
`--learning_rate`	`float`	5e-6	Initial learning rate after warmup.
`--scale_lr`	`flag`	False	Scale learning rate by number of GPUs, gradient accumulation steps, and batch size.
`--lr_scheduler`	`str`	"constant"	Learning rate scheduler type: ["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"].
`--lr_warmup_steps`	`int`	500	Number of steps for learning rate warmup.
`--lr_num_cycles`	`int`	1	Number of hard resets for cosine_with_restarts scheduler.
`--lr_power`	`float`	1.0	Power factor for polynomial scheduler.
`--use_8bit_adam`	`flag`	False	Use 8-bit Adam optimizer from bitsandbytes for lower memory usage.
`--dataloader_num_workers`	`int`	0	Number of subprocesses for data loading (0 means main process).
`--adam_beta1`	`float`	0.9	Beta1 parameter for Adam optimizer.
`--adam_beta2`	`float`	0.999	Beta2 parameter for Adam optimizer.
`--adam_weight_decay`	`float`	1e-2	Weight decay for Adam optimizer.
`--adam_epsilon`	`float`	1e-08	Epsilon value for Adam optimizer.
`--max_grad_norm`	`float`	1.0	Maximum gradient norm for clipping.
`--push_to_hub`	`flag`	False	Push the model to the HuggingFace Hub.
`--hub_token`	`str`	None	Token for pushing to the HuggingFace Hub.
`--hub_model_id`	`str`	None	Repository name for syncing with `output_dir`.
`--logging_dir`	`str`	"logs"	TensorBoard log directory.
`--allow_tf32`	`flag`	False	Allow TF32 on Ampere GPUs for faster training.
`--report_to`	`str`	"tensorboard"	Integration for logging: ["tensorboard", "wandb", "comet_ml", "all"].
`--mixed_precision`	`str`	None	Mixed precision training: ["no", "fp16", "bf16"].
`--enable_xformers_memory_efficient_attention`	`flag`	False	Enable xformers for memory-efficient attention.
`--set_grads_to_none`	`flag`	False	Set gradients to None instead of zero to save memory.
`--dataset_name`	`str`	None	Name of the dataset from HuggingFace Hub or local path.
`--dataset_config_name`	`str`	None	Dataset configuration name.
`--train_data_dir`	`str`	None	Directory containing training data with `metadata.jsonl`.
`--image_column`	`str`	"image"	Dataset column for target images.
`--conditioning_image_column`	`str`	"conditioning_image"	Dataset column for ControlNet conditioning images.
`--caption_column`	`str`	"text"	Dataset column for captions.
`--max_train_samples`	`int`	None	Truncate training examples to this number for debugging or quicker training.
`--proportion_empty_prompts`	`float`	0	Proportion of prompts to replace with empty strings (0 to 1).
`--validation_prompt`	`str`	None	Prompts for validation, evaluated every `validation_steps`.
`--validation_image`	`str`	None	Paths to ControlNet conditioning images for validation.
`--num_validation_images`	`int`	4	Number of images generated per validation prompt-image pair.
`--validation_steps`	`int`	100	Run validation every X steps.
`--tracker_project_name`	`str`	"train_controlnet"	Project name for Accelerator trackers.

Usage Example

To train a ControlNet model, run the following command:

python src/controlnet_image_generator/train.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1" \
  --dataset_name="huggingface/controlnet-dataset" \
  --output_dir="controlnet_output" \
  --resolution=512 \
  --train_batch_size=4 \
  --num_train_epochs=3 \
  --learning_rate=1e-5 \
  --validation_prompt="A cat sitting on a chair" \
  --validation_image="path/to/conditioning_image.png" \
  --push_to_hub \
  --hub_model_id="your-username/controlnet-model"

This command trains a ControlNet model using the Stable Diffusion 2.1 pretrained model, a specified dataset, and logs results to the HuggingFace Hub.

Notes

Ensure the dataset contains columns for target images, conditioning images, and captions as specified by image_column, conditioning_image_column, and caption_column.
The resolution must be divisible by 8 to ensure compatibility with the VAE and ControlNet encoder.
Mixed precision training (fp16 or bf16) can reduce memory usage but requires compatible hardware.
Validation images and prompts must be provided in matching quantities or as single values to be reused.

For further details, refer to the source scripts or the HuggingFace Diffusers documentation.