About training

#2
by ArranEye - opened

Will you publish the training method in the future?

Owner

Yes, later. Actually it's quite straightforward:

1 A simple trainer to perform distillation from pre-made CLIP outputs for pretraining.
2 As soon as it began to show signs of life and generate something but not noise - trained as a part of whole checkpoint.

At early stages unet and LLM are frozen (released model was trained like that), then they should be unfrozen and trained altogether.
Currently I'm using rewritten version of sd-scripts for training, but there are a lot of mess and hardcode that must be fixed before publishing. It will take some time.

Will other smaller llm get similar result, such as qwen-0.7b?

Owner

Depends, but you have to retrain it to be used with a different llm.

Amazing work! Could you describe more procedure on distillation of pre-made CLIP? Right now, I'm training umT5 as additional text encoder, but for now results are bad :(

Owner

Using small part of rouwei dataset with little balancing I prepared about 1M various captions and precalculated clip-L clip-G states and pooled output that are used for SDXL unet.
Then with the same captions llm states were prepared as well.
Finally comes a simple trainer script with few steps:
1 load llm states and clip pools
2 process llm states with adapter
3 calculate loss comparing the output with pre calculated clip outputs
4 backward pass
Same approach proven to work with t5gemma.
Maybe I'll post some code examples in a couple of weeks when get back from vacation.

But from my observations it can work only as pretraining stage that allows to save a lot of compute after initializing the model from noise limiting forward/backward with the adapter only. The result will be something like "shows some understanding" rather than "good and accurate".
It will really start working only after training within whole checkpoint.

Sign up or log in to comment