why not use CLIP?

#45
by ruwwww - opened

Hello, I've been exploring Flux and this model architecture recently. I noticed that Chroma doesn't utilize CLIP, unlike many other models. Could you clarify the reasoning behind this choice? Was there a specific limitation or design consideration that led to this decision?

Original Flux is already barely affected by CLIP. You could run it without loading CLIP and not notice any difference in output quality.

As for why it was removed, the explaination is on the model card:

But after a simple experiment of zeroing these pooled vectors out, the model’s output barely changed—which made pruning a breeze! Why? Because the only information left for this layer to encode is just a single number in the range of 0-1. Yes, you heard it right—3.3B parameters were used to encode 8 bytes of float values. So this was the most obvious layer to prune and replace with a simple FFN. The whole replacement process only took a day on my single 3090, and after that, the model size was reduced to just 8.9B.

Does this imply that we could take any Flux based UNet and do the same, reducing its footprint? Cause I can run Chroma in Q8, but standard Flux models are just on the other side of that without other tricks, for me.

A day on a 3090 doesn't seem trivial per se, but for a good model, may be worthwhile. Or does this damage the model in the immediate term, in a way that Chroma can only get away with it because it's following up with extensive training?

@CognitiveSourceress Don't just take my word for it, but I believe you could just pick the distilled_guidance_layerfrom Chroma to replace the modulations in any other Flux model, and already have something that kinda works. (maybe some training passes to iron things out after would be needed to make it good?)

Sign up or log in to comment