Quant precision vs quality

#16
by GTManiK - opened

Worth mentioning that 'float8_e4m3fn' is WAY different in quality compared to BF16. I mean it was already the case for Flux, but difference there wasn't so dramatic. Is there any way to improve it?
I've seen 'scaled' variants of 8-bit chroma here https://huggingface.co/Clybius/Chroma-fp8-scaled , but it only outputs latent noise.

Any suggestions?

Use Q_8 version. It’s the basic standard rule unless some ultra nerd creates a new quantisation algorithm: fp32> fp16 = Q_8> bf16= Q_6 > fp8_scaled > ( more = than >) fp8

On my machine with RTX4070 (12GB VRAM), with torch.compile and sage attention:

Q6_K -> around 3.1 s/it
Q_8 -> around 2.6 s/it
fp8_scaled (using either 'default' or 'fp8_e4m3fn' dtype) -> around 2.2 s/it
fp8_scaled (using 'fp8_e4m3fn_fast' dtype) -> around 1.3 s/it

so the last example is twice faster compared to Q_8 in my case...

Quality wise, fp8_scaled with 'fp8_e4m3fn_fast' is better than stock Chroma model using the same dtype, as you can see here:

Scaled (v26):
image.png

Stock (v26):
image.png

Here is side by side scaled down for convenience (Scaled on the left, Stock on the right):
image.png

oh, you got torch compile to work? i'd like to have your workflow too...

@levzzz , I think workflow should be embedded in the first two images with a flower lady above...

Note that I merge two Loras to speed up things a little bit, both at weight 0.4:

  1. https://civitai.com/models/686704?modelVersionId=768547
  2. https://huggingface.co/silveroxides/Chroma-LoRA-Experiments/blob/main/Hyper-Chroma-low-step-LoRA.safetensors

Plus ton of CFG trickery, Detail Daemon and so on, which might be irrelevant to you.
With the above loras it converges in about 30-35 steps instead of 45-50 with dpmpp_2m / simple

Important! For torch.compile to work you have to find 'vcvarsall.bat' (comes with MSVC compile tools) to execute it before running comfyui, like so:

call ..\some_path\vcvarsall.bat x64
RunYourComfyUiBatFile.bat

This is because for Chroma specifically it needs more include libs in order to be able to compile for some reason, otherwise it complains on missing header file like 'algorithm.h'
Update: since now chroma code is merged into ComfyUI master (using standard nodes), invocation of 'vcvarsall.bat' is no longer required

Hi, thanks I would love to see that workflow as well. However when I saved the first two images (in PNG format), comfy did not find the workflow. Is it possible for you to post that separately (JSON)?

Sign up or log in to comment