Permanent residence in VRAM

by rtbonet - opened Aug 7

Aug 7

Every time I do an image generation, even if I keep the pipe alive, it seems to be loading/unloading stuff which adds tons of waiting time.
I tried removing the enable cpu offload function call but then there are no memory savings compared to base qwen-image
Use case is: I run this a service and I want to have it all in VRAM permanently for fast serving (reaping the benefit of lower vram usage with DF11).

ovedrive

Aug 7

i spent a whole day trying what you are suggesting. I am not the author and not at his level. What i learned was that its a trade-off. You are getting a lossless full model but at the cost of longer generation times. The internals of DFloat11 handle cpu offloading better than the default cpu offloads available in pytorch. Which means they are managed internally to Dfloat. The way DFloat is doing this means its probably doing 1 block or few blocks at a time (i think its just 1) so it uses little VRAM during inference. I am not sure if you are using with pin_memory=True (default) or not but if you have more than enough RAM you will see "some" speedup.

Only DFloat11 can optimize their package but it may be not be possible the other opption might be using a quantized version that can full load into memory then you dont need DFloat11

LeanQuant

Dynamic-length Float (DFloat11) org Aug 7

Thank you for bringing this to my attention!

I have added a feature in the DFloat11 package for configuring the number of blocks to offload, which means

offloading more blocks uses less GPU memory and more CPU memory,
offloading less blocks uses more GPU memory and less CPU memory, and could be faster.

This will allow you to configure the optimal number of blocks to offload for the best balance between memory-efficient and speed. To try it, upgrade to the latest pip version pip install -U dfloat11[cuda12] and follow the instructions in this model card.

LeanQuant

Dynamic-length Float (DFloat11) org Aug 7

Every time I do an image generation, even if I keep the pipe alive, it seems to be loading/unloading stuff which adds tons of waiting time.
I tried removing the enable cpu offload function call but then there are no memory savings compared to base qwen-image
Use case is: I run this a service and I want to have it all in VRAM permanently for fast serving (reaping the benefit of lower vram usage with DF11).

To answer your question, the DFloat11 version does save around 11GB in VRAM usage if you load everything into VRAM.

The problem is that the Qwen-Image model is larger than you think. It has a diffusion transformer (41GB) and a text-encoder (14GB) and a VAE (0.25GB). If you load everything into VRAM, it would consume around 55GB, which is probably larger than your GPU capacity. The DFloat11 model reduces the diffusion transformer from 41GB to 28.5GB.

To load everything into VRAM, you replace the line pipe.enable_model_cpu_offload() with pipe = pipe.to('cuda'). This should remain compatible with DFloat11 offloading.

tetsfr

11 days ago

•

edited 11 days ago

I run the dfloat11 version on a 3090rtx. It runs fine, but it takes about 15-20mins by image, with super long low gpu utilisation a a long slow rampup in vram usage. I was thinking "ok, maybe it is slow and/or I need 32gb to get it runni9ng a lot faster, the cpu offloading slowing things down" ... but when running Qwen image edit on Replicate.com, it take 2.7 sec / image !!! Ok they have decent hardware but there is no way their hadware is that much faster.

What perf you guys have on 5090rtx hardware? I am wondering is the open source model is not just super sub-optimised to give an edge to paying services, but hopefyully I am wrong.

note: btw, the memory pin option just slows things down like crazy on Ubuntu. If I activate it I have to wait 5 to 6mins to get the weight loaded in mem! if I do not activate it, it just take 20sec each time I try, so very constant and faster load time. Not an expert but I guess the memory pin option is only working.useful for Windows users.

ovedrive

11 days ago

I am surprised by your 2.7 seconds report of replicate. It’s possible to do an edit in just 10 steps, maybe less. Qwen image edit can tested on HF spaces for free, it should run on h200 here by default. If it’s really the case of 2.7 seconds i want to see some proof.

tetsfr

6 days ago

I have no equity with replicate, and they are not always particularly crazy fast, but I just ran their defaut example and the log shows 2.62sec on that run...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment