383 55 196

Sayak Paul

sayakpaul

https://sayak.dev

AI & ML interests

Diffusion models, representation learning

Recent Activity

updated a dataset about 10 hours ago

huggingface/diffusers-metadata

updated a model about 10 hours ago

diffusers-internal-dev/gemini-prompt-expander

published a model about 15 hours ago

diffusers-internal-dev/gemini-prompt-expander

View all activity

Organizations

updated a dataset about 10 hours ago

huggingface/diffusers-metadata

Viewer • Updated about 10 hours ago • 72 • 767 • 7

updated a model about 10 hours ago

diffusers-internal-dev/gemini-prompt-expander

Updated about 10 hours ago

published a model about 15 hours ago

diffusers-internal-dev/gemini-prompt-expander

Updated about 10 hours ago

updated a model about 16 hours ago

diffusers-internal-dev/canny-filtering

Updated about 16 hours ago

published a model about 20 hours ago

diffusers-internal-dev/canny-filtering

Updated about 16 hours ago

updated a model 2 days ago

sayakpaul/qwen-gguf

20B • Updated 2 days ago • 15

updated a Space 2 days ago

Diffusers To GGUF

💻

Convert diffusers-format model checkpoints to GGUF.

published 2 models 2 days ago

sayakpaul/qwen-image-gguf

Updated 2 days ago

sayakpaul/qwen-gguf

20B • Updated 2 days ago • 15

updated a model 2 days ago

sayakpaul/different-lora-from-civitai

12B • Updated 2 days ago • 46 • 1

published a Space 2 days ago

Diffusers To GGUF

💻

Convert diffusers-format model checkpoints to GGUF.

upvoted an article 3 days ago

Article

Welcome GPT OSS, the new open-source model family from OpenAI!

and 11 others •

4 days ago

• 414

replied to a-r-r-o-w's post 3 days ago

Phenomenal post!

Maybe it could help future readers to have some clarity on the CUDA programming model itself so that the hierarchy of where each of the components (SM, thread blocks, registers, etc.) sits is clear.

reacted to a-r-r-o-w's post with 👍 3 days ago

Post

1944

You would've implemented the 3-loop matrix multiplication many times as a ML practitioner, but the naive implementation is terrible for GPU performance. Modern GPUs achieve peak performance through careful memory access patterns and minimizing scheduling overhead.

In naive matmul (MxK . KxN), the computation happens in tiles - both for the output matrix and for how you read chunks from the input matrices. Each thread-block processes one output tile by loading corresponding tiles from input (for sum-reduction across K dimension), performing the computation, then terminating. The GPU launches many thread-blocks and schedules them across available streaming multiprocessors (SMs). When an SM finishes one tile, it gets assigned a new thread-block for the next uncomputed tile. This way, multiple output tiles are computed in parallel across the SMs, but we pay the cost for launching thread-blocks each time a new tile is computed.

Persistent matmul changes this approach. Instead of launching thread-blocks to compute some output tiles, computing the results on SMs in parallel, and repeating until all output tiles are computed, you launch only as many thread-blocks as you have SMs available (typically 80-132 on modern GPUs). These thread-blocks stay alive until all output tiles are computed, looping through multiple tiles sequentially. Each persistent thread-block may handle multiple output tiles.

The key benefit is the reduced thread-block launch latency. This persistence strategy, combined with other optimizations like coalesced memory loads/stores, block-tiling, warp-tiling, warp-specialization, double-buffering, ping-pong scheduling and other tricks, helps achieve peak performance. More on this in the future!

Code snippet for testing: https://gist.github.com/a-r-r-o-w/28339b442d164084506c0967029968a8

(Bonus: Since I've wanted to learn Manim for a while, this was a great opportunity to make a visualization for Naive VS Persistent matmul. Enjoy ✨)