metadata

license: apache-2.0

vllm-flash-attn3

This is an implementation of Flash Attention 3 CUDA kernels with support for attention sinks. The attention sinks implementation was contributed to Flash Attention by the vLLM team. The transformers team packaged the implementation and pre-built it for use with the kernels library.

How to Use

When loading your model with transformers, provide this repository id as the source of the attention implementation:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "<your model id on the Hub>"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
+    # Flash Attention with Sinks
+    attn_implementation="kernels-community/vllm-flash-attn3”,
)

This will automatically resolve and download the appropriate code for your architecture. See more details in this post.

Credits

Tri Dao and team for Flash Attention and Flash Attention 3.
The vLLM team for their implementation and their contribution of attention sinks.
The transformers team for packaging, testing, building and making it available for use with the kernels library.