vllm-flash-attn3 / README.md
pcuenq's picture
pcuenq HF Staff
README for vllm-flash-attn3 (#1)
62efba7 verified
metadata
license: apache-2.0

vllm-flash-attn3

This is an implementation of Flash Attention 3 CUDA kernels with support for attention sinks. The attention sinks implementation was contributed to Flash Attention by the vLLM team. The transformers team packaged the implementation and pre-built it for use with the kernels library.

How to Use

When loading your model with transformers, provide this repository id as the source of the attention implementation:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "<your model id on the Hub>"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
+    # Flash Attention with Sinks
+    attn_implementation="kernels-community/vllm-flash-attn3”,
)

This will automatically resolve and download the appropriate code for your architecture. See more details in this post.

Credits