kernels-community
/

vllm-flash-attn3

Model card Files Files and versions

README for vllm-flash-attn3

#1

by pcuenq HF Staff - opened 7 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

Files changed (1) hide show

README.md +34 -0

README.md ADDED Viewed

	@@ -0,0 +1,34 @@

+---
+license: apache-2.0
+---
+# vllm-flash-attn3
+This is an implementation of Flash Attention 3 CUDA kernels with support for attention sinks. The attention sinks implementation was contributed to Flash Attention by the [vLLM team](https://huggingface.co/vllm-project). The [transformers team](https://huggingface.co/transformers-community) packaged the implementation and pre-built it for use with the [kernels library](https://github.com/huggingface/kernels).
+## How to Use
+When loading your model with transformers, provide this repository id as the source of the attention implementation:
+```diff
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "<your model id on the Hub>"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    torch_dtype="auto",
++    # Flash Attention with Sinks
++    attn_implementation="kernels-community/vllm-flash-attn3”,
+)
+```
+This will automatically resolve and download the appropriate code for your architecture. See more details in [this post](https://huggingface.co/blog/hello-hf-kernels).
+## Credits
+- [Tri Dao](https://huggingface.co/tridao) and team for Flash Attention and [Flash Attention 3](https://tridao.me/blog/2024/flash3/).
+- The [vLLM team](https://huggingface.co/vllm-project) for their implementation and their contribution of attention sinks.
+- The [transformers team](https://huggingface.co/transformers-community) for packaging, testing, building and making it available for use with the [kernels library](https://github.com/huggingface/kernels).