README for vllm-flash-attn3

#1
by pcuenq HF Staff - opened
Files changed (1) hide show
  1. README.md +34 -0
README.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # vllm-flash-attn3
6
+
7
+ This is an implementation of Flash Attention 3 CUDA kernels with support for attention sinks. The attention sinks implementation was contributed to Flash Attention by the [vLLM team](https://huggingface.co/vllm-project). The [transformers team](https://huggingface.co/transformers-community) packaged the implementation and pre-built it for use with the [kernels library](https://github.com/huggingface/kernels).
8
+
9
+ ## How to Use
10
+
11
+ When loading your model with transformers, provide this repository id as the source of the attention implementation:
12
+
13
+ ```diff
14
+ from transformers import AutoModelForCausalLM, AutoTokenizer
15
+
16
+ model_id = "<your model id on the Hub>"
17
+
18
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
19
+ model = AutoModelForCausalLM.from_pretrained(
20
+ model_id,
21
+ device_map="auto",
22
+ torch_dtype="auto",
23
+ + # Flash Attention with Sinks
24
+ + attn_implementation="kernels-community/vllm-flash-attn3”,
25
+ )
26
+ ```
27
+
28
+ This will automatically resolve and download the appropriate code for your architecture. See more details in [this post](https://huggingface.co/blog/hello-hf-kernels).
29
+
30
+ ## Credits
31
+
32
+ - [Tri Dao](https://huggingface.co/tridao) and team for Flash Attention and [Flash Attention 3](https://tridao.me/blog/2024/flash3/).
33
+ - The [vLLM team](https://huggingface.co/vllm-project) for their implementation and their contribution of attention sinks.
34
+ - The [transformers team](https://huggingface.co/transformers-community) for packaging, testing, building and making it available for use with the [kernels library](https://github.com/huggingface/kernels).