license: apache-2.0 | |
# vllm-flash-attn3 | |
This is an implementation of Flash Attention 3 CUDA kernels with support for attention sinks. The attention sinks implementation was contributed to Flash Attention by the [vLLM team](https://huggingface.co/vllm-project). The [transformers team](https://huggingface.co/transformers-community) packaged the implementation and pre-built it for use with the [kernels library](https://github.com/huggingface/kernels). | |
## How to Use | |
When loading your model with transformers, provide this repository id as the source of the attention implementation: | |
```diff | |
from transformers import AutoModelForCausalLM, AutoTokenizer | |
model_id = "<your model id on the Hub>" | |
tokenizer = AutoTokenizer.from_pretrained(model_id) | |
model = AutoModelForCausalLM.from_pretrained( | |
model_id, | |
device_map="auto", | |
torch_dtype="auto", | |
+ # Flash Attention with Sinks | |
+ attn_implementation="kernels-community/vllm-flash-attn3”, | |
) | |
``` | |
This will automatically resolve and download the appropriate code for your architecture. See more details in [this post](https://huggingface.co/blog/hello-hf-kernels). | |
## Credits | |
- [Tri Dao](https://huggingface.co/tridao) and team for Flash Attention and [Flash Attention 3](https://tridao.me/blog/2024/flash3/). | |
- The [vLLM team](https://huggingface.co/vllm-project) for their implementation and their contribution of attention sinks. | |
- The [transformers team](https://huggingface.co/transformers-community) for packaging, testing, building and making it available for use with the [kernels library](https://github.com/huggingface/kernels). | |