kernels-community
/

vllm-flash-attn3

Model card Files Files and versions

vllm-flash-attn3 / README.md

pcuenq's picture

pcuenq HF Staff

README for vllm-flash-attn3 (#1)

62efba7 verified 7 days ago

|

history blame contribute delete

1.62 kB

	---
	license: apache-2.0
	---

	# vllm-flash-attn3

	This is an implementation of Flash Attention 3 CUDA kernels with support for attention sinks. The attention sinks implementation was contributed to Flash Attention by the [vLLM team](https://huggingface.co/vllm-project). The [transformers team](https://huggingface.co/transformers-community) packaged the implementation and pre-built it for use with the [kernels library](https://github.com/huggingface/kernels).

	## How to Use

	When loading your model with transformers, provide this repository id as the source of the attention implementation:

	```diff
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "<your model id on the Hub>"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="auto",
	torch_dtype="auto",
	+ # Flash Attention with Sinks
	+ attn_implementation="kernels-community/vllm-flash-attn3”,
	)
	```

	This will automatically resolve and download the appropriate code for your architecture. See more details in [this post](https://huggingface.co/blog/hello-hf-kernels).

	## Credits

	- [Tri Dao](https://huggingface.co/tridao) and team for Flash Attention and [Flash Attention 3](https://tridao.me/blog/2024/flash3/).
	- The [vLLM team](https://huggingface.co/vllm-project) for their implementation and their contribution of attention sinks.
	- The [transformers team](https://huggingface.co/transformers-community) for packaging, testing, building and making it available for use with the [kernels library](https://github.com/huggingface/kernels).