Skip to content
Mobrief

FlexAttention + FlashAttention-4: Fast and Flexible

TL;DR: On Hopper and Blackwell GPUs, FlexAttention now has a FlashAttention-4 backend.

PyTorch Blog · · ~2 min read
Research

Academic or research source. Check the methodology, sample size, and whether it's been replicated.

  • May affect how AI can be used.
  • PyTorch Blog added support in PyTorch to automatically generate CuTeDSL score/mask modification functions, and to JIT-instantiate FlashAttention-4 for custom attention variants.
  • This leads to performance gains of 1.2× to 3.2× over the existing Triton implementation on compute-bound workloads.

Context

PyTorch Blog added support in PyTorch to automatically generate CuTeDSL score/mask modification functions, and to JIT-instantiate FlashAttention-4 for custom attention variants. This leads to performance gains of 1.2× to 3.2× over the existing Triton implementation on compute-bound workloads. FlexAttention recap FlexAttention is a PyTorch API that lets you implement custom attention variants in a few lines of Python, no CUDA required. You write a scoremod or maskmod function that modifies attention scores, and the compiler handles the rest: ALiBi, sliding window, document masking, soft-capping, and combinations of these all work through the same interface. Under the hood, it’s two extensions over vanilla FlashAttention: Pointwise modifications to pre-softmax scores, with arbitrary loads from global memory. Block-sparse iteration for both forward and backward, with a simple data structure for encoding data-dependent sparsity at runtime. Of course, the devil is in the details, but as PyTorch Blog’ve shown in the original FlexAttention post and FlexAttention for inference , these two extensions cover a wide range of popular attention variants. With this release, FlexAttention now…

For builders

PyTorch Blog added support in PyTorch to automatically generate CuTeDSL score/mask modification functions, and to JIT-instantiate FlashAttention-4 for custom attention variants.

PyTorch Blog added support in PyTorch to automatically generate CuTeDSL score/mask modification functions, and to JIT-instantiate FlashAttention-4 for custom attention variants.

Read Original
Open
O open S save B back M mode