FlexAttention + FlashAttention-4: Fast and Flexible

TL;DR: On Hopper and Blackwell GPUs, FlexAttention now has a FlashAttention-4 backend.

PyTorch Blog · Mar 05, 2026 17:55 UTC · ~2 min read

Research

Academic or research source. Check the methodology, sample size, and whether it's been replicated.

Key Takeaways

May affect how AI can be used.
PyTorch Blog added support in PyTorch to automatically generate CuTeDSL score/mask modification functions, and to JIT-instantiate FlashAttention-4 for custom attention variants.
This leads to performance gains of 1.2× to 3.2× over the existing Triton implementation on compute-bound workloads.

What It Means

Context

PyTorch Blog added support in PyTorch to automatically generate CuTeDSL score/mask modification functions, and to JIT-instantiate FlashAttention-4 for custom attention variants. This leads to performance gains of 1.2× to 3.2× over the existing Triton implementation on compute-bound workloads. FlexAttention recap FlexAttention is a PyTorch API that lets you implement custom attention variants in a few lines of Python, no CUDA required. You write a scoremod or maskmod function that modifies attention scores, and the compiler handles the rest: ALiBi, sliding window, document masking, soft-capping, and combinations of these all work through the same interface. Under the hood, it’s two extensions over vanilla FlashAttention: Pointwise modifications to pre-softmax scores, with arbitrary loads from global memory. Block-sparse iteration for both forward and backward, with a simple data structure for encoding data-dependent sparsity at runtime. Of course, the devil is in the details, but as PyTorch Blog’ve shown in the original FlexAttention post and FlexAttention for inference , these two extensions cover a wide range of popular attention variants. With this release, FlexAttention now…

For builders

PyTorch Blog added support in PyTorch to automatically generate CuTeDSL score/mask modification functions, and to JIT-instantiate FlashAttention-4 for custom attention variants.

For Builders

PyTorch Blog added support in PyTorch to automatically generate CuTeDSL score/mask modification functions, and to JIT-instantiate FlashAttention-4 for custom attention variants.

Read Original