The Problem
FlashAttention fused attention kernels into a single operation, cutting memory transfers and speeding up transformer inference. But there's a mismatch: tensor cores (fast) handle matrix multiplication, while CUDA cores (slower) handle softmax. During softmax computation, tensor units sit idle. As NVIDIA ships faster tensor hardware, this bottleneck worsens.
The Approach
FlashAttention-T repurposes tensor matrix multiply-add (MMA) instructions to run softmax primitives like element-wise scaling. The team built a tensorized online softmax algorithm that maintains numerical stability while running on repurposed tensor cores. They parallelized softmax across both tensor and vector units, eliminating the wait.
The implementation targets NVIDIA Ampere (A100, AGX Orin) and Hopper (H100) GPUs specifically. Architecture-aware scheduling ensures both unit types stay busy.
The Results
On A100s, vector interval ratios (idle tensor core time) dropped 1.17-2.18x compared to baseline FlashAttention. On H100s, tensor units now idle just 2.7% of the time. Average speedups reached 1.17x over FlashAttention-2 and FlashAttention-3, with accuracy maintained.
For context: FlashAttention-2 already hit 72% model FLOPs utilization (225 TFLOPs/s on A100s). Even small gains matter at this efficiency level. The original FlashAttention enabled 2x longer sequences with 3x speedup on GPT-2. This iteration squeezes more from the same silicon.
What This Means
Teams running long-context models (16k-64k tokens) on NVIDIA hardware get measurable throughput gains without changing model architecture. The catch: this is NVIDIA-specific optimization. Portability to AMD or other accelerators isn't clear.
The broader pattern: as tensor cores get faster, software needs to evolve beyond just using them for GEMMs. FlashAttention-T shows one path, fully tensorizing operations that previously ran on slower units. Worth watching if you're planning H100 or H200 deployments for LLM inference.
The artifact is available on Zenodo for teams wanting to benchmark against their workloads. This is incremental progress, not revolution, but incremental matters when you're burning thousands of GPU hours monthly.