FlashAttention-T offloads softmax to idle tensor cores, cuts GPU downtime by half

Researchers reworked attention kernels to use NVIDIA's Tensor Cores for softmax operations, not just matrix math. The result: 1.17x faster inference on A100s, with tensor units sitting idle just 2.7% of the time on H100s. This matters for teams running long-context LLMs at scale.

The Biggish Editorial · Tuesday, February 3, 2026

The Problem

FlashAttention fused attention kernels into a single operation, cutting memory transfers and speeding up transformer inference. But there's a mismatch: tensor cores (fast) handle matrix multiplication, while CUDA cores (slower) handle softmax. During softmax computation, tensor units sit idle. As NVIDIA ships faster tensor hardware, this bottleneck worsens.

The Approach

FlashAttention-T repurposes tensor matrix multiply-add (MMA) instructions to run softmax primitives like element-wise scaling. The team built a tensorized online softmax algorithm that maintains numerical stability while running on repurposed tensor cores. They parallelized softmax across both tensor and vector units, eliminating the wait.

The implementation targets NVIDIA Ampere (A100, AGX Orin) and Hopper (H100) GPUs specifically. Architecture-aware scheduling ensures both unit types stay busy.

The Results

On A100s, vector interval ratios (idle tensor core time) dropped 1.17-2.18x compared to baseline FlashAttention. On H100s, tensor units now idle just 2.7% of the time. Average speedups reached 1.17x over FlashAttention-2 and FlashAttention-3, with accuracy maintained.

For context: FlashAttention-2 already hit 72% model FLOPs utilization (225 TFLOPs/s on A100s). Even small gains matter at this efficiency level. The original FlashAttention enabled 2x longer sequences with 3x speedup on GPT-2. This iteration squeezes more from the same silicon.

What This Means

Teams running long-context models (16k-64k tokens) on NVIDIA hardware get measurable throughput gains without changing model architecture. The catch: this is NVIDIA-specific optimization. Portability to AMD or other accelerators isn't clear.

The broader pattern: as tensor cores get faster, software needs to evolve beyond just using them for GEMMs. FlashAttention-T shows one path, fully tensorizing operations that previously ran on slower units. Worth watching if you're planning H100 or H200 deployments for LLM inference.

The artifact is available on Zenodo for teams wanting to benchmark against their workloads. This is incremental progress, not revolution, but incremental matters when you're burning thousands of GPU hours monthly.

The Problem

The Approach

The Results

What This Means

Related Articles

Why control-era software architecture breaks with LLM agents - and what replaces it

SpaceX's $1.25T xAI merger targets orbital data centers, faces physics and economics problems

Product lead writes code doomed to obsolescence: here's the business case