FlashMLA is a production CUDA kernel optimizing Multi-head Latent Attention (MLA) inference on NVIDIA Hopper GPUs (H100/H800). Developed by DeepSeek-AI, the kernel addresses a critical bottleneck in modern LLM inference: attention operations are memory-bound, not compute-bound, so traditional kernel designs waste GPU compute while waiting for memory. FlashMLA achieves 3000 GB/s memory bandwidth utilization in dense inference and 660 TFLOPS in compute-bound configurations, reaching near-theoretical peak performance through kernel-level scheduling that overlaps CUDA Core operations, Tensor Core operations, and memory transfers.

The technical implementation merges several optimization strategies. FlashMLA uses programmatic dependent launch to overlap the splitkv_mla and combine kernels, reducing synchronization overhead. A tile scheduler allocates jobs to streaming multiprocessors for load balancing. The kernel supports BF16 precision natively and implements paged KV cache with 64-byte blocks, dramatically reducing memory pressure compared to contiguous allocations. For sparse workloads using FP8 KV cache, throughput reaches 410 TFLOPS. Variable-length sequence handling (padding-free) further improves efficiency for batched inference.

DeepSeek released FlashMLA as part of their open-source week initiative, targeting inference infrastructure teams operating large model deployments. The kernel integrates with vLLM and SGLang inference engines, allowing drop-in speedups for production LLM APIs. Infrastructure providers hosting Qwen, DeepSeek, or other MLA-based models benefit from 2-3x throughput improvements. For research teams fine-tuning MLA architectures, FlashMLA provides reference implementations demonstrating memory-optimal kernel design applicable beyond MLA to general attention optimization.

FlashMLA

Pricing

Platforms

Categories

Tags

Use Cases

Alternatives

DeepEP

DeepGEMM

Related Tools

KubeAI

Freestyle

OpenSRE

Twill AI

Baseten

Resolve AI

Used in Stacks