FlashMLA is a production CUDA kernel optimizing Multi-head Latent Attention (MLA) inference on NVIDIA Hopper GPUs (H100/H800). Developed by DeepSeek-AI, the kernel addresses a critical bottleneck in modern LLM inference: attention operations are memory-bound, not compute-bound, so traditional kernel designs waste GPU compute while waiting for memory. FlashMLA achieves 3000 GB/s memory bandwidth utilization in dense inference and 660 TFLOPS in compute-bound configurations, reaching near-theoretical peak performance through kernel-level scheduling that overlaps CUDA Core operations, Tensor Core operations, and memory transfers.
The technical implementation merges several optimization strategies. FlashMLA uses programmatic dependent launch to overlap the splitkv_mla and combine kernels, reducing synchronization overhead. A tile scheduler allocates jobs to streaming multiprocessors for load balancing. The kernel supports BF16 precision natively and implements paged KV cache with 64-byte blocks, dramatically reducing memory pressure compared to contiguous allocations. For sparse workloads using FP8 KV cache, throughput reaches 410 TFLOPS. Variable-length sequence handling (padding-free) further improves efficiency for batched inference.
DeepSeek released FlashMLA as part of their open-source week initiative, targeting inference infrastructure teams operating large model deployments. The kernel integrates with vLLM and SGLang inference engines, allowing drop-in speedups for production LLM APIs. Infrastructure providers hosting Qwen, DeepSeek, or other MLA-based models benefit from 2-3x throughput improvements. For research teams fine-tuning MLA architectures, FlashMLA provides reference implementations demonstrating memory-optimal kernel design applicable beyond MLA to general attention optimization.