aicoolies logo

FlashMLA

DeepSeek's optimized attention kernel for Multi-Head Latent Attention

Share
open-sourceOpen Source
Visit Website →

FlashMLA is DeepSeek's MIT-licensed CUDA kernel library for optimized attention in DeepSeek-V3 and DeepSeek-V3.2-Exp style inference. It includes dense MLA decoding plus sparse attention kernels for DeepSeek Sparse Attention, with README-reported H800/CUDA metrics up to 3000 GB/s, 660 TFLOPS, and sparse 640/410 TFlops paths. It has 12.7K+ GitHub stars.

We have a review for this tool

A detailed review by the aicoolies team — click to read

FlashMLA is a production CUDA kernel optimizing Multi-head Latent Attention (MLA) inference on NVIDIA Hopper GPUs (H100/H800). Developed by DeepSeek-AI, the kernel addresses a critical bottleneck in modern LLM inference: attention operations are memory-bound, not compute-bound, so traditional kernel designs waste GPU compute while waiting for memory. FlashMLA achieves 3000 GB/s memory bandwidth utilization in dense inference and 660 TFLOPS in compute-bound configurations, reaching near-theoretical peak performance through kernel-level scheduling that overlaps CUDA Core operations, Tensor Core operations, and memory transfers.

The technical implementation merges several optimization strategies. FlashMLA uses programmatic dependent launch to overlap the splitkv_mla and combine kernels, reducing synchronization overhead. A tile scheduler allocates jobs to streaming multiprocessors for load balancing. The kernel supports BF16 precision natively and implements paged KV cache with 64-byte blocks, dramatically reducing memory pressure compared to contiguous allocations. For sparse workloads using FP8 KV cache, throughput reaches 410 TFLOPS. Variable-length sequence handling (padding-free) further improves efficiency for batched inference.

DeepSeek released FlashMLA as part of their open-source week initiative, targeting inference infrastructure teams operating large model deployments. The kernel integrates with vLLM and SGLang inference engines, allowing drop-in speedups for production LLM APIs. Infrastructure providers hosting Qwen, DeepSeek, or other MLA-based models benefit from 2-3x throughput improvements. For research teams fine-tuning MLA architectures, FlashMLA provides reference implementations demonstrating memory-optimal kernel design applicable beyond MLA to general attention optimization.

Pricing

Free and open-source under MIT license

Platforms

CUDA, NVIDIA GPUs, Python/C++ integration

Categories

Tags

Use Cases

Alternatives

Related Tools

KubeAI

Kubernetes operator for serving AI inference workloads

KubeAI is an Apache-2.0 Kubernetes operator for deploying and scaling AI inference workloads, including LLMs, embeddings, reranking, and speech-to-text. It gives platform teams OpenAI-compatible endpoints, model proxy/controller primitives, model caching, scale-from-zero behavior, and cluster-native resource management for self-hosted inference on Kubernetes.

open-sourceOpen Source
Freestyle logo

Freestyle

Sandboxes for coding agents — Linux VMs, Git, and deploys in one box

Freestyle is YC-backed sandbox infrastructure built for AI coding agents, shipping secure Linux VMs with nested virtualization, Git servers, and one-click web deploys. It lets agents run real workloads, branch repos, and deploy apps under short-lived identities while billing only for active compute. Used in production by vly.ai, Rork, and Vibeflow.

freemium
OpenSRE logo

OpenSRE

Open-source toolkit for building AI SRE incident response agents

OpenSRE is Tracer Cloud’s open-source public-alpha Python toolkit for building AI SRE agents that investigate and respond to production incidents. It ships 60+ tools across observability, databases, incident management, communications, deployment and protocol integrations, plus simulation/evaluation workflows for benchmarking agent accuracy before live pager use.

open-sourceOpen Source
Twill AI logo

Twill AI

Autonomous coding agents that ship while you sleep

Twill is an autonomous coding agent platform that implements features, fixes bugs, and ships pull requests without manual intervention. Uses structured workflow of research, planning, human review, implementation in isolated sandbox, AI code review, then merge. Supports custom agent configurations with multiple LLM providers, isolated dev environments for verification, and integrations with GitHub, Linear, Sentry, Notion, and cloud platforms for end-to-end engineering automation.

freemium
Baseten logo

Baseten

ML inference platform for production AI models

Baseten is the inference platform for deploying AI models at scale with dedicated and pre-optimized model APIs and performance-optimized infrastructure. Specializes in image generation, transcription, text-to-speech, LLM serving, embeddings, and compound AI workloads. Delivers 75% latency reduction with 415ms cold starts and 3000+ concurrent scaling. Available as managed cloud or self-hosted, trusted by Cursor, Notion, Descript, and Sourcegraph for production inference.

api-usage-based
Resolve AI logo

Resolve AI

AI-powered production incident resolution

Resolve AI automates production incident investigation, diagnosis, and remediation acting as an AI SRE that participates in every on-call rotation. Autonomously investigates incidents pursuing multiple hypotheses in parallel, validates against real evidence, creates code snippets and drafts PRs, generates post-mortems, and onboards new teammates with instant answers about code and infrastructure. Drives 5x faster MTTR and 87% faster incident investigations.

paid

Used in Stacks