
Most LLM serving infrastructure posts get ten minutes of attention and then disappear. Kimi's Mooncake paper made it about three weeks before the internet found something shiny. That's a shame, because the core idea — KV cache disaggregation for distributed LLM serving — is one of the more architecturally honest takes on what actually limits long-context inference at scale.
Let me explain why it matters, because the name undersells it.
When an LLM generates tokens, it doesn't reprocess the entire prompt on every step. It caches the Key and Value matrices from attention calculations — the "KV cache" — so each new token only needs to attend to the accumulated context, not recompute it from scratch. This is essential for efficiency. Without KV caching, generation would be O(n²) per token instead of O(n), and long contexts would be unusable.
The problem: in most serving stacks, that KV cache lives inside the GPU's HBM. When you're serving a model like a 70B parameter one, the weights alone take ~140GB. The KV cache for a 128K context window is orders of magnitude larger. At a certain point, you run out of GPU memory for cache, and you either truncate the context or start recomputing from scratch — both of which are terrible options.
Mooncake's answer is to disaggregate the KV cache: pull it out of GPU memory and put it in a separate pool of memory nodes that can be accessed over a high-speed interconnect. The prefill phase still runs on GPU (that's where the compute happens). But once the KV cache is generated, it gets stored in pooled memory. The decode phase can then access cached tokens without them consuming precious GPU HBM.
The disaggregation bet is specifically about the relative cost of bandwidth versus compute. As context windows grow, the bottleneck in LLM inference shifts from compute to memory. A GPU doing attention math is fast — but it's blocked waiting for KV cache reads when the cache is larger than GPU memory can hold.
By separating cache storage from compute, Mooncake is arguing that you should size your memory pool independently from your compute pool. Memory nodes don't need fast CUDA cores — they need high bandwidth and low latency access to large KV tensors. Compute nodes don't need to hold entire caches — they need enough memory for the active window plus weights.
This maps onto real hardware trends. High-bandwidth memory is expensive per GB but getting denser. CPU memory and NVMe are cheaper per GB but slower. The disaggregated design lets you pick the right tool for each job: GPU HBM for active compute, pooled memory for cached states.
The production numbers from Kimi are what make this worth taking seriously. Their deployment of Mooncake handles the inference for Kimi (the chatbot) and API services. They report meaningful throughput improvements for long-context workloads. Not toy benchmarks — production traffic.
There's a second-order benefit that's worth highlighting separately: when you disaggregate the KV cache, you open up the possibility of sharing cached states across requests.
Think about a RAG pipeline. Multiple requests come in with the same retrieved document context. In a traditional serving setup, each request generates its own KV cache for that context — even if it's byte-for-byte identical. That's wasted compute.
In Mooncake's architecture, if the cache for a given prefix is stored in pooled memory, subsequent requests with the same prefix can reuse it. The decode phase starts from an already-cached attention state instead of recomputing from the beginning of the context. For applications with shared document contexts, batched RAG, or any scenario with repeated prefixes, this is a meaningful efficiency gain.
It's not a silver bullet — cache invalidation becomes your problem now, and the sharing granularity matters. But the infrastructure to support prefix-level cache reuse is cleanly handled by the disaggregated architecture in a way that's much harder to implement when cache and compute are tightly coupled.
Mooncake's core scheduler, memory pool management, and the disaggregation primitives are available on GitHub under an open license. The project is actively maintained by Moonshot AI's infrastructure team, which matters — this isn't a research-only release, it's the actual production stack.
The transfer engine that handles KV cache movement between compute and memory nodes is part of the open source package. So is the prefix-aware scheduler that decides which cached states to keep in GPU memory versus pushing to the pool. If you're building custom inference infrastructure or evaluating vLLM versus SGLang versus raw TGI, Mooncake is worth a look specifically because it attacks the problem from a different architectural angle than the compute-focused schedulers.
Mooncake is not a magic throughput multiplier. The disaggregation gains are most visible when your context windows are large and your cache hit rates are reasonable — which is a specific set of workloads, not every LLM use case. If you're running short-context classification or extraction tasks, you won't see much benefit.
But if you're building systems that depend on long contexts — document understanding, agentic workflows with large working memories, RAG over large corpora — the disaggregation architecture addresses a real bottleneck that the compute-focused serving frameworks have been dancing around. The memory wall is a hardware constraint that compute-side scheduling can't fully solve. You need to move the problem to a different layer, which is exactly what Mooncake does.
It's also a useful frame for thinking about where LLM infrastructure is heading. Context windows are not going to shrink. The demands on KV cache will grow proportionally. The question is whether that cache lives inside GPU HBM (and runs out) or whether it lives in a properly sized pool that you manage independently. Mooncake is betting on the latter. So far, the production numbers suggest it's a reasonable bet.
Mooncake: KV cache disaggregation for distributed LLM serving, open source from Moonshot AI, production-deployed at Kimi scale, cross-request cache sharing for repeated-prefix workloads, memory pool vs GPU HBM sizing independence.