*The era of bragging about context window size is over. What comes next is infrastructure.*
There's a running joke in the AI community: every time a new model drops, the benchmark charts get taller and the context windows get longer, but the actual experience of using these models for real work stays exactly the same. You paste in a codebase, the model drifts. You load a long document, attention gets spotty past page 20. You run an agent through 50 tool calls and by call 40 it's forgetting what it was doing.
DeepSeek V4, released April 24, 2026, is the first flagship model that actually takes the infrastructure problem seriously. The 1 million-token context window isn't the headline — it's the proof of concept. The real story is *how* they made a 1M window usable on hardware that doesn't cost more than a car.
**DeepSeek V4 ships a 1M-token context window that runs at 2% of the KV cache cost of standard attention. Two variants — V4-Pro (1.6T params, 49B active) and V4-Flash (284B params, 13B active) — both support the full window. V4-Flash costs $0.14 per million input tokens. That's 85% cheaper than GPT-5.5 at equivalent context lengths. And it runs on Huawei Ascend chips — no NVIDIA required.**
Here's what the benchmarks never tell you: a 1M-token context window doesn't mean the model actually *attends* to all 1M tokens with equal fidelity. Standard transformer attention is expensive — every token you add increases the memory footprint of the KV cache and the FLOPs per forward pass. At 1M tokens, naive attention would require hundreds of GB of KV cache just to run one inference call.
What most vendors do: they advertise the window size, let the model technically accept 1M tokens, and then watch it degrade on tasks that require reasoning over the full context. The model looks like it has a big brain. It doesn't.
The actual bottleneck isn't the *window* — it's the *attention math*. Every token decoded pays a cost proportional to the full sequence length. Long context degrades into expensive, slow, memory-hungry inference that nobody wants to pay for. That's why most "1M context" models end up being used for RAG retrieval pipelines instead of true end-to-end reasoning.
DeepSeek V4 is built around the idea that the context window is only as useful as the infrastructure supporting it. They didn't just build a bigger window — they rebuilt the attention mechanism so the window actually works.
The key innovation in V4 is a two-path attention system called **Hybrid Compressed Attention (HCA)**, layered with **Compressed Sparse Attention (CSA)** in alternating transformer layers. Here's how each works:
**CSA (Compressed Sparse Attention)** compresses KV entries by 4x along the sequence dimension. Instead of storing every token's key-value representation, it merges groups of 4 tokens into a single compressed entry using a learned token-level compressor with a softmax-gated pooling mechanism. A "lightning indexer" — a sparse selection module running in FP4 — then picks the top-k compressed blocks per query. This means the model only attends to the most relevant compressed regions, dramatically reducing the effective search space.
**HCA (Heavily Compressed Attention)** goes further: 128x compression. It drops the sparse selection entirely — the compressed sequence is short enough that dense attention over every compressed block is computationally affordable. Every query attends to every compressed block with full fidelity.
The layers alternate: in V4-Pro's 61-layer stack, layers 0–1 run pure HCA, layers 2–60 alternate CSA and HCA, and the Multi-Token Prediction (MTP) block at the end runs sliding-window only. Different layers carry different attention patterns, and forcing one mechanism across all layers wastes capacity.
**The result:** V4-Pro requires only 27% of single-token inference FLOPs compared with V3.2, and 10% of the KV cache memory. On a 1M-token context, V4-Pro uses roughly 2% of the KV cache that a standard grouped-query attention architecture would demand. That 2% figure is what makes the whole thing deployable.
Both models also leverage FP8 storage for most KV entries and BF16 only for the RoPE rotary dimensions, compounding the memory savings. The lightning indexer inside CSA runs in FP4 — aggressive quantization that further shrinks the working set.
Standard chatbots are single-turn or short multi-turn. Agents are different. They run long-horizon workflows — hundreds of tool calls, each result appended to context, each subsequent token paying full attention cost against everything that came before. The model can't afford to lose state halfway through, and it can't slow to a crawl when the context gets deep.
The known failure modes for agentic LLMs are:
1. **Context budget blowup** — the KV cache fills the GPU and inference speed collapses
2. **Reasoning state loss across user turns** — the model discards chain-of-thought traces when a new user message arrives
3. **Tool-call degradation** — JSON parsing errors, escaping failures, nested object corruption in function calls
4. **Training infrastructure mismatch** — RL pipelines that can't run hundreds of thousands of concurrent sandboxes cheaply
V4 addresses all four:
**Interleaved thinking across tool calls.** Previous DeepSeek models kept reasoning traces across tool-result rounds but discarded them when a new user message arrived. For single-turn chat, fine. For multi-turn agentic workflows, the model lost accumulated reasoning state. V4 preserves the complete reasoning history across all rounds, including across user turns, whenever the conversation contains tool calls. The model maintains a coherent, cumulative chain of thought over long-horizon tasks.
**DSec: A sandbox built for RL rollouts.** V4's agent behaviors were trained with reinforcement learning against real tool environments. The infrastructure for this — DeepSeek Elastic Compute (DSec) — is a Rust platform that exposes four execution substrates behind one Python SDK: function calls, containers, microVMs (Firecracker), and full VMs (QEMU). A single cluster runs hundreds of thousands of concurrent sandboxes. Key features: fast container image loading via layered 3FS storage (RL rollouts don't wait on container startup), and preemption-safe trajectory replay (so interrupted training steps don't corrupt the replay buffer).
**XML-based tool-call schema.** V4 introduces a `|DSML|` special token and an XML-based tool-call format that reduces escaping failures compared to JSON-in-string tool calls — a common failure mode when models emit nested quoted content. The schema separates string parameters (passed as-is with `string="true"`) from structured parameters (passed as JSON with `string="false"`). This removes a class of parsing errors around numbers and booleans that JSON tool-call formats routinely hit.
The most underreported aspect of the V4 launch: DeepSeek explicitly optimized the models for **Huawei Ascend AI chips**, specifically the Ascend 950-based supernode lineup. In a press release dated April 24, 2026, Huawei said its entire Ascend SuperNode product line was "fully adapted" to V4.
This matters for two reasons:
**Economic access.** Ascend chips are available in Chinese data centers without the export restrictions that limit NVIDIA H100/H200 availability outside China. For developers and enterprises outside the U.S. who want to run frontier-level models at scale, Ascend provides a path that bypasses the NVIDIA supply crunch.
**Geopolitical signal.** The AI infrastructure race is increasingly a hardware race. DeepSeek's V3 already demonstrated that Chinese labs could match frontier model quality at a fraction of the training cost. V4 extends that argument to inference hardware. If Huawei Ascend can run a 1.6T parameter MoE model with a 1M-token context at efficiency levels competitive with NVIDIA H100s, the U.S. chip export controls become less effective as a lever.
The pragmatic implication for developers: V4 is the first frontier-scale open-source model you can actually host on non-NVIDIA hardware at production scale. If you're building agentic systems and your infrastructure is constrained by GPU availability or cost, the Huawei partnership opens options that didn't exist last month.
The two V4 variants serve different priorities:
| **V4-Pro** | **V4-Flash** |
|---|
| Total parameters | 1.6 trillion | 284 billion |
|---|
| Active parameters | 49 billion | 13 billion |
|---|---|---|
| Context window | 1M tokens | 1M tokens |
| Input cost / 1M tokens | $1.74 | $0.14 |
|---|---|---|
| Output cost / 1M tokens | $3.48 | $0.28 |
| Best for | Complex reasoning, agentic tasks | High-volume, cost-sensitive tasks |
|---|
| Hardware requirements | High-end (consumer-grade 128GB+ runs it) | Accessible |
|---|