← Back to Payloads
AI Engineering2026-05-11

Docs Beat Skills in 250 Evals. Scenario Models Need Guardrails. Rust Is Quietly Eating AI.

Wix ran 250 evaluations to test whether AI agent skills beat documentation. The answer is uncomfortable. The TLDR coverage adds two more threads: scenario models need runtime guardrails, and Rust is becoming the default systems language for AI infrastructure. Here is the synthesis.
Quick Access
Install command
$ mrt install wix
Browse related skills
Docs Beat Skills in 250 Evals. Scenario Models Need Guardrails. Rust Is Quietly Eating AI.

Hey guys, Mr. Technology here.

The TLDR May 11 issue dropped a research result that the entire agent-builder community needs to sit with. Wix ran 250 evaluations to test whether AI agent skills outperform documentation, and documentation won on net. The TLDR coverage added two more threads that read as separate stories but actually fit the same pattern: scenario models for agents need runtime guardrails, and Rust is becoming the default systems layer for serious AI infrastructure. The unifying question is: how do you build software that AI agents can use reliably? The honest answer in May 2026 is: with great documentation, runtime guardrails, and Rust where you can.

The Wix 250-Eval Result

The full writeup is at Wix Engineering. The setup: Wix's developer productivity team gave AI agents two interfaces for the same tasks — one a skill (a packaged capability with structured input/output, tool definitions, and an execution contract) and one a documentation page (Markdown describing how to use the underlying API or service). They ran 250 evaluations across both interfaces. The result, in their own words: skills are not a clean win.

The uncomfortable findings:

  • Skills help when the underlying API is hard to discover or has non-obvious parameters. That is the case Wix is designed for, and skills did help there.
  • Documentation wins when the API is well-described and the agent is competent at reading it. A clean REST API with good examples was faster for the agent to use via docs than via a skill.
  • Skills rot. The moment the underlying API changes, a skill that does not get updated becomes worse than documentation. Wix found skill maintenance overhead was a bigger drain on engineering time than skill authoring.
  • Mixed-mode is the worst outcome. When a skill partially covers a task and the agent falls back to docs for the gap, the model is more confused than when it uses docs for the entire task. Handoff cost is real.

The synthesis that the agent-builder community has been avoiding: a skill is a maintenance liability. It is a code dependency that lives in a special directory, has its own version, and breaks when the API changes. A documentation page is a Markdown file that lives in your docs site and can be updated in the same PR as the API change. For most agent consumers, the latter is the lower-friction path.

This does not mean skills are useless. Skills win in three cases:

1. The API is so complex that no agent can figure it out from docs. This is the rare case. Most APIs are not that complex. 2. The skill wraps stateful behavior that the agent should not have to reconstruct. Examples: database connection pooling, multi-step authentication flows, transaction handling. 3. The skill enforces policy the agent should not be able to override. Examples: rate limits, audit logging, PII redaction.

Outside of those three cases, the Wix data says: write good docs.

Scenario Models Need Runtime Guardrails

The second story in the TLDR issue is the technical answer to "what if the agent is going to do something irreversible." As agents get more capable, the worst-case scenarios stop being hypothetical. An agent that can send email, move money, deploy code, and modify production data can cause real damage. The traditional alignment approach — pre-deployment safety training — is necessary but not sufficient. Once the model is running with tools, you need runtime guardrails.

The pattern that is emerging: scenario models — small, specialized models that monitor the main agent's actions in real time and flag or block specific behaviors. The classic example is a model that watches every outbound email an agent composes and refuses to let it send anything that looks like a phishing payload. Another is a model that watches every tool call and refuses to let an agent delete more than N records per minute. The scenario model runs in parallel to the main agent and has veto power on specific action classes.

StateTech Magazine's January 2026 piece made the case that runtime guardrails are no longer optional in 2026. Menlo Security's March 2026 post on the same topic argued that agentic action at machine speed has outpaced pre-deployment evaluation: by the time you have a test suite for a new attack pattern, an agent has already attempted it 10,000 times in production.

The practical stack that is forming:

  • Action-level guardrails. A small, fast model (often a fine-tuned 7B-13B) inspects every tool call before it executes. Latency budget: 50-200ms per call.
  • Policy-level guardrails. A rules engine (OPA, Cedar, custom) that enforces structured policies — "no emails to addresses outside this list," "no database writes after 6pm," "no production deploys without human approval."
  • Audit trail. Every guardrail decision logged with model confidence, the full action context, and the policy that triggered the block.

The vendors in this space are mostly second-generation — Vendia, Robust Intelligence, Calypso AI, Lakera, and a dozen newer entrants. The interesting move is that the major agent platforms are now shipping built-in guardrail hooks: LangChain's policy middleware, CrewAI's action review layer, and Anthropic's Claude Code Review being the most visible examples.

Rust Is Quietly Becoming the AI Systems Layer

The third thread is the easiest to miss and the most consequential. Rust is not a popular language for AI application code — Python is and will be for a long time. But Rust is becoming the default for the systems layer underneath: inference servers, agent runtimes, vector database internals, model serving infrastructure, and the new generation of agentic code-execution sandboxes.

A few signals:

  • **Hugging Face's text-embeddings-inference** — the fastest open-source embedding server — is Rust.
  • Anthropic's model serving stack — internally Rust and Go, per the engineering blog and the Claude Code Review launch materials.
  • Qdrant and LanceDB — the two most popular open-source vector databases — are Rust.
  • OpenAI's Triton compiler — Rust, with Python bindings.
  • The new agent runtimes (LangChain's Rust port, AutoGen-Rust, and the MCP server SDKs in Rust) — Rust.

The Reddit r/rust discussion from May 2026 captured the trend. The reasons are not subtle: Rust gives you memory safety without a garbage collector, predictable latency under load, and a binary that is small enough to embed in a sidecar. For AI infrastructure, those three properties are worth the steeper learning curve.

If you are building AI infrastructure — model servers, agent runtimes, vector stores, inference hardware adapters — Rust is now the table-stakes choice. The performance gap between a Rust implementation and a Python implementation at the systems layer is often 10x-100x. The reliability gap (no GC pauses, no runtime errors from bad memory access) is the reason production teams keep picking Rust for new components.

The corollary is that the AI infrastructure hiring market is bifurcating. Application-layer engineers write Python. Systems-layer engineers write Rust. If you are a Rust engineer reading this, your skills are about to be very valuable. If you are a Python engineer who has been putting off learning Rust, the next 18 months are a good window to add it.

The Take

Three threads, one underlying shift: the agent-builder ecosystem is maturing past the "build something that works for the demo" phase into the "build something that works in production for two years" phase. That means:

  • Documentation over skills for most agent interfaces. The Wix 250-evals data is the strongest signal yet that the skill abstraction is not a free win.
  • Runtime guardrails over pre-deployment alignment for any agent that takes irreversible actions. The scenario-model pattern is the new floor.
  • Rust over Python for AI infrastructure systems. The application layer stays in Python. The systems layer moves to Rust.

If you are picking a stack in 2026, that is the rule set. If you are hiring, that is the skill map. The agent era is not "Python everything" the way the web era was "JavaScript everything." It is Python at the top, Rust at the bottom, and runtime guardrails in the middle.

Mr. Technology


Sources: Wix Engineering — We Ran 250 AI Agent Evals to Find Out if Skills Beat Docs, StateTech Magazine — AI Guardrails Will Stop Being Optional in 2026, Menlo Security — When AI Acts: Why Guardrails Must Move Into the Runtime, Anthropic Engineering Blog, Epsilla — The $25 Code Review Tax, Reddit r/rust — Rust is quietly becoming the foundation layer for AI tooling, MCP server SDKs, Model Context Protocol.

Related Dispatches