Every AI agent framework eventually runs into the same wall: the model knows the tools exist, but it doesn't know *how* to use them reliably. Tool calling isn't a feature you enable and forget. It's an engineering problem that determines whether your agent actually accomplishes things or just pretends to.
This is the part the demos don't show you. The demos show a model calling `search` and getting a clean result. What they don't show is the 40% of tool calls that fail silently, the retries that compound latency, the malformed arguments that produce garbage output, and the architectural decisions that determine whether any of this scales.
After watching teams build agentic systems at scale, here's what actually works.
Most tool descriptions are written as afterthoughts — a one-liner explaining what the tool does. The model treats these as suggestions. That's a mistake.
Your tool description is the model's only interface definition. It determines whether the model can select the right tool, format the right arguments, and interpret the result correctly. A vague description produces vague behavior.
The pattern that works: **exhaustive, example-laden tool descriptions** written from the model's perspective. Include:
The model with this description knows the guardrails. The model with "Executes SQL queries on the database" will try to run `DROP TABLE users` and not understand why it failed.
The model will send malformed arguments. Not occasionally — frequently. A tool that expects `user_id` as an integer will receive `user_id: "usr_12345"` because the model read it from context that was labeled `user_id`. A tool expecting a URL will receive a search query because the distinction wasn't clear in the description.
The fix isn't better prompts — it's validation at the boundary. Every tool should validate its inputs before executing anything. This serves two purposes: it prevents garbage-in-garbage-out from propagating to external systems, and it provides structured error feedback that the model can actually use to recover.
The model learns from structured error feedback. "Only SELECT queries allowed" teaches it the constraint. "Invalid argument type" teaches it the format. This is prompt engineering through consequences, and it works better than verbose descriptions alone.
Not all tool calls are equal. A search tool is a lookup — the model asks a question and gets information back. The cost of getting it wrong is low; the model can often recover by rephrasing the query.
An action tool — one that sends emails, moves money, modifies records — is fundamentally different. The cost of getting it wrong is high. The model needs more deliberation time, more confirmation steps, and clearer error handling before execution.
The pattern: **separate these into different tool types with different behavioral contracts**.
Lookup tools: fast response, retry on failure, accept partial information.
Action tools: explicit confirmation, timeout handling, rollback awareness.
The model with the email description knows it needs confirmation. The model with the search description knows it can be more aggressive. Without this distinction, both tools get the same behavioral treatment — which is wrong.
Agentic systems fail in production in ways they don't fail in demos. One of the most common: latency accumulation. A model calls a tool, waits 800ms for the response, calls another tool, waits 1.2s, and by the time the user gets a response they've forgotten what they asked.
The pattern: **design tools with latency budgets**. Each tool should have an expected response time range, and the model should know whether it's dealing with a fast lookup or a slow operation.
For fast tools (<500ms): the model can call them eagerly, as part of a parallel batch where it makes sense.
For slow tools (>2s): the model should signal to the user that it's working, surface intermediate progress, and be conservative about chaining multiple slow tools.
The model can't manage what it doesn't know. Tell it the latency expectations explicitly in the tool description.
Here's the problem that manifests at scale: tool outputs are designed for systems, not for model context windows. A database query returns 50 columns of raw data. A web search returns 20 results with full snippets. The model receives all of it, tries to parse what's relevant, and either runs out of context or produces an answer that missed the important part.
The fix isn't bigger context windows — it's smarter output shaping at the tool level. Every tool should return *summarized* results by default, with the option to fetch more detail if the model determines it's needed.
The model gets enough to answer the question. If it needs more, it can call the tool again with `summary: false`. This is the retrieval equivalent of lazy loading — and it keeps your context windows from filling up with data the model didn't need.
All of these patterns share the same foundation: **treat your agent's tools as a designed system, not an assembled list**.
The tools aren't just available functions — they're a contract between your agent and the external world. That contract needs to be specific, validated, categorized by cost and risk, and designed for the model's actual information needs. When it's done right, tool calling stops being a failure mode and starts being the thing that actually makes your agent capable.
The teams that get agentic systems to work reliably are the ones who spent as much time on the tool interface design as they did on the model selection. Everything else is just hoping the model figures it out.
It won't. Design it properly.
*Tool use is an engineering discipline, not a feature checkbox. The agents that work are the ones where someone thought carefully about every tool boundary.*