← Back to Payloads
ai2026-06-03

Your AI agents are getting joins wrong Heres what its costin

AI agents pointed at enterprise warehouses hallucinate joins on more than 65% of complex queries — and DataHub's new Context Intelligence layer, mining SQL query history, is the first real fix. The lakehouse routing gap it closes is worth 8x in compute.
Quick Access
Install command
$ mrt install ai
Browse related skills
Your AI agents are getting joins wrong Heres what its costin

Your AI agents are getting joins wrong — here's what it's costing you

When Miro's data team pointed AI agents at their production warehouse, more than 65% of the agents' queries hallucinated joins — they invented relationships between tables that didn't exist, or used deprecated ones that did. The fix wasn't better prompts. It was exposing agents to the actual SQL query history that human analysts had already validated. That's the bet behind DataHub's new "Context Intelligence" layer, which mines your existing query log to build a semantic index the agents can read.

What You Need to Know: AI agents pointed at enterprise data warehouses hallucinate joins more than 65% of the time on complex queries, and the only proven fix is grounding them in the SQL query history your team already produced — not in static schema docs.

Why It Matters

  • The 8x cost problem is real. When an agent routes a query to a lakehouse without context, it pulls whole tables and runs full scans instead of targeted joins. DataHub's research shows mis-routed lakehouse queries cost roughly 8x what a properly-joined warehouse query does.
  • Your data sprawl is the silent killer. Most enterprise data teams now operate 50+ source systems feeding warehouses, lakes, and lakehouses. Agents without a unified metadata layer will pick the wrong source almost every time.
  • Static schemas are obsolete for agent work. Column names alone don't tell an agent that customer_status = 'gold' means "paying" in one table and "lifetime value tier" in another. Only validated queries carry that business context.
  • The DataHub approach is replicable. Because it's built on the same lineage-tracking infrastructure that LinkedIn open-sourced, the underlying query-log corpus already exists in most companies — it just hasn't been exposed to agents.

What Actually Happened

DataHub ships Context Intelligence, mining SQL query history for agents

On May 28, 2026, DataHub — the metadata platform born from the open-source project founded at LinkedIn — released its Context Intelligence layer. The new module reads a company's existing SQL query history and builds a semantic index that AI agents can query directly, through integrations with LangChain, Google's Agent Development Kit, and CrewAI.

The system is "production-proven" lineage tech, according to DataHub CTO Shirshanka Das, who spent nearly 11 years leading data infrastructure at LinkedIn. The pitch: instead of asking an LLM to guess what a column means, give it the actual queries your analysts have already written, and let the agent reason from proven intent.

Futurum Group's coverage of the launch drove the point home: "When agents ingest static, developer-defined schemas bereft of business context, they frequently hallucinate complex joins and query deprecated tables." That's not a theoretical risk. It's measured.

Miro's 65% error rate became the new baseline

The catalyst for the conversation was Miro. The collaborative whiteboard company's data team publicly reported that AI agents, pointed at their warehouse with only a schema, hallucinated joins or used deprecated relationships on more than 65% of complex multi-table queries. Miro didn't publish its fix in detail, but their data team pointed at query history as the missing ingredient.

It's worth pausing on the number. Two out of three agent-generated joins on a real production warehouse are wrong. That means any AI-powered analytics product shipping today without grounding is shipping broken — and most of them are.

The lakehouse routing gap is costing 8x per query

The "8x" in the headline refers to a separate problem: when an agent doesn't know which table to hit, it pulls from the most permissive source available — usually the lakehouse. Lakehouse queries without proper predicates scan massive files and run on much more expensive compute than warehouse queries with clean joins. The cumulative cost across thousands of agent-driven queries is the kind of line item that gets a CFO's attention within a quarter.

This is the routing gap DataHub's Context Intelligence is targeting. By showing the agent which tables are typically joined, in which order, and with which filter patterns, the system reduces the chance an agent falls back to "scan everything" mode.

The Take

I've been saying this since 2024: a 1M-context model with zero context is still dumber than a 200K-context model with the right one. The DataHub launch is the first productized admission that static schemas are dead for agent work. If you're building analytics AI in 2026, your agent's "memory" needs to be the actual query log of the humans who knew what they were doing. Anything less is hallucination as a service.

The hard truth: most companies don't have a query log. They have a query relic — a partial warehouse query_history table from Snowflake, a dbt run log, a Looker audit table, and a bunch of tribal knowledge in senior analysts' heads. DataHub, Atlan, and the rest of the metadata vendors are going to make a fortune stitching these together over the next 18 months.

Quick Summary

AI agents hallucinating joins at a 65% clip is a 2026 baseline, not an edge case. DataHub's Context Intelligence, mining your SQL query history, is the first real product to fix it — and the lakehouse routing gap it closes is worth 8x in compute spend.

Sources

Related Dispatches