Fivetran and dbt Labs closed their merger with dbt Core v2 going Apache 2.0 and dbt State claiming 30%+ infra savings. Zepto's cart-as-sentence MLM is the new pattern to steal, and an eight-workload benchmark put Ray Data ahead of Daft for production multimodal pipelines.

dbt Core v2 Alpha , Cart Prediction with LLMs , Ray vs Daft

The TLDR Data digest for June 4 was a data-engineer fever dream: dbt Labs and Fivetran closed their merger with a stack of new open-source releases, Zepto open-sourced a masked-language-model approach to cart prediction, and an eight-production-workload benchmark settled the Ray-vs-Daft debate for now.

What You Need to Know: Fivetran and dbt Labs closed their merger on June 4 and immediately open-sourced the dbt Core v2 (Fusion) runtime under Apache 2.0, shipped dbt State in preview as a pipeline caching layer, and launched dbt Wizard in beta as an AI co-author. Separately, Zepto published its "Cart Contextual Model" — a Transformer MLM that treats a shopping cart as a sentence — and an independent eight-workload benchmark of Ray Data vs Daft for multimodal data lakes picked Ray for production stability, with Daft winning on ergonomics.

Why It Matters

The dbt / Fivetran stack just became the de-facto foundation for "trusted AI agents." dbt State promises >30% infra cost reduction via pipeline caching, and dbt Wizard is the first serious attempt at an in-pipeline AI co-author that understands the project DAG. If you're building agent infra, the lineage you were hand-rolling is now a primitive.
The "cart as a sentence" trick is generalizable. Zepto is doing last-mile retail, but the masked-LM approach works anywhere you have a sequential, append-only event stream with a strong prior on completion — checkout flows, form fills, code completions, search refinements.
Ray Data won the multimodal data-lake bake-off, but Daft's ergonomics are not a fluke. Anyone architecting a multimodal pipeline in 2026 needs to know that "Ray for scale, Daft for clarity" is a defensible position, and the test matrix the field-journal author ran is the closest thing to a real benchmark we have.
dbt Core v2 staying Apache 2.0 is bigger than the runtime itself. Vendors were whispering about a license change; the Apache commitment kills the FUD for any team that was deferring an adoption decision.

What Actually Happened

dbt Core v2: Fusion runtime open-sourced under Apache 2.0

On June 4, Fivetran and dbt Labs officially closed their merger — first announced in March — and used the moment to ship four pieces of news: (1) dbt Core v2 (alpha) with the Fusion engine's Rust-based runtime released under Apache 2.0, unifying Core and Fusion around a shared foundation, (2) dbt State in preview, a caching layer for data pipelines that dbt claims cuts underlying infrastructure costs by more than 30%, (3) dbt Wizard in beta, an AI assistant that understands a project's full lineage, model health, test coverage, and semantic definitions, and (4) Agents Schema, a proposed open standard for agentic context. Fusion remains the recommended free CLI for most users; Core v2 exists for teams that need fully open-source code or custom builds.

The version-1 → version-2 path matters here. dbt Core v1 was the Python-based engine that built the modern analytics-engineering category. Core v2 is the same product surface backed by a Rust runtime with up to 10x faster parsing, compilation, and execution, dialect-aware SQL comprehension, column-level lineage, and Parquet artifacts for cleaner local docs and simpler installs. The community has been asking whether the merger meant a license change; the Apache 2.0 commitment is the answer.

Source coverage: dbt blog, "dbt Core v2 is here: still open source, now rebuilt for what's next", dbt Developer Hub, "About dbt local installations", Fivetran press release on the merger close.

Zepto's "Cart as a Sentence": masked-LM cart prediction

Zepto (the Indian 10-minute grocery delivery company) published a detailed post on its Cart Contextual Model, which treats a shopping cart as a sentence in a masked-language-model problem. The model is a Transformer MLM trained on historical cart patterns with temporal, geographical, and product signals plus an inverse-frequency masking strategy to handle long-tail items, and it infers user intent in real time as items are added. The output is a ranked list of "what else you'll probably buy," surfaced live in the cart UI.

The interesting engineering choices: (1) treating the cart as an ordered, mutable sequence (not a bag-of-items) lets the model capture "if A goes in first, B usually follows," (2) inverse-frequency masking prevents the model from overfitting to predict high-frequency items (rice, milk) and forces it to learn long-tail co-occurrence, and (3) the temporal features include hour-of-day and day-of-week, which is what gives the model its "I know you order chai on Sunday morning" quality. Zepto is open about the trade-off: retraining cadence, not inference latency, is the bottleneck.

Source coverage: Zepto engineering blog, "Your Cart Has a Story. Here's How We Learned to Read It".

Ray Data vs Daft: an eight-workload field journal

An independent practitioner published a 14-minute field journal after running eight production-like multimodal data-lake workloads side by side on Ray Data and Daft. The conclusion: Ray Data won on production stability and resilience at scale, particularly around async LLM inference, while Daft won on ergonomic native multimodal primitives and cleaner code for many operations. The author picks Ray for production deployments today and flags Daft as the better choice for teams whose workload shape matches Daft's strengths.

The two technical issues that decided it: (1) under sustained async load (think: thousands of in-flight LLM calls), Daft's error-recovery path required more manual operator intervention than Ray's, and (2) Ray's distributed shuffle and fault-tolerance primitives are more battle-tested for the long-tail failure modes the benchmark deliberately induced (slow nodes, OOMs, partial completion). Daft is younger, and the gap isn't architectural — it's ecosystem maturity. For a team starting greenfield, the answer in 2026 is still "test both on your real workload."

Source coverage: Mehul Batra on Medium, "A field journal on Ray Data and Daft for multimodal data lake".

The Take

The data-engineering stack just had its "Kubernetes moment" — the abstraction layer (Fivetran + dbt) is consolidating, and the application layer (agents) is being built on top of it as a first-class primitive. dbt State is the part to watch: if 30% infra-cost reduction is real, that's not a feature release, it's a category re-price.

The cart-as-a-sentence post is the kind of writeup I wish more ML teams would publish. It's specific (inverse-frequency masking, temporal features, retraining cadence) and honest (Zepto admits where the system fails). The Ray-vs-Daft field journal is the same energy from the platform side: a real benchmark, real workloads, real failures, no vendor Y-axis manipulation.

If you build data products, the playbook for 2026 just got tighter: dbt Core v2 as the runtime, dbt State for caching, dbt Wizard for the agentic layer, and Ray Data (with a Daft pilot) for the multimodal lake underneath.

Quick Summary

Fivetran-dbt Labs closed their merger with dbt Core v2 going Apache 2.0, dbt State cutting infra costs 30%+, and dbt Wizard in beta. Zepto showed that "cart as a sentence" with inverse-frequency MLM is the cart-prediction pattern to steal. Ray Data edged Daft in an eight-workload multimodal benchmark, with stability and async LLM inference as the deciding factors.

Sources

Source: TLDR Data (2026-06-04) | mr.technology — The Master Skill Index