
Every RL library I have poked at in the last 18 months — TRL, veRL, OpenRLHF, HF's trl GRPO trainer — assumes you are training a single-turn chat model on a static dataset. That is a solved-shaped problem. It is also not the problem anyone running a real agent has. The problem is: my multi-step agent takes 14 tool calls, half of them are wrong, and I want the policy to learn from rollouts of the trajectory, not from a labeled pair. OpenPipe ART (openpipe/art on GitHub, Apache 2.0, 10.2k stars, last commit 2 days ago) is the first framework I have seen that treats trajectory-level RL as the default and not the escape hatch. It is a thin wrapper around GRPO that drops into existing Python agent code with one model registration and a log() call. That is the whole pitch, and the bench numbers back it up.
We are mid-inference-time-compute backlash, but that backlash is about reasoning models, not agents. Agents are still the place where the model is the bottleneck — every team I talk to has a 70% tool-call success rate they cannot push past 85% with prompt engineering. RL is the obvious escape valve; the tooling is what stopped everyone. You have to hand-roll a rollout harness, a reward function, a vLLM inference pool, a LoRA-swap dance, and a checkpoint manager. ART pretends that loop does not exist. The reason this matters now is that W&B Training (Serverless RL) launched in Q1 and ART integrates with it, so you can run the trainer on W&B-managed GPUs without owning a single A100. That collapses the cost story from "six figures of capex plus an MLOps hire" to "a W&B API key and a Saturday."
The client is a Python library (pip install openpipe-art) you embed in your existing agent loop — it does not care if you are on LangGraph, raw OpenAI calls, or hand-rolled asyncio. The server is a trainer you run on a single local GPU or hand to W&B; it runs the GRPO update against a base model (Qwen 2.5/3.6, Llama 3.x, GPT-OSS are first-class). The rollout store holds trajectories before the reward function scores them. The minimal loop:
import art
from art.serverless.backend import ServerlessBackend
model = art.TrainableModel(project="voice-agent", name="agent-001",
base_model="Qwen/Qwen3.6-27B")
model.register(ServerlessBackend(api_key="your_wandb_api_key"))
# inside your existing agent loop, after each trajectory:
await model.log(group=trajectory_group, reward=judge(trajectory))
# train() per N trajectories; GRPO runs server-sideThe clever bit is judge(trajectory). ART's optional RULER module is a general-purpose reward model you can drop in when you cannot hand-code a reward: it samples multiple rollouts, ranks them with a strong judge model, and uses the ranking as the GRPO signal. That is the part that makes the framework usable for the 80% of agent tasks where "the user got what they wanted" is the only honest reward.
The proof point is ART·E: a Qwen 2.5 14B email-retrieval agent trained with ART that beats OpenAI's o3 on OpenPipe's Enron-derived email search benchmark — on accuracy, cost, and latency, against a frontier reasoning model. The training run was ~1,200 rollouts against synthetic Enron-style corpora (no real emails in training, which is the legally interesting part). The framework also ships notebooks for things you would not expect an RL library to handle cleanly: 2048, Codenames, MCP-server tool use (MCP·RL), Temporal Clue. If your task is multi-turn with a verifiable end state, ART can probably train it. The LangGraph integration is the one I have used in production: two decorators and a state-graph node. Right ergonomics.
Three honest weaknesses. First, GRPO is not magic on tasks with sparse or deceptive rewards — if your success metric is "the user clicked thumbs up," you will burn GPU-hours chasing noise. RULER helps but does not fix it. Second, the serverless backend ties you to W&B's inference fleet, which has its own rate-limit and queueing story you will discover at 2 a.m. on a deadline. Third, ART assumes your agent is Python. If you are on a Node/TypeScript stack (Mastra, Vercel AI SDK, LangChain.js), you are in wrapper-script hell — TypeScript clients exist but are not first-class. Finally, the framework optimizes the model policy, not prompt or tool selection. If your agent's failures are "which tool to call," RL is the wrong hammer; use DSPy or a router.
Use ART if you have a multi-step agent with a measurable end-of-trajectory reward, a Python runtime, and a few hundred dollars of GPU time to test the hypothesis. Skip it if your bottleneck is tool selection, if you are on TypeScript, or if your reward is a thumbs-up button. Try it first on a synthetic environment — the 2048 notebook is the right 30-minute smoke test. The reason ART matters in 2026 is not that GRPO is new (it has been around since DeepSeek-R1). It is that ART is the first wrapper that makes GRPO as easy to try as a new prompt, and the ART·E result is the evidence the abstraction does not leak on a real task. That is the part of the RL-for-agents story that has been missing.