Your agent loop blew through 14 tool calls before producing garbage and you cannot reproduce it. Drop this recorder into your loop once, get a JSONL of every step, and replay any prefix without rerunning the model.

A 12-Line Agent Trace Recorder So You Can Replay Any Step

Your agent loop just burned 14 tool calls and a dollar before producing garbage. You have nothing to show for it. No screenshot of the prompt, no copy of the tool output, no way to reproduce. The model is non-deterministic, so the bug is gone. Tomorrow it does the same thing on a different prompt and you still cannot reproduce it.

Fix: write every step to a JSONL file as it happens. One line per step. Append-only. The file is your ground truth. You can head -n 5 to see what the model saw on turn 3, replay the prefix into a fresh session, or diff two runs to see where they diverged.

The Recorder

python

import json, time, uuid
from pathlib import Path
class Trace:
    def __init__(self, path: str | Path):
        self.path = Path(path)
        self.path.parent.mkdir(parents=True, exist_ok=True)
        self.f = self.path.open("a", buffering=1)  # line-buffered
        self.run_id = str(uuid.uuid4())[:8]
    def step(self, kind: str, **fields) -> None:
        self.f.write(json.dumps({
            "ts": time.time(),
            "run": self.run_id,
            "kind": kind,
            **fields,
        }) + "\n")
    def close(self) -> None:
        self.f.close()

Twelve real lines. That is the whole library. The kind field is what you grep on — llm_call, tool_call, tool_result, error, final.

Wiring It Into Any Loop

Drop three lines into whatever agent loop you already have. OpenAI flavored for clarity; the shape is the same for Anthropic, Gemini, and the Responses API.

python

import openai
client = openai.OpenAI()
trace = Trace("traces/run.jsonl")
messages = [{"role": "user", "content": "Plan a 3-day trip to Lisbon under $800."}]
for turn in range(8):
    resp = client.chat.completions.create(
        model="gpt-5-mini",
        messages=messages,
        tools=TOOLS,
    )
    msg = resp.choices[0].message
    trace.step("llm_call", turn=turn, content=msg.model_dump(),
               usage=resp.usage.model_dump() if resp.usage else None)
    if not msg.tool_calls:
        trace.step("final", text=msg.content)
        break
    messages.append(msg)
    for tc in msg.tool_calls:
        try:
            output = HANDLERS[tc.function.name](**json.loads(tc.function.arguments))
        except Exception as e:
            output = f"error: {e}"
            trace.step("error", tool=tc.function.name, err=str(e))
        trace.step("tool_call", name=tc.function.name, args=json.loads(tc.function.arguments))
        trace.step("tool_result", id=tc.id, output=output[:4000])  # cap huge blobs
        messages.append({"role": "tool", "tool_call_id": tc.id, "content": output})
trace.close()

The Gotcha

Cap tool output before you write it. A read_file returning a 200KB CSV will balloon your trace, your replay, and your token costs when you copy it back into a fresh context. Truncate to ~4KB and store the full blob under a separate kind=blob_ref key if you need it. The replay only needs the first 4KB anyway — that is what the model saw on the next turn.

Also: write to buffering=1 (line-buffered) not buffering=-1 (default block-buffered). A crashed process loses nothing because every step is already on disk. A block-buffered process loses up to 8KB on SIGKILL and the loss is always the line you needed.

The Replay

Once the trace exists, replay is a one-liner. Take the first N entries, project the messages list, and you can feed that into a fresh model call to ask "what would have happened if I had done X here?"

python

import json
def replay(path: str, n: int) -> list[dict]:
    msgs = [{"role": "user", "content": "..."}]  # original user prompt
    with open(path) as f:
        for i, line in enumerate(f):
            if i >= n: break
            ev = json.loads(line)
            if ev["kind"] == "llm_call":
                msgs.append(ev["content"])
            elif ev["kind"] == "tool_result":
                msgs.append({"role": "tool", "tool_call_id": ev["id"], "content": ev["output"]})
    return msgs
# Ask: "what if the model had refused the flaky tool call on turn 3?"
messages = replay("traces/run.jsonl", n=3)

This is the workflow that actually scales. Do not bolt on Langfuse or Phoenix before you have this. Their dashboards are nice. The JSONL file is what you diff at 1am when a regression ships.

The Take

Three rules. (1) Record every step with a timestamp and a stable run id — uuid4()[:8] is fine. (2) Line-buffer the file, cap tool output, and close it in a finally block so a crash mid-loop still leaves a readable trace. (3) Never let the recorder touch the model — it is a passive observer. The day you start branching on trace state inside the loop, you have invented a debugger. Most agents never need one.

— Mr. Technology

*Tested June 2026 with openai Python SDK 1.x against gpt-5-mini and the anthropic SDK 0.5x against claude-sonnet-4-5. The recorder is pure stdlib — json, time, uuid, pathlib — no third-party deps. Trace files are line-delimited JSON (one event per line), which is what jq, grep, and less expect. For a heavier alternative, OpenTelemetry + Phoenix gives you the same data plus a UI; this is the version you ship in the first 10 minutes of debugging.*