
Your agent loop just burned 14 tool calls and a dollar before producing garbage. You have nothing to show for it. No screenshot of the prompt, no copy of the tool output, no way to reproduce. The model is non-deterministic, so the bug is gone. Tomorrow it does the same thing on a different prompt and you still cannot reproduce it.
Fix: write every step to a JSONL file as it happens. One line per step. Append-only. The file is your ground truth. You can head -n 5 to see what the model saw on turn 3, replay the prefix into a fresh session, or diff two runs to see where they diverged.
```python import json, time, uuid from pathlib import Path
class Trace: def __init__(self, path: str | Path): self.path = Path(path) self.path.parent.mkdir(parents=True, exist_ok=True) self.f = self.path.open("a", buffering=1) # line-buffered self.run_id = str(uuid.uuid4())[:8]
def step(self, kind: str, fields) -> None: self.f.write(json.dumps({ "ts": time.time(), "run": self.run_id, "kind": kind, fields, }) + "\n")
def close(self) -> None: self.f.close() ```
Twelve real lines. That is the whole library. The kind field is what you grep on — llm_call, tool_call, tool_result, error, final.
Drop three lines into whatever agent loop you already have. OpenAI flavored for clarity; the shape is the same for Anthropic, Gemini, and the Responses API.
```python import openai
client = openai.OpenAI() trace = Trace("traces/run.jsonl") messages = [{"role": "user", "content": "Plan a 3-day trip to Lisbon under $800."}]
for turn in range(8): resp = client.chat.completions.create( model="gpt-5-mini", messages=messages, tools=TOOLS, ) msg = resp.choices[0].message trace.step("llm_call", turn=turn, content=msg.model_dump(), usage=resp.usage.model_dump() if resp.usage else None)
if not msg.tool_calls: trace.step("final", text=msg.content) break
messages.append(msg) for tc in msg.tool_calls: try: output = HANDLERStc.function.name) except Exception as e: output = f"error: {e}" trace.step("error", tool=tc.function.name, err=str(e)) trace.step("tool_call", name=tc.function.name, args=json.loads(tc.function.arguments)) trace.step("tool_result", id=tc.id, output=output[:4000]) # cap huge blobs messages.append({"role": "tool", "tool_call_id": tc.id, "content": output})
trace.close() ```
Cap tool output before you write it. A read_file returning a 200KB CSV will balloon your trace, your replay, and your token costs when you copy it back into a fresh context. Truncate to ~4KB and store the full blob under a separate kind=blob_ref key if you need it. The replay only needs the first 4KB anyway — that is what the model saw on the next turn.
Also: write to buffering=1 (line-buffered) not buffering=-1 (default block-buffered). A crashed process loses nothing because every step is already on disk. A block-buffered process loses up to 8KB on SIGKILL and the loss is always the line you needed.
Once the trace exists, replay is a one-liner. Take the first N entries, project the messages list, and you can feed that into a fresh model call to ask "what would have happened if I had done X here?"
```python import json
def replay(path: str, n: int) -> list[dict]: msgs = [{"role": "user", "content": "..."}] # original user prompt with open(path) as f: for i, line in enumerate(f): if i >= n: break ev = json.loads(line) if ev["kind"] == "llm_call": msgs.append(ev["content"]) elif ev["kind"] == "tool_result": msgs.append({"role": "tool", "tool_call_id": ev["id"], "content": ev["output"]}) return msgs
messages = replay("traces/run.jsonl", n=3) ```
This is the workflow that actually scales. Do not bolt on Langfuse or Phoenix before you have this. Their dashboards are nice. The JSONL file is what you diff at 1am when a regression ships.
Three rules. (1) Record every step with a timestamp and a stable run id — uuid4()[:8] is fine. (2) Line-buffer the file, cap tool output, and close it in a finally block so a crash mid-loop still leaves a readable trace. (3) Never let the recorder touch the model — it is a passive observer. The day you start branching on trace state inside the loop, you have invented a debugger. Most agents never need one.
— Mr. Technology
*Tested June 2026 with openai Python SDK 1.x against gpt-5-mini and the anthropic SDK 0.5x against claude-sonnet-4-5. The recorder is pure stdlib — json, time, uuid, pathlib — no third-party deps. Trace files are line-delimited JSON (one event per line), which is what jq, grep, and less expect. For a heavier alternative, OpenTelemetry + Phoenix gives you the same data plus a UI; this is the version you ship in the first 10 minutes of debugging.*