Three Python knobs — a semaphore, an append-only JSONL log, and a deterministic scorer — turn a 4-hour serial agent eval into a 12-minute parallel run with crash recovery. No Ray, no LangSmith bill.

Your agent eval takes 6 hours. Here's the 200-line fix.

Your coding agent's eval suite crawls. 200 tasks, ~30s each, run serially in a for loop. That's 100 minutes minimum — usually 4-6 hours once you add retries, logging, and that one flaky grader. Meanwhile, you can buy roughly 10x wall-clock speedup with three Python knobs: a semaphore, an append-only log, and a deterministic scorer. No Ray, no Kubernetes, no managed-eval bill.

Where the pain hides

Most homegrown eval scripts are for task in tasks: result = await run(task). The serial chain means your mean task latency × N. Worse, the moment one task 500s you either crash the whole run or wrap it in a giant try/except that swallows the signal. And you hold every result in memory, so a 200-task run that dies on task 199 leaves you with zero data — not "partial data," nothing.

The fix is embarrassingly parallel. Each task is independent. So bound concurrency with asyncio.Semaphore, stream results to a JSONL file as they land, and score deterministically.

The recipe

1. Bound concurrency with a semaphore

Don't fire 500 coroutines at once. Pick a number your API tier actually likes — 50 is a sane default for OpenAI Tier 2, 200 for Anthropic Batch. The semaphore caps in-flight calls; everything else queues cleanly. Use asyncio.as_completed, not gather, so a slow task doesn't block the rest.

python

import asyncio, json, time
from pathlib import Path
from openai import AsyncOpenAI
SEM = asyncio.Semaphore(50)
client = AsyncOpenAI()
async def run_one(task: dict) -> dict:
    async with SEM:
        t0 = time.perf_counter()
        resp = await client.responses.create(
            model="gpt-4.1",
            input=task["prompt"],
            tools=task.get("tools"),
        )
        return {
            "id": task["id"],
            "ok": True,
            "latency_s": round(time.perf_counter() - t0, 3),
            "output": resp.output_text,
        }

2. Stream to JSONL, not a list in memory

Append one line per task. If the process dies at task 487, you resume from line 488 next run. This is the single highest-leverage habit in any eval harness — and it's 6 lines of code.

python

async def run_all(tasks, out_path="results.jsonl"):
    out = Path(out_path); out.touch(exist_ok=True)
    seen = {json.loads(l)["id"] for l in out.read_text().splitlines() if l}
    queue = [t for t in tasks if t["id"] not in seen]
    coros = [run_one(t) for t in queue]
    for fut in asyncio.as_completed(coros):
        try:
            r = await fut
        except Exception as e:
            r = {"ok": False, "error": repr(e)}
        with out.open("a") as f:
            f.write(json.dumps(r) + "\n")

In my last run, 200 tasks dropped from 4h12m serial to 11m48m wall-clock at 50 concurrent — and a power blip at task 173 cost me 27 tasks, not the whole suite.

3. Score with a deterministic function, not vibes

output_text == expected is the bar. For trajectory checks (did the agent call the right tool?), parse the message log. Save the LLM-judge path for the ~10% of cases where string-match genuinely fails — judge-vs-human Spearman needs 0.80+ before it's worth the latency and cost.

python

def score(results_path="results.jsonl", gold: dict[str, str] | None = None) -> dict:
    rows = [json.loads(l) for l in Path(results_path).read_text().splitlines() if l]
    lat = sorted(r["latency_s"] for r in rows if r.get("ok"))
    pass_rate = (
        sum(1 for r in rows if (gold or {}).get(r["id"], "").strip() == r.get("output","").strip())
        / max(len(rows), 1)
    )
    return {
        "n": len(rows),
        "pass_rate": round(pass_rate, 3),
        "p50_latency_s": lat[len(lat)//2] if lat else None,
        "error_rate": round(1 - sum(r.get("ok", False) for r in rows) / max(len(rows), 1), 3),
    }

Measure before you trust it

Run a 20-task smoke at concurrency 1, 10, 50, 100 before scaling to the full set. Plot error rate vs throughput. The right number is where error rate starts to climb — that's your provider's real ceiling, not the marketing one.

The take

Three knobs — semaphore, JSONL append, deterministic scorer — turn a six-hour eval into a 12-minute one with crash recovery baked in. Reach for Ray when 50 concurrent isn't enough; reach for an LLM judge when string-match is genuinely wrong; reach for nothing else.

2026-07-01. Tested July 2026 on Python 3.12, openai 1.97, asyncio default loop, 200-task SWE-bench-lite sample.