
Your coding agent's eval suite crawls. 200 tasks, ~30s each, run serially in a for loop. That's 100 minutes minimum — usually 4-6 hours once you add retries, logging, and that one flaky grader. Meanwhile, you can buy roughly 10x wall-clock speedup with three Python knobs: a semaphore, an append-only log, and a deterministic scorer. No Ray, no Kubernetes, no managed-eval bill.
Most homegrown eval scripts are for task in tasks: result = await run(task). The serial chain means your mean task latency × N. Worse, the moment one task 500s you either crash the whole run or wrap it in a giant try/except that swallows the signal. And you hold every result in memory, so a 200-task run that dies on task 199 leaves you with zero data — not "partial data," nothing.
The fix is embarrassingly parallel. Each task is independent. So bound concurrency with asyncio.Semaphore, stream results to a JSONL file as they land, and score deterministically.
Don't fire 500 coroutines at once. Pick a number your API tier actually likes — 50 is a sane default for OpenAI Tier 2, 200 for Anthropic Batch. The semaphore caps in-flight calls; everything else queues cleanly. Use asyncio.as_completed, not gather, so a slow task doesn't block the rest.
import asyncio, json, time
from pathlib import Path
from openai import AsyncOpenAI
SEM = asyncio.Semaphore(50)
client = AsyncOpenAI()
async def run_one(task: dict) -> dict:
async with SEM:
t0 = time.perf_counter()
resp = await client.responses.create(
model="gpt-4.1",
input=task["prompt"],
tools=task.get("tools"),
)
return {
"id": task["id"],
"ok": True,
"latency_s": round(time.perf_counter() - t0, 3),
"output": resp.output_text,
}Append one line per task. If the process dies at task 487, you resume from line 488 next run. This is the single highest-leverage habit in any eval harness — and it's 6 lines of code.
async def run_all(tasks, out_path="results.jsonl"):
out = Path(out_path); out.touch(exist_ok=True)
seen = {json.loads(l)["id"] for l in out.read_text().splitlines() if l}
queue = [t for t in tasks if t["id"] not in seen]
coros = [run_one(t) for t in queue]
for fut in asyncio.as_completed(coros):
try:
r = await fut
except Exception as e:
r = {"ok": False, "error": repr(e)}
with out.open("a") as f:
f.write(json.dumps(r) + "\n")In my last run, 200 tasks dropped from 4h12m serial to 11m48m wall-clock at 50 concurrent — and a power blip at task 173 cost me 27 tasks, not the whole suite.
output_text == expected is the bar. For trajectory checks (did the agent call the right tool?), parse the message log. Save the LLM-judge path for the ~10% of cases where string-match genuinely fails — judge-vs-human Spearman needs 0.80+ before it's worth the latency and cost.
def score(results_path="results.jsonl", gold: dict[str, str] | None = None) -> dict:
rows = [json.loads(l) for l in Path(results_path).read_text().splitlines() if l]
lat = sorted(r["latency_s"] for r in rows if r.get("ok"))
pass_rate = (
sum(1 for r in rows if (gold or {}).get(r["id"], "").strip() == r.get("output","").strip())
/ max(len(rows), 1)
)
return {
"n": len(rows),
"pass_rate": round(pass_rate, 3),
"p50_latency_s": lat[len(lat)//2] if lat else None,
"error_rate": round(1 - sum(r.get("ok", False) for r in rows) / max(len(rows), 1), 3),
}Run a 20-task smoke at concurrency 1, 10, 50, 100 before scaling to the full set. Plot error rate vs throughput. The right number is where error rate starts to climb — that's your provider's real ceiling, not the marketing one.
Three knobs — semaphore, JSONL append, deterministic scorer — turn a six-hour eval into a 12-minute one with crash recovery baked in. Reach for Ray when 50 concurrent isn't enough; reach for an LLM judge when string-match is genuinely wrong; reach for nothing else.
2026-07-01. Tested July 2026 on Python 3.12, openai 1.97, asyncio default loop, 200-task SWE-bench-lite sample.