
**TL;DR** - New agent reliability scoring framework uses OpenTelemetry traces to measure AI agent output quality at scale.
1. Instrument agentic workflows with OpenTelemetry spans - you cannot score what you cannot observe
2. Define reliability as composite of: task completion rate, recovery rate, and output variance over time
3. Use trace divergence as proxy for software slop - high divergence from expected execution paths indicates problems
**Example Prompt:**
Design an OpenTelemetry-based scoring system for an AI customer support agent handling tier-1 tickets.
| Pros | Cons |
|---|---|
| OpenTelemetry-based scoring operationally clean | Requires instrumentation investment upfront |
| Composite scoring captures what accuracy alone misses | Scoring criteria domain-specific and political |
|---|---|
| Trace divergence as slop detection novel and useful | Slop detection thresholds hard to tune |