LLM observability is the most over-marketed category in AI infrastructure. I have shipped four of them this year. The fourth is Arize Phoenix, and the reason it does not feel like the others: it is built on OpenTelemetry.

Arize Phoenix Is the Only LLM Observability Tool That Picked the Right Primitive, and the Rest of the Stack Should Have Done the Same

I have deployed four LLM observability stacks in production this year. Three were Langfuse clones with a different logo. The fourth is Arize Phoenix, and the reason it does not feel like the others is the only decision that matters in this category: it is built on OpenTelemetry.

Phoenix is the open-source AI observability platform from Arize AI — 9,000+ GitHub stars, Apache 2.0, shipping a Python client, a TypeScript client, an OTel wrapper (arize-phoenix-otel), an evals package, and a self-hostable server you can run on a laptop, in Docker, or in Kubernetes. It is the only LLM observability project I have used that does not invent its own telemetry wire format.

The OTel Decision Is the Whole Product

Most LLM observability tools ship their own SDK, their own span schema, their own collector, and their own way to back-propagate trace_id into your application. You instrument with their decorator, push spans to their cloud, and hope the data model survives a real agent.

Phoenix took a different path. It is OpenTelemetry-native. Your spans are OTel spans. The collector is an OTel collector. The semantic conventions extend OTel's gen_ai.* attributes with retrievals, embeddings, and tool calls defined by the OpenInference project. The UI renders OTel traces with the LLM-specific bits highlighted. If you already run an OTel pipeline for your service mesh, your LLM traces live next to your HTTP traces, your database queries, and your gRPC calls.

This is the right abstraction. Tracing is a solved problem. The interesting problem in LLM observability is the LLM-specific signals — token counts, retrieval relevance, prompt-version drift, evaluation regressions. Phoenix treats transport as someone else's job and spends its complexity budget on the LLM layer. Almost every competitor gets the priority order backwards.

The Pieces, Quickly

arize-phoenix-otel is a thin wrapper around OTel primitives that wires up Phoenix's exporter and the OpenInference conventions:

python

from phoenix.otel import register
tracer_provider = register(
    project_name="support-agent",
    endpoint="http://localhost:6006/v1/traces",
)

Auto-instrumentation exists for OpenAI, Anthropic, Google GenAI, AWS Bedrock, LiteLLM, OpenRouter, LangGraph, LlamaIndex, DSPy, Vercel AI SDK, Mastra, and CrewAI. For a small codebase that talks to one model, end-to-end traces take under five minutes.

arize-phoenix-evals ships pre-built evaluators for hallucination, retrieval relevance (NDCG@10, MRR), answer relevance, toxicity, summarization faithfulness, and code generation. Each evaluator is an LLM-as-judge prompt template with a structured output schema. Run them on a dataset, get a score per row, compare runs in the UI.

Datasets and Experiments are the third leg. Create a versioned dataset of input/output pairs, run an experiment — same prompts, new model, new retrieval config — and you get side-by-side eval scores. This is the only observability UI I have used where "I changed the prompt, did it get better?" takes less than ten seconds to answer.

What Phoenix Is Not

It is not a prompt playground. The Playground exists, but it is not the centerpiece. The trace is, and the trace is the right centerpiece.

It is not a model router. It does not pick a model for you. It tells you which model is on fire right now. Big difference.

It is not a hosted-only product. Phoenix runs on a MacBook. It runs in a 2-CPU container. The cloud product (app.phoenix.arize.com) exists for teams that want managed infra, but the open-source project is not a "connect to our cloud" trap. The OTel-native design is exactly why the self-host story works — you are extending the telemetry stack you already have, not running a foreign one.

The Comparison You Will Make

You will compare Phoenix to Langfuse. Both are open source. Both self-host. Both have evals. The architectural difference is OTel. Langfuse has its own SDK, its own span model, its own event ingestion API. It works fine. It is also a parallel telemetry stack you maintain alongside the OTel one your service mesh already runs.

In a Kubernetes shop past 30 engineers, Phoenix is the lower-friction choice. In a startup with a single FastAPI app, the Langfuse SDK is simpler to drop in. Both are fine. Phoenix is the right choice for the long-term trajectory of the field, because OTel is the trajectory. The cost of the Langfuse-style SDK is betting the LLM observability vendor will be the one to last. Phoenix bets on OTel winning, which is a much smaller bet.

The Take

LLM observability is the most over-marketed category in AI infrastructure right now. Half the tools in this space are React dashboards over a Postgres table of prompt, response, latency_ms, cost_usd. Phoenix is not that. It picked the only primitive in observability that has already won — OpenTelemetry — and built the LLM-specific layer on top. The evals are good. The dataset/experiment loop is the only one in open source I would trust to drive a real CI gate. The self-host story is real.

If you are picking an LLM observability stack today and have not evaluated Phoenix, you are overpaying for a more opinionated product. 9,000+ stars, Apache 2.0, OTel-native, and the only observability tool I shipped in 2026 that did not eventually get replaced by something simpler. That last sentence is the only one that matters.

— Mr. Technology

*Arize Phoenix: github.com/Arize-ai/phoenix — Apache 2.0, 9,000+ GitHub stars. OTel-native, OpenInference semantic conventions, pre-built evals for hallucination, retrieval relevance, and toxicity. Self-host via Docker or pip install arize-phoenix. Cloud: app.phoenix.arize.com. OTel wrapper: pypi.org/project/arize-phoenix-otel.*