← Back to Payloads
2026-06-05

Langfuse @observe + Custom Spans: How to Actually Trace Multi-Step Agents

Auto-instrumentation gives you one span per LLM call, which is useless for multi-step agents. Here is the @observe plus custom-tool-span pattern that makes Langfuse traces actually debuggable, with the asyncio gotcha the docs bury.
Quick Access
Install command
$ mrt install tutorial
Browse related skills
Langfuse @observe + Custom Spans: How to Actually Trace Multi-Step Agents

Langfuse @observe + Custom Spans: How to Actually Trace Multi-Step Agents

After you ship the Langfuse cost dashboard, the next problem shows up by Wednesday: a 12-step agent fails on a user's run and the trace shows one observation with no way to find the broken step. By the end of this you will wrap any Python agent in @observe, emit a custom span per tool call, and score terminal outcomes so failures are filterable.

1. Install the SDK and the OTel Exporter (2 min)

bash pip install langfuse openinference-instrumentation-openai export LANGFUSE_PUBLIC_KEY="pk-lf-..." export LANGFUSE_SECRET_KEY="sk-lf-..." export LANGFUSE_HOST="http://localhost:3000" # or your self-hosted URL

The openinference-instrumentation-openai package is the part most people skip, and it is the only reason traces show up automatically. Without it, @observe gives you a parent span with no LLM telemetry inside it.

2. Wrap Your Agent with @observe and Decorate Tool Calls (5 min)

```python from langfuse import observe, get_client from openai import OpenAI client = OpenAI() langfuse = get_client()

@observe(name="agent.run") def run_agent(user_input: str) -> str: messages = [{"role": "user", "content": user_input}] for step in range(12):

Every LLM call is auto-traced by openinference

resp = client.chat.completions.create( model="gpt-4o", messages=messages, tools=TOOL_SCHEMAS ) msg = resp.choices[0].message messages.append(msg) if not msg.tool_calls: return msg.content for call in msg.tool_calls:

Custom span — this is the bit the docs bury

with langfuse.start_as_current_observation( name="tool.call", as_type="span", input={"tool": call.function.name, "args": call.function.arguments}, ) as span: result = dispatch(call.function.name, json.loads(call.function.arguments)) span.update(output=result) messages.append({"role": "tool", "tool_call_id": call.id, "content": json.dumps(result)}) ```

Two things to notice. First, @observe decorates the outer function only; openinference auto-instruments the OpenAI client and emits LLM spans as children. Second, langfuse.start_as_current_observation is the API for custom spans. The as_type="span" is the difference between a generic observation and a true span that shows up in the waterfall UI. Use as_type="generation" only for actual model calls.

3. Score the Final Outcome (1 min)

```python run_agent("cancel my order #4471") langfuse.score_current_span(name="success", value=True)

Or attach a user-feedback score from a thumbs-up event in the UI

`

Scores are how you stop reading traces by hand. Set a success score on every terminal span and filter the trace list for score=success=false to find the failures in seconds.

What the Docs Don't Tell You

The decorator emits a span per call, but the OpenTelemetry context propagates by thread, not by asyncio task. If you run your agent in asyncio.gather or with anyio.create_task_group, every concurrent call collapses into the first parent's trace. Fix: import langfuse.use_otel_context() and pass the current context to each task, or set LANGFUSE_TRACING_ENABLED=true plus call langfuse_context.update_current_observation(metadata={"trace_id": ...}) inside the task. This bug costs a day to find because the UI looks fine — it just has the wrong parents.

Next Step

Wire @observe around the three highest-traffic agents in your repo, add a tool.call span per tool dispatch, and a success score per terminal call. After 24 hours, the trace list filtered by score=success=false is the only debugging surface you need.

Mr. Technology

Related Dispatches