Helicone is a one-line OpenAI proxy that gives you request logs, per-user cost breakdowns, latency histograms, and prompt caching in 10 minutes. No SDK rewrite, no deploy. Here is the setup, the gotchas, and when to skip it for Langfuse.

Helicone, in 10 Minutes: Real LLM Observability Without Re-Writing Your Client

You ship an LLM feature on Tuesday. On Thursday your CFO asks why the OpenAI bill is $4,200. You have no idea which user cost you the most, which prompt burned the tokens, or whether that latency spike was your code or OpenAI's. Helicone is a one-line proxy that fixes this. Point your OpenAI client at it, and every request gets logged, costed, and searchable. No SDK rewrite. No data warehouse. Ten minutes to useful.

The Setup

Install the SDK (or skip it for pure HTTP):

bash

pip install helicone

Then flip your client. One base URL, one header:

python

import os
from openai import OpenAI
client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url="https://oai.helicone.ai/v1",  # <- the proxy
    default_headers={
        "Helicone-Auth": f"Bearer {os.environ['HELICONE_API_KEY']}",
    },
)

The Implementation

The proxy is a pass-through. Every request your client makes gets mirrored, parsed, and stored. You can attach metadata with custom properties, and this is where it gets useful.

python

resp = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "summarize this PDF"}],
    extra_headers={
        "Helicone-Property-User-Id": user.id,
        "Helicone-Property-Feature": "pdf-summarizer",
        "Helicone-Property-Env": "prod",
    },
)

Those Helicone-Property-* headers become first-class filters in the dashboard. What did your pdf-summarizer feature cost last week? One click. Every 5xx for users in cohort B? One click. Token cost between gpt-4.1 and gpt-4.1-mini for the same prompt? One click. Helicone logs request, response, latency, cost, and your custom properties, with retention controls and per-key rate limits.

First Run

Make ten calls with different User-Id headers and open the dashboard. The request stream populates in real time, with a per-user cost breakdown, latency histogram, and token-usage table by model. The Requests view shows the full prompt and response inline, redacted if you toggle PII scrubbing. The Costs view shows a per-feature, per-user, per-model breakdown exportable to CSV.

Add Helicone-Cache-Enabled: true and matching requests are served from cache at zero tokens. On a RAG-heavy workload this is a 30-50% cost cut in the first week.

The Gotchas

**Streaming needs stream=True.** Without it the response buffers at the proxy and your latency quietly doubles. One-line fix, confusing symptom.
Custom properties are case-sensitive and lazy-indexed. The dashboard only filters on properties it has seen. If you ship Helicone-Property-Plan on Friday, you cannot filter old data on it. Forward-plan your property names.
Do not double-wrap. If you already route through LiteLLM or a custom gateway, do not stack Helicone in front. Helicone has a native LiteLLM integration; read the docs before layering.
PII is your problem. Full request and response are stored by default. Toggle Helicone-Moderations or scrub upstream if you are in a regulated vertical. HIPAA lives on the enterprise tier, not free.

The Take

Use Helicone when you need observability today and do not want to deploy anything. The 10-minute setup is real, the dashboard is useful, and the per-feature cost breakdowns will save the next "why is the bill $4,200" Slack thread. Skip it when you need OpenTelemetry-native tracing for a multi-step agent: use Langfuse for that, or the Helicone OTel exporter if you want both. For a one-week MVP that just needs to know what it is spending, this is the tool. I have deployed it on three production systems this quarter. None have regretted it.

— Mr. Technology