
AI pipelines fail in ways that are uniquely frustrating to debug. Not because the errors are complex — they're often mundane — but because the output that would tell you what went wrong was never captured in the first place. A prompt gets sent to an API. Something goes wrong. The response is garbage, or a timeout, or a malformed JSON that crashed your parser. You have no idea which of the twelve steps in your pipeline actually failed, what the model was given as input, or why it decided to output something structurally invalid.
This is not a mysterious problem. It's a logging problem. And it has a straightforward solution: structured logs, not console.log.
Most AI pipeline logging looks like this:
print(f"Calling API with prompt: {prompt}")
result = call_api(prompt)
print(f"Got result: {result}")
This is not logging. This is archaeologists leaving notes for future civilizations in a language future civilizations don't speak. When something breaks in production and you're grepping through 40GB of log files that are mostly print statements, you will not find what you need. You will find timestamps out of order, JSON objects truncated mid-field, and output that was logged before an exception but not after it.
The fundamental failure: unstructured logs mix levels, mix concerns, and don't capture the relationships between pipeline steps that you need when something breaks.
Here's the pattern that actually works.
Every pipeline execution gets a correlation ID — a UUID generated at entry and propagated through every step, every function call, every API call. This ID appears in every log line for that execution. When you search for it, you get everything that happened in that specific run, in order.
import structlog
import uuid
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer()
]
)
logger = structlog.get_logger()
async def run_pipeline(user_input: str) -> dict:
run_id = str(uuid.uuid4())
logger = logger.bind(run_id=run_id)
logger.info("pipeline_start", user_input=user_input)
try:
parsed = await parse_input(run_id, logger, user_input)
logger.info("input_parsed", tokens=len(parsed))
context = await build_context(run_id, logger, parsed)
logger.info("context_built", context_length=len(context))
response = await call_model(run_id, logger, context)
logger.info("model_response_received", response_length=len(response))
result = await parse_response(run_id, logger, response)
logger.info("pipeline_complete", run_id=run_id)
return result
except Exception as e:
logger.error("pipeline_failed", error=str(e), error_type=type(e).__name__)
raise
The key insight: every log line is JSON with the same shape. The `run_id` is always there. The event name tells you what happened. The data payload captures the inputs and outputs of each step.
The common mistake is logging everything or logging nothing. You want to log the right things: inputs that might fail, outputs that might be unexpected, and enough context to understand the chain without flooding your log storage.
For LLM API calls specifically:
async def call_model(run_id, logger, prompt: str) -> str:
logger.info(
"llm_api_call_start",
prompt_length=len(prompt),
model="gpt-4o",
temperature=0.7
)
try:
start = time.time()
response = await openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.7
)
latency_ms = (time.time() - start) * 1000
output = response.choices[0].message.content
logger.info(
"llm_api_call_success",
latency_ms=round(latency_ms, 2),
output_length=len(output),
finish_reason=response.choices[0].finish_reason,
tokens_used=response.usage.total_tokens
)
return output
except Exception as e:
logger.error(
"llm_api_call_failed",
error=str(e),
error_type=type(e).__name__,
prompt_length=len(prompt)
)
raise
Notice what you capture: latency, token usage, output length, finish reason. These are the metrics that tell you whether the model was working normally or behaving oddly. A finish reason of "length" instead of "stop" tells you the model hit the max_tokens limit — which often explains output truncation that breaks your JSON parser downstream.
Never log the full prompt or full response in production logs. Log the first N characters, the length, and a SHA of the content. This lets you:
1. Know roughly what the content was
2. Correlate across systems if you're logging to multiple backends
3. Not blow up your log storage with million-token outputs
def log_output(logger, label: str, content: str, max_chars: int = 500):
logger.info(
f"{label}_output",
content_preview=content[:max_chars],
content_length=len(content),
content_hash=hashlib.sha256(content.encode()).hexdigest()[:16]
)
When something goes wrong, the hash lets you search for the exact same content across all your log backends without storing the full content in your search index.
The real value of structured logs is queryability. With JSON logs in Elasticsearch, Grafana Loki, or any decent log aggregator, you can answer questions that are impossible with text logs:
This is the difference between debugging in 20 minutes and debugging in 4 hours.
Every log entry should have:
That's it. Everything else is payload data specific to that step. Keep it consistent, keep it queryable, and for the love of everything: log before the call and after the call, so you know whether the failure was in the attempt or the result.
The pattern doesn't require fancy infrastructure. It requires discipline. Every step, every external call, every potential failure mode — logged consistently, with correlation IDs, before you need them. The night something breaks in production, you'll be glad you built the habit.
*The tools are structlog for Python, winston for Node.js, and slog for Go. All support structured JSON output. Pick one. Use it. Your future on-call self will thank you.*