Single-prompt requests fail in predictable ways: the model bites off more than it can reason through, takes wrong turns mid-response, and you end up regenerating three times before getting something usable. Prompt chaining solves this by splitting the work into smaller, verifiable steps where each output feeds the next.
The key insight: you don't chain because you want the model to do more work. You chain because you want to **control the work** — catch errors early, inject structured state between steps, and ensure each stage has a narrow, well-defined job.
This isn't chain-of-thought prompting (that's internal reasoning). This is architectural: explicitly separating tasks into distinct LLM calls with structured data passing between them.
The most common real-world chain. Three steps, each with a clear purpose.
extract_prompt = """From the article below, extract: title, author, date, key claims (3-5).
Return JSON only. No explanation.
Article: {input}"""
extracted = llm.json(extract_prompt)
transform_prompt = """Given this extracted article data, determine:
1. Is it relevant to {}? (yes/no + 1 sentence reason)
2. Sentiment: positive/neutral/negative
3. Priority: high/medium/low
Data: {}
Return JSON only.""".format(topic, json.dumps(extracted))
transformed = llm.json(transform_prompt)
validation_prompt = """Review this transformed data. Flag any inconsistencies:
Transformed: {}
Original: {}
Return {{"valid": bool, "issues": []}}""".format(json.dumps(transformed), original_input)
validated = llm.json(validation_prompt)
Why this works: each step is narrow enough to be reliable. Extraction handles raw→structured. Transform handles business logic. Validation catches drift before it propagates. If validation fails, you retry the step that failed, not the whole chain.
For tasks where the approach matters more than the output. The model plans first, then executes the plan, then reflects.
plan_prompt = """Given this task and constraints, produce a step-by-step plan.
Each step: [action] → [expected output]
Task: {task}
Constraints: {constraints}
Respond with numbered steps only. No preamble."""
plan = llm.call(plan_prompt)
for step in plan.steps:
step_output = llm.call(step.action_prompt.format(context=context))
context[step.name] = step_output
refine_prompt = """Review the execution against the original task.
Did we achieve the goal? What could be improved?
If nothing needs changing, say DONE.
Task: {task}
Execution: {context}
Return: {{"status": "done"|"needs_revision", "revision": "..." if needed}}""".format(task=task, context=context)
result = llm.json(refine_prompt)
The plan step sounds like overhead but it catches big-picture errors before you've invested compute in the wrong direction. If the plan is wrong, you fix it. If the execution is wrong, you fix that — separately.
When input types determine workflow, use a classification step first.
classify_prompt = """Classify this input type. Respond with one word only:
Input: {user_input}"""
input_type = llm.call(classify_prompt).strip().lower()
if input_type == "bug_report":
result = bug_report_chain(user_input)
elif input_type == "feature_request":
result = feature_request_chain(user_input)
elif input_type == "question":
result = qa_chain(user_input)
else:
result = complaint_chain(user_input)
The classification step costs almost nothing (10-20 tokens) and means each downstream chain can be much narrower and more reliable than a single "handle everything" prompt. A bug report chain knows it's dealing with a problem; a question chain knows it's answering, not solving.
All chains need a way to pass data between steps. The naive approach: string interpolation at each step. The right approach: a state object.
class ChainState:
def __init__(self, original_input):
self.input = original_input
self.extracted = None
self.transformed = None
self.validated = None
self.errors = []
def add_error(self, step, message):
self.errors.append({"step": step, "message": message})
def is_valid(self):
return len(self.errors) == 0
def run_chain(initial_input, chain_steps):
state = ChainState(initial_input)
for step_name, step_fn in chain_steps:
try:
output = step_fn(state)
setattr(state, step_name, output)
except Exception as e:
state.add_error(step_name, str(e))
if not continue_on_error:
break
return state
State lets any step inspect any prior step's output. Validation can see extraction. Refinement can see execution. You get visibility across the chain without tight coupling.
**Use it when:**
**Don't use it when:**
Chaining trades latency for reliability and debuggability. If you're regenerating because a single prompt got messy, you're paying the latency anyway — just with worse outcomes.
The reason chaining works isn't that it breaks tasks into smaller pieces. It's that it makes **intermediate states inspectable**. You can log what was extracted, see what was transformed, verify what was validated. When something goes wrong, you know exactly where and why — not just "the output was bad."
That visibility is what separates production-grade LLM pipelines from demo-quality chains. If you're building anything that runs repeatedly without human review, chaining isn't optional — it's the foundation.