Breaking complex tasks into LLM steps with clean data flow — practical patterns for reliable chain-of-thought workflows without the hallucination risk.

Why Chaining Beats One-Shot

Single-prompt requests fail in predictable ways: the model bites off more than it can reason through, takes wrong turns mid-response, and you end up regenerating three times before getting something usable. Prompt chaining solves this by splitting the work into smaller, verifiable steps where each output feeds the next.

The key insight: you don't chain because you want the model to do more work. You chain because you want to **control the work** — catch errors early, inject structured state between steps, and ensure each stage has a narrow, well-defined job.

This isn't chain-of-thought prompting (that's internal reasoning). This is architectural: explicitly separating tasks into distinct LLM calls with structured data passing between them.

Pattern 1: Extract → Transform → Validate

The most common real-world chain. Three steps, each with a clear purpose.

Step 1: Extract - pull structured data from raw input

extract_prompt = """From the article below, extract: title, author, date, key claims (3-5).

Return JSON only. No explanation.

Article: {input}"""

extracted = llm.json(extract_prompt)

Step 2: Transform - apply business logic to extracted data

transform_prompt = """Given this extracted article data, determine:

1. Is it relevant to {}? (yes/no + 1 sentence reason)

2. Sentiment: positive/neutral/negative

3. Priority: high/medium/low

Data: {}

Return JSON only.""".format(topic, json.dumps(extracted))

transformed = llm.json(transform_prompt)

Step 3: Validate - sanity check the transform output

validation_prompt = """Review this transformed data. Flag any inconsistencies:

Does the priority match the relevance and sentiment?
Are the key claims actually present in the original?

Transformed: {}

Original: {}

Return {{"valid": bool, "issues": []}}""".format(json.dumps(transformed), original_input)

validated = llm.json(validation_prompt)

Why this works: each step is narrow enough to be reliable. Extraction handles raw→structured. Transform handles business logic. Validation catches drift before it propagates. If validation fails, you retry the step that failed, not the whole chain.

Pattern 2: Plan → Execute → Refine

For tasks where the approach matters more than the output. The model plans first, then executes the plan, then reflects.

Planning step - not what to do, but how to do it

plan_prompt = """Given this task and constraints, produce a step-by-step plan.

Each step: [action] → [expected output]

Task: {task}

Constraints: {constraints}

Respond with numbered steps only. No preamble."""

plan = llm.call(plan_prompt)

Execution steps - run each plan step, capture output

for step in plan.steps:

step_output = llm.call(step.action_prompt.format(context=context))

context[step.name] = step_output

Refinement step - quality check the full execution

refine_prompt = """Review the execution against the original task.

Did we achieve the goal? What could be improved?

If nothing needs changing, say DONE.

Task: {task}

Execution: {context}

Return: {{"status": "done"|"needs_revision", "revision": "..." if needed}}""".format(task=task, context=context)

result = llm.json(refine_prompt)

The plan step sounds like overhead but it catches big-picture errors before you've invested compute in the wrong direction. If the plan is wrong, you fix it. If the execution is wrong, you fix that — separately.

Pattern 3: Branch on Type

When input types determine workflow, use a classification step first.

classify_prompt = """Classify this input type. Respond with one word only:

bug_report
feature_request
question
complaint

Input: {user_input}"""

input_type = llm.call(classify_prompt).strip().lower()

Each branch is a specialized chain

if input_type == "bug_report":

result = bug_report_chain(user_input)

elif input_type == "feature_request":

result = feature_request_chain(user_input)

elif input_type == "question":

result = qa_chain(user_input)

else:

result = complaint_chain(user_input)

The classification step costs almost nothing (10-20 tokens) and means each downstream chain can be much narrower and more reliable than a single "handle everything" prompt. A bug report chain knows it's dealing with a problem; a question chain knows it's answering, not solving.

The State Object Pattern

All chains need a way to pass data between steps. The naive approach: string interpolation at each step. The right approach: a state object.

class ChainState:

def __init__(self, original_input):

self.input = original_input

self.extracted = None

self.transformed = None

self.validated = None

self.errors = []

def add_error(self, step, message):

self.errors.append({"step": step, "message": message})

def is_valid(self):

return len(self.errors) == 0

def run_chain(initial_input, chain_steps):

state = ChainState(initial_input)

for step_name, step_fn in chain_steps:

try:

output = step_fn(state)

setattr(state, step_name, output)

except Exception as e:

state.add_error(step_name, str(e))

if not continue_on_error:

break

return state

State lets any step inspect any prior step's output. Validation can see extraction. Refinement can see execution. You get visibility across the chain without tight coupling.

When to Use Chaining

**Use it when:**

The task has distinct phases (extract → process → format)
Errors need to be caught mid-process, not at output
Different steps need different prompting strategies
You're doing multi-document synthesis

**Don't use it when:**

Single prompt gets you there reliably
Latency is critical (each chain step = round trip)
The task is simple enough that failure is cheap and retry is easy

Chaining trades latency for reliability and debuggability. If you're regenerating because a single prompt got messy, you're paying the latency anyway — just with worse outcomes.

The Real Benefit

The reason chaining works isn't that it breaks tasks into smaller pieces. It's that it makes **intermediate states inspectable**. You can log what was extracted, see what was transformed, verify what was validated. When something goes wrong, you know exactly where and why — not just "the output was bad."

That visibility is what separates production-grade LLM pipelines from demo-quality chains. If you're building anything that runs repeatedly without human review, chaining isn't optional — it's the foundation.

Prompt Chaining Patterns That Actually Work