← Back to Payloads
opinion

Fine-Tuning Is Mostly Expensive Prompt Engineering

Every AI startup offers fine-tuning as a premium service tier. Most of them are selling you expensive prompt engineering and calling it machine learning. Here's why the math never works out the way the vendors promise.
Quick Access
Install command
$ mrt install hot-take
Browse related skills

Fine-Tuning Is Mostly Expensive Prompt Engineering

Let me say something that will make every AI infrastructure vendor uncomfortable: fine-tuning is mostly expensive prompt engineering with a better sales pitch.

I know. I know. The papers show real improvements. The benchmark lifts are real. Instruction tuning works. RL from human feedback produces models that behave differently in measurable ways. I'm not disputing any of that. I'm disputing the conclusion the industry draws from it, which is: when your model isn't behaving the way you want, you should fine-tune.

That conclusion is wrong most of the time. And the evidence is in what happens after the fine-tuning bill arrives.

The Fine-Tuning Industrial Complex

Every AI startup, every hyperscaler, every API vendor now has a fine-tuning tier. It's presented as the premium path — you've outgrown base models, you need something tailored to your domain, you need to stand apart. The pricing reflects the premium: thousands of dollars, days of compute, a process that feels serious and scientific.

And then the results come in.

The teams that get real value from fine-tuning are the ones who had a specific, well-defined behavioral gap that couldn't be solved any other way. The teams that don't are the ones who fine-tuned because that's what you're supposed to do when the model isn't performing — and the gap was actually in their system prompt, their retrieval quality, or their data cleanliness.

I've talked to ML engineers at three different companies who independently described the same pattern: they spent $30K+ on fine-tuning, saw a 3-5% improvement on their internal eval, realized the actual problem was in the prompt pipeline, fixed it in an afternoon, and got 25% improvement. The fine-tuning wasn't wrong. It was just solving the wrong problem at the wrong price.

What Fine-Tuning Actually Costs

Let's talk numbers, because the industry doesn't.

A meaningful fine-tuning run on a mid-sized model — one that actually changes behavior rather than just adjusting surface-level responses — requires hundreds to thousands of dollars in compute. If you're doing it properly, with proper evaluation loops, proper data curation, proper hyperparameter search, you're talking about weeks of iteration before you ship anything.

That cost is not inherently wrong if the value is there. But the value is rarely there for the problems most teams are trying to solve.

Instruction tuning — the dominant fine-tuning approach for making models follow formats, adopt tones, and complete specific task types — is mostly teaching the model to do things that a well-crafted system prompt can teach it to do. The difference is that instruction tuning costs thousands of dollars and takes days. A system prompt change costs nothing and takes minutes.

The exception: behavioral fine-tuning. The stuff that actually requires the model to internalize something that can't be expressed in context. This is where RLHF actually earns its cost — when you're trying to instill actual values, safety behaviors, or domain-specific reasoning patterns that the model genuinely can't learn from a prompt because the prompt is too long to fit in context or the behavior is too foundational.

But here's the thing: that's a small fraction of the fine-tuning jobs I see being sold. Most of them are instruction tuning. Most of those are solvable with prompts. And the teams selling them aren't having that conversation because "fix your prompts first" doesn't show up on an invoice.

The Data Quality Problem Nobody Talks About

You know what makes fine-tuning genuinely painful? Bad training data.

The teams that get real value from fine-tuning almost uniformly have one thing in common: rigorous, curated training data pipelines. They spent real time building the infrastructure to generate, clean, evaluate, and maintain their fine-tuning datasets. They have human raters, quality checks, distribution analysis, deduplication.

The teams that don't get value also uniformly have one thing in common: they fine-tuned on whatever data they had, assumed quantity would compensate for quality, and shipped a model that learned the wrong things more thoroughly than the base model learned the right things.

Garbage data fine-tuned is still garbage, just more expensive to generate.

The Honest Framework

Here's the decision tree I wish someone had given me five years ago:

**Is the gap behavioral or surface-level?** If the model keeps producing outputs in the wrong format, the wrong tone, or the wrong structure — that's a system prompt problem. Fix the prompt first.

**Is the data clean enough to fine-tune on?** If you can't articulate what good examples look like, if you don't have rigorous quality labels, if your dataset has distribution issues — fine-tuning will amplify them.

**Is the improvement worth the cost?** If a 3-5% internal eval improvement doesn't materially change your product metrics, the fine-tuning isn't earning its cost even if the benchmark says it worked.

**Is there a foundation problem underneath?** If your retrieval is returning irrelevant chunks, if your context assembly is noisy, if your data pipeline is dirty — fixing the foundation will almost always outperform fine-tuning the model on top of it.

The teams that understand this are the ones who use fine-tuning as a precision instrument — surgical, targeted, earned. Everyone else is paying for the premium service tier and getting prompt engineering in a expensive box.

*Fine-tuning is not the problem. Fine-tuning for the wrong problems at the wrong stage is the problem. Fix your prompts first. Then fix your data. Then, maybe, talk to us about fine-tuning.*