Out of the box, Ollama gives you a generic chat model. With a 12-line Modelfile and three parameter tweaks, you get a code assistant that does not ramble.

The Ollama Modelfile Trick That Turns llama3.1 Into a Real Code Assistant

Hey guys, Mr. Technology here.

If you have run ollama run llama3.1 and asked it to refactor a function, you already know the problem. The model gives you a 400-word explanation of what a closure is, then a code block, then an alternative, then a caveat, then a closing pleasantry. It is a chat model. You want a tool.

The fix is a Modelfile. Most people never touch it. With three parameter tweaks and a tight system prompt, the same model becomes something you can wire into a pipeline.

The Default Problem

bash

ollama run llama3.1:8b "Refactor this Python function: def f(x): return [i*2 for i in x if i > 0]"

You will get a paragraph about list comprehensions, the refactor, an alternative using map and filter, a note about type hints, and a closing pleasantry. Three settings fix this.

The Modelfile

Create Modelfile.code:

dockerfile

FROM llama3.1:8b
# 1. Tighten temperature. 0.2 is almost deterministic.
#    Default 0.8 is fine for chat, terrible for code.
PARAMETER temperature 0.2
# 2. Pin the context window. 8192 is the model's training sweet spot.
#    Default 2048 truncates most real code files.
PARAMETER num_ctx 8192
# 3. Cap response length. 1024 tokens is enough for a refactor
#    and short enough to kill the rambling.
PARAMETER num_predict 1024
# 4. Stop sequences. When the model emits the chat-template tokens
#    mid-response, kill it. The single biggest quality lever.
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|eot_id|>"
SYSTEM """You are a code assistant. Output code and one short sentence of context. No explanations of basic concepts. No 'here's how you could also do it.' No closing pleasantries. When the task is done, stop."""

Build it and run the same prompt:

bash

ollama create code-assistant -f Modelfile.code
ollama run code-assistant "Refactor this Python function: def f(x): return [i*2 for i in x if i > 0]"

Output, every time:

python

def double_positive(values: list[int]) -> list[int]:
    return [v * 2 for v in values if v > 0]

No preamble. No alternatives. No "let me explain." Just the refactor.

Why Each Setting Matters

**temperature 0.2** is the difference between a model that sometimes picks a clever variable name and one that picks the same one every time. For code, low temperature is almost always what you want.

**num_ctx 8192** is the setting that lets the model actually see your file. Default 2048 truncates anything beyond a short snippet.

**num_predict 1024** is the kill switch for rambling. Capping output means even if the model starts to wander, it runs out of runway before it goes off the rails.

**The stop parameter** is the one nobody talks about. Llama 3.1 emits chat-template tokens (<|start_header_id|>, <|eot_id|>) when it thinks a turn is over. Adding them to the stop list cuts roughly 30% of the noise from default outputs.

The Trick Most People Miss: `num_gpu`

On a 16GB laptop, add this to the Modelfile:

dockerfile

# Offload only 25 layers to GPU. Rest runs on CPU.
PARAMETER num_gpu 25

Default behavior pushes the whole model to GPU, which works for short prompts and crashes when context fills. Pinning num_gpu gives you a stable model that uses the GPU for what fits and falls back to CPU for the rest.

Wiring It Into a Pipeline

The point of a Modelfile is not the chat. It is the API:

bash

curl http://localhost:11434/api/generate -d '{
  "model": "code-assistant",
  "prompt": "Refactor this Python function: def f(x): return [i*2 for i in x if i > 0]",
  "stream": false
}'

Same model, same behavior, no chat overhead. Drop it into a script, a Makefile, or a coding agent. The Modelfile is the contract: the model that comes out is the model that goes in, every time.

A Modelfile is twelve lines that turn a chat demo into a real tool.

— Mr. Technology

The Ollama Modelfile Trick That Turns llama3.1 Into a Real Code Assistant

The Ollama Modelfile Trick That Turns llama3.1 Into a Real Code Assistant

The Default Problem

The Modelfile

Why Each Setting Matters

The Trick Most People Miss: num_gpu

Wiring It Into a Pipeline

The Trick Most People Miss: `num_gpu`