
Hey guys, Mr. Technology here.
If you have run ollama run llama3.1 and asked it to refactor a function, you already know the problem. The model gives you a 400-word explanation of what a closure is, then a code block, then an alternative, then a caveat, then a closing pleasantry. It is a chat model. You want a tool.
The fix is a Modelfile. Most people never touch it. With three parameter tweaks and a tight system prompt, the same model becomes something you can wire into a pipeline.
bash ollama run llama3.1:8b "Refactor this Python function: def f(x): return [i*2 for i in x if i > 0]"
You will get a paragraph about list comprehensions, the refactor, an alternative using map and filter, a note about type hints, and a closing pleasantry. Three settings fix this.
Create Modelfile.code:
```dockerfile FROM llama3.1:8b
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
PARAMETER num_predict 1024
PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|eot_id|>"
SYSTEM """You are a code assistant. Output code and one short sentence of context. No explanations of basic concepts. No 'here's how you could also do it.' No closing pleasantries. When the task is done, stop.""" ```
Build it and run the same prompt:
bash ollama create code-assistant -f Modelfile.code ollama run code-assistant "Refactor this Python function: def f(x): return [i*2 for i in x if i > 0]"
Output, every time:
python def double_positive(values: list[int]) -> list[int]: return [v * 2 for v in values if v > 0]
No preamble. No alternatives. No "let me explain." Just the refactor.
**temperature 0.2** is the difference between a model that sometimes picks a clever variable name and one that picks the same one every time. For code, low temperature is almost always what you want.
**num_ctx 8192** is the setting that lets the model actually see your file. Default 2048 truncates anything beyond a short snippet.
**num_predict 1024** is the kill switch for rambling. Capping output means even if the model starts to wander, it runs out of runway before it goes off the rails.
**The stop parameter** is the one nobody talks about. Llama 3.1 emits chat-template tokens (<|start_header_id|>, <|eot_id|>) when it thinks a turn is over. Adding them to the stop list cuts roughly 30% of the noise from default outputs.
num_gpuOn a 16GB laptop, add this to the Modelfile:
```dockerfile
PARAMETER num_gpu 25 ```
Default behavior pushes the whole model to GPU, which works for short prompts and crashes when context fills. Pinning num_gpu gives you a stable model that uses the GPU for what fits and falls back to CPU for the rest.
The point of a Modelfile is not the chat. It is the API:
bash curl http://localhost:11434/api/generate -d '{ "model": "code-assistant", "prompt": "Refactor this Python function: def f(x): return [i*2 for i in x if i > 0]", "stream": false }'
Same model, same behavior, no chat overhead. Drop it into a script, a Makefile, or a coding agent. The Modelfile is the contract: the model that comes out is the model that goes in, every time.
A Modelfile is twelve lines that turn a chat demo into a real tool.
— Mr. Technology