
Fine-tuning stopped being a privilege in 2026. A small team rewrote the slow parts in Triton and called the result Unsloth. The library is unsloth, the new product is Unsloth Studio, and the numbers — 2x faster, 70% less VRAM, 500+ supported models, Apache 2.0, direct GGUF export — are real. Just what happens when you stop letting PyTorch's general kernels touch your training loop and write the hot path yourself.
Hey guys, Mr. Technology here.
unsloth patches Hugging Face Transformers + TRL + PEFT to swap PyTorch's general CUDA implementations of attention, RoPE, cross-entropy, RMS layernorm, MLP, and QKV for hand-written Triton kernels plus a manual autograd that skips the autograd tape. Looks like a normal FastLanguageModel.from_pretrained(...). Is not a normal training loop. On a free Colab T4: 2x speedup and 70% less VRAM versus the same model with the same hyperparameters in raw HF + PEFT. On a single H100 the ratio is smaller (1.6x-1.8x) but the VRAM savings buy larger context windows and bigger batch sizes, which is what you actually want.
```python from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/llama-4-8b-bnb-4bit", max_seq_length=2048, load_in_4bit=True, ) model = FastLanguageModel.get_peft_model( model, r=16, target_modules=["q_proj","k_proj","v_proj","o_proj", "gate_proj","up_proj","down_proj"], lora_alpha=16, use_gradient_checkpointing="unsloth", )
model.save_pretrained_gguf("llama-4-8b-lora", tokenizer, quantization_method="q4_k_m") ```
That last line is the one nobody talks about. Unsloth's GGUF export writes a quantized llama.cpp file directly from a LoRA-merged model — no llama.cpp clone, no convert.py, no manual merge. The 6x GGUF conversion speedup Daniel Han published in the TinyLlama benchmarks is real because the merge happens inside the same kernel pipeline.
The wins are not from one trick. They are from six, all small, all cumulative: Triton RoPE (~7% standalone), Triton cross-entropy (~1%, free in the gradient path), Triton RMS layernorm (~3%), manual autograd for MLP / QKV (~6% combined), native 4/8-bit + QLoRA with fused dequant, and **use_gradient_checkpointing="unsloth" which fits 4x longer context** in the same VRAM. Sum them and you get the 2x headline. No single clever trick — just the discipline of writing the hot path in Triton and not letting PyTorch touch it.
In March 2026 they layered Unsloth Studio on top — an open-source, no-code web UI for training, running, and exporting the same models, with tool-calling auto-healing and Python/bash execution. Runs locally on Mac, Windows, Linux, DGX Spark, consumer NVIDIA. Most teams that fine-tune today do not need a training cluster, they need a Jupyter replacement with kernels that do not suck. Studio is that.
Versus raw HF + PEFT + TRL. Same model, same hyperparameters, 2x faster, 70% less memory. The only reason to use raw HF is an architecture Unsloth does not yet support — and the list is now 500+ models: Llama 4, Qwen 3, Gemma 3, Mistral, Phi 4, DeepSeek-V3, Command A+, plus every vision and audio variant. No good reason to train on raw HF in 2026.
Versus Axolotl. Axolotl is the YAML-config framework the labs use. Excellent for full fine-tunes and multi-node. Slow per step on a single GPU because it does not rewrite the kernels. For QLoRA / LoRA / DoRA on a single 24-80GB box, Unsloth wins. For 70B+ full fine-tunes across a 32-GPU cluster, Axolotl is still the right answer.
Versus LLaMA-Factory. LLaMA-Factory has a web UI and ~200 model recipes. Unsloth has a web UI and 500+ models with custom Triton kernels. LLaMA-Factory wins on multi-modal recipes. Unsloth wins on raw training speed. Use Unsloth for training, LLaMA-Factory's dataset utilities upstream.
Versus hosted fine-tuning services. Unsloth is Apache 2.0 and runs on your hardware. Hosted services cost $0.50-$4.00 per training hour — fine for teams without GPUs, irrelevant for teams with them. The 2x speedup means a $4/hr hosted run is roughly equivalent to an $8/hr self-hosted run in throughput. For pipelines re-run monthly, self-hosted wins on cost and on data sovereignty.
Where Unsloth loses. Multi-node training. FSDP / DeepSpeed integration is partial. Some architectures (Mamba, Jamba, RWKV) are not yet supported. The 2x number is on a single GPU; scaling past 8 GPUs does not give you a 16x. If you are training a 405B base, use Megatron-LM or NeMo. If you are training a 7B-70B with PEFT on a single box, Unsloth is the answer.
Unsloth is the most important open-source fine-tuning library of the last two years, and the reason is not the 2x number. The reason is the discipline of writing the hot path. Most "fast fine-tuning" libraries wrap a faster optimizer or a smarter scheduler. Unsloth wrote Triton kernels for the six operations that actually cost time in an LLM training step and shipped them under Apache 2.0. Different category of work.
If you are still training with raw Hugging Face + PEFT + TRL on a single GPU, you are leaving 2x throughput and 70% VRAM on the table every step. Patch in Unsloth, re-run your last fine-tune, watch the wall-clock halve. There is no excuse in 2026 to fine-tune slow.
— Mr. Technology