← Back to Payloads
Open Source2026-06-10

Liger Kernel Is the Open-Source Triton Hack That Trains LLMs on 60% Less Memory — and Almost Nobody Uses It Properly

LinkedIn's Liger Kernel is the single highest-leverage open-source library in the LLM training stack and almost nobody is talking about it. Triton-fused RMSNorm, RoPE, SwiGLU, and a genuinely clever FusedLinearCrossEntropy that drops 5-7 GB of activation memory at the loss layer. One line of code. 20% throughput, 60% memory reduction, 7M+ downloads, integrated into HuggingFace Transformers, TRL, LLaMa-Factory, Axolotl, and SWIFT. If you fine-tune, this is the change you make this week.
Quick Access
Install command
$ mrt install liger-kernel
Browse related skills
Liger Kernel Is the Open-Source Triton Hack That Trains LLMs on 60% Less Memory — and Almost Nobody Uses It Properly

Liger Kernel Is the Open-Source Triton Hack That Trains LLMs on 60% Less Memory — and Almost Nobody Uses It Properly

Every team I talk to is overpaying for GPU memory. Not because H100s are expensive — because the default PyTorch path through an LLM wastes 30-50% of the memory you paid for on activations and intermediates you never needed to materialize. The fix is Liger Kernel — LinkedIn's open-source collection of Triton-fused training kernels, BSD 2-clause, 7M+ downloads, ~40 architectures, integrated into HuggingFace Transformers, TRL, LLaMa-Factory, Axolotl, and SWIFT. One line of code.

The Mechanism: Kernel Fusion Where It Hurts

The claim is concrete: on a LLaMA-3-8B SFT run with batch 8, bf16, AdamW, gradient checkpointing, FSDP1 across 8 A100s, Liger gives 20% higher throughput and 60% less memory than stock HuggingFace. HuggingFace OOMs at 4K context; HuggingFace + Liger scales to 16K. (Liger Kernel README)

The mechanism is kernel fusion. PyTorch implements RMSNorm as a sequence of low-level ops — pow, mean, rsqrt, mul — each one a separate kernel launch and a full intermediate tensor in HBM. Liger fuses the sequence into a single Triton kernel that streams the row through SMs, never materializing intermediates. Same trick for RoPE, SwiGLU, and the one that matters most: FusedLinearCrossEntropy.

FusedLinearCrossEntropy is the headline. The final lm_head projection from hidden states (e.g. 4096) to vocabulary (e.g. 128k) materializes a [B, T, 128000] logits tensor in bf16 — bytes you compute, write to HBM, read back, softmax, and discard. FusedLinearCrossEntropy fuses the matmul and the loss into a single chunked kernel that streams logits through SRAM and never materializes the full tensor. On a 7B with 128k vocab, that is a 5-7 GB activation savings per step — the reason Liger unlocks long-context training on hardware that OOMs on stock PyTorch.

The Code

The integration is a one-liner:

python from liger_kernel.transformers import apply_liger_kernel_to_llama apply_liger_kernel_to_llama( rope=True, fused_linear_cross_entropy=True, rms_norm=True, swiglu=True, ) model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")

For TRL, the same flags drop into the SFTTrainer config (use_liger=True since TRL 0.10.1) and for LLaMa-Factory there is a enable_liger_kernel: true YAML toggle. No monkey-patching, no rewriting your training loop. Kernels drop in transparently and the backward pass is rewritten in Triton too — gradients are exact, not approximated.

When To Use It, When Not To

Use it whenever you are fine-tuning a supported architecture on NVIDIA or AMD hardware — which is basically every fine-tune. The v0.5.0 release added post-training kernels — DPO, CPO, ORPO, SimPO, KTO — that deliver up to 80% memory savings on alignment tasks, the only reason some teams can run a 70B DPO on 8 H100s at all.

Skip it for anything custom on the forward pass — non-standard attention, MoE routing tweaks, new normalization schemes Liger does not cover. Skip it for inference — vLLM, SGLang, and llama.cpp own inference.

Liger vs The Alternatives

Liger vs Unsloth. Unsloth ships QLoRA tricks (4-bit base, custom autograd through bitsandbytes) and is the right answer for LoRA adapters on a 24GB GPU. Liger is the right answer for full-parameter SFT, DPO, and long-context pretraining. They are stackable in some configs, orthogonal in most.

Liger vs torch.compile. torch.compile is a general graph fuser. Liger is hand-written, architecture-aware. The LinkedIn team reports torch.compile on top of Liger yields an additional 10x speedup on a training encoder (LinkedIn, April 2026). Liger first, torch.compile second.

Liger vs Flash Attention. Flash Attention is upstream of Liger's philosophy — it proved kernel fusion is the right answer for the attention layer. Liger is what Flash Attention looks like when you apply the same idea to the other 20 layers nobody bothered to fuse.

The 2026 Angle: Agents Writing The Kernels

The most interesting post in the project this year is not a release — it is LinkedIn's April 2026 engineering blog on using coding agents to write new Liger kernels. Three workflows (liger-kernel-dev, liger-autopatch, liger-kernel-bench) classify a target op into three complexity tiers, generate up to 8 files of Triton + PyTorch wrapper + tests, and benchmark. A ReLU² kernel that would have taken days of expert time was shipped end-to-end after a human review checkpoint. (LinkedIn blog) Humans define the tier, agents write the code, humans verify the numbers.

The Take

Liger is the rare open-source project that delivers a 3x-class improvement on the most expensive operation in the AI stack and asks for nothing but a one-line config flag. The trade-off is essentially zero — forward and backward are exact, kernels are tested against stock PyTorch convergence, integration is upstreamed into every major trainer. If you are paying for a GPU cluster, the first optimization you should make is not a cheaper cloud or a smaller model. It is a pip install liger-kernel and a True in your config.

Mr. Technology


*Repo: github.com/linkedin/Liger-Kernel — BSD 2-clause, 7M+ downloads, 100+ contributors, supporting ~40 model architectures including Llama 3.x, Qwen 2/3, Mistral, Gemma 2/3, Phi-3, GPT-OSS, DeepSeek-V3, Granite, and the Qwen2-VL / Llama-3.2-Vision vision-language models. Tech report: arxiv.org/abs/2410.10989. LinkedIn engineering: Liger Kernel launch (Dec 2024), Agents accelerating Liger engineering (Apr 2026), TorchTune × Liger joint post (Mar 2025). Docs: linkedin.github.io/Liger-Kernel. Integrations: HuggingFace Transformers, TRL SFTTrainer, LLaMa-Factory, Axolotl, SWIFT, oumi. Install: pip install liger-kernel. Requires torch >= 2.1.2 (CUDA) or torch >= 2.5.0 (ROCm 7.2), triton >= 2.3.0 (CUDA) or triton >= 3.0.0 (ROCm). Compatible with Flash Attention, FSDP, DeepSpeed ZeRO-1/2/3, DDP.*

Related Dispatches