The GPU shortage is not a hardware problem. It is a software problem. SkyPilot — Apache 2.0, 10k+ stars, from UC Berkeley's Sky Computing Lab, now in v0.12 — abstracts 20+ clouds, Kubernetes, and Slurm behind a single YAML, ships Managed Spot and a multi-cloud optimizer, and is the layer Shopify, H Company, and CoreWeave are running their AI training on. It is the first compute substrate that treats agent workloads as a first-class citizen.

SkyPilot Is the Open-Source AI Compute Layer the Rest of the Stack Has Been Waiting For

Every AI team I have talked to this year is fighting the same war on two fronts. The model is getting smarter. The GPU bill is getting stupid. The frontier is moving toward agentic, multi-step, RL-heavy workloads that do not fit cleanly into a single cloud or scheduler. The labs are answering the model question. The cloud question is still open. SkyPilot is the open-source answer — Apache 2.0, 10k+ GitHub stars, built by UC Berkeley's Sky Computing Lab, v0.12 released March 2026.

The Mechanism: One YAML, Every Backend

The core abstraction is a single task spec. You write YAML describing what you need — accelerators, nodes, workdir, setup, run — and submit with sky launch my_task.yaml. SkyPilot finds the cheapest available infra across your registered clouds, provisions GPUs with auto-failover, syncs your workdir, runs setup, and streams logs. The spec is the same whether it lands on AWS, GCP, Azure, CoreWeave, Nebius, Lambda, RunPod, Vast.ai, or your own Kubernetes or Slurm cluster. The mechanism is a policy engine that normalizes the differences between every cloud's SDK, auth, instance vocabulary, spot semantics, pricing, and preemption behavior. The v0.12 Slurm backend follows the same contract — the same YAML that runs on AWS today runs on your on-prem Slurm cluster tomorrow with no code change.

The Cost Engine Is The Real Product

The mechanism is necessary. The cost engine is what makes the project matter. Managed Spot runs on spot/preemptible instances with automatic checkpointing and auto-recovery. If AWS yanks your H200 pod mid-training, SkyPilot requeues on the next spot and resumes from the last checkpoint. The team publishes 3-6x cost savings on the same hardware class versus on-demand. The Optimizer queries live pricing tables across every registered provider on every launch and picks the cheapest available instance — A100 prices on Lambda, Vast, and CoreWeave fluctuate 3-5x week-to-week, the Optimizer adds another 2x on top of Spot. Autostop and Binpacking auto-terminate idle clusters and pack workloads onto the smallest set of nodes that fits, lifting GPU utilization above 80% on multi-team deployments — versus 30-40% on unmanaged Kubernetes GPU pools.

SkyPilot vs The Alternatives

The alternatives are all partial. Ray is a distributed-computing framework — right answer for custom training code, wrong answer for cross-cloud scheduling; it does not provision resources, optimize cost, or speak Slurm. Modal is serverless Python — excellent for fast cold-starts, wrong for a 32-GPU training job that runs six hours. Runhouse is closer in spirit but more focused on cluster bring-up than scheduler, spot-recovery, and policy. BentoML and LitServe are serving frameworks that assume you already have a cluster. Vanilla Kubernetes with Kueue is the right answer for the 30% of workloads that fit cleanly into K8s and the wrong answer for the 70% that do not — multi-cloud, spot, Slurm, RL rollouts, agents that bring up an H100 for five minutes and tear it down. SkyPilot sits on top of Kueue when your workloads outgrow what Kueue can express.

Production Use

Three stories from the past quarter. Shopify runs all AI training on SkyPilot: H200s with InfiniBand on Nebius for distributed training, L4s on GCP for dev. Engineers write accelerators: H200:8 in YAML and a custom plugin decides the cloud, injects the InfiniBand pod spec, mounts shared caches, and routes to the right Kueue queue. Most engineers do not know which cloud their job landed on. (Shopify, January 2026) H Company used SkyPilot to unify its multi-cloud platform for online RL — a workload that breaks Slurm's batch scheduling model — and Job Groups (v0.12) is the abstraction that lets RL rollouts and trainer loops share resources. (H Company, March 2026) Research-Driven Agents is the most interesting: a Claude agent on SkyPilot was given the llama.cpp repo and a budget, read arXiv, implemented five flash-attention kernel fusions, and landed a 15% speedup in about three hours for ~$29 in spot GPU time. (SkyPilot blog, April 2026)

The Take

The AI stack in 2026 has a clear top (the model), a clear middle (the agent framework), and a clear bottom (the GPU). What it has been missing is the compute layer — the thing that takes a task spec and finds the right hardware, on the right cloud, at the right price, with the right failure semantics. SkyPilot is the first open-source project to ship that abstraction end-to-end, and the production deployments are showing what it is worth: 3-6x on managed spot, 2x on top from the Optimizer, and a multi-cloud posture that lets you route around capacity, pricing, and preemption events without rewriting your job. The model layer is commoditizing. The agent layer is commoditizing. The compute layer is the next place the open-source stack will take a real position, and SkyPilot is the project defining it.

— Mr. Technology

*Repo: github.com/skypilot-org/skypilot — Apache 2.0, 10,071 stars, 1,089 forks, 74 contributors. v0.12 (March 2026). Built by UC Berkeley's Sky Computing Lab. Production: Shopify (Jan 2026), H Company (Mar 2026), CoreWeave (Nov 2025), AWS SageMaker HyperPod. Featured: Research-Driven Agents, GPU Compass, Agent Skill, RL Doesn't Work on Slurm. Install: uv pip install "skypilot[kubernetes,aws,gcp,azure,oci,nebius,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp,seeweb,shadeform,verda]". Supported infra: Kubernetes, Slurm, AWS, GCP, Azure, OCI, CoreWeave, Nebius, Lambda, RunPod, Fluidstack, Cudo, Digital Ocean, Paperspace, Cloudflare, Samsung, IBM, Vast.ai, VastData, Crusoe, Seeweb, Prime Intellect, Shadeform, Verda, VMware vSphere.*