
Hey guys, Mr. Technology here.
Open-source video models have been silent for two years: clips you have to Foley yourself, generation separated from editing, inpainting that means regenerating the whole shot. SkyReels-V4 from Kunlun's Skywork AI went live in April 2026, and it is the first open-weights foundation model that generates synchronized video AND audio in a single forward pass. The single-pass part is the only number that matters.
SkyReels-V4 is a unified multi-modal video foundation model. You give it a prompt — optionally with reference images, video clips, masks, or audio references — and it gives you back a 1080p, 32 FPS, 15-second clip with sound baked in. The sound is not added in post. It is generated alongside the frames, locked to visual events by the same model that drew them. A wave crashing produces its crash. A foot on gravel produces gravel. A piano key produces the note in the acoustic space of the room shown.
It is also the first open model that unifies generation, inpainting, and editing in a single network. Same weights handle "make me a 15-second shot," "remove the coffee cup on the left," and "change her jacket to the reference image." No separate ControlNet, no LoRA, no model swap.
The lineage is from Skywork AI — V1 was text-to-video, V2 added diffusion forcing for autoregressive length, V3 shipped January 2026 with multimodal in-context learning. V4 is the architectural reset.
The headline architecture is a dual-stream Multimodal Diffusion Transformer. One branch is a video DiT that synthesizes video latents. The other is an audio DiT that synthesizes audio latents. Both share a text encoder instantiated from a strong multimodal LLM. The branches do not fuse mid-network; they remain independent, connected only through shared conditioning.
This is not the obvious design. The obvious design is cross-attention between audio and video at every layer — Omnihuman, Multitalk. SkyReels-V4 says: keep them separate, share the prompt encoder, learn temporal alignment from data. The paper claims better lip-sync and event-locked audio because each branch has a clean unimodal objective to optimize. Cross-attention designs leak gradients and force each branch to compromise.
For editing and inpainting, the video branch uses a channel-concatenation formulation. The conditioning signal — mask, reference image, style reference — is concatenated channel-wise with the noisy latent before denoising. Image-to-video, video extension, video editing, and object replacement are all instances of "inpainting with a different mask." One network, one interface.
The MLLM text encoder is what lets this generalize. Because the encoder understands visual references semantically, the model does visually-referenced inpainting — "change the character's shirt to match this image" — without explicit feature-matching losses. The MLLM already knows what "match this image" means. The video branch just attends through the shared conditioning.
1080p, 32 FPS, 15 seconds is 480 frames at 1920×1080. Generating all of that in one diffusion pass would blow memory on any GPU you can reasonably buy. V4's efficiency trick: generate a low-resolution full sequence plus high-resolution keyframes in one diffusion step, then run a separate super-resolution pass and frame interpolation to reconstruct the final video. The audio branch runs at its own temporal resolution and gets its own upsampler. This is the same trick the proprietary Veo and Kling stacks use; SkyReels-V4 is the first open model to publish it.
The closed-source comparison is brutal and worth naming. Veo 3.1, Sora 2, Kling 2.6, Gen-4.5, Seedance 1.5, and Wan 2.6 all generate synchronized audio-video. None of them are open weights. None let you run them on your hardware, fine-tune them on your data, or inspect the latents. They all charge per clip. SkyReels-V4 takes the Veo 3.1 feature set and ships it under permissive licensing on Hugging Face.
Against the open-source field — CogVideoX, Mochi 1, Open-Sora 2.0, HunyuanVideo, LTX-2 — V4 wins on the audio axis. None of those generate audio at all. You are doing silent video and bolting on MMAudio or Stable Audio Open in post. The synchronization is approximate, the workflow is two-stage, and the lip-sync is bad on anything that is not a still face. SkyReels-V4 makes that downstream stack redundant.
V4 also wins on task unification. Runway Aleph does editing. Vidu does reference-to-video. Kling-Omni does multimodal conditioning. No single open model unifies generation, inpainting, and editing under one set of weights. V4 does.
SkyReels-V4 is the open-source video foundation model the field has been waiting two years for. The dual-stream MMDiT is the right architecture: separate branches, shared conditioning, MLLM-driven reference understanding. The 1080p/32 FPS/15-second envelope is what you actually need for short-form content, and the joint low-res / high-res keyframe strategy is what makes that envelope feasible on consumer GPUs.
The honest caveats: 15 seconds is short for narrative work, the MLLM backbone is a closed-weight model (Qwen2.5-Omni class — the paper does not name it), and the inference stack is not yet a one-line install. If you are running video pipelines in 2026 and you are not testing SkyReels-V4 against whatever you shipped this quarter, you are paying for sync drift that does not have to exist.
— Mr. Technology