SkyReels V4 — The Chinese Dark Horse That Just Topped the AI Video Leaderboard

Kunlun's SkyReels V4 quietly climbed to #1 on the Artificial Analysis audio-video arena, beating Kling 3.0 and Veo 3.1. Dual-stream architecture, microsecond lip-sync, and unified editing — here's what you need to know.

2026-03-19 · 8 min de lectura

A model nobody saw coming

If you follow the AI video space, you know the usual suspects: Seedance 2.0 from ByteDance, Kling 3.0 from Kuaishou, Sora 2 from OpenAI, Veo 3.1 from Google. These are the models that get the press coverage, the comparison articles, the Twitter debates.

Then on February 25, 2026, a model called SkyReels V4 from Kunlun Tech — a company most people outside China associate with mobile games, not AI research — quietly appeared on the Artificial Analysis leaderboard. Within days it climbed to #1 in the Text-to-Video with Audio category, posting an ELO score of 1132. That put it ahead of Kling 3.0 (1097) and Veo 3.1 (1087). Not by a slim margin.

The AI community's reaction was roughly: wait, who? Kunlun's previous video work was SkyReels V3, an open-source model focused on avatar generation. Perfectly fine for talking-head videos, but not the kind of thing that threatens Sora. V4 is a completely different animal.

What makes the architecture different

Most AI video models treat audio as a second pass — generate the video first, then synthesize audio to match. It works, but the sync is always slightly off. You notice it in lip movements, in footsteps that land a frame too early, in ambient sounds that feel pasted on.

SkyReels V4 uses a dual-stream Multimodal Diffusion Transformer (MMDiT). One branch generates video. The other generates audio. They share a text encoder and run simultaneously, not sequentially. The paper (arXiv:2602.21818) calls this "temporally aligned" generation, but the practical effect is simpler: sound and image are born together.

The difference is immediately obvious when you watch the output. A person speaking has lip movements that match at the microsecond level — not the "close enough" sync you get from post-hoc audio generation. Footsteps hit exactly when feet touch ground. A door closing produces its sound precisely when the latch engages. These aren't cherry-picked examples; it's consistent across generations.

The specs that matter

Resolution: up to 1080p. Frame rate: 32 FPS. Duration: up to 15 seconds. Native audio generation in Chinese, English, Japanese, Korean, German, and French. Accepts text, images, video clips, masks, and audio references as input.

On paper, these specs aren't dramatically different from Kling 3.0 or Seedance 2.0. The 32 FPS is actually lower than Kling's 60 FPS, and max resolution is 1080p versus Kling's 4K. Where SkyReels V4 separates itself isn't raw specs — it's what happens when audio and video are generated as a single coherent output rather than stitched together.

The model also unifies generation, inpainting, and editing in one framework. You can generate a video, then use masks to edit specific regions, replace subjects, or modify attributes — all without switching tools or re-rendering from scratch. That's a workflow advantage that pure generation models don't offer.

Chinese speech synthesis is where it shines brightest

Every AI video model with audio supports English reasonably well at this point. The gap shows up in other languages. SkyReels V4's Chinese speech synthesis is noticeably ahead of the competition — tones are accurate, prosody sounds natural, and the lip-sync precision that works so well in English carries over perfectly to Mandarin.

This matters for more than the Chinese market. If you're producing multilingual content — product videos, educational material, social media for global audiences — having one model that handles both English and Chinese without quality degradation eliminates an entire step in your pipeline. Most competitors require you to generate video in one language, then re-do audio in another, often with degraded sync.

Japanese and Korean synthesis are solid too, though a step below the Chinese quality. French and German are functional but have the slightly robotic edge you'd expect from any current-gen model doing European languages.

Where it falls short compared to Kling 3.0 and Seedance 2.0

SkyReels V4 wins the audio-video sync battle convincingly. But video generation isn't just about audio sync, and in other dimensions, the competition still leads.

Motion quality: Kling 3.0 still produces more physically convincing movement. Walk cycles, fabric dynamics, complex multi-person scenes — Kling handles these with a fluidity that SkyReels V4 doesn't quite match. V4's motion is good, but it occasionally has a slight "floaty" quality in fast action sequences.

Creative control: Seedance 2.0's multimodal reference system — feeding in images, video clips, audio files, and text as combined input — gives creators a level of directorial control that SkyReels V4 can't match. V4 accepts multi-modal inputs, but the reference-based compositing isn't as precise.

Resolution and frame rate: Kling 3.0's native 4K at 60 FPS is a clear advantage for anyone producing content for large screens or broadcast. SkyReels V4's 1080p at 32 FPS is fine for social media and web, but it's not broadcast-ready.

Availability: As of mid-March 2026, SkyReels V4 is still in limited preview. You can access it through the official Tiangong platform with daily free generation limits, but there's no public API yet. Atlas Cloud announced integration on March 18, but broad availability is still rolling out.

Pricing: surprisingly affordable when you can get access

The official pricing is $7.20 per minute with audio and $8.40 per minute without audio — which seems backwards until you realize the dual-stream architecture is actually more efficient when generating both modalities together than when suppressing one.

For context, that's roughly competitive with Kling 3.0's Pro tier and significantly cheaper than Sora 2 for equivalent output. The catch is access — the free tier gives you limited daily generations on the Tiangong platform, and the API isn't publicly available yet.

On Sovra, we're tracking SkyReels V4 availability closely. As soon as API access opens up, it will join the model lineup alongside Seedance 2.0, Kling, Sora 2, and the rest. The audio sync quality alone makes it worth having in a multi-model workflow.

The real story: audio-video sync is the next battleground

SkyReels V4's success points to something bigger than one model's leaderboard ranking. For the past year, the AI video competition has been about visual quality — resolution, motion realism, frame rate, consistency. Those problems aren't solved, but they're approaching diminishing returns. The models all look pretty good now.

Audio sync is the new frontier. When you watch a Kling 3.0 output with audio enabled, it looks incredible and sounds... fine. The audio is there, it's roughly synchronized, but there's a subtle uncanny valley. Your brain registers that something is slightly off even if you can't articulate what.

SkyReels V4 closes that gap. The dual-stream approach isn't just an engineering trick — it's a fundamentally different way of thinking about video generation. Sound isn't an afterthought bolted onto images. It's part of the same generative process. And the leaderboard results suggest human evaluators can tell the difference.

Expect every major model to move toward simultaneous audio-video generation in the next 6 months. Seedance 2.0 already uses audio references for sync control. Kling 3.0 has native audio. But the "generate both streams from the same latent representation" approach that SkyReels V4 pioneered is likely where everyone ends up.

Bottom line

SkyReels V4 is the most impressive audio-video model available today, and it came from a company nobody was watching. The dual-stream architecture delivers on its promise — synchronized audio that actually feels like it belongs with the video, not pasted on top.

It's not the best at everything. If you need 4K output, Kling 3.0 wins. If you need precise creative control from reference files, Seedance 2.0 wins. If you need the most physically realistic motion, it's still a toss-up between Kling and Sora 2.

But if your workflow involves dialogue, narration, music videos, or any content where sound-image sync is critical — SkyReels V4 is the model to watch. And with API access expected to broaden through March and April 2026, it's about to become a lot more accessible.