Kling O1 (Omni): Is the "Fragmented" Video Workflow Ending?
Video production has traditionally been a fragmented process: generate with one model, upscale with another, and try to edit with a third tool (or Photoshop). Kuaishou's Kling O1 (Kling-Omni) made one of the most interesting debuts of 2025 with the claim of consolidating this process onto a single "unified multimodal foundation."
This article examines Kling O1’s technical architecture and how it solves the "edit loop" problem.
1. Technical Innovation: Unified Architecture
Kling O1’s biggest claim is that video generation, editing, and understanding are not separate modules, but capabilities of the same neural network.
Why Is This Important?
In traditional models, if you wanted to change a character's outfit in a generated video, the model tended to regenerate the entire video (breaking scene consistency). Since Kling O1 "understands" the video, it can execute the command "Make the jacket red" without disturbing the lighting of the scene or the character's movement.
2. Benchmarks and Performance: VBench-2.0 Realities
Comparing video models is difficult because "quality" is subjective. However, frameworks like VBench-2.0, which became standard in 2025, offer data-driven insights.
Kling O1 Strengths (VBench Data):
Temporal Consistency: High performance in preventing objects from shapeshifting between scenes (flickering issue).
Motion Smoothness: Adherence to physics in complex action scenes (running, fighting, etc.).
Instruction Following: Higher reported adaptation rate to "Edit" commands compared to competitors (Runway, Pika).
Note: The Kling-Omni technical report (arXiv) claims the model is SOTA specifically in "reasoning-based editing."
3. Real-World Use: Agencies and Creators
Kling O1 could be a lifesaver for sectors with intense revision cycles.
Scenario: An ad agency produced a 10-second beverage commercial. The client says, "Great, but make the sky in the background more like a sunset and make the bottle sweat a bit more."
Old Method: Change the prompt, regenerate the video (the bottle's position will likely change), try again.
Kling O1 Method: Input the existing video, enter the command "Make the background sunset." The model changes only the atmosphere while preserving the scene structure.
Risks and Limitations
Access: Being a China-based model, global access and data privacy (GDPR) issues may be questions for corporate entities.
Hardware Requirements: The multimodal unified structure may demand high GPU power during inference, affecting API costs.
Example Prompt (Video Edit)
Input: [Existing Video File]
Command: Transform the ground the character is walking on from concrete to grass. Keep the walking speed and camera angle exactly the same. Shift the lighting to a warmer, late-afternoon atmosphere.
Article 5: Sora 2 & Veo 3.1
Focus: Native Audio, Physics, and the LMArena Leadership Race
Meta Title: Sora 2 vs. Veo 3.1 Comparison: Native Audio, Physics, and Arena Scores (2025)
Meta Description: Who is the video giant of 2025? We compare OpenAI Sora 2 and Google Veo 3.1 using LMArena Elo scores, Native Audio capabilities, and VABench tests.
The Power of Sound in Video: Sora 2 and Veo 3.1 Face Off
2024 was the era of "silent movies" for AI video. Autumn 2025 is setting the stage for the "talkies" revolution. OpenAI’s Sora 2 and Google’s Veo 3.1 are not just generating pixels; they are creating synchronized audio (native audio), dialogue, and effects in a single pass.
In this article, we put the "Text-to-Video Arena" scores and technical differences of these two giants on the table.
1. The Biggest Innovation: Native Audio
Both models no longer add sound as an afterthought; they generate waveforms alongside pixels.
Sora 2: OpenAI focuses on physical realism in "synced audio." It can simulate how the sound of a breaking glass changes depending on the surface it hits (wood vs. concrete).
Veo 3.1: Using the vast audio library in the Gemini ecosystem, Google is very strong in "Audio Prompt Fidelity." It understands abstract descriptions like "Cinematic, bass-heavy, tension music" exceptionally well.
2. The Benchmark Arena: Who is the Leader?
LMArena (Text-to-Video Arena), where users watch two anonymized videos and vote for the better one, is one of the most reliable references for video models.
December 2025 LMArena Table (Summary):
Rank Model Elo Score Strength
1 Veo 3.1 (Fast Audio) ~1383 Speed and Audio-Visual Sync.
2 Veo 3.1 (Audio) ~1374 High-Quality Audio Detail.
4 Sora 2 Pro ~1356 Physical Realism and Simulation.
6 Sora 2 ~1325 General Use.
Analysis: Google Veo 3.1 seems to be a step ahead in the user's eye, specifically in the "video with audio" category. However, Sora 2 remains a formidable competitor in complex physics simulations (e.g., water flow, fabric movement).
3. What Do VABench and VBench-2.0 Say?
The new generation benchmark VABench (Video-Audio Benchmark) measures how well the sound matches the image (e.g., lip sync or the timing of an explosion).
Veo 3.1: Leader in Semantic Audio Matching (finding the right sound for the object in the image).
Sora 2: More successful in Spatial Audio (sound coming from the correct position in the video).
4. Recommendation
Choose Google Veo 3.1: If you are producing YouTube content, social media videos, or fast marketing materials. Speed and audio quality are optimized to grab the viewer.
Choose OpenAI Sora 2: If you are creating a movie scene, in-game cinematics, or physical simulation visualization. "Controllability" and physics engine accuracy are critical here.
Example Prompt (Veo 3.1 Audio)
Image: A futuristic motorcycle speeding under neon lights in a rainy cyberpunk city. Camera tracking from behind.
Audio Prompt: High-RPM electric motor sound of the motorcycle, sound of rain hitting the helmet, and muffled city sirens in the distance. Sounds should increase in sync with the motorcycle accelerating.
Kuaishou's Kling O1 model combines video generation, editing, and understanding in a single multimodal structure. Analysis of VBench scores and the technical re
- Yunus Yigit, 2025