Diverging Corporate Strategies in the Multimodal "DeepSeek Era": ByteDance Prioritizes "Efficiency", Kuaishou Targets "Professionalism", and Alibaba Focuses on "E-Commerce"

The recent wave of multimodal updates has been密集. On January 31st, KUAISHOU-W upgraded Kling to version 3.0. On February 7th, ByteDance released Seedance 2.0. On February 10th, ByteDance's Seedream 5.0 and Alibaba's Qwen-Image-2.0 further enhanced the "text-to-image/image editing" foundation. A report from Hua Chuang Securities Research Institute on the 12th offered a direct assessment: video generation is no longer just a technical showcase but is evolving into a tool capable of integrating into workflows. The core obstacle hindering commercialization has been attributed to the uncontrollable marginal costs resulting from a "gacha" or lottery-like generation process, where the same requirement needs repeated generation and revision, with high discard rates consuming time and budget.

The focus of the upgrades for Kling 3.0 and Seedance 2.0 is not merely on improving visual quality, but on elevating controllability to a higher priority. Enhancements in cross-shot subject consistency, adherence to complex semantic instructions, and post-generation editing capabilities collectively aim to reduce the discard rate. The report concludes that these technological leaps provide a foundation for AI video to enter large-scale B2B workflows, with sectors like e-commerce advertising and short-form drama production likely feeling the impact sooner.

The report breaks down the implications into two layers. The first involves a divergence in product strategies: ByteDance appears to be building an "efficiency infrastructure," while KUAISHOU-W leans towards a "professional narrative." The second layer is a supply-side revolution that recalibrates cost structures, where the marginal cost of content production increasingly converges with compute costs. Regarding investment opportunities, the report highlights potential beneficiaries in content IP, content copyright, AI video tools/models, and the demand for inference capabilities from cloud and platform providers.

The key issue being addressed is the uncontrollable cost associated with the "gacha" mechanic. The report repeatedly emphasizes a logical chain: the past difficulty in commercializing AI video was not an inability to "produce output," but rather the "instability of the output." Using the same script, assets, and prompts resulted in significant fluctuations in final video quality, forcing creators to use more generation cycles to gamble on a good result, leading to失控 marginal costs. The significance of the new generation of models lies in prioritizing "controllability" over raw "generation capability." Through native multimodal architectures, instruction alignment, and enhanced subject consistency/semantic adherence, these models aim to lower the discard rate, thereby reducing overall video production costs. The threshold for commercialization is thus being redefined—shifting from "is it possible to do?" to "can it be delivered reliably?"

Kling 3.0 bets on a "blockbuster feel," prioritizing physical realism and long-form logical narratives. The report identifies Kling 3.0's key advancements as a systematic upgrade of core capabilities and the integration of generation and editing (Omni). For video, Kling 3.0's improvements include stronger subject consistency in multi-shot/continuous action scenes; finer parsing of complex text instructions; reduced referential confusion in multi-person scenes; and an emphasis on precise text-to-visual character mapping (including multilingual support, dialect accents, and natural lip-syncing and expressions). The Omni mode is another highlighted change, allowing for localized, controllable modifications to already generated content, reducing the need for complete regeneration. The report also mentions two capabilities leaning towards professional creation: the ability to create video subjects (extracting character features and original voice timbre for precise lip-sync and driving); and native custom storyboarding, with single-generation duration increased to 15 seconds, allowing specification of shot duration, framing, perspective, narrative content, and camera movement at the shot level.

For images, Kling Image 3.0 is positioned as part of "workflow completion." It supports up to 10 reference images to lock subject outlines, core elements, and color tones; allows free specification, addition, deletion, and modification of elements across multiple reference images; supports batch image set output for storyboard/material package creation; and enhances high-definition output and detail rendering.

Seedance 2.0 is positioned as an "orchestratable" industrial tool. The report describes its foundation as emphasizing reasonable physics, natural motion, precise instruction understanding, and stable style maintenance. It highlights three key capabilities: consistency optimization (from faces to clothing, font details, scene transitions, etc.); controllable replication of complex camera work and actions; and precise replication of creative templates/complex special effects.

More crucially is the interaction paradigm. The report suggests that Seedance 2.0's use of "@assetname" to specify the use of images/videos/audio essentially deconstructs black-box generation into a controllable production pipeline. The model can separately extract camera movements from @video, details from @image, and rhythm from @audio, thereby significantly lowering the "discard rate." The usage limits provided are also more aligned with "production constraints": supports ≤9 input images; ≤3 input videos with a total duration not exceeding 15 seconds; supports ≤3 MP3 audio uploads with total duration not exceeding 15 seconds; a mixed input limit of 12 files total; generation duration ≤15 seconds (selectable 4-15s); and provides output with sound effects/music. In terms of entry points, "First/Last Frame" and "All-round Reference" correspond to different asset organization methods.

The report's view on the competitive landscape focuses less on "benchmark rankings" and more on strategic differentiation among vendors. ByteDance's approach is characterized as focusing on low-barrier, low-cost toolification and generalization capabilities, akin to an advanced version of "Jianying," aiming to reduce content production costs across the internet and benefit its ecosystem. KUAISHOU-W's Kling bets on physical simulation, realism in complex scenes, and character consistency, making it more suitable for professional content requiring high coherence, such as film demos or cinematic narratives. Alibaba's Qwen update for image models, with its high-fidelity improvements, leans more towards vertical scenarios (e-commerce), strengthening capabilities related to product digitization. These three paths point towards different business models: one pursues large-scale throughput, another aims for high-quality narrative delivery, and the third focuses on "production-ready" solutions for vertical industries.

In its commercialization outlook, the report presents a radical view of a "supply-side revolution." With the dual enhancement of image and video foundational capabilities, the marginal cost of content production is expected to increasingly converge with compute costs. In the short term, the report is more optimistic about two changes: improved efficiency in material production for marketing/e-commerce service providers, leading to better gross margins; and a potential explosion in production capacity for comic-drama and short-drama industries. In the medium to long term, the focus shifts to the IP side—as content becomes easier to produce, pricing based on scarcity will concentrate more heavily on IP: the value of top-tier IP and derivatives will increase, and mid-tier IP may also be revalued through AI video adaptation. Simultaneously, giants with strong computing infrastructure (cloud) and closed-loop traffic scenarios (platforms) are positioned to more directly benefit from the frequent inference calls this revolution will entail.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Tiger Brokers

Diverging Corporate Strategies in the Multimodal "DeepSeek Era": ByteDance Prioritizes "Efficiency", Kuaishou Targets "Professionalism", and Alibaba Focuses on "E-Commerce"

Most Discussed