Meituan's Wang Xing Launches 561B-Parameter "Omni" LongCat Model, Debuts AI Assistant App

Deep News
2025/11/03

Meituan has officially open-sourced its multimodal model LongCat-Flash-Omni, boasting 560 billion total parameters and 27 billion activated parameters. Dubbed the "first open-source large language model integrating full-modal coverage, end-to-end architecture, and efficient large-parameter inference," LongCat-Flash-Omni ("Omni" meaning "all-capable") achieves state-of-the-art performance in multimodal benchmarks while excelling in key unimodal tasks like text, image, video understanding, and speech perception/generation.

Built upon the MoE-architecture LongCat-Flash framework, the new model integrates efficient multimodal perception and speech reconstruction modules, supporting 128K-token context windows and over 8 minutes of audio-video interaction. For pretraining, researchers compiled a diverse 2.5-trillion-token multimodal corpus using progressive training strategies from simple to complex sequence modeling. This marks Meituan's third model release since September 1, following LongCat-Flash-Chat and LongCat-Flash-Thinking.

Concurrently, Meituan launched public testing for its LongCat app, currently featuring web search and voice call capabilities (with video calls coming soon). Users can experience audio interaction via web or app interfaces, though early testing revealed image upload glitches on Android requiring reinstallation.

Benchmark evaluations show LongCat-Flash-Omni rivals closed-source models like Gemini-2.5-Pro and GPT-4o while outperforming open-source alternatives like Qwen3-Omni. Key achievements include: - Image-to-text: Comparable to Gemini-2.5-Flash, surpassing Qwen3-Omni in multi-image tasks - Video-to-text: SOTA performance in short-video understanding, competitive in long-video tasks - Audio capabilities: Top performance in ASR, TTS, and speech continuation tasks - Cross-modal interaction: Scores third in naturalness/fluency behind closed-source leaders

The model addresses four core challenges in multimodal training: 1. Cross-modal heterogeneity through unified representation strategies 2. Unified offline/streaming capabilities via human-AI collaborative data construction 3. Real-time AV interaction using ScMoE architecture and efficient encoders 4. Training efficiency via Modal-Decoupled Parallelism (MDP) strategy maintaining >90% text-training throughput

Training leveraged a five-stage progressive approach: - Stage 0: Large-scale text pretraining - Stage 1: Speech-text alignment - Stage 2: Visual-language integration - Stage 3: Spatiotemporal video reasoning - Stage 4: 128K-token context expansion - Stage 5: Continuous audio feature processing

Future work will focus on expanding training data diversity, adaptive reasoning modes, and embodied AI interaction forms. The release aims to accelerate multimodal research toward human-centric AGI systems.

免責聲明:投資有風險,本文並非投資建議,以上內容不應被視為任何金融產品的購買或出售要約、建議或邀請,作者或其他用戶的任何相關討論、評論或帖子也不應被視為此類內容。本文僅供一般參考,不考慮您的個人投資目標、財務狀況或需求。TTM對信息的準確性和完整性不承擔任何責任或保證,投資者應自行研究並在投資前尋求專業建議。

熱議股票

  1. 1
     
     
     
     
  2. 2
     
     
     
     
  3. 3
     
     
     
     
  4. 4
     
     
     
     
  5. 5
     
     
     
     
  6. 6
     
     
     
     
  7. 7
     
     
     
     
  8. 8
     
     
     
     
  9. 9
     
     
     
     
  10. 10