Alibaba's Qwen3.7-Plus Model Surpasses GPT-5.4 in Screen Understanding, Develops Apps in 11 Hours, Unifying Vision and Action

Just a day after MiniMax's M3 model made a splash, Alibaba's Qwen has released a formidable new contender. On June 2nd, Alibaba Cloud's Tongyi Qianwen team officially announced the release of Qwen3.7-Plus on platform X. This is a multimodal Agent model, described by its creators as "unifying vision and language into an integrated intelligent agent foundation."

The team summarized its product positioning in one sentence: "One model that can see, think, write code, and act."

Creating apps or replicating a stock trading application is well within Qwen3.7-Plus's capabilities. The official Qwen blog disclosed that a Hybrid-Agent system built on Qwen3.7-Plus once ran continuously and stably for over 11 hours, autonomously completing the full development cycle of an English vocabulary learning app. The Hybrid-Agent system also independently completed a high-fidelity replication of the native macOS Stocks application. Furthermore, the model scored 79 on screen understanding, surpassing both GPT-5.4 and Gemini-3.1 Pro.

Unifying Vision and Action

The core innovation of Qwen3.7-Plus is its true integration of visual understanding with task execution. The official blog describes the model as capable of "perceiving real-world scenes, reading screens and operating GUIs, generating code based on visual references, navigating mobile applications end-to-end," and seamlessly integrating GUI and CLI interactions within a single agent loop. In essence, it doesn't just "understand pictures"; it can comprehend your phone screen or computer interface and then click, type, and navigate to complete tasks autonomously.

Extended Development Capabilities

Regarding its specific capabilities, the official blog provided several product-oriented demonstrations. The key takeaway from the 11-hour app development case is not the volume of code written (over 10,000 lines) but the demonstration of a long, complex workflow. A real software task involves installation, running, testing, debugging, and re-validation, not just one-time code generation.

In another demonstration, the system autonomously replicated the macOS Stocks app. This process involved interacting with the native app to understand its UI layout and functional details, generating SwiftUI source code based on the interaction log, integrating with the LongBridge real-time market data API, and automatically compiling, building, and launching the replicated application. The model passed all 10 functional verification tests.

The model can also convert images, videos, UI screenshots, and design references into executable code, from SVG reproduction to full webpage generation. It can act as a visual agent to solve puzzles like finding differences, completing image blocks, or navigating mazes by first understanding the visual problem and then autonomously writing and executing code to find a solution.

Performance Benchmarks

In multimodal benchmark tests, Qwen3.7-Plus shows notable results. It scored 79.0 on ScreenSpot Pro for screen understanding and mobile control, outperforming GPT-5.4 (67.4) and Gemini 3.1 Pro (68.1). It scored 90.3 on MathVision for visual math reasoning, close to GPT-5.4's 91.0. It also achieved the highest score of 85.9 on the CharXiv(RQ) chart recognition benchmark among compared models.

In pure text capabilities, the official statement indicates Qwen3.7-Plus is "overall close to Max-level models." It scored 70.3 on Terminal Bench 2.0 and 62.3 on Deep-Planning (complex multi-step planning), leading its peers. However, it has weaker areas, scoring 77.7 on SWE-Verified (real software engineering tasks) and 34.7 on HLE (extremely hard reasoning), which are lower than some competitors.

Market Context and Comparison

The timing of this release is notable, coming just a day after MiniMax launched its new flagship open-source model, M3, intensifying the domestic large model competition. While both models compete in programming agent capabilities, their focuses differ. MiniMax M3 emphasizes open-source availability, ultra-long context (1M tokens), and autonomous research and code optimization. Qwen3.7-Plus, currently offered only via API, focuses on the deep integration of multimodal and GUI operation capabilities with plug-and-play compatibility for mainstream development frameworks.

Pricing for Qwen3.7-Plus is set at $0.4 per million tokens for input and $1.6 per million tokens for output.

免责声明：投资有风险，本文并非投资建议，以上内容不应被视为任何金融产品的购买或出售要约、建议或邀请，作者或其他用户的任何相关讨论、评论或帖子也不应被视为此类内容。本文仅供一般参考，不考虑您的个人投资目标、财务状况或需求。TTM对信息的准确性和完整性不承担任何责任或保证，投资者应自行研究并在投资前寻求专业建议。

老虎证券

Alibaba's Qwen3.7-Plus Model Surpasses GPT-5.4 in Screen Understanding, Develops Apps in 11 Hours, Unifying Vision and Action

热议股票