Qwen Unveils Qwen3.7-Max, Aiming to Be the Ultimate Agent Foundation

On May 20, Qwen officially launched Qwen3.7-Max, a new generation flagship model designed for the era of intelligent agents, which will soon be available via API services. Qwen3.7-Max aims to become a versatile foundation for agents, capable of handling tasks ranging from writing and debugging code, automating office workflows, to autonomously executing long-cycle tasks spanning hundreds or even thousands of steps.

The core strengths of Qwen3.7-Max lie in the breadth and depth of its agent capabilities. In programming, it can manage everything from front-end prototype development to complex multi-file engineering projects. For office and productivity, it automates workflows through MCP integration and multi-agent collaboration. In long-cycle autonomous execution, it maintained coherent reasoning in a fully autonomous kernel optimization experiment lasting 35 hours with over 1,000 tool calls, thoroughly validating its persistent and stable execution capability. Furthermore, whether deployed in frameworks like Claude Code, OpenClaw, Qwen Code, or others, it consistently demonstrates excellent cross-framework generalization.

Qwen3.7-Max will soon be available through Alibaba Cloud's Bailian platform, offering: * Advanced programming agent capabilities, from prototypes to complex software engineering. * Office productivity and workflow automation, supporting MCP integration and multi-agent collaboration. * Sustained, stable long-cycle autonomous execution. * Generalization capability across multiple agent frameworks. (API calls via Alibaba Cloud Bailian are coming soon.)

In terms of model performance for programming agents, Qwen3.7-Max achieved leading results on SWE-Pro (60.6), SWE-Multilingual (78.3), SciCode (53.5), and QwenSVG (1608). It surpassed DS-V4-Pro Max (67.9) on Terminal Bench 2.0-Terminus (69.7). On SWE-Verified (80.4), it performed comparably to Opus-4.6 Max (80.8) and DS-V4-Pro Max (80.6).

Improvements in general agent capabilities are even more significant. Qwen3.7-Max performed exceptionally well on MCP-Mark (60.8 vs. GLM-5.1's 57.5), MCP-Atlas (76.4 vs. Opus-4.6's 75.8), and Skillbench (59.2 vs. K2.6's 56.2). It also demonstrated powerful GPU kernel optimization on Kernel Bench L3 (1.98x median speedup, 96% acceleration rate). It performed strongly on BFCL-V4 (75.0), Qwenclaw (64.3), and ClawEval (65.2), closely following Opus-4.6 Max. On the office automation benchmark SpreadSheetBench-v1, it scored 87.0, placing it at a top-tier level.

In reasoning, Qwen3.7-Max achieved leading scores on GPQA Diamond (92.4 vs. Opus-4.6's 91.3), HLE (41.4 vs. Opus-4.6's 40.0), HMMT 2026 Feb (97.1 vs. Opus-4.6's 96.2), IMOAnswerBench (90.0 vs. DS-V4-Pro's 89.8), and Apex (44.5 vs. DS-V4-Pro's 38.3), showcasing robust capabilities on high-difficulty reasoning benchmarks.

In general capabilities and multilingual performance, Qwen3.7-Max excelled on IFBench (79.1 vs. DS-V4-Pro's 77.0), demonstrating precise instruction-following ability. It also led on WMT24++ (85.8) and MAXIFE (89.2), indicating its multilingual understanding and translation quality are first-rate. It performed well on SuperGPQA (73.6) and QwenWorldBench (57.3).

It is important to emphasize that the above evaluation scores come from various different agent frameworks. Qwen3.7-Max is not optimized for any single specific framework but performs stably across Claude Code, OpenClaw, Qwen Code, and various custom tool-use frameworks, making it a reliable foundation for diverse agent systems.

**Productivity Assistant** For real-world productivity scenarios, Qwen3.7-Max will serve as a deep collaborator. Leveraging its powerful agent capabilities, it can comprehensively reshape professional workflows: comprehensive research and integration of vast information, deep analysis and modeling of complex data, and generation of publication-ready documents and visualizations—accurately undertaking high-complexity, high-intensity enterprise-level tasks.

Qwen3.7-Max natively adapts to mainstream agent frameworks. For long-chain delivery tasks, it supports autonomous planning and execution lasting several hours, continuously improving deliverable quality through thousands of tool calls and dozens of version iterations. Complex projects that previously required professional teams one to two weeks can now be completed in an end-to-end delivery cycle within hours by agents powered by Qwen3.7-Max, driving a genuine leap in productivity.

**Agent Expansion** Building on the environment expansion method introduced in Qwen3.5, Qwen3.7 further significantly expands the quality and diversity of agent training environments. Just as language models gain generalization ability from diverse pre-training text, we find that agent capabilities can also generalize from diverse training environments. As shown in the figure below, this environment expansion has led to a clear and stable trajectory of performance improvement. Qwen3.7-Max ranks in the top three in comprehensive rankings, approaching the level of Claude-4.6-Opus-Max.

Notably, all environments involved in our benchmark evaluations are completely new, out-of-distribution domains never seen during training. We also observed a significant predictability in expansion behavior: performance gains on any benchmark subset are highly consistent, reliably predicting relative gains on the remaining benchmarks or the overall average. This indicates that environment expansion drives true capability generalization, not improvements specific to particular benchmarks. Further analysis of expansion dynamics and methodology will be detailed in an upcoming technical report.

**Cross-Framework Generalization Capability** Our Rollout environment infrastructure decouples each training instance into three orthogonal components—Task, Harness, and Verifier—which can be freely recombined. We are compatible with multiple harnesses and their iterative versions, and we ground environments in real-world scenarios rather than synthetic substitutes. This decoupled design enables combinatorial expansion: the same task can be matched with different types and versions of harnesses and verifiers at a very low marginal cost.

More crucially, it enables reinforcement learning (RL) training across frameworks and verifiers—forcing the model to handle homologous tasks under varying harness configurations, thereby compelling it to learn problem-solving strategies with generalization ability rather than relying on shortcuts specific to a particular harness.

In evaluations on QwenClawBench and CoWorkBench, regardless of the harness used during assessment, Qwen3.7-Max demonstrated strong and consistent performance, significantly surpassing the Qwen3.6 series models. This confirms that the model has genuinely mastered the ability to solve tasks rather than overfitting to a specific framework.

Qwen3.7-Max can be seamlessly integrated into mainstream agent frameworks and programming assistants, including Claude Code, OpenClaw, and Qwen Code.

免責聲明：投資有風險，本文並非投資建議，以上內容不應被視為任何金融產品的購買或出售要約、建議或邀請，作者或其他用戶的任何相關討論、評論或帖子也不應被視為此類內容。本文僅供一般參考，不考慮您的個人投資目標、財務狀況或需求。TTM對信息的準確性和完整性不承擔任何責任或保證，投資者應自行研究並在投資前尋求專業建議。

老虎證券

Qwen Unveils Qwen3.7-Max, Aiming to Be the Ultimate Agent Foundation

熱議股票