Zhipu Launches AutoGLM 2.0: The "Manus Moment" for Mobile Agents?

Deep News
08/20

When AI stops competing for your phone screen, mobile Agents might finally become truly usable.

On August 18, Zhipu officially released its new ToC product AutoGLM 2.0 — a universal mobile Agent. The previous AutoGLM version released in March had a significant limitation: "users could only watch without any other options or ability to do other things" during task execution. This local "screen-grabbing" approach created a "choose one or the other" situation between human and machine interaction.

For example, when using an Agent to order coffee on a phone, users could only "watch and wait" for task completion. This mode limited AI-driven efficiency improvements to around 1.x times, failing to achieve the multiplicative productivity gains expected.

Now, as AutoGLM re-enters the public eye, the situation has changed significantly. In the 2.0 version, each user receives a cloud phone and cloud computer. With just a single command, the Agent can automatically execute operations, collaborate across applications, and complete entire task workflows in the cloud.

This means AI can work independently in the cloud 24/7 without interfering with foreground operations. The human-AI collaboration paradigm is evolving from a synchronous "you watch me work" mode to an asynchronous parallel "you do your thing, I do mine" mode.

When AI gains an independent "body" and "workspace," a new parallel digital world driven by Agents appears to be opening up.

**From "Screen Grabbing" to "Cloud Avatar"**

Let's return to that core pain point. Previously, whether with early versions of AutoGLM or similar attempts, every AI operation was reflected in real-time on users' physical screens. This "screen-grabbing" mode created several obstacles:

First was the efficiency problem — when AI worked, humans had to wait, creating a mutually exclusive relationship. This limited overall efficiency improvements and failed to achieve ideal productivity multiplication.

Second was execution interruption potential — screen locks, network fluctuations, app switching, or any user behavior could interrupt Agent long-task flows. AI struggled to work continuously during non-attention periods (sleep, entertainment), significantly reducing its value.

Finally, there were adaptation challenges — Android system fragmentation made local adaptation costs prohibitively high. Every phone brand and system version could potentially affect Agent stable operation.

AutoGLM's new solution replaces "local mirroring" with "cloud-native" deployment. It deploys a complete Android environment (cloud phone) and Linux environment (cloud computer, with Windows support coming later) in the cloud for each user.

When users issue commands like "go to Meituan, find nearby bubble tea shops, order 20 cups, remember to use coupons," the entire task flow — from opening apps, skipping ads, searching stores, selecting products, repeatedly clicking to increase quantities, to intelligently applying coupons — all runs on that cloud phone.

Meanwhile, users' physical phones remain free. Users can continue chatting, watching videos, or simply put the phone in their pocket with the screen off. AI work and user operations are physically decoupled, with no interference. Users only need to check progress in task lists and return to "confirm" at key points like payment or publishing.

During Zhipu's closed-door session, product manager Liu Xiao demonstrated this core experience live. When he used an iPhone to assign AutoGLM a Xiaohongshu operation task — "create and publish a video introducing AutoGLM, style should fit self-media" — the Agent began efficient work in the cloud.

It conducted high-concurrency searches of dozens of keywords, quickly browsed multiple web pages, then completed information collection and copywriting, automatically beginning video production. During this time, Liu Xiao also demonstrated bubble tea ordering and "scroll TikTok until finding a cat video" entertainment tasks on the cloud phone.

According to official information, AutoGLM can currently operate over 40 high-frequency applications in the cloud, including TikTok, Xiaohongshu, Meituan, and JD.com.

This reflects Zhipu's insights into future human-machine collaboration relationships. Zhipu CEO Zhang Peng shared a perspective at the meeting: future personal competitiveness will center on "personal ability + N AI agents" combined. Everyone will transform from "worker" to "leader," with core capabilities shifting from hands-on execution to "communication, task assignment, and command."

AutoGLM's cloud architecture represents the product implementation of this concept. It enables AI to become a "digital employee" capable of 24/7 parallel work, breaking the barrier of "AI must operate under your supervision," allowing users to "outsource" time-consuming, repetitive, or beyond-capability tasks to this cloud avatar.

In actual operation testing, I tried using AutoGLM to buy coconut water from Meituan. Before initiating tasks, I needed to take control of the cloud machine, log into relevant app accounts, exit control, then normally initiate tasks. Manual operation was also required during payment, but other steps could indeed be completed automatically. The system even polished requirements beforehand, automatically adding "use red packets," though speed was concerning — slower than manually operating the phone.

Subsequently, I tried a computer task: "answer questions under the top trending topic on Zhihu." Midway through, possibly due to missing a confirmation point, the task automatically restarted. When I switched back from other pages, I encountered this situation.

After I took control and confirmed the operation point, I could see AutoGLM executing. However, this task execution was somewhat problematic — the instruction was "answer questions under Zhihu's top trending topic," but it only found the topic and remained there, considering the task complete without actually "answering."

Perhaps the instruction wasn't detailed enough? I updated it to "find Zhihu's top trending topic, write a 200-word response to the question, and publish directly after writing," starting a new task. This time it did write an answer, but due to system connection restrictions, manual submission was still required.

There's also an issue where some users experienced forced logouts and machine code locks when using AutoGLM to publish Xiaohongshu content, possibly triggering risk controls.

**"3A Principles" and "Online Reinforcement Learning" Driving Agents**

If "cloud phone/cloud computer" represents AutoGLM's new "body," then the powerful models, training methodology, and product principles behind it serve as the "brain" enabling efficient operation.

Through team discussions, we learned that AutoGLM's product philosophy can be distilled into "3A Principles." These three principles collectively define AutoGLM's vision of mature Agent form and explain its current product architecture.

Previously, many Agents relied on supervised fine-tuning (SFT), learning from human expert operation trajectories. This method's weakness was "poor generalization" — AI could only imitate operations it had seen, often helpless with unseen scenarios or interface changes.

To enable Agents to truly complete tasks in complex, variable real environments (thousands of concurrent phone, computer, and browser environments), the AutoGLM team chose end-to-end online reinforcement learning. The core idea is that after minimal expert data "cold start," models learn through "trial and error" in thousands of parallel real cloud environments, just like humans.

The system no longer tells models "where to click next" but only provides "success" reward signals when tasks are finally completed. Models must explore optimal decision paths themselves.

This presents enormous engineering challenges, requiring a massive system capable of simultaneously scheduling and monitoring thousands of cloud computers and phones. For specific technical implementation, Zhipu disclosed several reinforcement learning breakthroughs: proposing API-GUI collaborative paradigms for improved data diversity on computers (ComputerRL); innovating difficulty-adaptive reinforcement learning methods for mobile complex task stability (MobileRL); and solving multi-task training instability through cross-sampling mechanisms (AgenRL).

These specific technical innovations collectively ensure AutoGLM's high success rates in complex environments. According to Zhipu, online reinforcement learning improved AutoGLM's task success rate by 165% compared to cold start, with over 66% of success gains coming from this approach.

"We discovered that with sufficiently good 'Environment' and 'Reward,' existing algorithms can optimize virtually any task," Liu Xiao shared. "Bottlenecks are no longer in algorithms themselves, but in building scalable validation and feedback environments."

This "model as Agent" concept is also reflected in base models. GLM-4.5 and GLM-4.5V underwent deep optimization for Agent tasks from pre-training stages, termed "Agentic Language Models." This native design from the ground up enables AutoGLM to excel in multiple public benchmarks. For instance, in OSWorld Benchmark testing computer operation capabilities, AutoGLM scored 48.1, surpassing ChatGPT Agent and Anthropic models.

Advanced technical routes bring significant commercial viability breakthroughs — cost reduction. Traditional Agents built on third-party large model APIs cost $3-5 per complex task (like Deep Research). AutoGLM, leveraging proprietary models and integrated architecture, compressed single-task costs including model calls and virtual machine resources to approximately $0.2 (about 1.5 RMB).

This approaches Google's single search cost of about $0.02, differing by less than an order of magnitude. Such cost reduction gives Zhipu confidence to open directly to all C-end users without invitation codes. Cost reduction enables super-app potential to rise.

**From "Tool" to "Ecosystem"**

By providing independent cloud runtime environments and GLM-4.5/4.5V model capabilities for Agents, AutoGLM's positioning transcends single efficiency tools, beginning to build an ecosystem connecting multiple devices and services.

First is product capability depth. Beyond demonstrated cross-application operations, AutoGLM's cloud computer targets support for professional productivity tools like Office and Photoshop. The upcoming "scheduled tasks" feature will mark AI's transition from "passive response" to "semi-autonomous planning."

Imagine: "Every morning at 9 AM, automatically summarize boss's unread emails and send summaries to my WeChat," "Weekdays at 10 AM, automatically compare prices across platforms and order my regular coffee" — essentially half a secretary.

Second is hardware ecosystem empowerment. Current AI hardware like smart glasses and Pin-type devices universally face the "impossible triangle" of computing power, battery life, and interaction. Stacking heavy systems and large batteries on miniature devices often yields poor experiences.

AutoGLM's solution involves "lightweight" edge hardware responsible only for sensing and issuing commands, while delegating complex application operations and task execution entirely to cloud Agents.

Creative cases demonstrated at the session illustrate this: connecting weight scales that automatically trigger cloud Agents to order meal replacements when detecting weight exceeding 70kg thresholds; connecting gas sensors that automatically order deodorizing shoe pads when detecting excessive ammonia/hydrogen sulfide concentrations in shoe cabinets.

This showcases a relatively complete "physical sensor → cloud Agent → real-world service" chain, enabling Agents to connect and operate the physical world.

Through open APIs and developer programs, AutoGLM attempts to make "everything Agent-capable." To accelerate this process, Zhipu launched "AutoGLM Mobile API Application Portal" and "Developer Ecosystem Co-building Program," allowing developers to encapsulate AutoGLM's cloud execution capabilities into their hardware or software products.

Finally, traditional internet traffic ceilings are users' "attention limits" — only 24 hours daily, using this app means no time for that one. Agents create new traffic forms: parallel and demand-driven. With only single-threaded attention, you can deploy countless parallel Agents to research travel guides, compare online prices, and filter work materials.

This mode of AI proxies using services for humans could dramatically expand the effective traffic pool across the internet. Moreover, this traffic carries clear "transaction intent," representing relatively higher commercial value.

From another perspective, Agents' average single-task consumption exceeding 256k tokens presents 32 times the demand and value density of traditional dialogue scenarios for upstream inference infrastructure.

At the session's conclusion, Liu Xiao proposed a staged AGI (Artificial General Intelligence) definition he calls "AGI's lower bound." When an Agent can autonomously and stably run for an entire day (24 hours) as your colleague or secretary, collaborating to complete work and life tasks while improving your comprehensive efficiency by over 2x, AGI's dawn appears.

AutoGLM's evolution may still have distance from this "lower bound." It remains in early form, with rudimentary instruction understanding and some bugs. However, by building the core "cloud avatar" architecture, it's genuinely beginning to pave the way for Agent "independent operation."

The shift from synchronous operation to asynchronous delegation may mark the beginning of human-machine collaboration paradigm transformation. Future personal competitiveness might depend on "personal ability + N AI agents" models, where users issue commands for multiple AI parallel task completion, fundamentally changing how individuals handle daily and work affairs.

More ideally, perhaps a future where you only need to speak while countless digital avatars manage your digital world is unfolding.

免責聲明:投資有風險,本文並非投資建議,以上內容不應被視為任何金融產品的購買或出售要約、建議或邀請,作者或其他用戶的任何相關討論、評論或帖子也不應被視為此類內容。本文僅供一般參考,不考慮您的個人投資目標、財務狀況或需求。TTM對信息的準確性和完整性不承擔任何責任或保證,投資者應自行研究並在投資前尋求專業建議。

熱議股票

  1. 1
     
     
     
     
  2. 2
     
     
     
     
  3. 3
     
     
     
     
  4. 4
     
     
     
     
  5. 5
     
     
     
     
  6. 6
     
     
     
     
  7. 7
     
     
     
     
  8. 8
     
     
     
     
  9. 9
     
     
     
     
  10. 10