Amidst the buzz surrounding Anthropic's Claude Code and open-source projects like OpenClaw lies a significant risk: such AI agents could potentially be manipulated into leaking sensitive personal data, such as bank information. Earlier this year, Anthropic explicitly identified the study of rogue agents as a core research topic for its scholar program, underscoring the seriousness of these concerns.
According to proposal documents reviewed by The Information, Anthropic insiders suggested that scholars train agents to exhibit anomalous behavior in specific scenarios—for instance, writing code containing security vulnerabilities. The team also requested researchers to establish a benchmark for measuring how frequently agents are exposed to security risks.
Anthropic proposed a total of 49 research projects for its scholars, covering a wide range of directions from training Claude to win cybersecurity competitions to investigating Chinese open-source large language models. This disclosure provides a rare glimpse into the company's research priorities.
Guided by senior researchers, the scholars advance Anthropic's work in AI safety and security, though this work does not include core technical development such as training more advanced frontier models. While scholars ultimately pursued only about half of the proposed projects, the proposals clearly illustrate the key issues identified by Anthropic's researchers.
This is significant: for Anthropic and its competitors like OpenAI, Google DeepMind, and xAI, foundational research is the first step in developing new products and applications, and is crucial for building safety guardrails that enable user trust.
Anthropic's spokesperson stated that over half of the research output published by the company's Alignment team, which focuses on catastrophic AI risks, from November to December of last year originated from the scholar program. Participants in the program are primarily undergraduate or graduate students who spend four to six months working on projects selected by Anthropic staff and partners, such as the Berkeley-based AI research organization Redwood Research.
Ethan Perez, who leads a significant portion of Anthropic's safety research and helped launch the scholar program, said the initiative "has significantly increased our research capacity and helped us bring more talent into the field."
In the scholar program launched this January, Anthropic's team and its partners proposed 49 projects. Fifteen of these focused on safety, primarily researching agent-related security issues and proposing fixes. Dozens of other projects aimed to monitor and steer AI system behavior, including guarding against models that might "scheme" against users.
For example, one proposal suggested using Anthropic's flagship model, Claude Opus, to replicate attack behaviors so the company could better defend against them. Currently, when Anthropic discovers new vulnerabilities targeting its agents, employees must manually set up reproduction environments, such as creating a fake phishing bank website designed to trick an agent. Researchers proposed having Claude Opus automatically generate such websites for training models to resist attacks.
Preventing hackers from misusing agents is critical to Anthropic's business. The company has gained a competitive edge against rivals like OpenAI with its code agent, Claude Code, and its non-technical applications like the email-handling Claude Cowork.
Anthropic's spokesperson revealed that since its launch last February, Claude Code has reached an annualized revenue of $25 billion, excluding Cowork. This growth helped the company secure $30 billion in funding earlier this month, achieving a pre-money valuation of $350 billion.
However, frequent anomalous agent behavior, such as clearing a user's inbox, could limit user adoption, highlighting the necessity of robust safety measures. Anthropic has already advised Cowork users to "be mindful of Claude's suspicious behavior." The difficulty of defending against such attacks also presents challenges for OpenAI.
Anthropic researchers also proposed several projects focusing on Chinese AI models, such as replicating innovations from Chinese AI labs. However, Perez noted that no scholars recently chose to pursue these directions, and the reasons for their preference for other topics remain unclear.
Another nine projects aimed to understand the internal workings of AI models, a traditional strength for Anthropic and an area where the company is currently hiring extensively. Related projects include uncovering the mathematical principles behind certain AI models' bizarre behaviors.
For instance, one project seeks to study so-called "large language model mind viruses," such as parasitic personas in AI models: an obsession with spiral patterns or inducing humans to post strange content on social platforms, thereby spreading the "virus" to other models.
This type of research is vital for AI companies, which are willing to offer top researchers compensation packages worth hundreds of millions of dollars. Even for Anthropic's scholars, the compensation is substantial: program application documents indicate that future cohorts will receive a weekly stipend of $3,850, equivalent to an annual salary exceeding $200,000.
Perez stated that beyond supporting core research directions, the scholar program allows Anthropic to explore "more unconventional and offbeat ideas," which may evolve into important research avenues in the future.