Stanford's ACE Framework: AI Learns to Write Its Own Prompts, Boosting Performance 17% While Cutting Costs 87%

Deep News
Oct 13, 2025

Stanford University and SambaNova AI have recently published a joint research paper introducing Agentic Context Engineering (ACE). The core concept focuses on optimizing input context without modifying model parameters, allowing models to generate their own prompts, reflect on effectiveness, and iteratively improve.

This process can be visualized as the model maintaining a "work manual" where failed attempts are recorded as troubleshooting guides, while successful cases are distilled into reusable rules.

**Performance Data**

The research demonstrates impressive results: - AppWorld task accuracy improved by 10.6% compared to GPT-4-driven agents - Financial reasoning tasks showed 8.6% improvement - Cost and latency reduced by 86.9%

The entire process requires no human annotation, achieving optimization through feedback loops alone.

**Counterintuitive Approach**

The framework challenges conventional wisdom. While mainstream thinking pursues concise prompts and streamlined instructions, ACE constructs an information-dense, continuously growing "operations manual." Over time, this manual becomes increasingly comprehensive, with effectiveness accumulating proportionally.

Large language models appear to need sufficient context density rather than brevity. This suggests we may have been overly focused on the models themselves while neglecting how to communicate with them more effectively. This represents both a technical advancement and a fundamental shift in thinking.

**Technical Framework Details**

**Research Motivation**

LLM-based AI applications, including LLM agents and composite AI systems, increasingly rely on context adaptation. Unlike modifying model weights, context adaptation directly incorporates explicit instructions, structured reasoning steps, or domain-specific formats into inputs to enhance performance.

Context serves as foundation for many AI system components: system prompts guiding downstream tasks, memory storing historical facts and experiences, and factual evidence reducing hallucinations and supplementing knowledge.

Context-based adaptation offers several advantages over weight-based approaches. Context remains interpretable and understandable to users and developers, enables rapid integration of new knowledge at runtime, and allows sharing across different models or modules in composite systems.

**Core Problems with Existing Methods**

Current context adaptation methods face two critical limitations:

First is brevity bias. Many prompt optimizers prioritize concise, general instructions over comprehensive knowledge accumulation. This abstraction loses important domain heuristics, tool usage guidelines, and common failure modes crucial for practical applications. While appearing reasonable on certain validation metrics, such approaches often miss the detailed strategies required for agents and knowledge-intensive applications.

Second is context collapse. Methods relying on LLM-based wholesale rewriting often degrade over time into shorter, less informative summaries, causing performance drops. When LLMs are tasked with completely rewriting accumulated context at each adaptation step, context collapse occurs dramatically.

Experimental evidence shows that at step 60, context contained 18,282 tokens with 66.7% accuracy, but collapsed to 122 tokens in the next step with accuracy dropping to 57.1% — worse than the unadapted baseline of 63.7%.

**ACE Framework Design**

To address these limitations, the research proposes ACE (Agentic Context Engineering), a comprehensive framework for both offline scenarios (system prompt optimization) and online scenarios (test-time memory adaptation).

Rather than compressing context into refined summaries, ACE treats contexts as evolving playbooks that accumulate and organize strategies over time. Based on Dynamic Cheatsheet's agentic architecture, ACE incorporates modular workflows of generation, reflection, and curation, while adding structured incremental updates guided by grow-and-refine principles.

The workflow begins with a Generator creating reasoning traces for new queries, exposing effective strategies and recurring issues. A Reflector critically analyzes these traces to extract lessons, optionally refining them across multiple iterations. A Curator then synthesizes these lessons into compact delta entries, deterministically merging them into existing context through lightweight non-LLM logic.

**Incremental Delta Updates**

ACE's core design principle represents context as structured, itemized bullet collections rather than monolithic prompts. Each bullet includes: - Metadata: unique identifiers and counters tracking helpful or harmful classifications - Content: capturing independent units like reusable strategies, domain concepts, or common failure modes

This itemized design enables three key properties: - Localization: only relevant bullets are updated - Fine-grained retrieval: Generator can focus on most relevant knowledge - Incremental adaptation: allows efficient merging, pruning, and deduplication during reasoning

**Grow-and-Refine Mechanism**

Beyond incremental growth, ACE ensures context remains compact and relevant through periodic or lazy refinement. In grow-and-refine, bullets with new identifiers are appended while existing bullets are updated in-place. Deduplication steps then prune redundancy through semantic embedding comparison of bullets.

**Experimental Evaluation**

**Performance Results**

Evaluation demonstrates that ACE enables high-performance self-improving agents. ACE allows agents to dynamically refine input context for self-improvement. Learning solely from execution feedback to engineer better context without ground-truth labels achieved 17.1% accuracy improvement on the AppWorld benchmark.

This context-driven improvement enables smaller open-source models to match performance of top-tier proprietary agents on leaderboards.

On domain-specific benchmarks, ACE showed substantial improvements. In complex financial reasoning benchmarks, ACE achieved average improvements of 8.6% over strong baselines by constructing comprehensive playbooks containing domain-specific concepts and information.

**Cost and Efficiency Analysis**

ACE demonstrates particular advantages in reducing adaptation costs and latency through support for incremental "delta" context updates and non-LLM-based context merging and deduplication.

For AppWorld offline adaptation, ACE achieved 82.3% adaptation latency reduction and 75.1% rollout count reduction compared to GEPA. For FiNER online adaptation, ACE achieved 91.5% adaptation latency reduction and 83.6% token dollar cost reduction compared to Dynamic Cheatsheet.

These efficiency gains primarily stem from two design features: - Incremental updates avoiding overhead of complete context rewrites - Parallel processing of multiple deltas enabling batch adaptation

**Task Evaluation**

The research evaluated ACE on two types of LLM applications that most benefit from comprehensive evolving context:

**Agent Benchmarks**: AppWorld provides autonomous agent tasks involving API understanding, code generation, and environment interaction. It offers real execution environments with common applications and APIs, plus two difficulty levels. The public leaderboard shows the best systems achieving only 60.3% average accuracy at submission time.

**Domain-Specific Tasks**: FiNER and Formula test LLM performance on financial reasoning tasks dependent on eXtensible Business Reporting Language (XBRL). FiNER requires annotating tokens in XBRL financial documents with 139 fine-grained entity types. Formula focuses on extracting values from structured XBRL files and performing calculations to answer financial queries.

**Results Summary**

On AppWorld benchmark, ACE consistently improved strong baselines. In offline settings, ReAct + ACE substantially outperformed ReAct + In-Context Learning and ReAct + GEPA by 12.3% and 11.9% respectively. These improvements extended to online settings, with ACE continuing to outperform previous adaptive methods like Dynamic Cheatsheet by an average of 7.6%.

Notably, on the latest AppWorld leaderboard, ReAct + ACE (59.4%) matched the top-performing IBM CUGA (60.3%), a production-grade GPT-4.1-based agent, despite using a smaller open-source model.

On financial analysis benchmarks, ACE provided strong improvements. In offline settings with training set ground-truth answers, ACE outperformed baselines by an average of 10.9%, demonstrating particular effectiveness when tasks require precise domain knowledge.

**Limitations and Future Directions**

ACE's potential limitation lies in dependence on a capable Reflector. If the Reflector cannot extract meaningful information from generated traces or results, the constructed context may become confused or harmful. In domain-specific tasks where models cannot extract useful information, the resulting context naturally lacks value.

Not all applications require rich or detailed context. Tasks like HotPotQA often benefit more from concise high-level instructions rather than long contexts. Similarly, games with fixed strategies may only need single reusable rules, making additional context redundant.

Overall, ACE proves most effective for applications requiring detailed domain knowledge, complex reasoning chains, or long-term strategy accumulation. For tasks with simple structures or fixed strategies, traditional concise prompt optimization may remain sufficient.

The research represents progress toward more flexible, interpretable, and efficient LLM adaptation, opening new possibilities for building AI systems that continuously learn and improve from experience. Future work can explore ACE applications across broader domains and integration with other adaptation techniques such as parameter-efficient fine-tuning.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Most Discussed

  1. 1
     
     
     
     
  2. 2
     
     
     
     
  3. 3
     
     
     
     
  4. 4
     
     
     
     
  5. 5
     
     
     
     
  6. 6
     
     
     
     
  7. 7
     
     
     
     
  8. 8
     
     
     
     
  9. 9
     
     
     
     
  10. 10