A latest assessment from OpenAI reveals that artificial intelligence is rapidly catching up to, and even approaching, the performance levels of human professionals in executing economically valuable work tasks.
OpenAI released a new evaluation tool called GDPval-v0 on Thursday. This tool is designed to measure AI model performance in completing "real-world work deliverables" such as legal documents, engineering blueprints, and care plans.
The study covers nine major business sectors that constitute a significant portion of the United States' Gross Domestic Product (GDP), involving approximately 1,300 specific work tasks across 44 occupations. Results indicate that current top-tier AI models have reached capabilities comparable to human professionals in executing many occupational tasks, with this capability improvement accelerating.
Following the release of GDPval-v0, Jack Clark, former OpenAI policy director and Anthropic co-founder, comprehensively evaluated GDPval's research process and results in his latest blog post "Eval the world economy; singularity economics; and Swiss sovereign AI."
**GDPval May Become New Standard for Measuring AI Economic Value**
The GDPval benchmark test encompasses 1,230 professional tasks spanning technology services, financial insurance, healthcare, information industry, manufacturing, and other sectors. Each task was carefully designed and reviewed by seasoned professionals with an average of over 14 years of industry experience.
Clark noted that this list encompasses virtually all key knowledge-intensive positions in the modern economy, demonstrating that AI companies are systematically testing their systems' adaptability across various economic "niches."
Another excellent feature of this benchmark test is its involvement of multiple response formats and attempts to handle the complexity inherent in the real world.
To simulate real-world work complexity, GDPval tasks are not simple text Q&As but come with reference documents and context, requiring AI to deliver diverse outputs including documents, slides, charts, and spreadsheets.
The evaluation results directly quantify AI's capability boundaries. Data shows Claude Opus 4.1 achieved a 47.6% "win or tie" rate compared to human experts, ranking first. This was followed by GPT-5-high (38.8%) and o3 high (34.1%).
These data indicate that AI quality in handling complex professional knowledge work has reached, and in some cases exceeded, that of experienced humans.
Clark believes GDPval's emergence provides a crucial benchmark for assessing AI's broad economic impact, with significance similar to SWE-Bench's role in the programming field.
Public information shows SWE-Bench was launched in November 2024 to evaluate AI models' programming capabilities. This benchmark test uses over 2,000 real programming problems extracted from GitHub public repositories of 12 different Python projects.
**Evaluating the World Economy; Singularity Economics; and Swiss Sovereign AI**
OpenAI built and released GDPval, a well-crafted benchmark for testing AI system performance on various tasks people engage in within the real-world economy. In terms of evaluation, GDPval's significance for broad real-world economic impact may be equivalent to SWE-Bench's significance for programming impact—this is a big deal!
GDPval "measures model performance on tasks directly from the real world involving knowledge work of experienced professionals across industries, providing a clearer picture of model performance on economically valuable tasks."
The benchmark covers 44 occupations across 9 industries, including 1,230 professional tasks, "each carefully crafted and reviewed by experienced professionals with an average of over 14 years of experience." The dataset "includes 30 thoroughly reviewed tasks per occupation (complete set), plus 5 tasks per occupation in our open-source gold set."
Another excellent feature of this benchmark is its involvement of multiple response formats and attempts to handle real-world inherent complexity. "GDPval tasks are not simple text prompts. They come with reference documents and context, with expected deliverables spanning documents, slides, charts, spreadsheets, and multimedia. This realism makes GDPval a more realistic test of how models can support professionals."
"To evaluate model performance on GDPval tasks, we rely on expert 'graders'—a group of experienced professionals from the same occupations represented in the dataset. These graders blindly compare model-generated deliverables with task writer-produced outcomes (not knowing which is AI-generated and which is human-generated) and provide critiques and rankings. Graders then rank human and AI deliverables and categorize each AI deliverable as 'better,' 'equally good,' or 'worse than' each other."
**Results:** "We found that today's best frontier models are already approaching the quality of work produced by industry experts." Claude Opus 4.1 ranked first with an overall win or tie rate of 47.6% compared to human work, followed by GPT-5-high at 38.8% and o3 high at 34.1%.
**Faster and Cheaper:** More importantly, "we found that frontier models complete GDPval tasks about 100 times faster than industry experts and about 100 times cheaper."
**What Types of Work Does GDPval Include?**
• Real Estate and Rental: Concierges; Property, Real Estate, and Community Association Managers; Real Estate Sales Agents; Real Estate Brokers; Counter and Rental Clerks.
• Government: Recreation Workers; Compliance Officers; First-Line Supervisors of Police and Detectives; Administrative Services Managers; Child, Family, and School Social Workers.
• Manufacturing: Mechanical Engineers; Industrial Engineers; Purchasing Agents and Buyers; Shipping, Receiving, and Inventory Clerks; First-Line Supervisors of Production and Operating Workers.
• Professional, Scientific, and Technical Services: Software Developers; Lawyers; Accountants and Auditors; Computer and Information Systems Managers; Project Management Specialists.
• Healthcare and Social Assistance: Registered Nurses; Nurse Practitioners; Medical and Health Services Managers; First-Line Supervisors of Office and Administrative Support Workers; Medical Secretaries and Administrative Assistants.
• Finance and Insurance: Customer Service Representatives; Financial and Investment Analysts; Financial Managers; Personal Financial Advisors; Securities, Commodities, and Financial Services Sales Agents.
• Retail Trade: Pharmacists; First-Line Supervisors of Retail Sales Workers; General and Operations Managers; Private Detectives and Investigators.
• Wholesale Trade: Sales Managers; Order Clerks; First-Line Supervisors of Non-Retail Sales Workers; Wholesale and Manufacturing Sales Representatives, Except Technical and Scientific Products; Wholesale and Manufacturing Sales Representatives, Technical and Scientific Products.
• Information: Audio and Video Technicians; Producers and Directors; News Analysts, Reporters, and Journalists; Film and Video Editors; Editors.
**Why This Matters—AI Companies Are Building Systems to Enter Every Part of the Economy:** At this point, I hope readers imagine me standing in the center of Washington D.C., holding a huge sign that reads: AI companies are building benchmarks designed to test their systems on various jobs in the economy—and they're already very good at it!
This is not normal!
We are testing systems across an extremely broad range of behaviors through ecologically effective benchmarks, which ultimately tell us how well these systems can integrate into approximately 44 different "economic ecological niches" in the world. We find they are already very close to the same level of performance as humans—and this is just based on today's models. Soon, they will surpass many humans in these tasks. Then what happens? Nothing will happen? No! The economy will undergo extremely strange changes!