Large Models Acting as Their Own Judges Proves Unreliable! Shanghai Jiaotong University Reveals Flaws in LLM-as-a-Judge Mechanisms

Large language models (LLMs) are evolving from tools into "judges" (LLM-as-a-judge), increasingly being used to evaluate AI-generated content on a large scale. While this efficient evaluation paradigm offers convenience, its reliability and consistency with human judgment have rarely been thoroughly verified.

A fundamental yet critical question arises: before judging whether a model can "stay in character," can AI judges accurately identify who is speaking in a conversation?

Addressing this question, a research team led by Professor Wang Dequan from Shanghai Jiaotong University conducted a systematic study in their paper "PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?"

The paper introduces a novel benchmark called PersonaEval. The core task of this test requires models to identify the actual speaker from several candidate characters after being given a dialogue segment.

Test results show that even the best-performing model, Gemini-2.5-pro, achieved only 68.8% accuracy, while human test groups averaged 90.8% accuracy.

The paper is scheduled for publication at the 2nd Conference on Language Modeling (COLM) in October 2025.

**A Simple Problem That Stumps Top Models**

Recently, discussions about whether large language models can serve as competent "judges" have intensified. From controversies about "hidden prompts" affecting large model peer reviews to Stanford University's preparation of the first pure AI academic conference Agent4Science, these developments signal an emerging trend: LLMs acting as judges to evaluate AI-generated content.

This trend is particularly evident in role-playing scenarios. From having large models portray classic literary characters and game NPCs to the popularity of Character.AI and the rise of "AI companions" in various applications, an era of LLM-driven virtual companions and content creation is approaching.

As the enormous commercial and application potential draws widespread industry attention, evaluating AI "acting skills" naturally becomes a core problem requiring urgent solutions. Consequently, using LLMs as judges has logically become one of the mainstream evaluation methods in this field.

Before AI can serve as a judge, it must first be confirmed whether AI can accurately perform "role identification." The authors argue that if this basic capability is lacking, all subsequent advanced evaluations regarding tone, emotion, and personality consistency become meaningless.

Consider an example that appears simple to humans but causes top large models to make judgment errors:

As shown in the example, the character Zhuang Yan is conversing with someone. In her inner monologue, she clearly mentions "Luo Ji," while in her dialogue, she also refers to "Teacher Luo."

This example pinpoints a fatal flaw in current LLM judges: they seem to focus more on superficial language style (who it sounds like), while humans primarily observe actual conversational intent and context (who would say this in that situation).

Why does this divergence occur? This reflects a fundamental difference between AI and human intelligence models. As cognitive scientist Josh Tenenbaum points out, referenced in the paper: LLM intelligence is "derived" from learning patterns in massive language data, making them expert pattern matchers; human intelligence, however, "precedes" language—we develop and use language as a tool with intent and cognition.

**PersonaEval: A "Truth Detector" Designed for LLM Judges**

To systematically evaluate LLM capabilities in role identification, the paper's authors carefully constructed the PersonaEval benchmark. It features several core characteristics ensuring evaluation alignment with humans and appropriate challenge levels.

The entire benchmark comprises three different directional test sets designed to assess various aspects of role identification capability.

**Test Findings: AI Judgment Still Has Enormous Gaps Compared to Humans**

How do existing LLMs perform in this PersonaEval "examination room"? The results are shocking.

The paper's authors tested multiple top models including GPT series, Claude series, and DeepSeek series. Results show that even the best-performing model, Gemini-2.5-pro, achieved only 68.8% accuracy. In contrast, human research organized by the paper's authors, involving 20 highly educated volunteers, achieved an average accuracy of 90.8%!

This clearly illustrates a massive "gap." This definitively answers the question posed in the paper's title: current LLM judges are far from being sufficiently "human-like" to reliably evaluate role-playing.

**How to Bridge the Gap? Strengthening "Reasoning" is Key, Not "Feeding" Character Knowledge**

Having identified the problem, how can it be solved? The paper's authors further explored two common model improvement strategies:

Results were again surprising. Research found that fine-tuning models with character-related data not only failed to improve their role identification capabilities but might actually cause performance degradation. This could be because rote memorization of character knowledge interferes with the model's more fundamental, general reasoning abilities.

Meanwhile, test-time computation methods show greater potential, particularly models "born for reasoning," which demonstrate clear advantages. For example, models optimized for reasoning tasks like DeepSeek-R1 and QwQ-32B ranked at the top in benchmark testing.

This indicates that creating good "AI judges" doesn't depend on instilling more character knowledge, but rather on enhancing the model's own powerful, robust reasoning engine with contextual awareness capabilities.

**Conclusion**

This paper reveals serious flaws in the currently popular "LLM-as-a-judge" evaluation paradigm in a fundamental yet overlooked dimension. This research not only provides us with a valuable evaluation tool but also prompts us to reconsider how to build AI systems truly aligned with human values and judgment capabilities.

Future research might delve deeper into analyzing the "thinking pathways" behind models' incorrect judgments, thereby developing more effective, reasoning-oriented improvement methods.

PersonaEval is progressing toward this goal. Ultimately, we hope AI can not only "portray" humans but truly "understand" human interaction methods.

**Author Information**

The paper's first author is Zhou Lingfeng, a doctoral student at Shanghai Jiaotong University, primarily researching large model agents and AI-enabled social sciences. The corresponding author is Wang Dequan, tenure-track assistant professor and doctoral supervisor at Shanghai Jiaotong University. He graduated from Fudan University for his undergraduate degree and earned his doctorate from UC Berkeley under Professor Trevor Darrell. His papers have received over 12,000 citations on Google Scholar over the past five years, with an H-index of 22.

Project link: https://github.com/maple-zhou/PersonaEval Paper address: https://arxiv.org/abs/2508.10014

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Tiger Brokers

Large Models Acting as Their Own Judges Proves Unreliable! Shanghai Jiaotong University Reveals Flaws in LLM-as-a-Judge Mechanisms

Most Discussed