Refining Reinforcement Learning: Key Insights from the Top 10% Rewards!

Aligning large models with human intent remains a fundamental challenge in the AI field. Although mainstream Reinforcement Learning from Human Feedback (RLHF) methods are effective, they suffer from a critical flaw: reward over-optimization. This weakness can be identified as the “Achilles' heel” of large model alignment. In simple terms, models end up learning to “game the system” — they don’t actually improve; instead, they figure out how to score high on the reward model, which can lead to a decline in the quality of their actual output. This situation can be likened to students memorizing answers verbatim to pass examinations without truly understanding the underlying knowledge.

Recent research from Scale AI directly addresses this pain point by revealing the theoretical roots of the problem and proposing innovative solutions. The research has made its code and data publicly available: Code: https://github.com/Jun-Kai-Zhang/rubrics Data: https://huggingface.co/datasets/JunkaiZ/Rubrics

Theoretical Breakthrough: Focus on the High-Score Zone A research team from Scale AI, UCLA, and the University of Chicago has provided a clear theoretical answer for the first time: the root cause of reward over-optimization lies in the inaccuracies of the reward model in the high-score zone. The accuracy of the high-reward regions is crucial; when the agent's rewards are skewed in high-score areas, model performance can collapse dramatically during training, while errors in low-score regions have minimal impact. Remarkably, accurately identifying just the Top 2 responses can suffice; if we can correctly rank the top 10% of high-quality answers, model performance can approach optimal levels, nearly on par with a perfect reward model. This means we do not need to assess all responses accurately; distinguishing between “good” and “excellent” is all that is required!

Innovative Method: Capturing “Excellence” with Scoring Criteria While the theory is clear, a new question arises: how do we obtain high-quality samples to train the reward model? There is a paradox here: Sampling from the foundational model is too inefficient since high-scoring samples are scarce. Generating these samples with a more advanced model might introduce distribution shifts, meaning the reward model could learn superficial features instead of genuine abilities. The research team has proposed a rubric-based solution. A rubric is a set of clear criteria for assessing answer quality, each with corresponding weights. For instance, in medical diagnosis, high-weight criteria could include “correct disease identification” and “indicating urgency,” while low-weight criteria might encompass “mentioning treatment options.”

The core advantage of the rubric lies in its decomposition of scoring into verifiable standards, where each criterion is a binary judgment (met/not met), with the final score being a weighted average of the standards met. More importantly, the rubric possesses innate distribution invariance — it focuses on the quality characteristics of the responses rather than their generation sources.

Two Key Principles: How to Construct an Effective Rubric To effectively capture discrepancies in the high-score zone, the research team proposed two critical principles: Principle 1: Distinguish between “Good” and “Excellent.” By comparing two strong responses, the subtle differences can be identified and encoded into new scoring criteria. Principle 2: Look for differences among diverse high-quality responses. Expanding the candidate pool by sampling from 16 top models ensures a variety of excellent response patterns.

Empirical Validation: Significantly Outperforming Baseline Methods The research conducted extensive experiments in both general and medical fields, revealing significant performance enhancements from improved scoring criteria using high-quality samples. The win rate increased from 31.3% to 39.7%, and in the medical field, the HealthBench score improved from 0.3004 to 0.3513. The model trained with initial scoring criteria experienced a performance crash after 60 steps, whereas the improved scoring criteria delayed this collapse to 160 steps, nearly tripling operational persistence.

High-Reward Area Accuracy: Much Improvement After refining the scoring criteria, there was a significant enhancement in accuracy within high-reward regions, while accuracy in low-reward areas remained largely unchanged, perfectly validating the theoretical predictions.

Qualitative Leap: Excellent Samples Yielding Deeper Improvements The research team also analyzed the types of rubric enhancements brought about by different quality samples: - Improvements driven by excellent samples: Adding penalty terms to avoid gross errors, easing overly strict standards, and correcting errors or aligning expectations. - Enhancements driven by exceptional samples: Decomposing complex standards into sub-criteria, bolstering verification and evidence standards, and clearly delineating ranges, boundaries, and constraints, including risk analysis and safety considerations.

Taking medical cases as an example: The initial rubric only required “mentioning the correct diagnosis” and “stating urgency” — both excellent responses met these criteria. The refined rubric added standards like “clearly indicating the need for urgent imaging tests (such as CT or MRI/MRV) to confirm diagnoses,” successfully distinguishing the superior response. This represents a qualitative leap: moving from superficial judgments to deep validation standards.

Industry Significance and Outlook This research offers a fresh perspective on large model alignment: Theoretically guiding practice by clearly determining the optimization direction in reward modeling, focusing on high-reward areas. The method is highly actionable, based on scoring criteria that are easy to implement and explain, and demonstrates excellent adaptability in fields like healthcare.

Of course, the study also acknowledges current limitations: a simple weighted average may not be the optimal aggregation of scores. For large model practitioners, this work provides a clear direction: do not strive for perfection across all aspects, but rather focus on accurately distinguishing top responses, which is key to alignment.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Tiger Brokers

Refining Reinforcement Learning: Key Insights from the Top 10% Rewards!

Most Discussed