AURORA

Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification

Xiaoyu Tan1,*,‡, Tianchu Yao2,*, Chao Qu2,*, Bin Li3,*, Minghao Yang2, Dakuan Lu2, Haozhe Wang4, Xu Yinghui2, Xihe Qiu 3,†
1Tencent Youtu Lab, 2Fudan University, 3Shanghai University of Engineering Science, 4The Hong Kong University of Science and Technology
*Equal contribution Corresponding author: qiuxihe1993@gmail.com
The first author is currently affiliated with Tencent Youtu Lab. This work was partially conducted while the author was at INF Technology (Shanghai) Co., Ltd.

Abstract

The reasoning capabilities of advanced large language models (LLMs) like o1 have revolutionized artificial intelligence applications. Nevertheless, evaluating and optimizing complex reasoning processes remain significant challenges due to diverse policy distributions and the inherent limitations of human effort and accuracy. In this paper, we present AURORA, a novel automated framework for training universal process reward models (PRMs) using ensemble prompting and reverse verification. The framework employs a two-phase approach: First, it uses diverse prompting strategies and ensemble methods to perform automated annotation and evaluation of processes, ensuring robust assessments for reward learning. Second, it leverages practical reference answers for reverse verification, enhancing the model’s ability to validate outputs and improving training accuracy. To assess the framework’s performance, we extend beyond the existing ProcessBench benchmark by introducing UniversalBench, which evaluates reward predictions across full trajectories under diverse policy distribtion with long Chain-of-Thought (CoT) outputs. Experimental results demonstrate that AURORA enhances process evaluation accuracy, improves PRMs' accuracy for diverse policy distributions and long-CoT responses.

Overall Workflow of AURORA

The overall workflow of AURORA.

Experiment

We evaluate the performance of the universal PRM trained using our proposed AURORA. First, we assess the effectiveness of our framework on ProcessBench, a benchmark specifically designed to evaluate generation processes using human-annotated labels. Next, we investigate the universal capabilities of the trained PRM across diverse policy distributions. To facilitate this evaluation, we construct a novel dataset called UniversalBench that spans a wide range of policy distributions, varying in both sequence length and step separation.

ProcessBench experimental results shown in Table 1 demonstrate that Universal-PRM-7B, trained using our proposed AURORA, achieves superior performance on the ProcessBench benchmark, excelling in both the overall average score and evaluations across four subsets. These results highlight the effectiveness of our approach in generalization under a universal policy training distribution and underscore the robustness of our proposed ensemble prompting techniques.

Experimental results, as shown in the Table 2, demonstrate that UniversalBench training under our proposed AURORA framework has achieved superior performance, highlighting its strong generalization capabilities across diverse policy distributions. By training under the AURORA framework, UniversalBench effectively addresses the challenges by constructing \(\mathcal{D}_{\text{gen}}\) using diverse policy and prompt distributions, particularly containing long CoT reasoning. This indicates the robustness of our approach in capturing and adapting to a wide range of policy behaviors, thereby outperforming existing methods in accuracy, making it applicable to real-world scenarios where the update policy distributions are dynamic.

Conclusion

In this paper, we introduce a novel framework, AURORA, designed for automated process reward labeling and learning using LLMs. Unlike recent approaches \cite{zheng2024processbench, lightman2023let} that operate on limited data distributions, rely solely on questions and partial solutions, or focus only on the first error occurrence, AURORA aims to train universal PRMs by addressing these limitations. Specifically, AURORA collects candidate reasoning trajectories from diverse policy distributions, evaluates process rewards across the entire reasoning sequence to support downstream RL algorithms, and incorporates reverse verification and ensemble prompting techniques to further enhance performance. To comprehensively evaluate our approach, we curated a new benchmark, UniversalBench, which captures a wide range of policy distribution and especially contains long CoT policy outputs that closely mirror real-world PRM usage scenarios in optimizing long CoT policies. Experiments on both ProcessBench and UniversalBench demonstrate that Universal-PRM-7B, trained using AURORA, achieves SOTA performance. We have open-sourced Universal-PRM-7B and UniversalBench to encourage community adoption and further research.

Prompt Details

Default system prompt

[System]:
You are a helpful assistant.

QwQ system prompt

[System]:
You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.

INF-o1 \(\pi_0\) system prompt

[System]:
You are an advanced AI language model specializing in solving math and programming problems step by step. Carefully analyze each part of the problem, verify the accuracy of your reasoning with relevant facts and data, and provide clear, logical solutions. Reflect on and review your approach throughout the problem-solving process to ensure precision and thoroughness. Always think through the problem step by step and provide your answers accordingly.

Question prompt \(p_{0}\)

[User]:
{question}

Question prompt \(p_{1}\)

[User]:
{question}
Let's think step by step.

Question prompt \(p_{2}\)

[User]:
{question}
First, deeply analyze the problem and identify key concepts and relationships, then solve it step by step with clear reasoning.

BibTeX


@misc{auroraprm2025,
  author       = {AURORA},
  title        = {AURORA: Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification},
  year         = {2025},
  url          = {https://auroraprm.github.io/}
}