AURORA

Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification

Anonymous for ICML 2025 Submission
UniversalBench will be made available as open-source following the acceptance.

Abstract

The reasoning capabilities of advanced large language models (LLMs) like o1 have revolutionized artificial intelligence applications. Nevertheless, evaluating and optimizing complex reasoning processes remain significant challenges due to diverse policy distributions and the inherent limitations of human effort and accuracy. In this paper, we present AURORA, a novel automated framework for training universal process reward models (PRMs) using ensemble prompting and reverse verification. The framework employs a two-phase approach: First, it uses diverse prompting strategies and ensemble methods to perform automated annotation and evaluation of processes, ensuring robust assessments for reward learning. Second, it leverages practical reference answers for reverse verification, enhancing the model’s ability to validate outputs and improving training accuracy. To assess the framework’s performance, we extend beyond the existing ProcessBench benchmark by introducing UniversalBench, which evaluates reward predictions across full trajectories under diverse policy distribtion with long Chain-of-Thought (CoT) outputs. Experimental results demonstrate that AURORA enhances process evaluation accuracy, improves PRMs' accuracy for diverse policy distributions and long-CoT responses.

Overall Workflow of AURORA

The overall workflow of AURORA.

Experiment

We evaluate the performance of the universal PRM trained using our proposed AURORA. First, we assess the effectiveness of our framework on ProcessBench, a benchmark specifically designed to evaluate generation processes using human-annotated labels. Next, we investigate the universal capabilities of the trained PRM across diverse policy distributions. To facilitate this evaluation, we construct a novel dataset called UniversalBench that spans a wide range of policy distributions, varying in both sequence length and step separation.

ProcessBench experimental results shown in Table 1 demonstrate that Universal-PRM-7B, trained using our proposed AURORA, achieves superior performance on the ProcessBench benchmark, excelling in both the overall average score and evaluations across four subsets. These results highlight the effectiveness of our approach in generalization under a universal policy training distribution and underscore the robustness of our proposed ensemble prompting techniques.

Experimental results, as shown in the Table 2, demonstrate that UniversalBench training under our proposed AURORA framework has achieved superior performance, highlighting its strong generalization capabilities across diverse policy distributions. By training under the AURORA framework, UniversalBench effectively addresses the challenges by constructing \(\mathcal{D}_{\text{gen}}\) using diverse policy and prompt distributions, particularly containing long CoT reasoning. This indicates the robustness of our approach in capturing and adapting to a wide range of policy behaviors, thereby outperforming existing methods in accuracy, making it applicable to real-world scenarios where the update policy distributions are dynamic.

Conclusion

In this paper, we introduce a novel framework, AURORA, designed for automated process reward labeling and learning using LLMs. Unlike recent approaches \cite{zheng2024processbench, lightman2023let} that operate on limited data distributions, rely solely on questions and partial solutions, or focus only on the first error occurrence, AURORA aims to train universal PRMs by addressing these limitations. Specifically, AURORA collects candidate reasoning trajectories from diverse policy distributions, evaluates process rewards across the entire reasoning sequence to support downstream RL algorithms, and incorporates reverse verification and ensemble prompting techniques to further enhance performance. To comprehensively evaluate our approach, we curated a new benchmark, UniversalBench, which captures a wide range of policy distribution and especially contains long CoT policy outputs that closely mirror real-world PRM usage scenarios in optimizing long CoT policies. Experiments on both ProcessBench and UniversalBench demonstrate that Universal-PRM-7B, trained using AURORA, achieves SOTA performance. We have open-sourced Universal-PRM-7B and UniversalBench to encourage community adoption and further research.

BibTeX


@misc{auroraprm2025,
  author       = {AURORA},
  title        = {AURORA: Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification},
  year         = {2025},
  url          = {https://auroraprm.github.io/}
}