Entropy-Regularized Process Reward Model

1University of Illinois Urbana-Champaign, 2University of Toronto,
3NVIDIA, 4Princeton University, 5Salesforce Research
*Indicates Equal Contribution
All of our training code, model checkpoints, and paper will be released soon.

Abstract

Large language models (LLMs) have shown promise in performing complex multi-step reasoning, yet they continue to struggle with mathematical reasoning, often making systematic errors. A promising solution is reinforcement learning (RL) guided by reward models, particularly those focusing on process rewards, which score each intermediate step rather than solely evaluating the final outcome. This approach is more effective at guiding policy models towards correct reasoning trajectories. However, traditional process reward models diverge from typical RL practices and can limit model generalization. In this work, we propose an entropy-regularized process reward model (ER-PRM) that addresses these limitations by integrating KL-regularized Markov Decision Processes (MDP) to balance policy optimization with the need to prevent the policy from shifting too far from its initial distribution. We derive a novel reward construction method and introduce an automatic labeling mechanism based on the theoretical results. Our experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models, achieving 1% improvement on GSM8K and 2-3% improvement on MATH under best-of-N evaluation. These results highlight the efficacy of entropy-regularization in enhancing LLMs' reasoning capabilities.

Method Overview: Process Reward Labeling

Illustration of ER-PRM

llustration of ER-PRM, along with other baseline methods to construct the process reward data and outcome reward data. The key idea of ER-PRM is to calculate the process reward from the sampling trajectories under entropy-regularization.

Introduction

Complex mathematical reasoning for Large Language Models (LLMs) typically requires multiple steps of reasoning before producing the final answer. This introduces the Process Reward Model (PRM), which assigns a score to each step in the reasoning process, providing more fine-grained feedback. It has been shown to outperform Outcome Reward Model (ORM) which only provides an overall score for the entire response. To obtain the PRM training data, one prominent method is proposed by Math-Shepherd to automatically label the process reward without human annotators. The main idea of Math-Shepherd is to interpret the score of each step as its potential to deduce the correct final answer. Then, we can sample multiple trajectories starting from the intermediate step, and use the number of correct trajectories as a proxy for this score.

Despite the effectiveness, previous automatic labeling strategies consider a traditional Markov Decision Process (MDP), which differs from the typical RL practice used with LLMs. Since the introduction of reinforcement learning from human feedback (RLHF), the standard approach has been using an entropy-regularized reward relative to a reference model. This also applies to RL in reasoning tasks where we formulate the problem as an entropy-regularized MDP. In this project, we formulate the multi-step mathematical reasoning task under the entropy-regularized MDP framework, derive the mathematical principles of the process reward construction, and propose practical algorithms.

Method

We consider training an LLM for reasoning. Given prompt \(x\), the LLM produces \(L\)-step reasoning chain \(a = [a^1,\dots,a^L]\). The reward \(r(a,x)\) indicates whether the result of the reasoning chain is correct or not. We want to find LLM, denoted by a policy \(\pi_*(a|x)\), that optimizes \(\pi\in \Pi\) with the KL-regularized loss function: \[ \mathcal{L}(\pi) = -\mathbb{E}_x\mathbb{E}_{a\sim\pi(\cdot|x)} \big[r(a,x) - \frac{1}{\eta}\ln{\frac{\pi(a|x)}{\pi_0(a|x)}}\big] \] where \(\pi_0\) is the initial policy model, a pretrained LLM, and \(\pi\) is the model being fine-tuned.

The minimizer for \(\mathcal{L}\) is: \[ \pi_*(a|x) = \frac{\pi_0(a|x)e^{\eta r(a,x)}} {\mathbb{E}_{a\sim\pi_0}e^{\eta r(a,x)}} \propto {\pi_0(a|x)e^{\eta r(a,x)}} \]

Let \(a^{[l]} = [a^1,\dots,a^l]\), be partial reasoning up to step \(l\), and \(a^{-[l]} = [a^{l+1},\dots,a^L]\) be the completion of the partial reasoning from step \(l+1\). We define the intermediate reward by \( e^{\eta r(a^{[l]},x)} = \mathbb{E}_{a^{-[l]}\sim\pi_0}e^{\eta r(a,x)} \). Thus, we have: \[ \pi_*(a^{-[l]}|x, a^{[l]}) = \frac{\pi_0(a^{-[l]}|x, a^{[l]})e^{\eta r(a,x)}}{\mathbb{E}_{a^{-[l]}\sim\pi_0(\cdot|x,a^{[l]})}e^{\eta r(a,x)}} \]

And we obtain: \[ \pi_*(a^{[l]}|x) =\frac{\pi_0(a^{[l]}|x)e^{\eta r(a^{[l]},x)}}{\mathbb{E}_{a^{[l]}\sim\pi_0}e^{\eta r(a^{[l]},x)}} \] \[ r(a^{[l]}, x) = \frac{1}{\eta} \ln \mathbb{E}_{a^{-[l]}\sim\pi_0}e^{\eta r(a,x)} \tag{1} \] The partial reward according to this formula is our definition of entropy regularized process reward. We can optimize partial reasoning step \(\pi(a^{[l]}|x)\) using this process reward.

An important property of this formulation is that our process reward model can be computed by using the reference policy \(\pi_0\) to generate the completion. In comparison, in traditional RL, the reward depends on the optimal policy that generates the completion. Therefore in the traditional RL, one has to learn reward and policy simultaneously. This is not necessary in our entropy-based approach. As one can see, (1) equation employs soft optimism over paths generated by the reference policy.

More generally, the reward can be computed using any policy including the optimal policy. In this case, the equivalent formula becomes: \[ r(a^{[l]}, x) = -\frac{1}{\eta} \ln{\mathbb{E}}_{a^{-[l]}\sim\pi_*}e^{-\eta r(a,x)} \tag{2} \] where we have used the following fact to express (1) using \(\pi_*\): \[ \pi_0(a^{-[l]}|x, a^{[l]}) = \frac{\pi_*(a^{-[l]}|x, a^{[l]})e^{-\eta r(a,x)}}{\mathbb{E}_{a^{-[l]}\sim\pi_*}e^{-\eta r(a,x)}} \] Intuitively, (2) implements soft pessimism over paths generated by the optimal policy. It can be shown that the reward of this model is equivalent to the reward computed using the reference policy in (1). As we have already pointed out, the fact that entropy-regularized process reward model can be computed using the reference policy is a key advantage of entropy regularization approach. In this project, we will use (1) to compute our process reward models, and demonstrate its advantage over previous approaches empirically.

Experiment Results

Best-of-N evaluation results on GSM8K and MATH500 datasets with Mistral-MetaMath-7b as the generator. The Hard-label PRM is the same setting as Math-Shepherd.

Mistral-MetaMath-7b GSM8K MATH
Baseline 77.9 28.6
+ Soft-label PRM 78.7 30.1
+ Hard-label PRM 79.0 32.0
+ ER-PRM (Ours) 79.5 32.8

Zero-shot CoT evaluation for policy model Mistral-MetaMath-7b improved via Rejection Sampling Fine-tuning (RAFT).

Acknowledgements

We would like to thank Wei Xiong for his support for the organization of the project and his participation in the discussions, which helped improve this work.

BibTeX

If you find our work or code useful, please consider citing our project:

@misc{zhang2024entropyregularizedprocessrewardmodel,
      title={Entropy-Regularized Process Reward Model}, 
      author={Hanning Zhang and Pengcheng Wang and Shizhe Diao and Yong Lin and Rui Pan and Hanze Dong and Dylan Zhang and Pavlo Molchanov and Tong Zhang},
      year={2024},
      eprint={2412.11006},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2412.11006}, 
}