We consider training an LLM for reasoning.
Given prompt \(x\), the LLM produces \(L\)-step reasoning chain \(a = [a^1,\dots,a^L]\).
The reward \(r(a,x)\) indicates whether the result of the reasoning chain is correct or not.
We want to find LLM, denoted by a policy \(\pi_*(a|x)\), that optimizes \(\pi\in \Pi\) with the KL-regularized loss function:
\[
\mathcal{L}(\pi) = -\mathbb{E}_x\mathbb{E}_{a\sim\pi(\cdot|x)}
\big[r(a,x) - \frac{1}{\eta}\ln{\frac{\pi(a|x)}{\pi_0(a|x)}}\big]
\]
where \(\pi_0\) is the initial policy model, a pretrained LLM, and \(\pi\) is the model being fine-tuned.
The minimizer for \(\mathcal{L}\) is:
\[
\pi_*(a|x) =
\frac{\pi_0(a|x)e^{\eta r(a,x)}}
{\mathbb{E}_{a\sim\pi_0}e^{\eta r(a,x)}}
\propto
{\pi_0(a|x)e^{\eta r(a,x)}}
\]
Let \(a^{[l]} = [a^1,\dots,a^l]\),
be partial reasoning up to step \(l\), and \(a^{-[l]} = [a^{l+1},\dots,a^L]\)
be the completion of the partial reasoning from step \(l+1\).
We define the intermediate reward by
\(
e^{\eta r(a^{[l]},x)} = \mathbb{E}_{a^{-[l]}\sim\pi_0}e^{\eta r(a,x)}
\). Thus, we have:
\[
\pi_*(a^{-[l]}|x, a^{[l]}) = \frac{\pi_0(a^{-[l]}|x, a^{[l]})e^{\eta r(a,x)}}{\mathbb{E}_{a^{-[l]}\sim\pi_0(\cdot|x,a^{[l]})}e^{\eta r(a,x)}}
\]
And we obtain:
\[
\pi_*(a^{[l]}|x)
=\frac{\pi_0(a^{[l]}|x)e^{\eta r(a^{[l]},x)}}{\mathbb{E}_{a^{[l]}\sim\pi_0}e^{\eta r(a^{[l]},x)}}
\]
\[
r(a^{[l]}, x) =
\frac{1}{\eta} \ln
\mathbb{E}_{a^{-[l]}\sim\pi_0}e^{\eta r(a,x)} \tag{1}
\]
The partial reward according to this formula is our definition of entropy regularized process reward.
We can optimize partial
reasoning step \(\pi(a^{[l]}|x)\) using this process reward.
An important property of this formulation is that our process reward model can be computed by using the reference policy \(\pi_0\) to generate the completion.
In comparison, in traditional RL, the reward depends on the optimal policy that generates the completion.
Therefore in the traditional RL, one has to learn reward and policy simultaneously. This is not necessary in our entropy-based approach.
As one can see, (1) equation employs soft optimism over paths generated by the reference policy.
More generally, the reward can be computed using any policy including the optimal policy. In this case, the equivalent formula becomes:
\[
r(a^{[l]}, x) =
-\frac{1}{\eta}
\ln{\mathbb{E}}_{a^{-[l]}\sim\pi_*}e^{-\eta r(a,x)} \tag{2}
\]
where we have used the following fact to express (1) using \(\pi_*\):
\[
\pi_0(a^{-[l]}|x, a^{[l]}) =
\frac{\pi_*(a^{-[l]}|x, a^{[l]})e^{-\eta r(a,x)}}{\mathbb{E}_{a^{-[l]}\sim\pi_*}e^{-\eta r(a,x)}}
\]
Intuitively, (2) implements soft pessimism over paths generated by the optimal policy.
It can be shown that the reward of this model is equivalent to the reward computed using the reference policy in (1).
As we have already pointed out, the fact that entropy-regularized process reward model can be computed using the reference policy is a key advantage of entropy regularization approach.
In this project, we will use (1) to compute our process reward models, and demonstrate its advantage over previous approaches empirically.