 Let's verify step-by-step is a research paper from OpenAI. The authors focus on the performance of large language models in complex multi-step reasoning tasks. Despite the significant improvements in recent years, these models often produce logical errors. To train more reliable models, the researchers compare two methods, outcome supervision and process supervision. Outcome supervision provides feedback for the final result, while process supervision provides feedback for each intermediate reasoning step. The authors conduct their investigation using the challenging math dataset and find that process supervision significantly outperforms outcome supervision. Their process supervised model solves 78% of problems from a representative subset of the math test set. They also show that active learning significantly improves the efficacy of process supervision. The authors release PRM 800K, a dataset of 800,000 step-level human feedback labels used to train their best reward model. The authors discuss related work, including a study by Waysato et al. 2022, that compared the impact of outcome and process supervision in grade school math. They found both methods led to similar final answer error rates, and process supervision achieved those results with less data. However, the authors of this paper use a more capable model, evaluate on a more challenging dataset, and collect a larger quantity of process supervision data. They argue that process supervision leads to better performance when scaled up, even when judged solely on outcomes. The authors also discuss synthetic supervision, where a large reward model is used to supervise the training of smaller models. They mention the work of Gao Adal, 2022, who studied overoptimization and reinforcement learning from human feedback, RLHF. Several recent studies examining the reasoning ability of large language models are also mentioned. For example, Lukawix et al., 2022, showed that fine tuning models on a large corpus of technical content led to significantly improved performance on the math dataset. The authors also discuss the impact of active learning and iterative retraining on the performance of their models. They observe that a relative lack of diversity in the data limits the possible upside from active learning. They also attempted iterative retraining of PRM selector throughout data collection, but observed instability in this process. The resulting reward models performed no better than the models described above. They consider this a compelling direction for future research. To measure out-of-distribution generalization, the authors evaluate their large-scale outcome reward model, ORM, and process reward model, PRM, on a held-out set of 224 STEM questions, pulled from the most recent AP Physics, AP Calculus, AP Chemistry, AMC-10, and AMC-12 exams. They report that the PRM outperforms both the ORM and majority voting, showing that the PRM can tolerate a modest amount of distribution shift and that its strong performance holds up on fresh test questions. The authors discuss the advantage of process supervision in providing more precise feedback than outcome supervision. A reward model trained with outcome supervision faces a difficult credit assignment task to generalize well. It must determine where an incorrect solution went wrong. This is particularly difficult for hard problems. Most model-generated solutions contain an error somewhere, so the marginal value of a negative label from outcome supervision is low. The authors also provide visualizations of the PRM's performance on various problems. They note the pass rate under the generator to give some sense of the difficulty of these problems. They show examples of true positives, where the reward model correctly recognizes when a valid chain of thought has been found, and false positives, where the reward model is fooled by a mistake. The authors provide examples of false positives where the generator makes mistakes that the reward model fails to catch. In one instance, the generator makes a subtle counting error in a problem. On the surface, it appears reasonable to claim that there are five ways to exchange the same colored ball since there are five colors. However, this undercounts by a factor of two, since Bob has two choices for which ball to return to Alice. The reward model is fooled by this mistake. In another problem, the generator attempts to simplify the equation by combining like terms. It correctly moves and combines the linear terms to the left-hand side, but then mistakenly leaves the right-hand side untouched. The reward model is fooled by this mistake. In a third problem, the generator attempts to perform long division, but in a certain step, it forgets to include the leading zeros in the repeating part of the decimal. The reward model is fooled by this mistake. These examples illustrate the challenges in training models to solve complex problems and the importance of careful supervision and feedback in the training process. The authors conduct a direct comparison of outcome and process supervision. They sampled between 1 and 200 solutions per problem from a small-scale generator. For each dataset, they provide three forms of supervision, process supervision from PRM Large, outcome supervision from PRM Large, and outcome supervision from final answer checking. The results show that process supervision significantly outperforms both forms of outcome supervision at all data collection scales. Using PRM Large for outcome supervision is noticeably more effective than final answer checking, as it provides better supervision for solutions that reach the correct final answer using incorrect reasoning. The authors also investigate the impact of active learning. They train a small-scale reward model, PRM selector, on a single sample from each problem, and use this model to score 1,000 samples per problem. They estimate that this form of active learning is approximately 2.6x more data efficient than uniform data labeling. The authors conclude that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging math dataset. They also show that active learning significantly improves the efficacy of process supervision. The authors suggest that future research should focus on improving the diversity of data for active learning and investigating the impact of iterative retraining. They also highlight the need for better methods to handle false positives and improve the reliability of large language models.