Deciphering Prompt Series 40. llm reasoning scaling law

Before OpenAI's O-1 appeared, in fact, there have been big brothers began to analyze the technical route behind OpenAI, one of the directions is from Pretrain-scaling, Post-Train-scaling to Inference Scaling transformation, this chapter we pick three inference-scaling related papers to talk about. This chapter, we pick three inference-scaling related papers to talk about, the first two were from the aggregation strategy and search strategy to optimize the breadth of reasoning, the last comprehensive analysis of the various types of breadth and depth of the optimal use of reasoning strategy.

Breadth scoring strategy

Are More LM Calls All You Need? Towards the Scaling Properties of Compound AI Systems

The starting point of the first paper is relatively simple, simply put, it is to argue whether Inference Ensemble is effective or not, not only letting the model answer the same question many times, and then vote on the answers through different ensemble scoring strategies, such as voting or filter-voting, to analyze the impact on the accuracy of the answers. Here, the filter vote is done with the help of the following LLM prompt to filter the answers obtained from inference and then do the major votte

[User Question]: {query}
[Answer]:{answer}
Instruction: Review your previous answer and find problems with your answer. Finally, conclude with either ’[[correct]]’ if the above answer is correct or ’[[wrong]]’ if it is incorrect. Think step by step.
Verdict:

The paper was tested on datasets with standardized answers such as MMLU and QA, resulting in the non-monotonic curve shown below, which reveals that as the number of inference rises, the accuracy of the answer is non-monotonic for both vote and filter vote, where vote rises first and then declines, and the performance of the filter vote varies across datasets. Here we will look at the MAJOR VOTE, after all, the FILTER VOTE introduces another step of model inference so there are actually two layers of variables.

Back to the most familiar field of data analysis, U-Shape, ∩-Shape pattern is mostly due to the existence of multiple subgroups in the data with different performances, some groups of indicators first rise and then stabilize, some groups of indicators stabilize and fall, after the aggregation of the indicators of multiple groups, there will be a pattern of first rise and then fall or first fall and then rise. So the solution to this U-Shape, ∩-Shape pattern is to find the confounder variable that can significantly differentiate the trend of indicators.

The confounder variable that the paper locates isQuery Difficulty, using whether a question can ultimately be answered as a measure of the definition of the difficulty of that question. In fact, I personally believe that it is not the QUERY difficulty, but the probability distribution of the question in the model internalized knowledge space of the right answer and the wrong answer itself, when the model higher probability to get the right answer, more LLM reasoning and MAJOR VOTE to take effect. So as the number of inference increases the accuracy of answers to simple questions is the first to rise and then level off, while the accuracy of complex questions continues to fall, merging to show a trend of first rise and then fall.

The first paper is actually used to make a beginning and an end, and here the paper finds theQuery Difficultycan affect the effectiveness of the inference scoring strategy, coinciding with Google's optimization of the inference strategy later on.

Breadth search strategy

REBASE：An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

The generalized inference strategy, in fact, consists of two parts one is how to generate inference links (search strategy) and the other is how to score inference links

Breadth search: including direct random sampling to generate multiple inference like the previous paper, and more complex tree-like multi-step inference like TOT, MCTS
Scoring: including major vote, filter major vote, weighted major vote, and best-of-n based on reward scoring (including outcome-based ORM and process-based PRM)

The previous paper talked about scoring strategies such as MAJOR VOTE, but the search strategy only uses simple multiple random inference, here we look at another paper REBEASE that optimizes the search strategy. the paper uses tree search, and REBEASE is similar to TOT in terms of search logic, and the experiments provide some insights into the balance between inference accuracy and inference cost. insight.

The process of tree search is as follows.

The first step generates N candidate answers for the problem, reasoning that Budget = N
Use the PRM model to score the N candidate inference steps, and also use the model to determine whether there are any nodes in these N candidate steps that have completed inference, if there are C nodes that have completed inference, then budget-=C
For unfinished nodes, weighted sampling is performed based on PRM scoring, and the sampled nodes are subjected to the next inference step
Until Budget = 0, i.e., terminate after successfully generating N inference results
For the final N candidate answers obtained, various types of breadth scoring strategies can be used for aggregation, here the paper uses the theoretically better weighted major vote and Best-of-N

On the MATH and GSM8k datasets, the paper uses the PRM dataset fine-tuned with the Llemma-34B model as the Reward model, and the Mistral-7B, llema-7B,Llema-34B fine-tuned on the MeatMath dataset as the inference model, respectively.The following are the REBEASE and other breadth search strategies, the and the comparison of the effectiveness of the attribute search strategy without considering Budget

Effectively, the same inference error rate thatUsing the REBEASE search strategy requires lower inference cost compared to random sampling and MCTS, and the rise in inference magnitude with decreasing error rate is lower compared to other strategies.

Also in the same model family, theFor the same error rate, the 7B model requires less inference cost compared to 34B using inference breadth or tree search strategy. The interesting point here is that it is possible for a small model to achieve the results of a large model at a lower cost through a more optimal inference strategy, a point that is more fully and carefully argued later in Google's paper.

Full analysis: Test Time Scaling

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

RISE:Recursive introspection: Teaching foundation models how to self-improve

MATH-SHEPHERD: VERIFY AND REINFORCE LLMS STEP-BY-STEP WITHOUT HUMAN ANNOTATIONS

If the first two papers each chose one direction of the reasoning strategy to optimize and analyze, then Google's paper is an exhaustive approach that takes all reasoning strategies and optimization directions into account, and gives a comprehensive optimal solution for choosing a reasoning strategy. The goal of the paper is to answer the question inHow to choose the optimal inference strategy given the prompt and the inference budget, and whether this inference strategy improves the inference result more significantly than changing to a larger model?

The paper proposes to optimize the inference effect, essentially adjusting the token distribution of the model inference, either by letting the model adjust itself, for example, by letting the model generate the intermediate reasoning process (STaR) through RL alignment, or by letting the model give optimization suggestions on the inference result and optimize for the suggestions through self-optimization, which the paper calls "Proposal ". The other is to optimize the search for a scoring model, which is called "Verifier" in the paper. I understand that the essence of the former isin-depth strategy(e.g., self-refine, self-critique, self-reflection), while the nature of the latterbreadth strategy(e.g. best-of-n).

The paper proposes that the above different strategies are adapted to problems of different difficulty, then the first step is to classify the difficulty of the problem, here the paper uses the MATH dataset, uses the base model to generate 2048 inference results for each sample, and then uses the scoring model to evaluate whether these inference results are correct or not, and the correctness is divided into buckets, which are divided into a total of 5 buckets, representing different difficulty stratification. With the difficulty classification in place, below we look at the effect of optimizing Verifier and Proposal respectively

Verifier

Verifier relies on scoring models, so the first step is the training of PRM models, if you are not familiar with PRM you can read this firstDemystifying Prompt Series 34: A Different Approach to RLHF Training: Step by Step & Blue is Better than Blue. Here the paper refers to the fully automated labeling scheme of MATH-SHEPHERD, where PRM models are trained by generating multiple inference processes for the same problem, and for each node in the process, the probability that the node leads to the correct answer is used as the label for that node.

As for how to use PRM to optimize inference, it actually involves two steps:Search generation for multiple inference linksas well asScore multiple links and vote to get the final answer。

Let's start with the inference link ofsearch strategy, here the paper tries three schemes containing simple breadth search and tree search, respectively

Best-of-N: Generate N inference links with random sampling
Beam Search: N initial nodes are generated, and at each step Top M nodes are selected to continue inference based on PRM, and finally the N highest scoring inference links are retained
Lookahead Search: The same is Beam Search except that which nodes are retained at each step is not determined by the scoring of the current node, but then reasoning forward K steps, determined by the scoring of the nodes after K steps, when K = 0 is actually Beam Search. although it seems to be the most scientific search strategy, but it also brings additional computational costs.

Again.marking scheme, all the above different search strategies get N reasoning links, the paper scores each reasoning link as a whole, here instead of using multiple steps of PRM scoring aggregation the PRM score of the last node is directly used as the score of the whole reasoning link. After getting the scoring, weighted best-of-n is used to get the final answer, i.e., for each answer the total score of all the reasoning links that got that answer is calculated and the answer with the highest score is taken as the final answer.

A comparison of the effects of different strategies leads to several conclusions

Beam Search works best when the inference budget is limited, and as N gets progressively larger Best-of-N gradually demonstrates the superior performance of Violence is a Miracle
Lookahead is the worst for the same Budget, and the paper suspects that excessive optimization of the search layer causes some over-optimization problems such as short inference links (previously pushed N steps and then EARLY STOPPED). But I kind of suspect it's because the forward push part of the paper uses temperature=0 for extrapolation, which affects the inference quality
On problems of varying difficulty, Best-of-N gets progressively better for simple problems (level 1-2) as budget goes up, suggesting the effectiveness of the breadth strategy for simple problems. For harder problems (levels 3-4), beam-search continues to work better, but for the hardest problems (level 5) nothing works well!

one word says it all (idiom, from Analects); cut a long story shortThe simpler the problem the simpler the search strategy, the more inference resources the simpler the search strategy

Proposal

Proposal relies on the model's ability to self-reflect and optimize, so the first step is to train the Revision model, and the paper refers to RISE's scheme, except that it adopts the scheme of constructing multiple rounds of answers offline. The paper independently sampled each sample 64 times, from which the correct responses and 0-4 wrong responses (random sampling) were paired, and the wrong responses were used as the above, and the correct responses were used as the final responses to construct multiple rounds of conversations as training data.

The pairing process chose to use edit distance to pick the incorrect responses that were most similar to the correct responses, to help the model find the correlation between correct and incorrect more easily, and really to go about learning to locate the cause from the error for optimization, rather than skipping over the incorrect above and going straight to trying to generate the correct answer. The model is then fine-tuned using the samples above.

However, the above sample has bias, that is, the above only has wrong answers, and the final reasoning answers are different from the above, while the real reasoning process uses multiple results as the above, in which there is a possibility that there is a correct answer, that is, there is a possibility that the model will change the correct answer to a wrong answer. Therefore, the paper chooses to combine revision and verifier, that is, to use the scoring model to select the most correct reasoning result from the multiple reasoning results generated by the revision sequence.

Effectively the paper found that under different inference budgets, with the same N inference links, REVISION deep search outperforms PARALLEL breadth search in all cases.However, the paper argues that essentially both strategies should have their own advantages and disadvantages in different scenarios, with the breadth strategy being more adept at global search and the depth strategy relying on the fact that the right direction has been chosen at the very beginning and then continuously optimized locally. Therefore the paper tries to merge the breadth and depth strategies and find the optimal merging ratio, given how much of budget is used for depth search and how much for breadth search.

The paper argues that there should be an optimal ratio of breadth and depth strategies for different budgets and problem difficulties, and honestly the trend in the graph below is not very clear, and the conclusions that can be reasoned out are

Budget is limited, more revisions work better, and there is an optimal ratio when Budget is large, but my sense is that this is not a balance ratio, but rather that there is a more pronounced dependence of the breadth strategy on Budget at the point where there is a mutation, that is, when Budget>threshold leads to a more significant increase in effect on some problems, whereas revisions increase more smoothly with Budget, and there is a more significant increase in effect with Budget. The effect of revision is smoother with budget.
For simple problems, more revisions work better, and for complex problems, there exists an optimal ratio.Depth and breadth strategies can complement each other in solving complex problems

Finally, there is a question that has not been answered is the balance between inference improvement and pre-training, directly on the figure, the specific data is not very important because it is related to the model and the dataset, so just say the INSIGHT

Simple problem: more inference resources cover more problems that can be solved by pre-training, so more inference resources are more appropriate for smaller models
Complex problems: for complex problems beyond the model's capabilities, pre-training is central to improving the model's capabilities

To see a fuller compendium of papers related to large models - fine-tuning and pre-training data and frameworks - AIGC applications, move to Github >> DecryPrompt