Some thoughts on the REACT paradigm

The REACT paradigm has been explored for almost a year and has allowed us to have a very wide range of applications in many domains, and it has really enhanced many previously unsolvable problems, for example, while large models have shown impressive performance in tasks in terms of language comprehension and interactive decision making, how to allow models to generate reasoning traces and task-specific actions based on explanations using LLMs in an interleaved manner has been a problem for a long time. The REACT paradigm proposes a way to mimic this close synergy between "action" and "reasoning" in humans, and to mimic humans' ability to learn new tasks quickly and perform robust decisions or reasoning even in previously unseen situations or in the face of information uncertainty. face information uncertainty.

To cite some cases

List some specific issues

In-depth exploration of issues

this subsection, stating that this thesis discusses the purpose of its design.

Illustrating that after the experiments in the previous vignette, REACT outperforms ACT, Table 1 shows the HotpotQA and Fever results using PaLM540B as the base model and different cueing methods. We note that REACT outperforms Act on both tasks, which demonstrates the value of reasoning in guiding action, especially in synthesizing the final answer, as shown in Figure 1 (1c-d). Fine-tuning Result 3 also confirms the benefits of inference tracking for more informed behavior.

In the behavioral differences between REACT and CoT on HotpotQA, the authors designed a ROC measure to judge the goodness of classification and detection results, and the experimental steps are as follows

A random sample of 50 trajectories (200 examples in total) with correct and incorrect answers (as judged by the EM) were selected from REACT and CoT, respectively, and manually labeled with their success and failure modes in Table 2.

B) illustrates a number of issues, including grounded-ness and trustworthiness, that while the interleaving of reasoning, action, and observation steps is improved in REACT compared to CoT, this structural constraint also reduces the flexibility of REACT to formulate reasoning steps, resulting in a higher rate of reasoning errors than in CoT.

REACT often has a pattern of errors where the model regenerates previous thoughts and actions, which we categorize as part of a "reasoning error" because the model is unable to reason out what the correct action to take next is and break out of the loop. (This might be solved by trying to engineer a contextual problem, but it would further increase the cost of reasoning, which was attempted in the law competition).

C) Here the authors do not address Retrieving Informative Knowledge Accuracy , uninformative searches accounted for 23% of the incorrect cases in the original experiment, which undermines the model's reasoning and makes it difficult to recover and rephrase ideas.

This vignette, observing the author's ablation experiments

The paper ultimately gives a result that among the samples it experimented with, the REACT + CoT-SC prompted LLMs worked best, achieving CoT-SC performance with 21 samples using only 3-5 samples. These results show the value of correctly combining knowledge internal to the model with knowledge external to the inference task.

But I have some thoughts here.

The overall idea of the paper mentions a paradigm as follows

Question
Thought
Action [Finish]
Observation

In 3.3 RESULTS AND OBSERVATIONS, many of the problems discussed seem to be attributed to the model hallucination, observed, from the linguistic point of view of interpretive action, the model makes an interpretation of the problem (Thought) in making the next step in the action, this a process of what drives the action occurs, obviously not discussed here, the emergence of problems The place where the problem arises is perhaps the modeling illusion, or perhaps a deeper problem

We won't explore this here, let's discuss it in terms of the success rate of the action taking place, obviously we need to find an efficient way to evaluate it, we can continue to use ROC as a method to evaluate it by listing the confusion matrix of true, illusory, and false metrics, and letting the model try to learn to improve the effectiveness of this step.

The above step may be of some use, but in reality, the many problems mentioned earlier, is it really just this, let's observe that the condition of the "chain-of-thought" interrupt is REASONING, and it was mentioned earlier that REACT + CoT-SC are actually

Action [act] >Observation>Action [Finish]> Observation > CoT[Question>Thought>Answer]

This is another complex problem, how should we solve it, emm I don't know, let's start by saying that the inability to solve this problem isn't because it can't be solved, it's because it's a very complex problem.

When the model triggers an action to query a piece of information, this process, in fact, is out of the above paradigm covering the process, the action is triggered, maybe a data query, maybe a button operation, these detached from the behavior after the eventual Observation, and further use of the model ability to determine whether or not it is really Action [Finish], and so on. After this, the context of the model will be very large, may be useless information, and finally after Observation COT, maybe because of the huge context can not be processed, maybe we do not pay attention to the length of the context can be solved? Is this really the case?

Then think about this question: "Once upon a time there was a mountain, and on the mountain there was a temple. In the temple there was an old monk telling a story to a young monk about."

We see a loop, here is something I can't talk about at the moment, maybe I need to elevate it, here is the GPT explanation, maybe you can understand what I am talking about

#### The Nature of Loops

Then think about this question "Once upon a time there was a mountain, and on the mountain there was a temple. In the temple there was an old monk telling a story to a young monk about ......"

We see a cycle.

This loop is not just a logical trap, it can be the key to understanding the reasoning process. Each cycle of "the old monk telling the story" summarizes and enhances the previous cycle. Hegel's dialectic teaches us that any seemingly insoluble contradiction contains the seeds of a solution. Each cycle is a reflection and adjustment on the basis of the past to reach a new level of understanding.

In model reasoning, a similar cycle can be seen as a process of constant revision and improvement. Whenever the model gets stuck in the reasoning process, or hallucinates, this actually provides us with an opportunity to revisit the model's reasoning path and adjust its decision-making mechanism.

Perhaps we can think about this cycle as a learning opportunity, rather than just a pattern of errors to be avoided. By accumulating new experience and knowledge in each cycle, the model can gradually reduce its errors and eventually come out of the cycle to reach a higher level of reasoning.

We may seem trapped in an endless loop, but each iteration actually provides us with a ladder to higher understanding. The model's reasoning ability may be gradually improved by such constant loops and reflections.

Ultimately, we need to realize that the cycle itself is not the problem; the key lies in how we use it to facilitate the growth and evolution of the model.

An explanation of the ROC (/wuxiping2019/p/)

confusion matrix (math.)

included among these

TN: Predicting negative classes as negative classes (true negative classes)
FN: Predicting positive classes as negative classes (false negative classes)
TP: Predicting positive classes to positive classes (true classes)
FP: Predicting negative classes to positive classes (false positive classes)
Accuracy

The number of correctly categorized samples in the test sample as a percentage of the total number of samples tested.

Precision

The accuracy rate, also called the checking rate, is the ratio of the number of samples correctly categorized as positive to the number of samples categorized as positive in the test sample.

Recall

The recall rate, also known as the check all rate, is the ratio of the number of samples correctly categorized as positive to the number of samples that are actually positive in the test sample.

F1 value

The F1 value is a weighted average of the precision and recall rates.F1 is equivalent to the combined evaluation metric of precision and recall, which is a more favorable measure of the data and is more commonly used.

True class rate (TPR)

The proportion of actual positive instances in the predicted positive class to all positive instances. (Same formula as recall).

Negative positive class rate (FPR)

The ratio of actual negative instances in the predicted positive class to all negative instances.

KS value

We want to assess the ability of the model, in the case of different thresholds, TPR and FPR and different, suddenly confused, at this time, need to find a unique judgment standard, the most value has uniqueness, the red part of the above figure indicates the TPR-FPR, so let's use the highest point as an assessment of the model's ability! Yes, there is no mistake, the highest point is the so-called KS value, we use the KS value as an indicator to assess the model differentiation ability, the larger the KS value, the stronger the model differentiation ability. The formula is as follows:

ROC curve

ROC, or Receiver Operating Characteristic, was first invented by electrical and radar engineers in World War II to detect enemy carriers (airplanes, ships) on the battlefield, known as signal detection theory. It was quickly followed by the introduction of psychology for perceptual detection of signals. It has since been introduced into the field of machine learning to judge the results of classification and detection.

Or the above TPR and FPR values, above we know the KS value can indicate the model's discriminatory ability, we only feel the model is good when the threshold value is equal to the KS value, so that ignores the scenarios in which the threshold value is taken to other values, there is no kind of evaluation criterion, irrelevant to the value of the threshold value is taken?

In practical application scenarios, the model predicts a sample set, in the prediction of the positive class, of course, we hope that the higher the proportion of samples predicted to be positive, the better, and the smaller the proportion of samples predicted to be positive, the better, that is to say, the bigger the TPR is, the better, the better it is equal to 1, and the smaller the FPR is, the better it is, the better it is equal to 0. However, it is not so perfect! But there's no such thing as perfection. As the TPR gets bigger, so does the FPR. Mathematicians are smart, change at the same time, right, there is always a difference in the speed of change, right?

We randomly take many thresholds and get many FPRs and TPRs. using the X-axis for FPRs and the Y-axis for TPRs, we plot the above curve, which is the ROC curve. （The two coordinate points (0,0) and (1,1) are fixed according to the actual situation, if the rate of change of both is the same, that is to say, it is a straight line passing through (0,0) and (1,1), at this time, the slope is 1, that is to say, with the change of thresholds, the FPR and the TPR are changed to the same extent. Having plotted the curves, is it possible to characterize the model's capabilities in terms of the curves' properties? What we would like to have is.

FPR changes fast when TPR changes slow.
FPR changes slowly when it changes quickly.

By this point, we have found another metric for evaluating the model, yes, the shaded area in the figure, and observe that we can use the size of this shaded area to respond to the characteristics we wish to obtain above, and the size of this shaded area is called the AUC value.

AUC

AUC (Area Under Curve) is defined as the area under the ROC curve, because the ROC curve is generally on top of the straight line y=x, so it takes a value between 0.5 and 1. The reason for using AUC as an evaluation metric is that the ROC curve does not clearly indicate which classifier is more effective in many cases, and AUC as a numerical value, the larger its value, the more effective the classifier is. represents a better classifier. It is important to note that AUC is a probability value, when a positive sample and a negative sample are randomly selected, the probability that the current classification algorithm will rank this positive sample in front of the negative sample based on the calculated score is the AUC value. So, the larger the value of AUC, the more likely the current classification algorithm will rank the positive sample in front of the negative sample value, both for better classification.