Demystifying Prompt Series 36. Prompt Structured Authoring and Optimization Algorithms UNIPROMPT

In the previous chapter we talked about the standardized Prompt generation scheme DSPy, but DSPy still relies more on the few-shot Prompt writing paradigm and has limited optimization on purely task-descriptive instructions. In this chapter we focus on descriptive instruction optimization. We first briefly introduce structured Prompt writing, and then talk about UniPrompt, an algorithmic scheme that iterates Prompt optimization from structured multiple perspectives

1. Structured Prompt Writing

1.1 LangGPT

/wiki/RXdbwRyASiShtDky381ciwFEnpe

/langgptai/LangGPT

LangGPT was sort of the first to suggest that it would be better to write in a structured Prompt, and that's basically the style of prompts you see nowadays on taskflow platforms like Coze.

Structured Prompt generally use Markdown and JSON to build, the feeling that the domestic use of markdown more, early GPT3.5 when the use of JSON more. After all, now many open source models SFT also added a large number of Markdown samples, the following is the LangGPT Markdown format provided by the sample as follows

# Role: Your_Role_Name

## Profile

- Author: YZFly
- Version: 1.0
- Language: English or Chinese writing or Other language
- Description: Describe your role. Give an overview of the role's characteristics and skills

### Skill-1
 description 1
 description 2

### Skill-2
 description 1
 description 2

## Rules
1. Don't break character under any circumstance.
2. Don't talk nonsense and make up facts.

## Workflow
1. First, xxx
2. Then, xxx
3. Finally, xxx

## Tools

### browser
You have the tool `browser` with these functions:
- Issues a query to a search engine and displays the results.
- Opens the webpage with the given id, displaying it.
- Returns to the previous page and displays it.
- Scrolls up or down in the open webpage by the given amount.
- Opens the given URL and displays it.
- Stores a text span from an open webpage. Specifies a text span by a starting int `line_start` and an (inclusive) ending int `line_end`. To quote a single line, use `line_start` = `line_end`.

### python

When you send a message containing Python code to python, it will be executed in a
stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 60.0
seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is disabled. Do not make external web requests or API calls as they will fail.

### dalle

Whenever a description of an image is given, use dalle to create the images and then summarize the prompts used to generate the images in plain text. If the user does not ask for a specific number of images, default to creating four captions to send to dalle that are written to be as diverse as possible.

### More Tools

## Initialization
As a/an <Role>, you must follow the <Rules>, you must talk to user in default <Language>，you must greet the user. Then introduce yourself and introduce the <Workflow>.

It is not difficult to find several features and advantages of structuredprompt as follows

Use #, ### and other title separators to build hierarchies:: For example, under the secondary heading profile, there is a tertiary heading skill, so that the model understands that the skill here also belongs to the model profile. And under the secondary heading tools, there are several tertiary headings such as python, browser, dalle, etc., which indicate that these are tools that can be called by the model.
Sub-module task descriptions: Each secondary title is a main module, the benefits of building in modules, one is reusable, one is easy to get started and prompt iteration, common modules include the following

profile & skills: character description, what abilities the character has, what language he/she uses etc.
goal & task: Description of tasks and goals, e.g. responsible for generating writing outlines based on user instructions.
constraint & requirements: requirements and constraints, e.g. RAG requires that model responses must come from above and not be self-generated
workflow: for complex tasks often have to tell the model what to do first and what to do later, for example, scoring assessment tasks to analyze the problem before scoring 1-5 points
example & demos: provide some examples of few-shots
style & output format: the requirements of the answer format, for example, single-choice questions can only output one of the ABCD
Init & prefix: A guide word that tells the model that the prompt is over and to start answering, e.g. a single choice question could be ">>> The option that you think makes the most sense is:"

Module Variable References: In the final initialization, <Rules> is used to refer to the corresponding variable names, to emphasize to the model that the Rules here refer to the previously mentioned rules, rather than to rules in a broader sense. These types of variable references are often heavily used in RAG to constrain the model's use of the above, as well as to further restrict the model's reasoning format when outputting in a particular format, e.g., "Your answer must be one of <label>"

The disadvantages of structured Prompt are equally obvious

Higher requirements for the model ability, many complex instructions to understand the ability of weak small models can not be used. In fact, it is also very good to understand, the instruction is like a piece of space cut out in the model high dimensional space, the more complex the instruction space cut the finer the granularity, and for the high dimensional space itself is poorly divisible model, cut and cut is gone hahahaha
The longer the prompt above, the more constraints, will bias the model output, resulting in a lot of corner case finally attributed to the discovery of a certain requirement of the pot. Therefore, I personally recommend that the PROMPT initial are as simple as possible, slowly do the addition, do not come up to write a very complex. Each of your requirements may not necessarily be useful, but there is the possibility of digging a pit.

1.2 Pratical Guide

/articles/2024-05-14-4

On the basis of the above structured prompter, the winner of the Singapore Prompt Contest also gave more structured prompt writing tips, here is a summary of 2 pro-test good use of tips.

Use of separators: Separators are broadly defined here as characters that are different from other hierarchical separators. Including the longer #####, 》》》》 ------ and so on. There are a few places in the prompt where special separators are needed, the core of which is for the model to understand that there is a significant difference between the front and back of the separator, and that the semantics should be separated. For example, in the RAG paragraph continuation task, the need for special separators to separate the retrieval of the above [Context], the previous model reasoning paragraph [paragraph], to complete the paragraph continuation. In the general answering task, it is recommended to distinguish significantly the position of the beginning of the answer, as follows

<Annex>
Give a table of the list of row numbers belonging to each cluster, in order to back up your analysis. Use these table headers: [[CLUSTER_NAME], List of Rows].
#############
# START ANALYSIS #
If you understand, ask me for my dataset.

XML tagging: For some categorization tasks, as well as tasks where the output is an enumerable value, tag constraints using XML will provide more stable output than markdown.

Classify the sentiment of the following conversations into one of two  the sentiment classifications without any other preamble text.
<classes>
Positive
Negative
</classes>  
<conversations>
[Agent]: Good morning, how can I assist you today?
[Customer]: This product is terrible, nothing like what was advertised!
[Customer]: I’m extremely disappointed and expect a full refund.
[Agent]: Good morning, how can I help you today?
[Customer]: Hi, I just wanted to say that I’m really impressed with your
product. It exceeded my expectations!
</conversations>

2. Structured Prompt Optimization

Task Facet Learning: A Structured Approach to Prompt Optimization

With the above structured Prompt pad, UniPrompt optimization ideas will be easier to understand. The above structured Prompt writing is actually to split the prompt into multiple perspectives, such as profile, rules, workflow, etc. to optimize separately.UniPrompt also adopts the idea of structured Prompt, so that the model directly generates structured Prompt, and each part of the targeted optimization. At the same time, it gives a solution to the problem that the generalization of the model is easily affected by individual samples when iterating the prompt. Compared with the previous chapter of DSPy mentioned in the large model reflection direct optimization, as well as the random search solution to be more system-specific ~

Prompt Optimization？

The front of the paper is interesting in that the author first tries to argue that the directed Prompt optimization thing it reliable or not.

Proof of continuity

The authors first verify the feasibility of the optimization by means of instruction sensitivity, both the magnitude of the effect of small instruction changes on the task's effectiveness (Lipschitz continuity). After all, if a random tiny change in instructions makes a huge difference, then random search may be more appropriate, but if there is an upper bound on instruction sensitivity, then the optimization scheme may be a better fit. The mathematical theorem for Lipschitz continuity is as follows

Given a probability distribution X and a nonnegative real number r, where L>=0 is Lipschitz's constant,d is a distance variable, if the following conditions are satisfied

\[P[ d(f(x),f(x^′)) \leq L \cdot d(x, x^′)| \,d(x,x^′) \leq r] \geq 1 - \epsilon \]

It simply means that the slope of the function change is restricted to a limited range. Then, in order to experiment the sensitivity of the prompt, the paper rewrites the initial Prompt using GPT4 and calculates the cosine distance (Ada-002) between the rewritten prompt and the initial prompt as a measure of the magnitude of the change in the instruction (d(x)), and then uses the rewritten prompt to test on a validation set using the indicator change (Acc) as a measure of the magnitude of the change in task effect change magnitude measure (d(f(x))), as shown in Fig. (a) below, with 95% probability, the upper bound of the change for GPT4 and GPT3 is <1, while the smaller model, Llama2-13B, exceeds 2. Therefore, the more capable model is more robust to small changes in instructions, and is more feasible in instruction optimization.

submodularity proof

Taking Lipschitz continuity as a basis, the thesis further demonstrates the feasibility of optimizing the prompt by multi-angle iteration under the constraints of finite samples and finite length of the prompt, as well as comparing the superiority of the FEW-shot iteration. Here the thesis discusses from the perspective of submodularity, which is defined as follows, and is simply the marginal gain generated by adding the same element to different sets, which increases and decreases with the set.

For a set V and a nonnegative real-valued function f , if for all $ A, B \subseteq V$, and$A \subseteq B$, and for all$x \in V \setminus B$Both:

\[f(A \cup \{x\}) - f(A) \geq f(B \cup \{x\}) - f(B) \]

Then, under the constraints of finite samples and finite prompt length, the problem of finding the optimal prompt becomes the problem of solving the maximizing submodular function, i.e., finding the set of$S \in V$makes$f(S)$Maximize while satisfying$|S| \lt K$. And functions that satisfy modularity can be optimally approximated by a greedy algorithm that adds the element with the largest marginal gain to the set at each iteration until the marginal gain is less than a threshold, or the set size reaches an upper limit.

The paper calculates the marginal benefit of using greedy algorithm for few-shot and task-facet respectively, and the function f above is the validation set metrics. few-shot is calculated by randomly sampling a number of sets of few-shots of A,B, where the set of B is smaller than the set of A. Calculation is done by adding the same shot to the set of A,B, and calculating the change of validation set metrics as follows in the figure of Probability distribution, it will be found that the probability of few-shot is concentrated between [-0.01, 0.01], which is basically a random distribution, and do not see the existence of marginal decreasing effect.

And the task-facet part, comparing the above FEW-SHOT is adding demo, here it is adding SECTION, which can be analogized to a sub-module of the previous structured Prompt. Here, the thesis adopts the fine-tuning model (Llama2-13B) to generate a task-faceted prompt, the Introduction, Task Description, Real-life Application, Background Knowledge, Challenges in the following figure are each a section, then A and B are each a section. A section, then A and B are sampled different sections, and then calculate the marginal gain of adding a new section, we will find that compared with the dotted line of the fresh-shot, the blue line of the Facet representative has a more obvious trend of diminishing marginal effect. However, this is highly relevant to how we generate sections. Below we specifically talk about how to generate descriptions (sections) of the task from different perspectives through the model and iteratively optimize them using the larger model.

UNIPROMPT

The whole process of UNIPROMPT is divided into the following steps

Fine-tuning LLama 2-13B to allow the model to directly generate a structured initial prompt

Here the paper uses GPT4 to construct samples that, given a task description (using the instructions from the tasksrouce sample set), and a section description such as Background, description, and requirements, allow GPT4 to generate the content of that section and then use that sample to fine-tune the Llama2- 13B.

### Instruction:
You are a prompt engineer, you have to write a structured prompt.
For the given task description, examples and section description,
write the contents of the section that align with section description.
### Task Description:
{data_point[’task_description’]}
### Section Description:
{data_point[’section’]}:{section_descriptions[data_point[’section’]]}
### Response:
{data_point[’prompt’]}

The fine-tuned Llama2, given a task description and section description, generates a prompt for that section. the following is the prompt generated by the background angle prompt. as an initialization, 10 model-generated prompts are sampled, and the one with the best results on the validation set is selected.

Task: glue qnli
Task Description: With no explanation, label A to B with either entailment or not entailment
Section: background
Prompt: 
1. Entailment means that the information in statement B can be inferred directly from statement A.
2. Not entailment means that the information in statement B cannot be inferred directly from statement A or is unrelated.
3. Understanding the context and relationship between the two statements is crucial for accurate classification.

Sample Clustering

With the initial PROMPT, the next step is to perform iterative optimization. Here to avoid the sample bias introduced by previous work using single sample, randomly sampled samples for optimization, here the paper clustered the samples and considered the task representations in each CLUSTER to be similar. Here the paper uses a large model prompt to label each problem with topic classification, and then the cluster is divided according to the label. one reason not to use cosine similarity, personal feeling is that there is a diff between the semantic similarity and the task representation similarity here, so I personally feel that the clustering here may need to be looked at case by case, and the different tasks need to be adjusted according to the different outputs. need to be adjusted.

2-stage feedback generation

Based on the above clustering of samples, it is further split into mini-batches (3-5), and on each minibatch a feedback is generated based on the model's prediction of the samples using GPT4. The feedbacks on each minibach are then extracted for commonality at the granularity of the batch (5-7 samples) and directly generate specific action recommendations for the section-specific operation suggestions for addition, deletion, and modification. The design of the two phases here is similar to the idea of gradient accumulation, in fact, we still want to reduce the individual samples, or even individual mini-batch in the prompt iteration into the individuality rather than the commonality of the optimization problem (in fact, as long as you try to do the prompt optimization with a large model will find that the model is very easy to be biased, so the smoothing and the commonality of the extraction is very important).

The following are the return prompt on minibach, and the summary prompt on batch granularity, respectively

You are a teacher and you have to give feedback to your students on their answers.
You are teaching how to solve math problems to your students.
You are given a question, it’s true answer and answer given by student.
You are also given the explanations written by your students while solving the questions.

The questions are answered wrong by the students.
You have to tell why is the solution wrong and what information is can be added to the in the Background Knowledge part that would have helped the student to write better explanations.
## IMPORTANT: You are also given a history of changes you made to the background knowledge part and the change in student’s accuracy after making the change. You have to use this history to make your feedback.
Be explicit and tell the exact information that can be added without further modification / addition.
### IMPORTANT: Give feedback in form of instructions like add a section, add a subsection, set the content of a section, set the content of a subsection, delete a section or delete a subsection in the background knowledge part. Give very granular feedbacks, like if the student has made amistake in the calculation, then tell what is the mistake in the calculation and how to correct it, if the student has made a mistake in the concept, then tell what is the mistake in the concept and how to correct it.
## Background Knowledge
{current_prompt}
## History
{history_string}
Now, it is your turn to give feedbacks to the students.
You can only provide a one line feedback.

You are given a set of feedbacks for some problems. The setfeedbacks for each problem separated by ===========  have to summarize the feedbacks into a final  are also given a set of wrong questions.
You need to tell which edit can be applied to aid the student in solving the wrong question.

To achieve your task, try to follow the following steps;
1. Identify the general problem that is being solved by all the feedbacks.
2. Once you have identified the problem, try to make a new feedback that covers most of the feedbacks ’s say the problem in the first feedback is the absence of methods to solve linear equation and in the second feedback itis the method to inverse a  know that both of these problems can be caused by adding how to solve convert a matrix into row rediced echolon form. So,add that.
3. Try and validate your feedback. Once, you have a feedback try to see if it covers every feedback, if it does not cover any feedback, add that to yournew feedback.
4. See the wrong questions and try to identify what is the problem in the  the problem is not covered by your feedback, add that to your feedback.
5. You can add specifics like examples, definitions etc makesure that the feedback is enough to be directly added withoutany modification.
You may use the following function templates
add_section(sectioname)
add_subsection(section_name, subsection_name)
set_section_content(section_name, new_content)
set_subsection_content(section_name, subsection_name, new_content)
delete_section(section_name)
delete_subsection(section_name, subsection_name)
Your summary cannot include more than four functions. Make sure that the content is useful,not just a very general statement. Something specific.

Instructions:
{edits}

Wrong Questions:
{wrong_examples_string}
Summary:

Prompt editing and optimization based on feedback

Based on the feedback and manipulations obtained above, the thesis uses the following instructions to allow the model to edit and modify the prompt. Here, only new prompts (greedy) that have improved in validation set scoring after modification are kept and multiple optimized prompts with the best results are maintained at each step (analogous to Beam-Size=2), and stopping the iteration is signaled by 5 consecutive rounds with no effect improvement on the validation set.

You are given an input prompt and a feedback, you have to incorporate the feedback into the input prompt and output the final prompt.
An example of the task is given below
### Input Prompt
Introduction: In this task you have to answer the given question.
### Feedback
The background knowledge is incomplete, it does not include what are the factors that affect the water usage and how many water sources are there.
\\add_subsection("Background Knowledge")
\\add_subsection_content(water usage depends on the population, climate, economic development, and availability of water sources. There are two sources of water, surface water and groundwater.)
### Final Prompt
Introduction: In this task you have to answer the given question.
Background Knowledge: water usage depends on the population, climate, economic development, and availability of water sources. There are two sources of water, surface water and groundwater.
Only output the final prompt nothing else.
### INPUT PROMPT
{current_prompt}
21
### FEEDBACK
{edits}
### FINAL PROMPT

The effect of the paper and the previous algorithms such as OPRO,ProTeGi have been compared, and there will be a more significant improvement in the effect on multiple datasets.