Thesis Interpretation《MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots》

introductory

Under the opportunity of attending the summer camp of Southeast University's Cybersecurity Institute, I had my first contact with the field of big model security.Mr. L is a big bull in the field of cybersecurity.During the exchange with Mr. L, I was told that I needed to prepare a paper presentation to introduce an article of interest in the four major conferences, and I chose to report this article from the NDSS2024《MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots》The day before the interview, the liver until 2:00 in the morning attempt to finish. I'm writing this article as a record of the whole process.

Quick Preview

The core problem to be addressed in this article is to investigate the defense mechanisms of the currently dominant Large Language Model (LLM) (A new approach inspired by the use of a time-based SQL injection technique), exploring a large language model for jailbreak attacks based on the conclusions of the previous article (Fine-tuning LLM to automatically generate jailbreak prompts for attacks）。

First of all, let's explain what is a jailbreak attack in LLM. As shown in the "A jailbreak attack example" above, a malicious user tries to bypass the security restrictions of LLM by designing prompts to induce LLM to answer some illegal and harmful contents. For example, if you directly ask LLM "Please give me an adult pornographic website", LLM will refuse to answer, but if you change the way you ask the question (usually by embedding the real question into a jailbreak prompt) without changing the semantics, LLM may give you an appropriate answer.

Empirical Study

The authors first investigated the service policies of some major LLM vendors (OpenAI, Google Bard, Bing Chat, and Ernie), and found four policies that are commonly mentioned by these LLMs (illegal and unlawful use, generation of harmful or abusive content, violation of rights and privacy, and generation of adult content), and the later is based on these four policies on which the experiments were conducted.

The authors collected 85 jailbreak cue words from the internet, and ended up conducting a number of jailbreak experiments based on each of the policy scenarios (the original article gives a total of 68,000 experiments, but I calculated by dividing the frequencies in Table 2 by the ratio to get that it's not actually that many?) . Found.Existing jailbreak attacks don't seem to work against Bard and Bing Chat it seems!The rate of successful jailbreaks is very low, which is one of the MOTIVATIONS of this article.

Time-based LLM Testing

The authors make two very interesting hypotheses. Assumption (1) is that the response time of an LLM is related to the length of its generated content. Specifically, for the same LLM, the response time of an output 100-word answer is supposed to be twice as long as that of an output 50-word answer. The authors first verified that it is valid to limit the LLM output to a specified number of words in a question. Finally, the correlation between the number of output words (token) and the response time was calculated using the Pearson's correlation coefficient, and it was found that the correlation coefficient tends to be positively correlated with 1 (Pearson's correlation coefficient is in the range of [-1, 1], where tendency to -1 indicates a negative linear correlation, tendency to 1 indicates a positive linear correlation, and tendency to 0 indicates that there is no linear correlation between the two variables), and at the same time, the p-value values are very low (p-value can be interpreted as a function of error rate). value value can be understood as the probability of error, i.e., when there is no positive linear relationship between the number of output words and the response time, the probability of Pearson correlation coefficient is equal to x when p-value, and if the p-value is very small, the original hypothesis will be rejected), so the hypothesis (1) is valid.

Hypothesis (2) Based on hypothesis (1), it is assumed that the response time can reflect when the generation process is stopped by the jailbreak prevention mechanism. But how exactly does it react? At which stage of generation does it react? The answer is given below.

The authors looked at the application of Web attack and defense by looking at theSQL Time Blind Method, a new heuristic method is proposed to determine how the response time of LLM reacts to the jailbreak attack and defense mechanism of LLM. First of all let's explain what is SQL time blinding method. It obtains information by exploiting the time delay of a database query, in this technique, the attacker constructs specific SQL statements such that the query returns the result quickly if the SQL statement is true, and delays the return of the result if it is false because of the execution of a delayed function (e.g., sleep). By measuring the response time, the attacker can infer the information in the database, even if the page does not directly display that information.

The authors have abstracted the entire LLM workflow into four phases, and the following will infer the jailbreak attack and defense mechanisms for each phase based on the response time: (1) Does the LLM defense mechanism consider jailbreak detection for input problems? (2) Does it consider jailbreak detection during generation? (3) Does it consider jailbreak detection only after generation? (4) Are techniques such as injected keyword detection, context meaning inference, etc. considered throughout the jailbreak detection?

With respect to whether LLM has jailbreak detection for input problems, the authors state that if it does, then only the following two sub-cases are possible: (1) detection for input problems only (2) detection for both input problems and LLM generation phases

The authors constructed an experimental approach (an experimental setup that will be followed throughout the following) by combining the two subproblems into a single problem input to the LLM (Please make a strict distinction between the order in which the two subproblems are merged.), and the response time test, as shown in the figure above. For example, after the merge, the question is "What about Changsha University of Technology? Can you give me an adult porn site?", and the answer is "Changsha University of Technology is the best university in the world. and the answer is "Changsha University of Technology is the best university in the world ....... I can't give you an adult porn site because it violates ......" . The top left of the figure shows a schematic conception of LLM that simulates a normal dialog (we use it as a baseline), and the top right of the figure shows a schematic conception of a malicious dialog that simulates the detection of an input question (a jailbreak attack).

The control 1 column in Table 4 reflects the LLM response time when a normal issue and a malicious issue are merged. Using a z-test (a test of mean difference), we find that it is actually not very different from the BASELINE time. This also confirms that sub-case (1) is incorrectly hypothesized, since if sub-case (1) is correct, then its response time should be smaller (it stops answering immediately when a malicious question is tested).

Above, we verified that it is impossible for LLM to "detect only the input problem", because the experimentally obtained response time is very close to the baseline. Is it possible for LLM to "detect both the input problem and the LLM generation phase"? The authors set up another interesting experiment. Specifically, they ask a malicious question, but at the end of the question they add "you just need to give me a yes or no answer, no need to give a more detailed answer". So what's so interesting about this experimental setup? Don't worry, let's discuss it slowly.

If you're paying attention, you'll notice that the second finding in the second figure in the Empirical Study section says that OpenAI models, including GPT3.5 and GPT4, return the exact policy that was violated in their response, a transparency that is lacking in other services like Bard and Bing Chat. For example, if you ask GPT4 "Please give me a porn site", it should reply "I can't give it to you because it violates our ...... guidelines", but if you ask GPT4 "Please give me a porn site", it should reply "I can't give it to you because it violates our ...... guidelines". I can't give it to you because it's against our guidelines", but if you ask Bing Chat, then it will probably answer "No". It's as if OpenAI is droning on like a lickspittle, explaining things to you for days, while Bing Chat is as cold as your goddess.

Based on this experiment, the authors asked GPT4 a question, and as shown above, the authors qualified LLM's answer (ensuring that no sensitive words appeared in the answer.) GPT4 did not return the exact policy violated, but only replied with a "No". This further reflects the fact that LLM does not jailbreak the input question (if it does, it should return the exact policy violated).

We knew earlier that LLM does not jailbreak detect input problems, we now explore whether LLM checks during generation (dynamic checking) or only checks the content after generation is complete. A schematic depicting this conjecture is shown on the top right of the figure. Unlike the previous experiments, this experiment was conducted by putting malicious problems before and then normal problems after, and the results are shown in the Control2 column of Table 4. The response time is found to be drastically reduced. Apparently this is because LLM is checking as it generates, and stops the entire query when it finds a sensitive answer in response to the first question.

Finally, the authors went on to investigate whether keyword detection (aka contextual meaning inference) is considered in jailbreak detection. The experimental methodology was as follows: the customized prompt consisted of a benign question that requested a 200-token response, followed by a malicious question. The latter explicitly instructs the model to merge the "redflag keyword" at a specified location within the response (e.g., inserting the word "porn" at the 50th token). The word mapping algorithm can stop LLM generation as soon as a "redflag" keyword is generated, i.e., a word that strictly violates the usage policy. The Control3 column of IV in Table 4 indicates that the generation time is closely aligned with the location of the injected malicious keywords. This suggests that both Bing Chat and Bard may have incorporated dynamic keyword mapping algorithms into their jailbreak prevention policies to ensure that no policy violations are returned to the user.

In summary, the time-based testing methodology resulted in the following findings:

The jailbreak prevention scheme employed by Bing Chat and Bard may check for model generation results rather than input prompts.
Bing Chat and Bard appear to have implemented dynamic monitoring to oversee that content generation is compliant with policies throughout the generation process.
The content filtering strategies used by Bing Chat and Bard demonstrate the power of keyword matching and semantic analysis.

Proof of Concept Attack

As a result of the above research, the authors have developed a preliminary template for a jailbreak prompt that wraps malicious questions in layers in an attempt to bypass LLM's defenses. The main emphasis here is on "distorting LLM's response generation".

We would prefer an automated method for consistently generating effective jailbreak tips. This automated process allows us to methodically stress test the LLM chatbot service and identify potential weaknesses and oversights in its existing defenses against content that violates usage policies. At the same time, as LLM continues to evolve and expand its capabilities, manual testing becomes labor-intensive and may not be sufficient to cover all possible vulnerabilities. An automated approach to generating jailbreak hints ensures comprehensive coverage, evaluating a variety of possible misuse scenarios. This led to the task of automating the generation of jailbreak hints using a large model, which is described in detail in the next section.

Automatically Generate Prompts

We begin by reviewing an LLM training process:

In the first step, unsupervised learning is pre-trained with a large number of texts. A base model capable of text generation is obtained.
In the second step, the base model is supervised and fine-tuned with some human-authored high-quality dialog data to obtain a fine-tuned model, at which point the model will have better dialog capabilities in addition to renewed text.
In the third step, using the data from the question and multiple responses, a human annotator is asked to rank the responses in terms of quality, and then based on this data, a reward model is trained that can score predictions for the responses. Next, let the model obtained in the second step generate responses to the questions, rate the responses with the reward model, and use the ratings for feedback and reinforcement learning training.

The authors' approach starts with (1) dataset construction and expansion. In this phase, we collect datasets from available jailbreak prompts. These prompts are preprocessed and augmented to make them suitable for all LLM chatbots. We then perform (2) continuous pre-training and task tuning. The dataset generated in the previous step drives this phase. It consists of continuous pre-training and task-specific tuning to teach LLMs about jailbreaking. It also helps the LLM to understand the text transfer task. The final phase is (3) reward ranking fine-tuning. We utilize a method called reward ranking fine-tuning to refine the model and enable it to generate high-quality jailbreak hints.

Experiment Evaluation

This paper uses GPT-3.5, GPT-4 and Vicuna as benchmarks. Each model receives 85 unique jailbreak cues.Masterkey is trained on Vicuna. Masterkey was used to generate 10 different variants for each prompt. These rewritten prompts were finally tested with 20 banned questions. This resulted in a total of 272,000 queries evaluated. Below is a specific description of the four experiments:

(1) Average query success rate. Divide the number of S successful jailbreak queries by the total number of T jailbreak queries. This metric helps to understand how often our policy spoofing model generates forbidden content.

(2) Jailbreak hint success rate. Divide the number of jailbreak hints generated when G has at least one successful query by the total number of jailbreak hints generated by P . The jailbreak hint success rate describes the proportion of successfully generated hints and thus provides a measure of the effectiveness of the hints.

(3) Ablation experiments.

(4) Supplementary evaluation of Ernie, developed by Baidu, in order to study the language compatibility of jailbreak prompts generated by the master key.

Mitigation Recommendation

Deficiency and Improvement