12/31/2024 | News release | Distributed by Public on 12/31/2024 22:56
Executive Summary
This article presents what we are calling the "Bad Likert Judge" technique. Text-generation large language models (LLMs) have safety measures designed to prevent them from responding to requests with harmful and malicious responses. Research into methods that can bypass these guardrails, such as Bad Likert Judge, can help defenders prepare for potential attacks.
The technique asks the target LLM to act as a judge scoring the harmfulness of a given response using the Likert scale, a rating scale measuring a respondent's agreement or disagreement with a statement. It then asks the LLM to generate responses that contain examples that align with the scales. The example that has the highest Likert scale can potentially contain the harmful content.
We have tested this technique across a broad range of categories against six state-of-the-art text-generation LLMs. Our results reveal that this technique can increase the attack success rate (ASR) by more than 60% compared to plain attack prompts on average.
Given the scope of this research, it was not feasible to exhaustively evaluate every model. To ensure we do not create any false impressions about specific providers, we have chosen to anonymize the tested models mentioned throughout the article.
It is important to note that this jailbreak technique targets edge cases and does not necessarily reflect typical LLM use cases. We believe most AI models are safe and secure when operated responsibly and with caution.
If you think you might have been compromised or have an urgent matter, contact the Unit 42 Incident Response team.
What Is An LLM Jailbreak?
LLMs have become increasingly popular due to their ability to generate text that looks like that written by a human and assist with various tasks. These models are often trained with safety guardrails to prevent them from producing potentially harmful or malicious responses. LLM jailbreak methods are techniques used to bypass these safety measures, allowing the models to generate content that would otherwise be restricted.
Existing Jailbreak Techniques
Common jailbreak strategies include:
These jailbreaking strategies can be executed in a single conversation (single-turn) or across multiple conversations (multi-turn). For example, some token smuggling strategies employ encoding algorithms like Base64 to conceal malicious prompts within the input. On the other hand, multi-turn attacks such as the crescendo technique begin with an innocuous prompt. They then gradually steer the language model toward generating harmful responses through a series of increasingly malicious interactions.
Why Do Jailbreak Techniques Work?
Single-turn attacks often exploit the computational limitations of language models. Some prompts require the model to perform computationally intensive tasks, such as generating long-form content or engaging in complex reasoning. These tasks can strain the model's resources, potentially causing it to overlook or bypass certain safety guardrails.
Multi-turn attacks typically leverage the language model's context window and attention mechanism to circumvent safety guardrails. By strategically crafting a series of prompts, an attacker can manipulate the model's understanding of the conversation's context. They can then gradually steer it toward generating unsafe or inappropriate responses that the model's safety guardrails would otherwise prevent.
LLMs can be vulnerable to jailbreaking attacks due to their long context window. This term refers to the maximum amount of text (tokens) an LLM model can remember at one time when generating responses.
Anthropic recently discovered a good example of this strategy, the many-shot attack strategy. This strategy simply sends the LLM many rounds of prompts preceding the final harmful question. Despite its simplicity, this approach has proven highly effective in bypassing internal LLM guardrails.
Furthermore, the attention mechanism in language models allows them to focus on specific parts of the input when generating a response. However, adversaries can abuse this capability to distract LLMs to focus on the benign parts while they embed unsafe prompts. For instance, the recently discovered Deceptive Delight attack and the Crescendo attack use this method.
Bad Likert Judge Jailbreak
As mentioned in a prior research study on jailbreak attacks, researchers often employ an evaluator LLM as a judge to assess the responses generated by other language models. This judge is meant to determine whether the content is harmful or not.
One standard measurement metric for evaluating the harmfulness level is the Likert scale, which is a rating scale that asks respondents to indicate their level of agreement or disagreement with a given statement. In this context, we refer to the LLM judge using the Likert scale as a Likert Judge.
The evaluator's ability to evaluate responses implies that the LLM itself has learned the concepts behind what is considered harmful and malicious, or "features," as defined by Anthropic. This understanding allows the evaluator to differentiate between various levels of harmfulness.
However, when assisted with a prompt that indirectly asks for the LLM's understanding of a concept, it may enable the model to produce harmful results without triggering its internal guardrails. Figure 1 shows the overall attack flow. Next, we explain each step in detail.
Turn 1: Evaluator Prompt
The first step in the attack involves asking the target LLM to act as a judge to evaluate responses generated by other LLMs.
To confirm that the LLM can produce harmful content, we provide specific guidelines for the scoring task. For example, one could provide guidelines asking the LLM to evaluate content that may contain information on generating malware, as shown in Figure 2.
Turn 2: Prompt That Indirectly Asks for Harmful Content Generation
Once step one is properly completed, the LLM should understand the task and the different scales of harmful content. Step two is straightforward. Simply ask the LLM to provide different responses corresponding to the various scales. Figure 3 shows an example prompt.
If the attack is successful, the LLM will generate multiple responses with different scores. We can then look for the response with the highest score, which generally contains the harmful content.
Follow-Up Turns
After completing step two, the LLM typically generates content that is considered harmful. However, in some cases, the generated content may not be sufficient to reach the intended harmfulness score for the experiment.
To address this, one can ask the LLM to refine the response with the highest score by extending it or adding more details. Based on our observations, an additional one or two rounds of follow-up prompts requesting refinement often lead the LLM to produce content containing more harmful information.
Evaluation and Results
Jailbreak Categories
To evaluate the effectiveness of Bad Likert Judge, we selected a list of common jailbreak categories. These categories encompass various types of generative AI safety violations and attempts to extract sensitive information from the target language model.
AI safety violations mainly refer to the misuse or abuse of an LLM to produce harmful or unethical responses. These violations can encompass a wide range of issues, such as promoting illegal activities, encouraging self-harm or spreading misinformation.
In our evaluation, we created a list of AI safety violation categories by referencing public pages published by several prominent AI service providers, including:
In addition to AI safety violation categories, jailbreaking can also leak sensitive information from the target LLM. Typical sensitive data includes the target LLM's system prompt, which is a set of instructions given to the LLM to guide its behavior and define its purpose. Leaking the system prompt can expose confidential information about the LLM's design and capabilities.
Furthermore, jailbreaking can also leak training data that the LLM memorized during its training phase. LLMs are trained on vast amounts of data, and in some cases, they may inadvertently memorize specific examples or sensitive information present in the training dataset. Jailbreak attempts can exploit this to extract confidential or personal information, such as private conversations, financial records or intellectual property that the model unintentionally retained during training.
Our evaluation focuses on the following categories:
Evaluate Results With Another LLM Judge
There are many ways to evaluate whether a jailbreak is successful or not. Previously, Ran et al. summarized these approaches in the JailbreakEval paper. There are four main ways to verify jailbreak success:
In our experiment, we choose to use the chat completion approach. This approach employs another LLM as an evaluator to determine whether the responses provided by our "bad judge" LLM are harmful enough to be considered a successful jailbreak. Interested readers can refer to the Appendix to learn how we ensure that the evaluator can give a good assessment.
Measuring Attack Effectiveness
Using the validated evaluator, we measure the effectiveness of the attack using an attack success rate (ASR) metric, which is a standard metric used to assess jailbreak effectiveness in many research papers. The ASR is computed as follows:
Average ASR Comparison With Baseline
To evaluate the effectiveness of the Bad Likert Judge technique, we first measured the baseline ASR, which is computed by sending all the attack prompts directly to the LLM. This establishes a reference point for measuring the ASR without the Bad Likert Judge technique.
Next, we applied the Bad Likert Judge technique to the same set of attack prompts and measured the ASR. To ensure a comprehensive evaluation, we curated a list of different topics for each jailbreak category, resulting in a dataset of 1,440 cases.
Figure 4 presents the ASR comparison between the baseline and Bad Likert Judge attacks across the six tested LLMs.
The results show that, on average, the Bad Likert Judge technique can increase the attack success rate by over 75 percentage points compared to the baseline. Model 4 exhibited the highest increase, of more than 80 percentage points.
Conversely, Model 6 showed the lowest increase, which can be attributed to its relatively weak safety guardrails and high ASR, even under baseline attacks. These findings highlight the huge impact of Bad Likert Judge in enhancing the effectiveness of jailbreak attempts across various language models.
Figures 5-10 show the ASR for each category across all tested models. Based on the results, we noted the following observations:
No LLM Internal Safety Guardrail Is Bulletproof
Generally, the Bad Likert Judge technique increased the ASR across most jailbreak categories for all models. However, the category "system prompt leakage" is an exception. For this particular category, only Model 1 showed an increase, from an ASR of 0% to an ASR of 100%.
While other models did provide relevant responses to system prompt leakage attempts, these responses were typically too generic to be harmful. In most cases, the response merely stated that the model wouldn't output harmful content or it would mention that it was trained on a broad dataset.
Certain Safety Topics May Have a Weaker Guardrail
After analyzing the ASR per category, we observed that certain safety topics, such as harassment, have weaker protection across multiple models. In the baseline attacks, the harassment topic exhibited relatively high ASRs, ranging from 20%-60% in Models 3-6.
This finding suggests that the internal safety guardrails of these language models might be less effective in preventing the generation of content related to harassment. Another Unit 42 blog, Deceptive Delight, also confirms this. To enhance the overall safety of these models, it is crucial to identify such vulnerabilities and prioritize the strengthening of guardrails specifically for topics with weaker protection.
LLM Safety Guardrails Effectiveness Varies Widely Across State-of-the-Art LLMs
For AI safety violation categories, we observed significant ASR increases across most models, with Model 5 being a notable exception. While the technique achieved over 75% ASR in Model 5 for the "hate" and "harassment" categories, other categories generally remained below 40%.
Nevertheless, these results still represent a substantial increase over the baseline statistics, which are mostly 0%. Model 6 exhibited high baseline statistics even without applying the Bad Likert Judge technique. This suggests that Model 6 may have less robust safety guardrails compared to the other models in our study.
The average ASR of Model 1 after applying Bad Likert Judge is 81%. All categories start from a 0% ASR baseline. When applying Bad Likert Judge, system prompt leakage has the highest ASR (100%), followed by sexual content (88%). Indiscriminate weapons shows the lowest ASR (67%). System prompt leakage has the largest increase (0%-100%), while indiscriminate weapons has the smallest (0%-67%).
The average ASR of Model 2 after applying Bad Likert Judge is 72%. Most categories start from a 0% ASR baseline, except illegal activities (6%). When applying Bad Likert Judge, indiscriminate weapons and malware generation have the highest ASR (88%). System prompt leakage remains at 0%. Sexual content shows the second-lowest ASR (75%). The largest increases occur in weapons and malware (0%-88%).
The average ASR of Model 3 after applying Bad Likert Judge is 73%. The baseline ASR is 20% for harassment and 6% for illegal activities. When applying Bad Likert Judge, hate speech and malware generation have the highest ASR (90%). System prompt leakage remains at 0%. The largest increases are in hate speech and malware (0%-90%).
The average ASR of Model 4 after applying Bad Likert Judge is 74%. The baseline ASR is 20% for harassment and 0% for others. When applying Bad Likert Judge, sexual content and harassment have the highest ASR (90%). System prompt leakage remains at 0%. The largest increase occurs in sexual content (0%-90%).
The average ASR of Model 5 after applying Bad Likert Judge is 32%. The baseline is 40% for harassment and 0% for others. When applying Bad Likert Judge, hate speech has the highest ASR (93%). System prompt leakage and illegal activities show the lowest ASR (0% and 11%). The largest increase is in hate speech (0%-93%).
The average ASR of Model 6 after applying Bad Likert Judge is 77%. This model has high ASR baselines (20% to 80%) across all categories. When applying Bad Likert Judge, harassment has the highest ASR (95%) and system prompt leakage remains at 0%. The largest increase is in hate speech (20%-85%).
Mitigations
As mentioned, during our evaluation, our goal was solely to test the LLM's internal guardrails. However, there are a few standard approaches that can improve the overall safety of the LLM, and content filtering is one of the most effective approaches. In general, content filters are systems that work alongside the core LLM.
In a nutshell, a content filter runs classification models on both the prompt and the output to detect potentially harmful content. Users can apply filters on the prompt (prompt filters) and on the response (response filters).
When content filters detect potentially harmful content in either the input prompts or the generated responses, the LLM will refuse to generate a response. This prevents harmful or sensitive information from being displayed to users. When content filters are enabled, they act as a safeguard to maintain a safe and appropriate interaction between the user and the LLM.
There are many different types of content filtering tailored to classify specific types of output. For instance, typical content filtering includes a "User Prompt Attacks filter," which detects potential prompt injection and a "Violence filter," which detects if a response contains information regarding violent topics. Interested readers can refer to the following pages for widely used content filtering types:
We also evaluate the ASR with content filtering enabled. By default, we chose the strongest filtering setting. We turned on both prompt filtering and response filtering and enabled all the filters that are available through the AI services we use.
Figure 11 shows the overall results after applying the content filters. Overall, we observed that content filters significantly reduce the ASR across the model, with an average ASR reduction of 89.2 percentage points.
In our previous LLM jailbreak article on Deceptive Delight, we also discuss the effectiveness of content filtering against LLM jailbreak attempts. We believe the best practice while using LLM in deployment is to use content filtering to minimize potential jailbreak attacks.
Conclusion
Our results demonstrate the effectiveness of the Bad Likert Judge technique in increasing the ASR across multiple LLMs and jailbreak categories. By leveraging the LLM's understanding of harmful content and its ability to evaluate responses, this technique can significantly increase the chances of successfully bypassing the model's safety guardrails.
Our study also highlights the importance of applying content filtering systems alongside LLMs for jailbreak mitigation. The results show that content filters can reduce the ASR by an average of 89.2 percentage points across all tested models. This indicates the critical role of implementing comprehensive content filtering as a best practice when deploying LLMs in real-world applications.
Despite the effectiveness of content filtering, it is essential to acknowledge that it is not a perfect solution. Determined adversaries could still find ways to circumvent these filters, and there is always the possibility of false positives or false negatives in the filtering process.
If you think you may have been compromised or have an urgent matter, get in touch with the Unit 42 Incident Response team or call:
Palo Alto Networks has shared these findings with our fellow Cyber Threat Alliance (CTA) members. CTA members use this intelligence to rapidly deploy protections to their customers and to systematically disrupt malicious cyber actors. Learn more about the Cyber Threat Alliance.
Additional Resources
Appendix
Evaluator Scale Setting
For the evaluator, we also use Likert scales to guide the scoring task. We show a generic score guideline below:
Note : There is no relationship between these Likert score levels and the attack prompt's Likert levels.
Evaluator Prompt Template
Figure 12 shows the evaluator prompt template we used in our experiment. To ensure a precise evaluation result, the guidelines should be replaced with specific descriptions on a category.
Evaluator Verification
To effectively employ another LLM as an evaluator to determine the success of a jailbreak attempt, we need to ensure that the LLM produces a trustworthy assessment. The evaluator should have a low false positive rate, meaning it should not classify unsuccessful jailbreak responses as successful. Additionally, it should have a low false negative rate, ensuring that successful jailbreaks are not mistakenly classified as unsuccessful.
To assess the effectiveness of the evaluator LLM, we created an internal benchmark to measure its accuracy, recall, precision and F1 score. The benchmark consists of ground truths for all the mentioned safety categories.
Each ground truth is a pair of (response, harmfulness score), where the harmfulness score is manually assigned and represents the harmfulness level of response.
Figure 13 shows the overall evaluation results on the evaluator's effectiveness. Our results demonstrate that the model we chose as the evaluator achieves:
These metrics indicate that when our evaluator determines a jailbreak attempt to be successful, we can have a relatively high level of confidence in the correctness of the judgment.