TY - GEN
T1 - PROADVPROMPTER
T2 - 13th International Conference on Learning Representations, ICLR 2025
AU - Di, Hao
AU - He, Tong
AU - Ye, Haishan
AU - Huang, Yinghui
AU - Chang, Xiangyu
AU - Dai, Guang
AU - Tsang, Ivor W.
N1 - Publisher Copyright:
© 2025 13th International Conference on Learning Representations, ICLR 2025. All rights reserved.
PY - 2025
Y1 - 2025
N2 - As large language models (LLMs) are increasingly being integrated into various real-world applications, the identification of their vulnerabilities to jailbreaking attacks becomes an essential component of ensuring the safety and reliability of LLMs. Previous studies have developed LLM assistants, known as the adversarial prompter, to automatically generate suffixes that manipulate target LLMs into generating harmful and undesirable outputs. However, these approaches often suffer from low performance or generate semantically meaningless prompts, which can be easily identified by perplexity-based defenses. In this paper, we introduce a novel two-stage method, ProAdvPrompter, that significantly improves the performance of adversarial prompters. In ProAdvPrompter, the first stage (Exploration) utilizes the loss information to guide the adversarial prompter in generating suffixes that are more likely to elicit harmful responses. Then the second stage (Exploitation) iteratively fine-tunes the prompter using high-quality generated adversarial suffixes to further boost performance. Additionally, we incorporate the prompt template to aid in the Exploration stage and propose a filtering mechanism to accelerate the training process in the Exploitation stage. We evaluate ProAdvPrompter against the well-aligned LLMs (i.e., Llama2-Chat-7B and Llama3-chat-8B), achieving attack success rates of 99.68% and 97.12% respectively after 10 trials on the AdvBench dataset, thereby enhancing performance by ∼ 2 times compared to previous works. Moreover, ProAdvPrompter reduces training time by 20% on Llama3-Instruct-8B, generates more generalized adversarial suffixes, and demonstrates resilience against the perplexity defense. An ablation study further evaluates the effects of key components in ProAdvPrompter (the prompt template and the filtering mechanism).
AB - As large language models (LLMs) are increasingly being integrated into various real-world applications, the identification of their vulnerabilities to jailbreaking attacks becomes an essential component of ensuring the safety and reliability of LLMs. Previous studies have developed LLM assistants, known as the adversarial prompter, to automatically generate suffixes that manipulate target LLMs into generating harmful and undesirable outputs. However, these approaches often suffer from low performance or generate semantically meaningless prompts, which can be easily identified by perplexity-based defenses. In this paper, we introduce a novel two-stage method, ProAdvPrompter, that significantly improves the performance of adversarial prompters. In ProAdvPrompter, the first stage (Exploration) utilizes the loss information to guide the adversarial prompter in generating suffixes that are more likely to elicit harmful responses. Then the second stage (Exploitation) iteratively fine-tunes the prompter using high-quality generated adversarial suffixes to further boost performance. Additionally, we incorporate the prompt template to aid in the Exploration stage and propose a filtering mechanism to accelerate the training process in the Exploitation stage. We evaluate ProAdvPrompter against the well-aligned LLMs (i.e., Llama2-Chat-7B and Llama3-chat-8B), achieving attack success rates of 99.68% and 97.12% respectively after 10 trials on the AdvBench dataset, thereby enhancing performance by ∼ 2 times compared to previous works. Moreover, ProAdvPrompter reduces training time by 20% on Llama3-Instruct-8B, generates more generalized adversarial suffixes, and demonstrates resilience against the perplexity defense. An ablation study further evaluates the effects of key components in ProAdvPrompter (the prompt template and the filtering mechanism).
UR - https://www.scopus.com/pages/publications/105010246278
M3 - 会议稿件
AN - SCOPUS:105010246278
T3 - 13th International Conference on Learning Representations, ICLR 2025
SP - 82676
EP - 82700
BT - 13th International Conference on Learning Representations, ICLR 2025
PB - International Conference on Learning Representations, ICLR
Y2 - 24 April 2025 through 28 April 2025
ER -