PROADVPROMPTER: A TWO-STAGE JOURNEY TO EFFECTIVE ADVERSARIAL PROMPTING FOR LLMS

  • Hao Di
  • , Tong He
  • , Haishan Ye
  • , Yinghui Huang
  • , Xiangyu Chang
  • , Guang Dai
  • , Ivor W. Tsang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

As large language models (LLMs) are increasingly being integrated into various real-world applications, the identification of their vulnerabilities to jailbreaking attacks becomes an essential component of ensuring the safety and reliability of LLMs. Previous studies have developed LLM assistants, known as the adversarial prompter, to automatically generate suffixes that manipulate target LLMs into generating harmful and undesirable outputs. However, these approaches often suffer from low performance or generate semantically meaningless prompts, which can be easily identified by perplexity-based defenses. In this paper, we introduce a novel two-stage method, ProAdvPrompter, that significantly improves the performance of adversarial prompters. In ProAdvPrompter, the first stage (Exploration) utilizes the loss information to guide the adversarial prompter in generating suffixes that are more likely to elicit harmful responses. Then the second stage (Exploitation) iteratively fine-tunes the prompter using high-quality generated adversarial suffixes to further boost performance. Additionally, we incorporate the prompt template to aid in the Exploration stage and propose a filtering mechanism to accelerate the training process in the Exploitation stage. We evaluate ProAdvPrompter against the well-aligned LLMs (i.e., Llama2-Chat-7B and Llama3-chat-8B), achieving attack success rates of 99.68% and 97.12% respectively after 10 trials on the AdvBench dataset, thereby enhancing performance by ∼ 2 times compared to previous works. Moreover, ProAdvPrompter reduces training time by 20% on Llama3-Instruct-8B, generates more generalized adversarial suffixes, and demonstrates resilience against the perplexity defense. An ablation study further evaluates the effects of key components in ProAdvPrompter (the prompt template and the filtering mechanism).

Original languageEnglish
Title of host publication13th International Conference on Learning Representations, ICLR 2025
PublisherInternational Conference on Learning Representations, ICLR
Pages82676-82700
Number of pages25
ISBN (Electronic)9798331320850
StatePublished - 2025
Event13th International Conference on Learning Representations, ICLR 2025 - Singapore, Singapore
Duration: 24 Apr 202528 Apr 2025

Publication series

Name13th International Conference on Learning Representations, ICLR 2025

Conference

Conference13th International Conference on Learning Representations, ICLR 2025
Country/TerritorySingapore
CitySingapore
Period24/04/2528/04/25

Fingerprint

Dive into the research topics of 'PROADVPROMPTER: A TWO-STAGE JOURNEY TO EFFECTIVE ADVERSARIAL PROMPTING FOR LLMS'. Together they form a unique fingerprint.

Cite this