大语言模型越狱攻击: 模型、根因及其攻防演化

Translated title of the contribution: Jailbreaking large language models: models, origins, and evolution of attacks and defenses

Research output: Contribution to journalArticlepeer-review

Abstract

Large language models have demonstrated outstanding performance in various applications and are being widely adopted as critical engines for creating new productivity tools. However, when malicious users employ specific techniques to bypass the security protections established through mechanisms like alignment, it can lead to jailbreak attacks. These attacks may generate content that violates the model's usage guidelines, ethics, or laws, raising ethical concerns. This paper comprehensively examines the origins of jailbreak attacks and their evolution in terms of attack and defense. First, a definition and a formal framework of jailbreak attacks are proposed based on three elements: methods, objects, and targets. Furthermore, the development history of jailbreak attacks is introduced from two perspectives: the evolution of large language models and changes in security perceptions, and the root cause of jailbreak attacks is summarized as the mismatch between the service attributes and values of large language models. Finally, from the perspective of offensive and defensive games, this paper summarizes the evolution of jailbreak attacks and defenses, discussing new threat models in jailbreak attacks and the future directions of defense methods.

Translated title of the contributionJailbreaking large language models: models, origins, and evolution of attacks and defenses
Original languageChinese (Traditional)
Pages (from-to)1372
Number of pages1
JournalScientia Sinica Informationis
Volume55
Issue number6
DOIs
StatePublished - 1 Jun 2025

Fingerprint

Dive into the research topics of 'Jailbreaking large language models: models, origins, and evolution of attacks and defenses'. Together they form a unique fingerprint.

Cite this