TY - JOUR
T1 - 大语言模型越狱攻击
T2 - 模型、根因及其攻防演化
AU - Li, Xitao
AU - Wu, Jiang
AU - Zheng, Qinghua
AU - Wang, Haijun
AU - Fan, Ming
AU - Hu, Shuai
AU - Guo, Jiaqi
AU - Liu, Ting
N1 - Publisher Copyright:
© 2025 Science Press. All rights reserved.
PY - 2025/6/1
Y1 - 2025/6/1
N2 - Large language models have demonstrated outstanding performance in various applications and are being widely adopted as critical engines for creating new productivity tools. However, when malicious users employ specific techniques to bypass the security protections established through mechanisms like alignment, it can lead to jailbreak attacks. These attacks may generate content that violates the model's usage guidelines, ethics, or laws, raising ethical concerns. This paper comprehensively examines the origins of jailbreak attacks and their evolution in terms of attack and defense. First, a definition and a formal framework of jailbreak attacks are proposed based on three elements: methods, objects, and targets. Furthermore, the development history of jailbreak attacks is introduced from two perspectives: the evolution of large language models and changes in security perceptions, and the root cause of jailbreak attacks is summarized as the mismatch between the service attributes and values of large language models. Finally, from the perspective of offensive and defensive games, this paper summarizes the evolution of jailbreak attacks and defenses, discussing new threat models in jailbreak attacks and the future directions of defense methods.
AB - Large language models have demonstrated outstanding performance in various applications and are being widely adopted as critical engines for creating new productivity tools. However, when malicious users employ specific techniques to bypass the security protections established through mechanisms like alignment, it can lead to jailbreak attacks. These attacks may generate content that violates the model's usage guidelines, ethics, or laws, raising ethical concerns. This paper comprehensively examines the origins of jailbreak attacks and their evolution in terms of attack and defense. First, a definition and a formal framework of jailbreak attacks are proposed based on three elements: methods, objects, and targets. Furthermore, the development history of jailbreak attacks is introduced from two perspectives: the evolution of large language models and changes in security perceptions, and the root cause of jailbreak attacks is summarized as the mismatch between the service attributes and values of large language models. Finally, from the perspective of offensive and defensive games, this paper summarizes the evolution of jailbreak attacks and defenses, discussing new threat models in jailbreak attacks and the future directions of defense methods.
KW - cybersecurity
KW - ethics of artificial intelligence
KW - jailbreak attack
KW - large language model
KW - natural language process
UR - https://www.scopus.com/pages/publications/105008554680
U2 - 10.1360/SSI-2024-0196
DO - 10.1360/SSI-2024-0196
M3 - 文章
AN - SCOPUS:105008554680
SN - 1674-7267
VL - 55
SP - 1372
JO - Scientia Sinica Informationis
JF - Scientia Sinica Informationis
IS - 6
ER -