TY - GEN
T1 - JBShield
T2 - 34th USENIX Security Symposium, USENIX Security 2025
AU - Zhang, Shenyi
AU - Zhai, Yuchen
AU - Guo, Keyan
AU - Hu, Hongxin
AU - Guo, Shengnan
AU - Fang, Zheng
AU - Zhao, Lingchen
AU - Shen, Chao
AU - Wang, Cong
AU - Wang, Qian
N1 - Publisher Copyright:
© 2025 by The USENIX Association All Rights Reserved.
PY - 2025
Y1 - 2025
N2 - Despite the implementation of safety alignment strategies, large language models (LLMs) remain vulnerable to jailbreak attacks, which undermine these safety guardrails and pose significant security threats. Some defenses have been proposed to detect or mitigate jailbreaks, but they are unable to withstand the test of time due to an insufficient understanding of jailbreak mechanisms. In this work, we investigate the mechanisms behind jailbreaks based on the Linear Representation Hypothesis (LRH), which states that neural networks encode high-level concepts as subspaces in their hidden representations. We define the toxic semantics in harmful and jailbreak prompts as toxic concepts and describe the semantics in jailbreak prompts that manipulate LLMs to comply with unsafe requests as jailbreak concepts. Through concept extraction and analysis, we reveal that LLMs can recognize the toxic concepts in both harmful and jailbreak prompts. However, unlike harmful prompts, jailbreak prompts activate the jailbreak concepts and alter the LLM output from rejection to compliance. Building on our analysis, we propose a comprehensive jailbreak defense framework, JBSHIELD, consisting of two key components: jailbreak detection JBSHIELD-D and mitigation JBSHIELD-M. JBSHIELD-D identifies jailbreak prompts by determining whether the input activates both toxic and jailbreak concepts. When a jailbreak prompt is detected, JBSHIELD-M adjusts the hidden representations of the target LLM by enhancing the toxic concept and weakening the jailbreak concept, ensuring LLMs produce safe content. Extensive experiments demonstrate the superior performance of JBSHIELD, achieving an average detection accuracy of 0.95 and reducing the average attack success rate of various jailbreak attacks to 2% from 61% across distinct LLMs.
AB - Despite the implementation of safety alignment strategies, large language models (LLMs) remain vulnerable to jailbreak attacks, which undermine these safety guardrails and pose significant security threats. Some defenses have been proposed to detect or mitigate jailbreaks, but they are unable to withstand the test of time due to an insufficient understanding of jailbreak mechanisms. In this work, we investigate the mechanisms behind jailbreaks based on the Linear Representation Hypothesis (LRH), which states that neural networks encode high-level concepts as subspaces in their hidden representations. We define the toxic semantics in harmful and jailbreak prompts as toxic concepts and describe the semantics in jailbreak prompts that manipulate LLMs to comply with unsafe requests as jailbreak concepts. Through concept extraction and analysis, we reveal that LLMs can recognize the toxic concepts in both harmful and jailbreak prompts. However, unlike harmful prompts, jailbreak prompts activate the jailbreak concepts and alter the LLM output from rejection to compliance. Building on our analysis, we propose a comprehensive jailbreak defense framework, JBSHIELD, consisting of two key components: jailbreak detection JBSHIELD-D and mitigation JBSHIELD-M. JBSHIELD-D identifies jailbreak prompts by determining whether the input activates both toxic and jailbreak concepts. When a jailbreak prompt is detected, JBSHIELD-M adjusts the hidden representations of the target LLM by enhancing the toxic concept and weakening the jailbreak concept, ensuring LLMs produce safe content. Extensive experiments demonstrate the superior performance of JBSHIELD, achieving an average detection accuracy of 0.95 and reducing the average attack success rate of various jailbreak attacks to 2% from 61% across distinct LLMs.
UR - https://www.scopus.com/pages/publications/105021384859
M3 - 会议稿件
AN - SCOPUS:105021384859
T3 - Proceedings of the 34th USENIX Security Symposium
SP - 8215
EP - 8234
BT - Proceedings of the 34th USENIX Security Symposium
PB - USENIX Association
Y2 - 13 August 2025 through 15 August 2025
ER -