TY - JOUR
T1 - JailGuard
T2 - A Universal Detection Framework for Prompt-based Attacks on LLM Systems
AU - Zhang, Xiaoyu
AU - Zhang, Cen
AU - Li, Tianlin
AU - Huang, Yihao
AU - Jia, Xiaojun
AU - Hu, Ming
AU - Zhang, Jie
AU - Liu, Yang
AU - Ma, Shiqing
AU - Shen, Chao
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2025/12/11
Y1 - 2025/12/11
N2 - The systems and software powered by Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) have played a critical role in numerous scenarios. However, current LLM systems are vulnerable to prompt-based attacks, with jailbreaking attacks enabling the LLM system to generate harmful content, while hijacking attacks manipulate the LLM system to perform attacker-desired tasks, underscoring the necessity for detection tools. Unfortunately, existing detecting approaches are usually tailored to specific attacks, resulting in poor generalization in detecting various attacks across different modalities. To address it, we propose JailGuard, a universal detection framework deployed on top of LLM systems for prompt-based attacks across text and image modalities.JailGuard operates on the principle that attacks are inherently less robust than benign ones. Specifically, JailGuard mutates untrusted inputs to generate variants and leverages the discrepancy of the variants’ responses on the target model to distinguish attack samples from benign samples. We implement 18 mutators for text and image inputs and design a mutator combination policy to further improve detection generalization. The evaluation on the dataset containing 15 known attack types suggests that JailGuard achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art methods by 11.81–25.73% and 12.20–21.40%.
AB - The systems and software powered by Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) have played a critical role in numerous scenarios. However, current LLM systems are vulnerable to prompt-based attacks, with jailbreaking attacks enabling the LLM system to generate harmful content, while hijacking attacks manipulate the LLM system to perform attacker-desired tasks, underscoring the necessity for detection tools. Unfortunately, existing detecting approaches are usually tailored to specific attacks, resulting in poor generalization in detecting various attacks across different modalities. To address it, we propose JailGuard, a universal detection framework deployed on top of LLM systems for prompt-based attacks across text and image modalities.JailGuard operates on the principle that attacks are inherently less robust than benign ones. Specifically, JailGuard mutates untrusted inputs to generate variants and leverages the discrepancy of the variants’ responses on the target model to distinguish attack samples from benign samples. We implement 18 mutators for text and image inputs and design a mutator combination policy to further improve detection generalization. The evaluation on the dataset containing 15 known attack types suggests that JailGuard achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art methods by 11.81–25.73% and 12.20–21.40%.
KW - Large Language Model System
KW - LLM Defense
KW - LLM Security
KW - Software and Application Security
UR - https://www.scopus.com/pages/publications/105028004706
U2 - 10.1145/3724393
DO - 10.1145/3724393
M3 - 文章
AN - SCOPUS:105028004706
SN - 1049-331X
VL - 35
JO - ACM Transactions on Software Engineering and Methodology
JF - ACM Transactions on Software Engineering and Methodology
IS - 1
M1 - 8
ER -