TY - GEN
T1 - Exploring Intrinsic Alignments within Text Corpus
AU - Liang, Zi
AU - Wang, Pinghui
AU - Zhang, Ruofei
AU - Hu, Haibo
AU - Zhang, Shuo
AU - Ye, Qingqing
AU - Xu, Nuo
AU - Xiao, Yaxin
AU - Zhang, Chen
AU - Cui, Lizhen
N1 - Publisher Copyright:
Copyright © 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2025/4/11
Y1 - 2025/4/11
N2 - Recent years have witnessed rapid advancements in the safety alignments of large language models (LLMs). Methods such as supervised instruction fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) have thus emerged as vital components in constructing LLMs. While these methods achieve robust and fine-grained alignment to human values, their practical application is still hindered by high annotation costs and incomplete human alignments. Besides, the intrinsic human values within training corpora have not been fully exploited. To address these issues, we propose ISAAC (Intrinsically Supervised Alignments by Assessing Corpus), a primary and coarse-grained safety alignment strategy for LLMs. ISAAC only relies on a prior assumption about the text corpus, and does not require preferences in RLHF or human responses selection in SFT. Specifically, it assumes a long-tail distribution of text corpus and employs a specialized sampling strategy to automatically sample high-quality responses. Theoretically, we prove that this strategy can improve the safety of LLMs under our assumptions. Empirically, our evaluations on mainstream LLMs show that ISAAC achieves a safety score comparable to current SFT solutions. Moreover, we conduct experiments on ISAAC for some RLHF-based LLMs, where we find that ISAAC can even improve the safety of these models under specific safety domains. These findings demonstrate that ISAAC can provide preliminary alignment to LLMs, thereby reducing the construction costs of existing human-feedback-based methods.
AB - Recent years have witnessed rapid advancements in the safety alignments of large language models (LLMs). Methods such as supervised instruction fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) have thus emerged as vital components in constructing LLMs. While these methods achieve robust and fine-grained alignment to human values, their practical application is still hindered by high annotation costs and incomplete human alignments. Besides, the intrinsic human values within training corpora have not been fully exploited. To address these issues, we propose ISAAC (Intrinsically Supervised Alignments by Assessing Corpus), a primary and coarse-grained safety alignment strategy for LLMs. ISAAC only relies on a prior assumption about the text corpus, and does not require preferences in RLHF or human responses selection in SFT. Specifically, it assumes a long-tail distribution of text corpus and employs a specialized sampling strategy to automatically sample high-quality responses. Theoretically, we prove that this strategy can improve the safety of LLMs under our assumptions. Empirically, our evaluations on mainstream LLMs show that ISAAC achieves a safety score comparable to current SFT solutions. Moreover, we conduct experiments on ISAAC for some RLHF-based LLMs, where we find that ISAAC can even improve the safety of these models under specific safety domains. These findings demonstrate that ISAAC can provide preliminary alignment to LLMs, thereby reducing the construction costs of existing human-feedback-based methods.
UR - https://www.scopus.com/pages/publications/105003999984
U2 - 10.1609/aaai.v39i26.34957
DO - 10.1609/aaai.v39i26.34957
M3 - 会议稿件
AN - SCOPUS:105003999984
T3 - Proceedings of the AAAI Conference on Artificial Intelligence
SP - 27455
EP - 27463
BT - Special Track on AI Alignment
A2 - Walsh, Toby
A2 - Shah, Julie
A2 - Kolter, Zico
PB - Association for the Advancement of Artificial Intelligence
T2 - 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Y2 - 25 February 2025 through 4 March 2025
ER -