GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2024) - Comments
doi: 10.18653/v1/2024.acl-long.30 

Yueqi Xie, Minghong Fang, Renjie Pi, Neil Gong