SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs
The Annual Conference on Neural Information Processing Systems (NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models), 2025-04-25 00:00:00 -0700