RepIt: Steering Language Models with Concept-Specific Refusal Vectors

The Annual Conference on Neural Information Processing Systems (NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models), 2025-04-23 00:00:00 -0700