Preference Poisoning Attacks on Reward Model Learning

arXiv preprint arXiv:2402.01920 (arXiv 2024), 2024-09-01 00:00:00 -0700