Cognitive Empathy as a Defense against Persistent Backdoors in Reasoning Models
Authors:      Edward Kim, Dr. Michael Wollowski
Developed a self-inhibiting safety mechanism for a reasoning model to assess the impact of its actions from the perspective of others and preempt harmful actions;
Outperformed SFT and RL baselines by 40-50% in reducing harmful behavior from reasoning-based sleeper agents;
Proposed an emergency security patch for frontier model training pipelines and a new approach to superalignment using empathy as a continuous self-learning mechanism for ethical behavior.