Cognitive Empathy as a Defense against Persistent Backdoors in Reasoning Models

Authors:      Edward Kim, Dr. Michael Wollowski

Developed a self-inhibiting safety mechanism for a reasoning model to assess the impact of its actions from the perspective of others and preempt harmful actions;

Outperformed SFT and RL baselines by 40-50% in reducing harmful behavior from reasoning-based sleeper agents;

Proposed an emergency security patch for frontier model training pipelines and a new approach to superalignment using empathy as a continuous self-learning mechanism for ethical behavior.





























Return to Top
Return to Home