Papers
arxiv:2603.01297

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Published on Mar 1
Authors:
,
,
,

Abstract

Instruction-tuned reasoning models show vulnerability to small embedding perturbations that severely degrade safety classifier performance while maintaining high confidence, revealing fundamental fragility in AI safety architectures.

AI-generated summary

Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude σ=0.02 (corresponding to approx 1^circ angular drift on the embedding sphere) reduce classifier performance from 85% to 50% ROC-AUC. Critically, mean confidence only drops 14%, producing dangerous silent failures where 72% of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20% worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.01297 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.01297 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.01297 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.