Secure Linear Alignment of Large Language Models

(arxiv.org)

1 points | by walterbell 2 hours ago

1 comments

gregfrank 1 hour ago
The "linear" assumption here is worth interrogating. In work I've been doing on alignment evaluation, I find that linear probes can achieve high accuracy on refusal-relevant directions, but that probe accuracy is non-diagnostic for whether the model actually routes behavior through those directions at inference time.
DeepSeek-R1 and Qwen2.5-72B have cleanly separable routing layers (ablating the refusal direction recovers accurate outputs), but Qwen3-8B doesn't - it confabulates, suggesting knowledge and suppression are jointly encoded. Whether a linear alignment method holds up may depend heavily on which of those architectural regimes you're in.