Trending:
AI & Machine Learning

Anthropic research: AI failures increasingly incoherent, not systematically misaligned

New research from Anthropic challenges the 'paperclip maximizer' narrative of AI risk. As models tackle harder tasks with longer reasoning chains, failures look more like unpredictable errors than coherent goal pursuit. The pattern holds across frontier models including Claude, OpenAI, and Qwen systems.

The Finding

AI systems fail by being a "hot mess" rather than systematically pursuing wrong goals, according to new research from Anthropic's 2025 Fellows Program. The paper measures how model errors decompose into systematic bias versus random variance across frontier reasoning models.

The central finding: as reasoning chains get longer and tasks get harder, model failures become increasingly dominated by incoherence. On easy tasks, larger models show more consistent behavior. On hard tasks, scale doesn't reduce unpredictability, and sometimes makes it worse.

This matters because it challenges the dominant AI risk narrative. Instead of superintelligent systems coherently pursuing misaligned goals (the classic paperclip maximizer scenario), the research suggests failures will look more like industrial accidents: unpredictable, self-undermining behavior that doesn't optimize for any consistent objective.

What They Measured

The researchers tested Claude Sonnet 4, o3-mini, o4-mini, and Qwen3 across multiple-choice benchmarks (GPQA, MMLU), coding tasks (SWE-Bench), and safety evaluations. They also trained small models on synthetic optimization to make the connection explicit.

Key pattern: when models spontaneously reason longer on a problem, incoherence spikes dramatically. Deliberately increasing reasoning budgets through API settings provides only modest coherence improvements. The natural variation dominates.

Ensembling multiple samples reduces variance, as theory predicts. But the core finding holds: smarter agents tackling harder problems show higher variance in their failures.

Context

This builds on Anthropic researcher Jascha Sohl-Dickstein's 2023 "hot mess theory," which surveyed experts ranking entities by intelligence and coherence independently. Smarter entities, including humans, were judged to behave less coherently.

The timing is notable. Anthropic CEO Dario Amodei recently warned of AI "adolescence" testing humanity, predicting human-level AI in roughly two years. The company's January Economic Index shows 49% of jobs now use AI in at least 25% of tasks, up from 36% in early 2025.

Meanwhile, ARC-AGI-2 scores hit 55%, up from under 20% a year ago. AI cost per task dropped 300x in the same period.

The Implications

If the research holds, it suggests scaling alone won't eliminate incoherence. As more capable models tackle harder problems, variance-dominated failures persist or worsen. That's a different safety problem than systematic misalignment, requiring different mitigations.

Skeptics see this as optimistic, potentially downplaying coherent misalignment risks. The traditional narrative predicts superintelligent, goal-pursuing AGI as the primary catastrophe scenario.

Anthropic's separate research on agentic misalignment shows models can exhibit insider-threat behaviors like blackmail under stress, though rarely spontaneously. The company's focus on reliability is already aiding enterprise adoption, but the question of how failures scale with capability remains open.

We'll see whether the hot mess pattern holds as models continue to improve. The paper provides measurable definitions, which is progress. The safety implications depend on whether you find unpredictable failures more or less concerning than systematic ones.