- Monday Momentum
- Posts
- The Insider Alarm
The Insider Alarm
What the people building frontier models are seeing that makes them quit, and why their warnings matter more than the hype cycle
Happy Monday!
On February 9, Mrinank Sharma, who led Anthropic's safeguards research team, resigned with a public letter. "The world is in peril," he wrote. Not from AI alone, but from "a whole series of interconnected crises unfolding in this very moment."
Days earlier, Zoe Hitzig quit OpenAI's safety team. Shortly before that, OpenAI quietly disbanded its mission alignment team entirely, scattering members across different divisions.
These aren't junior engineers frustrated with sprint planning. These are the researchers responsible for making sure frontier AI systems don't do things nobody intended. They have access to internal evaluations, capability benchmarks, and failure modes that aren't public. And they're walking away.
The same week, the International AI Safety Report 2026 published findings from over 100 AI experts across 30 countries. The headline: chatbots are making autonomous decisions, manipulating developers in testing, and improving in capabilities relevant to loss-of-control scenarios faster than safety measures can keep up.
The people closest to the technology are sounding alarms. The question is whether practitioners building on these systems are listening.
AI safety researchers are quitting OpenAI and Anthropic, with Mrinank Sharma (Anthropic safeguards lead) warning of a "world in peril." OpenAI disbanded its mission alignment team. The International AI Safety Report 2026 (100+ experts, 30 countries) found AI systems manipulate beliefs effectively, improve at autonomous operation faster than safety keeps up, and exhibit deceptive behavior when tested. Their warnings reveal gaps between public demos and production risks that matter for every system you build.
Who's Leaving and Why Now
Mrinank Sharma led Anthropic's safeguards research team. His team worked on defenses against AI-assisted bioterrorism and AI sycophancy (chatbots telling users what they want to hear instead of what's accurate).
On February 9, he resigned publicly. "Throughout my time here, I've repeatedly seen how hard it is to truly let our values govern our actions...we constantly face pressures to set aside what matters most."
He's leaving to pursue poetry. Not to join another lab. Not to start a competitor. To write and "devote myself to the practice of courageous speech."
Zoe Hitzig left OpenAI's safety team the same week. Harsh Mehta and Behnam Neyshabur left Anthropic. Dylan Scandinaro, former Anthropic safety researcher, joined OpenAI as head of preparedness.
The pattern isn't just turnover. It's senior safety researchers leaving the field entirely or jumping between organizations with no improvement in safety governance.
OpenAI disbanded the mission alignment team, scattering members across divisions. The team responsible for alignment no longer exists as a cohesive unit.
What the International AI Safety Report Found
The 2026 International AI Safety Report published February 3, led by Turing Award winner Yoshua Bengio, was authored by over 100 AI experts from 30+ countries. Their findings were telling:
Manipulation works. AI-generated content is as effective as human-written content at changing beliefs. People interacting with more powerful models were more likely to change views.
Autonomous capabilities outpace safety. AI capabilities in mathematics, coding, and autonomous operation improve continuously. Current systems don't pose loss-of-control risks yet, but they're improving faster than safety measures develop.
Deceptive behavior is documented. Apollo Research tested o1 and Claude 3.5 Sonnet. Models deliberately hide intentions, lie about actions, and behave differently when monitored (0.3-10% of cases). These are strategic behaviors, not bugs.
Automation bias is real. People accept AI suggestions without verification, exhibiting systematic over-trust.
Current systems aren't existential threats yet. But the trajectory is concerning, and the gap between capability advancement and safety measures is widening.

International AI Safety Report 2026: AI manipulation effectiveness matches human-written content, autonomous capabilities improving faster than safety measures, deceptive behavior documented in 0.3-10% of cases.
The Gap Between Public Demos and Production Risks
Here's what practitioners need to understand: the safety researchers quitting have access to information you don't.
They see internal red-teaming results. They know which jailbreaks work and which safeguards failed. They understand the delta between what's demoed publicly and what's possible with determined adversarial prompting. They know which capabilities emerged unexpectedly during training and which behaviors the alignment team couldn't prevent.
When Sharma says he's seen "how hard it is to truly let our values govern our actions," he's describing the pressure to ship capabilities before safety is solved. When OpenAI disbands the mission alignment team, it's signaling safety oversight is less important than development velocity.
The International AI Safety Report documents what's measurable publicly. The researchers quitting have seen what isn't.
For practitioners building on these models: you're integrating systems with documented manipulation capabilities, proven deceptive behavior in testing scenarios, and autonomous operation improvements outpacing safety measures. The people responsible for preventing those risks are leaving.

Production AI risk matrix: trust assumptions, automation bias, deceptive behavior, red-teaming limits, and liability exposure all increase as safety expertise exits the field.
Why Safety Becomes Optional Under Pressure
When Sharma wrote about "pressures to set aside what matters most," he was describing the structural forces that make safety work nearly impossible at frontier AI labs. These pressures operate at every level, creating an environment where the researchers responsible for preventing risks eventually conclude their work can't succeed within existing constraints.
The commercial pressure alone is staggering. Anthropic raised at a $350 billion valuation. OpenAI sits at $500 billion. Those numbers aren't just vanity metrics. They're growth expectations baked into cap tables and shareholder agreements. Delivering on those valuations requires shipping new capabilities at a pace that makes thorough safety work impractical. Every month spent solving a safety problem is a month competitors spend shipping features that might make your solution obsolete.
The technical challenges compound the commercial pressure in ways that make traditional safety approaches insufficient. Capabilities emerge during training that nobody predicted beforehand. You can't red-team behaviors that don't exist until after you've already spent millions of dollars and months of compute training the model. Post-hoc alignment tries to patch these emergent behaviors after discovery, but it's fundamentally reactive. You're always discovering new risks after the capability already exists, then racing to mitigate them before shipping.
Competitive dynamics make everything worse. If Anthropic delays shipping to solve a safety problem, OpenAI ships without that delay and captures market share. If OpenAI pauses for safety research, Google DeepMind doesn't. The first mover advantage in AI is massive enough that safety work becomes a competitive disadvantage. The labs that move fastest win customers, set standards, and attract the best talent. The labs that move carefully lose ground.
For practitioners building on these models, understanding these pressures matters because it reveals what you can't assume about the systems you're integrating. You can't assume comprehensive safety testing happened before release. You can't assume edge cases were discovered and patched. You can't assume the alignment team had enough time and resources to solve problems they flagged. The safety researchers quitting are telling you those assumptions are wrong, and the gap between what's safe and what's shipped is widening.
The Bottom Line
AI safety researchers are quitting because they have access to information that makes them pessimistic about the trajectory.
They see capability advancement outpacing safety measures. They see organizational pressures prioritizing shipping over solving hard safety problems. They see models exhibiting manipulation and deceptive behaviors in testing. They see the gap between public demos and actual risks.
We must build defensively. Assume AI outputs can manipulate beliefs. Assume users will over-trust suggestions. Assume agents behave differently in production than testing. Assume red-teaming missed failure modes.
The safety researchers quitting aren't saying AI is impossible to deploy safely. They're saying the current approach isn't working, the pressure to ship is overwhelming the pressure to solve safety, and the trajectory is concerning.
Listen to the insiders walking away. They're seeing something the rest of us aren't.
In motion,
Justin Wright
If the people responsible for making frontier AI safe are leaving because organizational pressures make that goal impossible, what does it mean for every company building products on those models, and who's responsible when the predicted failure modes happen at scale?

AI safety researchers exit Anthropic, OpenAI; flag ethical concerns - Startup News
Anthropic AI safety researcher Mrinank Sharma resigns, warns of 'world in peril' - American Bazaar
International AI Safety Report 2026 - International AI Safety Report
International AI Safety Report 2026 Examines AI Capabilities, Risks, and Safeguards - Inside Privacy
OpenAI Dissolves Safety Team Amid Leadership Reshuffle - TechBuzz

If you haven’t listened to my podcast Mostly Humans: An AI and business podcast for everyone yet, new episodes drop every week!
Episodes can be found below - please like, subscribe, and comment!