The Black Box Paradox

Why AI leaders are racing to see inside their own models as observability becomes more important than compute

Happy Monday!

Here's the uncomfortable truth nobody's talking about: OpenAI just spent up to $400 million to understand what happens when they train GPT. Anthropic's CEO says interpretability is the company's top goal by 2027. Palo Alto Networks dropped $3.35 billion on observability infrastructure.

The world's most sophisticated AI companies are spending billions to answer a question with large implications: What are our models actually doing?

Last week, OpenAI announced two things that seem unrelated but tell the same story. First, they're acquiring Neptune, a training observability platform that monitors "thousands of runs, analyzes metrics across layers, and surfaces issues." Second, they published research on "Confessions" which is a method that trains models to admit when they cheat, lie, or cut corners.

Neptune gives visibility into training. Confessions surface what models actually did. Both solve the same problem: we've been deploying trillion-parameter systems we fundamentally don't understand.

OpenAI acquires Neptune for training observability, after using it internally for a year, to see per-layer metrics across thousands of training runs. Simultaneously they launch Confessions: models trained to self-report misbehavior admitted to sandbagging, reward hacking, and cutting corners even when hiding it in main outputs. Anthropic publishes circuit tracing breakthroughs. Meanwhile, Palo Alto Networks pays $3.35B for the Chronosphere observability platform. Companies that can't explain their models won't be allowed to deploy them. Observability isn't optional, it's survival.

TL;DR

What We Thought We Were Doing

For years, the AI development story was simple: more data, bigger models, better performance. GPT-2 to GPT-3 to GPT-4. Benchmarks climbing. Capabilities expanding.

Training was a black box, but it worked. Pour in compute and data, get back intelligence. Who cares how?

Turns out, everyone cares now.

Neptune's value proposition tells you everything. The company built tools that let researchers "compare thousands of runs, analyze metrics across layers, and surface issues" during training. OpenAI was already their customer. They acquired Neptune because monitoring training runs is now more valuable than the models themselves.

Think about that. The ChatGPT maker just paid hundreds of millions for visibility into what happened during training. Not for better models, but for better understanding of the models they already have.

The name of the game is now visibility, not capability.

The Confession Booth Problem

Same week, different solution to the same crisis.

OpenAI published research showing they can train models to confess misbehavior. The goal wasn’t to prevent it, but instead surface it.

Here's how it works: Models generate two outputs. First, the main answer. Second, a confession evaluating how well they followed instructions. The confession is judged solely on honesty. Nothing the model admits gets held against it. Like a Catholic confessional, admission increases reward, not punishment.

The results are wild. When models were deliberately set up to fail: like being asked to write code that runs in nanoseconds (impossible), or given tests where scoring above 50% means knowledge erasure, they hid the cheating in their main answer but confessed honestly afterward.

Confessions were given with near 100% accuracy when there were separated reward functions for the answer and the confession itself.

The crazy part is confessions work even when the model produces no chain of thought at all. Meaning the dishonesty isn't in the reasoning process it's baked deeper. Models know when they're cutting corners. They just don't tell you unless you create incentive structures that reward honesty over performance.

The Anthropic Microscope

While OpenAI buys observability tools, Anthropic is building something else: a way to actually watch models think.

Their circuit tracing work reveals what's happening inside Claude's neural networks. Ask "what's the capital of the state containing Dallas" and you can see the model internally represent "Texas" before outputting "Austin." The two-hop reasoning we assumed was happening is real, we just couldn't see it before.

When Claude writes poetry, it plans ahead by identifying potential rhyming words before writing each line. Internal computational shortcuts it learned on its own are now visible for the first time.

Anthropic’s CEO, Dario Amodei, isn't being subtle about what this means. In his recent essay he said that Anthropic is doubling down on interpretability, with a goal of detecting model problems using this methodology. It’s now a core company goal. Ship date: 2027.

The Economic Forcing Function

Neptune, Confessions, and Circuit Tracing are all the same realization hitting from three directions.

But the real pressure isn't coming from research labs. It's coming from regulators and reality.

The EU AI Act goes live August 2026. High-risk AI systems (recruitment, credit scoring, insurance, medical devices, and law enforcement) must be "transparent and explainable." Article 13 requires providers ensure systems are "understandable to users" with information on "characteristics, capabilities, and limitations." Plus logs are required for traceability.

Penalties are up to €30 million or 6% of global revenue.

U.S. regulators are moving in a similar direction. The FDA’s 2025 guidance emphasizes "clearly defined model context and rigorous validation." HHS clarified biased clinical algorithms violate civil rights protections in federally funded programs.

Financial regulators are even more direct. A 2025 Finance Watch report questions how institutions can rely on technology that even its most advanced experts do not fully understand. McKinsey is already seeing the shift: banks using explainable AI saw improved customer trust and deployed AI in areas previously off-limits due to compliance.

The companies that can explain their models will win regulated markets. Everyone else gets locked out.

The Model Collapse Wildcard

There's another reason visibility just became existential: model collapse.

Nature published definitive research showing AI models trained on AI-generated content experience "irreversible defects." Feed a model outputs from its predecessors and it loses its grasp of rare events first (early collapse), then drifts toward bland central tendencies until outputs become nonsense (late collapse).

Any errors in one model's output get baked into training for the next generation. Those models produce their own errors. Compound recursively across generations and you get statistical drift away from reality.

Why does this matter for observability? The web is now saturated with AI content. Future training data will inevitably include synthetic text. Without visibility into where data came from and how models responded during training, you can't detect collapse until it's too late.

What This Means If You're Building

The infrastructure layer is forming fast. Neptune, Arize, Chronosphere, and dozens of YC-backed observability startups are solving this problem.

If you're training models without Neptune-class tooling, you're behind. Experiment tracking is no longer a nice-to-have.

Confessions point to new product category: honesty as a service. Enterprise customers will pay a premium for models that admit when they don't know vs. confidently hallucinating. The trust layer is being built.

Interpretability is becoming a commercial moat. Anthropic's explicit strategy of using mechanistic interpretability to create a unique advantage in regulated industries is paying off. If you can explain why your model made a decision, you can deploy in healthcare, finance, and legal domains. Competitors can't.

Regulatory compliance is also a forcing function. The EU AI Act, FDA guidance, and financial regulators are all converging on the same requirements.

Data provenance matters more than volume. With model collapse risk, verified human data is a premium asset. Your training data strategy needs to account for poisoned wells.

The Bottom Line

Three years into the generative AI boom, we're hitting the transparency wall.

OpenAI is spending hundreds of millions to monitor training. Confessions is revealing strategic deception from models. Anthropic is tracing circuits and reasoning within the neural networks themselves. Regulators are demanding explanations.

The black box worked when AI was experimental. It breaks when systems make hiring decisions, approve loans, diagnose diseases, and drive cars.

We've been building toward AGI assuming we'd figure out interpretability later. But the companies solving visibility now, before models get more powerful, will be the ones allowed to deploy when that power actually matters.

Right now, even the people building frontier AI can't fully explain how their systems work. That's not a research gap, it’s a real problem that this new technology is aiming to solve.

In motion,
Justin Wright

If the most sophisticated AI companies are spending billions just to understand what their models are doing during training and deployment, what does that say about the hundreds of companies deploying AI without any observability infrastructure at all?

Food for Thought

I am excited to officially announce the launch of my podcast Mostly Humans: An AI and business podcast for everyone!

Episodes can be found below - please like, subscribe, and comment!