Monday Momentum
Posts
The Decade of Nines

The Decade of Nines

Why Karpathy says AGI is further away than you think and why scaling performance is so difficult

Justin Wright
October 27, 2025 • Est. Reading Time: 10 minutes

Happy Monday!

Last week, Andrej Karpathy dropped a reality check that sent shockwaves through the AI community. The OpenAI co-founder and former Tesla AI director was a guest on Dwarkesh Patel's podcast and said that AGI is still a decade away. Not two years, not five years, a full decade. Even more pointed: today's AI agents are "slop," not the reliable digital workers being marketed to the world.

This isn't pessimism from a skeptic. This is pragmatism from someone who spent five years watching Tesla's self-driving program march through what he calls "nines" where every incremental improvement from 90% to 99% to 99.9% takes the same grueling amount of work. While Jensen Huang declares 2025 "the year of AI agents" and Dario Amodei predicts systems "better than almost all humans at almost all things" by 2027, the engineer who actually built these systems is warning about fundamental cognitive deficits that won't be fixed by scaling alone.

The gap between demo and product, between impressing users and actually replacing employees, remains vast. Understanding why reveals everything about where AI is actually headed.

Karpathy argues AGI is a decade away because current models suffer from fundamental architectural problems. They lack continual learning, collapse into repetitive outputs, and memorize too much while generalizing too little. Reinforcement learning "sucks supervision through a straw" by rewarding lucky guesses the same as genuine reasoning. The real AI progress happens through augmented intelligence and grinding incremental improvements, not sudden breakthroughs.

TL;DR

The Cognitive Core Problem

Karpathy introduces a framework that cuts through the hype: separate an LLM's knowledge from its cognitive core. Pre-training does two unrelated things: it loads models with facts while teaching algorithmic patterns like in-context learning. The problem? All that memorized knowledge actually holds models back.

LLMs compress 15 trillion tokens into billions of parameters which act as a "hazy recollection" of the internet. Unlike humans who forget and therefore generalize, LLMs get distracted by encyclopedic recall. Ask ChatGPT for a joke and you'll get one of three. The models are "silently collapsed" in that their outputs occupy a tiny sliver of possible responses.

Karpathy believes the optimal "cognitive core,” or pure intelligence stripped of memorized facts, could be as small as one billion parameters. That's 1,000x smaller than frontier models. Why the bloat? The training data, because it is composed of the actual internet instead of curated articles, is garbage. Random stock tickers, deprecated code, contradictions everywhere. Models need massive capacity just to compress that noise.

The implication: we need intelligent models to refine training data itself, then distill that into smaller, purer cognitive cores. The real solution starts with separating signal from noise.

Why Reinforcement Learning is "Terrible"

Karpathy reserves his harshest criticism for reinforcement learning, calling it "terrible," "stupid," and "crazy." Imagine solving a math problem. You try 100 approaches consisting of productive dead ends, lucky guesses, and genuine insights. One trajectory stumbles onto the correct answer.

The solution, according to reinforcement learning, is to upweight every single token in that winning trajectory. Every wrong turn and every lucky guess are all treated as correct decisions. Karpathy describes this as "sucking supervision through a straw." You do minutes of work, then compress all learning into a single binary reward signal broadcast across the entire trajectory.

Humans review after solving problems: "This worked, this didn't, I should try differently." There's nothing equivalent in current LLMs. Process-based supervision tries to fix this by judging each step, but LLM judges are gameable. Models find adversarial examples, nonsensical strings like "dhdhdhdh" that trick judges into assigning perfect scores.

Karpathy expects "three or four or five more" algorithmic breakthroughs of this magnitude before we reach truly capable agents. Each one requires fundamental rethinking of how models learn, not just scaling up what already exists.

The Demo-to-Product Gap

Karpathy's self-driving experience offers the sobering parallel. CMU demonstrated autonomous trucks in 1986. Waymo gave him a perfect ride in 2014. A decade later, we're still working through the "nines" of reliability.

Going from 90% to 99% takes the same work as going from 99% to 99.9%. Every nine is equivalent effort. Tesla spent five years advancing maybe two or three nines, but still more remain.

When Karpathy built nanochat, coding assistants were "net unhelpful." They excel at boilerplate and common patterns but fail at novel architectures and intellectually intense code. They misunderstand context, use deprecated APIs, and add bloat instead of focusing on streamlined, architecturally beautiful code.

The implication: models aren't very good at code that has never been written before which is exactly what frontier research requires. LLMs pattern-match existing training data, but they do not often create new and novel things. Breakthroughs happen by charting new territory, and this is why much of the LLM-generated code has been deemed “slop” by senior engineers.

Pattern Recognition: Where Models Actually Excel

Strip away the hype and clarity emerges. Models excel at:

Autocomplete for common patterns
Accessibility to unfamiliar domains
Starting points requiring human refinement
Productivity multipliers, not full replacements

Karpathy's workflow: autocomplete is the "sweet spot." Navigate to where code belongs, type the first letters, and let the model complete the rest. Or, let the model fill in details where there is high information bandwidth; you just specify location and intention. This is categorically different from "vibe coding" where you describe what you want and hope things turn out well.

The jobs most resistant to automation? Complex, messy roles like radiologists where Geoff Hinton once predicted obsolescence, only to see the profession grow. Radiology isn't image classification; it's patient care, coordination, and complex workflows. AI tools, in their current state, cannot automate these types of roles away.

The Most Provocative Claim: AGI Blends Into 2% Growth

Karpathy argues AGI won't create an intelligence explosion. It will blend into the same 2% GDP growth we've seen for 250 years.

His reasoning: AI is continuous with computing itself. Search engines were AI. Compilers were recursive self-improvement. Each transformative technology like electricity, railways, computers, and mobile phones at one time felt revolutionary, yet none broke the 2% pattern.

Even the iPhone didn't spike GDP. Impact spread gradually as adoption scaled and infrastructure adapted to this new paradigm. AI will follow the same path: slow integration, constant debugging, and gradual improvement across thousands of use cases.

The counterargument: labor itself differs from productivity tools. Ten billion additional minds could represent a population explosion that historically drove hyper-exponential growth. Karpathy remains skeptical: "You're presupposing some discrete jump that has no historical precedent I can find."

This reframes everything. Progress comes through grinding incremental work like better training data, algorithmic refinements, and integration improvements instead of through sudden breakthroughs. Reactions ranged from "If this doesn't pop the AI bubble, nothing will" to quiet acknowledgment from engineers who've seen the limitations.

What This Actually Means

Three frameworks emerge:

Augmented Intelligence, Not Replacement: For the next decade, AI accelerates human thinking and action rather than replacing it. The leverage is real, but it compounds through better integration, not autonomous breakthroughs.

The March of Nines Continues: Demo-to-product gaps are vast in high-stakes domains. Anyone building in AI should expect years of reliability work as part of that process. The unsexy middle years of debugging and integration will ultimately determine the winners.

Everything Improves 20%: No single breakthrough dominates. Algorithms, data quality, hardware, and training methods will all improve incrementally. Success comes from systematic improvement across dimensions, not betting on one scaling law.

Looking Ahead

Karpathy's timeline isn't pessimistic, it's realistic. Ten years to AGI would be the fastest any general-purpose technology has reached maturity. The real question is whether momentum sustains through the unglamorous middle years of grinding work.

According to Karpathy, the models are amazing but still need a lot of work. That work looks less like breakthrough papers and more like solving cognitive deficits: continual learning, reflection mechanisms, reduced model collapse, and better training objectives.

The hype cycle churns. Demos wow audiences. Benchmarks hit new highs. But Karpathy's warning cuts through: "The industry is making too big of a jump and trying to pretend like this is amazing. Overall, the models are not there."

The gap between pretend and reality is where the actual work lives. Understanding that gap including why it exists, what it takes to close it, and how long that actually requires matters more than any benchmark or demo.

The decade of agents isn't about autonomous AI workers replacing humans. It's about grinding through the “nines” of reliability while discovering what AI is actually good for versus what we wished it could do. That's the real trend shaping the next ten years.

In motion,
Justin Wright

If every nine of reliability requires equal work, and we're still multiple nines away from truly autonomous agents, does the current AI investment thesis predicated on near-term labor replacement fundamentally misunderstand the technology's actual trajectory and timeline?

Food for Thought

Google Maps: Now available in the Gemini API (Google)
Introducing Veo 3.1 (Google)
How a Gemma model helped discover a new potential cancer therapy (Google)
Dropbox Dash is your AI teammate that surfaces the content and context you need to stay focused and on track (Dropbox)
Copilot gets a personality (Microsoft)

I am excited to officially announce the launch of my podcast Mostly Humans: An AI and business podcast for everyone!

Episodes can be found below - please like, subscribe, and comment!

Spotify: https://creators.spotify.com/pod/profile/mostly-humans-podcast/
Apple: https://podcasts.apple.com/us/podcast/mostly-humans/id1831319729
YouTube: https://www.youtube.com/@Mostly_Humans_Podcast