Oranges vs. Giraffes: Why Language Models Aren’t Like Self-Driving Cars
Everyone’s comparing Gen AI to autonomous vehicles. Here’s why that’s a category error.
On a recent All-In Podcast, the crew drew parallels between self-driving cars and large language models, or LLMs, suggesting both are on a similar trajectory of data-driven evolution.
The point that the hosts seem to make is that, in both cases, we would be “running out of data to train” and that the solution for better outcomes was synthetic data…
It's an interesting take—but one that IMHO misses the mark in some important ways.
In fact, I’ll go as far as saying that this felt like “Comparing Oranges to Giraffes” (I learned this expression from a friend of mine recently and loved it, so I’m re-using it!)
Here’s the core issue.
If you’re new to this blog, welcome.
I write about what I’ve learned as a technology exec over the last 25 years—from startups to scale-ups, acquisitions to IPOs, and scaling Data & AI at Microsoft and Google.
No one’s paying me to write this—not my employer, not anyone.
I write to share—and learn from—smart, curious people.
If this resonates with you, I hope you’ll consider subscribing and sharing it. It’s free.
These are fundamentally different systems, powered by fundamentally different kinds of data.
Self-driving cars rely on finite, real-world sensor data—roads, traffic patterns, physical environments.
LLMs rely on the vast, ever-expanding stream of digital content—text, code, conversations, memes, news.
The All-In crew rightly cites the "bitter lesson": general-purpose methods plus compute and data win out over handcrafted rules. And they point to synthetic data as the next phase. But that answer might be too simplistic.
Because the real bottleneck for LLMs isn’t generating more synthetic content—it’s digitizing the ninety-five percent of human knowledge that still isn’t online.
The real bottleneck for LLMs isn’t generating more synthetic content—it’s digitizing the ninety-five percent of human knowledge that still isn’t online.
Why the comparison falls apart.
First, self-driving is about reacting to real-time stimuli. Compute is used on the fly to adjust to moving pedestrians, shifting traffic, unpredictable weather. The physical world doesn’t change that quickly—the challenge is in real-time response, not new data.
LLMs, on the other hand, are about pattern recognition on historical and streaming data.
And here’s the scale: the internet adds about four hundred and two million terabytes of new data every day. The challenge is continuously training and updating models to reflect this ever-shifting landscape.
Then there’s the synthetic versus digitized data question. Sure, synthetic data may help LLMs generalize. But it can only remix what they already know. Digitizing new domains—like institutional knowledge, analog records, or sensory experiences—unlocks entirely new frontiers.
Remember my CarCast on this?
Remember my CarCast at the beginning of the year? I was explaining how everyone’s focused on scaling compute, but ignoring the missing input layer.
An estimated ninety-five percent of global knowledge isn’t yet digitized. This includes undocumented employee expertise, niche practices, even physical-world data like taste, smell, and texture.
Right now, AI models are being trained on an incomplete map.
Here is the video as a refresher:
Here are the takeaways:
One, don’t confuse synthetic with sufficient. New insights require new data.
Two, digitization is the real unlock. The ninety-five percent offline is where the next breakthroughs lie.
Three, self-driving is about compute-on-the-fly. LLMs are about update-at-scale.
And four, comparisons can be useful—but they can also be misleading. Understand the source of each system’s intelligence.
I write about what I’ve learned as a technology executive over the last 25 years. I’ve helped build startups from inception and scale them. I’ve been acquired. I’ve acquired and invested in companies. I’ve worked at mid-size firms through IPO. I helped scale Data, AI and Analytics businesses at Microsoft and Google.
The above are my thoughts. No one’s paying me to write this—not my employer, not anyone.
Please feel free to comment, share your thoughts. I welcome smart debates. I write this to learn from smart people, regardless of whether they agree with me or not.
If this resonates with you, I hope you’ll consider subscribing. It’s free!
You elude to it in the narrative, but this is a question that has bugged me for a while - if we are using data generated by an LLM to train and LLM…we just risk amplifying and reinforcing the errors (and the gaps) that exist in them already.
Synthetic data can be a great way to distill a model - expand in one dimension or domain with guided generation based on the LLM and some resource your augment it with in the prompting, but the idea that they can be used to retain whole new, bigger, better models feels a bit “perpetual motion” to me 🧐