From an Alpine Village to the Center of the AI Universe
Somewhere in the northeastern Italian Alps, in a community of a few hundred people, a teenager was obsessed with one question: how does the brain actually work? That obsession carried Stefano Ermon from Padova to Cornell, then to Stanford, and eventually into one of the most consequential bets in generative AI - that the entire industry had picked the wrong architecture.
Ermon is not a man who announces his moves loudly. He spent years quietly running the ermongroup at Stanford, a research lab that punched so far above its weight that it's now easy to forget how contrarian the work was at the time. When his team started applying diffusion to language models, the consensus response was polite skepticism. Diffusion was for images. Autoregression was for text. Everyone knew that.
He wasn't interested in what everyone knew. He was interested in what the math suggested.
An LLM is more like a typewriter where you go left to right one token at a time. A diffusion model is more like an editor.
- Stefano ErmonThe editor metaphor is the one that keeps appearing in his interviews, and it gets at something fundamental. Autoregressive models commit. Each token is final. The model cannot revise what it's already written. Diffusion models, by contrast, are trained to correct mistakes. They refine. They iterate. And in Ermon's hands, they've gotten frighteningly fast.
The Papers That Moved the Field
To understand why Inception Labs' investors - including Andrej Karpathy and Andrew Ng - moved fast, you need to look at the publication record. Ermon didn't just write papers. He co-authored the infrastructure that underpins modern generative AI.
The DDIM paper is the one powering the image generation tools you've used. DALL-E, Stable Diffusion, Midjourney - they all rely on the sampling acceleration that Ermon's group pioneered. DPO is now one of the standard ways to align language models with human preferences. It's used everywhere, often without anyone knowing where it came from.
The SEDD paper was the direct intellectual ancestor of Mercury. It extended score matching to discrete spaces - meaning text, not just images - and demonstrated that diffusion could outperform GPT-2. When that paper won Best Paper at ICML 2024, beating out 9,473 other submissions, Ermon was already thinking about what came next.
Inception Labs: The Diffusion Bet Goes Commercial
Ermon co-founded Inception Labs in 2024 alongside Aditya Grover (UCLA professor, former Stanford researcher) and Volodymyr Kuleshov (Cornell professor, former Stanford student). The founding team is almost comically credentialed - three professors with deep roots in the same Stanford ecosystem that produced the research in the first place.
The company emerged from stealth in February 2025 with Mercury - the world's first commercial-scale diffusion LLM. The pitch was simple: same quality, 10x the speed, 1/10th the cost. It was already deployed in Fortune 500 coding tools before the announcement went out.
The World's First Reasoning Diffusion LLM
Launched February 24, 2026. Mercury 2 processes and refines entire sequences in parallel - not left to right. The result: 1,000+ tokens per second at a fraction of the inference cost of competing models. Mercury Edit 2 handles latency-sensitive coding workflows.
The November 2025 Series A told its own story. $50 million, led by Menlo Ventures. Participating: Mayfield, Innovation Endeavors, NVentures (NVIDIA's corporate arm), Microsoft M12, Snowflake Ventures, Databricks Investment. Then the angels: Andrew Ng and Andrej Karpathy. When Karpathy - formerly of OpenAI, formerly of Tesla, a man who chooses his bets carefully - backs a diffusion startup, people pay attention.
When we began applying diffusion to language in my lab at Stanford, many doubted it could work. That research became Mercury diffusion LLM: 10X faster, more efficient, and now the foundation of Inception Labs.
- Stefano ErmonWhy Diffusion, and Why Now
Ermon's central claim is architectural: autoregressive models are fundamentally limited by their sequential nature. Each token blocks the next. You cannot parallelize the core generation loop without abandoning the approach entirely.
Diffusion models generate text differently. They start with noise and iteratively refine toward coherence - processing positions in parallel, able to revise earlier decisions as generation proceeds. This means the architecture scales with compute in ways that autoregression doesn't.
The error correction point matters for a specific reason. Autoregressive models commit to mistakes. Once they've emitted a token, it's fixed - subsequent tokens condition on that error and amplify it. Diffusion models, trained to denoise, are designed to correct. At inference time, they keep refining, keep fixing, working toward the answer rather than toward the next character.
Diffusion as the Architecture of Generative AI
"We're envisioning a future where all LLMs are going to be based on the diffusion paradigm. That's important because it will make generative AI solutions better, will make them cheaper, will make them faster, will improve the quality of the answers."
Mercury 2's performance numbers are the current evidence for this thesis. 1,000+ tokens per second isn't a marginal improvement - it's a different category. For applications that need real-time response, structured output, or high-volume inference, the gap between autoregression and diffusion is becoming difficult to explain away.
The Arc
Master's in Electrical Engineering at Università degli Studi di Padova. First serious exposure to signal processing and statistical modeling.
Cornell University: MS then PhD in Computer Science. Research into probabilistic graphical models and scalable inference.
Joins Stanford faculty as Computer Science professor. Founds the ermongroup, focused on machine learning, generative models, and computational sustainability.
Sloan Research Fellowship, NSF CAREER Award. Woods Institute for the Environment fellowship for applying ML to sustainability.
IJCAI Computers and Thought Award - the premier recognition for AI researchers under 35.
DDIM co-authored with Jiaming Song and Chenlin Meng. Enables 10-50x faster diffusion sampling. Becomes foundational to every major image generation system.
DDIM wins Outstanding Paper at ICLR 2021. FlashAttention published. ICLR 2022 Outstanding Paper. Lab output accelerates.
DPO published - becomes a standard LLM alignment technique. NeurIPS 2023 Outstanding Paper Runner-up. SEDD discrete diffusion work begins.
SEDD wins ICML 2024 Best Paper (1 of 10 from 9,473 submissions). Inception Labs co-founded with Aditya Grover and Volodymyr Kuleshov.
Mercury launches from stealth. $50M Series A with Andrej Karpathy, Andrew Ng, NVIDIA, Microsoft, Databricks, Snowflake. Monograph on diffusion models published.
Mercury 2 launches: world's first reasoning diffusion LLM at 1,000+ tokens per second. 5x faster than any speed-optimized competitor.
Awards & Honors
Best Paper Award for SEDD - 1 of 10 best papers from 9,473 submissions
Outstanding Paper Awards both years - exceptional in competitive ML venues
Premier award for AI researchers under 35. Awarded by the International Joint Conference on AI.
Alfred P. Sloan Foundation fellowship for outstanding early-career scientists
NSF's most prestigious award in support of early-career faculty
Air Force and Office of Naval Research both selected Ermon for young investigator programs
62,700+ citations on Semantic Scholar. 190+ publications across AI, ML, and sustainability
Plus Sony Faculty Innovation Award and Hellman Fellowship
For the Record
He grew up in an Alpine village in northeastern Italy with a community of just a few hundred people. From there to shaping how AI generates text globally.
The ermongroup at Stanford produced the algorithms powering Stable Diffusion, DALL-E, and Midjourney. Most users of those tools have never heard his name.
Mercury 2 hits 1,000+ tokens per second. A fast human reader processes about 5 tokens per second. The gap is 200x.
Soccer, hiking, hockey. Not an obvious combination, but it tracks for someone from the Alps who ended up in Silicon Valley.
His fascination with neuroscience - with how the brain actually works - started in high school. It still drives the research agenda.
The transition from academia to startup wasn't about money. It was about scale. You can't run a 37-person company's worth of inference experiments from a university lab.
On the Record
As an academic, as a researcher, you always have to take contrarian bets - you're never going to be first if you just follow what everybody else is doing.
- Stefano ErmonIf we think about an autoregressive model, once it outputs something, it can never take it back. A diffusion model is actually trained to correct mistakes, and during inference when it generates an answer, it keeps refining, it keeps fixing mistakes.
- Stefano Ermon, on the fundamental differenceThe theory and mathematical analysis does provide intuition, but it's still very empirical.
- Stefano Ermon, on deep learning practice vs. theory