Breaking
Mercury 2 launches: 1,000+ tokens per second - 5x faster than any speed-optimized LLM Inception Labs raises $50M Series A led by Menlo Ventures - backers include Andrej Karpathy & Andrew Ng ICML 2024 Best Paper: Ermon's SEDD discrete diffusion framework beats GPT-2 Stanford professor on leave to build the future of AI inference DPO, DDIM, FlashAttention - Ermon's lab fingerprints are on the modern AI stack Mercury 2 launches: 1,000+ tokens per second - 5x faster than any speed-optimized LLM Inception Labs raises $50M Series A led by Menlo Ventures - backers include Andrej Karpathy & Andrew Ng ICML 2024 Best Paper: Ermon's SEDD discrete diffusion framework beats GPT-2 Stanford professor on leave to build the future of AI inference DPO, DDIM, FlashAttention - Ermon's lab fingerprints are on the modern AI stack
YesPress Profile

Stefano
Ermon

Alpine village. Cornell PhD. Stanford lab. Now rewriting how AI thinks.

Co-inventor of DDIM, DPO & FlashAttention. Now CEO of Inception Labs - building diffusion models that generate text as fast as human thought.

CEO, Inception Labs Stanford Associate Professor ICML 2024 Best Paper $50M Raised
62K+ Citations
85 H-Index
1K+ Tokens/sec
Stefano Ermon - CEO of Inception Labs and Stanford Professor
Currently at Inception Labs
1,000+ Tokens per second

Mercury 2's output speed. Claude Haiku Reasoning: 89 tok/s. GPT-5 Mini: 71 tok/s.

10x Faster & Cheaper

Mercury vs. traditional LLMs on inference speed and cost.

$50M Series A

Raised November 2025. Backed by NVIDIA, Microsoft, Databricks, Snowflake.

3 Foundational Inventions

DDIM, DPO, FlashAttention - techniques powering the modern AI stack.

From an Alpine Village to the Center of the AI Universe

Somewhere in the northeastern Italian Alps, in a community of a few hundred people, a teenager was obsessed with one question: how does the brain actually work? That obsession carried Stefano Ermon from Padova to Cornell, then to Stanford, and eventually into one of the most consequential bets in generative AI - that the entire industry had picked the wrong architecture.

Ermon is not a man who announces his moves loudly. He spent years quietly running the ermongroup at Stanford, a research lab that punched so far above its weight that it's now easy to forget how contrarian the work was at the time. When his team started applying diffusion to language models, the consensus response was polite skepticism. Diffusion was for images. Autoregression was for text. Everyone knew that.

He wasn't interested in what everyone knew. He was interested in what the math suggested.

An LLM is more like a typewriter where you go left to right one token at a time. A diffusion model is more like an editor.

- Stefano Ermon

The editor metaphor is the one that keeps appearing in his interviews, and it gets at something fundamental. Autoregressive models commit. Each token is final. The model cannot revise what it's already written. Diffusion models, by contrast, are trained to correct mistakes. They refine. They iterate. And in Ermon's hands, they've gotten frighteningly fast.

The Papers That Moved the Field

To understand why Inception Labs' investors - including Andrej Karpathy and Andrew Ng - moved fast, you need to look at the publication record. Ermon didn't just write papers. He co-authored the infrastructure that underpins modern generative AI.

Denoising Diffusion Implicit Models (DDIM) ICLR 2021 Best Paper
2020 · Co-authors: Jiaming Song, Chenlin Meng · 10-50x faster sampling
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO) NeurIPS 2023 Runner-up
2023 · Co-authors: Rafailov, Sharma, Mitchell, Manning, Finn · Standard LLM alignment technique
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (SEDD) ICML 2024 Best Paper
2024 · Co-authors: Aaron Lou, Chenlin Meng · 1 of 10 best papers from 9,473 submissions · Outperforms GPT-2
The Principles of Diffusion Models
2025 · Monograph · Co-authors include Dr. Yang Song

The DDIM paper is the one powering the image generation tools you've used. DALL-E, Stable Diffusion, Midjourney - they all rely on the sampling acceleration that Ermon's group pioneered. DPO is now one of the standard ways to align language models with human preferences. It's used everywhere, often without anyone knowing where it came from.

The SEDD paper was the direct intellectual ancestor of Mercury. It extended score matching to discrete spaces - meaning text, not just images - and demonstrated that diffusion could outperform GPT-2. When that paper won Best Paper at ICML 2024, beating out 9,473 other submissions, Ermon was already thinking about what came next.

Inference Speed Comparison
Output throughput · tokens per second · as of February 2026
Mercury 2
1,000+ tok/s
Claude Haiku Reasoning
89 tok/s
GPT-5 Mini
71 tok/s

Inception Labs: The Diffusion Bet Goes Commercial

Ermon co-founded Inception Labs in 2024 alongside Aditya Grover (UCLA professor, former Stanford researcher) and Volodymyr Kuleshov (Cornell professor, former Stanford student). The founding team is almost comically credentialed - three professors with deep roots in the same Stanford ecosystem that produced the research in the first place.

The company emerged from stealth in February 2025 with Mercury - the world's first commercial-scale diffusion LLM. The pitch was simple: same quality, 10x the speed, 1/10th the cost. It was already deployed in Fortune 500 coding tools before the announcement went out.

The November 2025 Series A told its own story. $50 million, led by Menlo Ventures. Participating: Mayfield, Innovation Endeavors, NVentures (NVIDIA's corporate arm), Microsoft M12, Snowflake Ventures, Databricks Investment. Then the angels: Andrew Ng and Andrej Karpathy. When Karpathy - formerly of OpenAI, formerly of Tesla, a man who chooses his bets carefully - backs a diffusion startup, people pay attention.

When we began applying diffusion to language in my lab at Stanford, many doubted it could work. That research became Mercury diffusion LLM: 10X faster, more efficient, and now the foundation of Inception Labs.

- Stefano Ermon

Why Diffusion, and Why Now

Ermon's central claim is architectural: autoregressive models are fundamentally limited by their sequential nature. Each token blocks the next. You cannot parallelize the core generation loop without abandoning the approach entirely.

Diffusion models generate text differently. They start with noise and iteratively refine toward coherence - processing positions in parallel, able to revise earlier decisions as generation proceeds. This means the architecture scales with compute in ways that autoregression doesn't.

The error correction point matters for a specific reason. Autoregressive models commit to mistakes. Once they've emitted a token, it's fixed - subsequent tokens condition on that error and amplify it. Diffusion models, trained to denoise, are designed to correct. At inference time, they keep refining, keep fixing, working toward the answer rather than toward the next character.

The Core Bet

Diffusion as the Architecture of Generative AI

"We're envisioning a future where all LLMs are going to be based on the diffusion paradigm. That's important because it will make generative AI solutions better, will make them cheaper, will make them faster, will improve the quality of the answers."

Mercury 2's performance numbers are the current evidence for this thesis. 1,000+ tokens per second isn't a marginal improvement - it's a different category. For applications that need real-time response, structured output, or high-volume inference, the gap between autoregression and diffusion is becoming difficult to explain away.

The Arc

2006 - 2008

Master's in Electrical Engineering at Università degli Studi di Padova. First serious exposure to signal processing and statistical modeling.

2008 - 2015

Cornell University: MS then PhD in Computer Science. Research into probabilistic graphical models and scalable inference.

2015

Joins Stanford faculty as Computer Science professor. Founds the ermongroup, focused on machine learning, generative models, and computational sustainability.

2017

Sloan Research Fellowship, NSF CAREER Award. Woods Institute for the Environment fellowship for applying ML to sustainability.

2018

IJCAI Computers and Thought Award - the premier recognition for AI researchers under 35.

2020

DDIM co-authored with Jiaming Song and Chenlin Meng. Enables 10-50x faster diffusion sampling. Becomes foundational to every major image generation system.

2021 - 2022

DDIM wins Outstanding Paper at ICLR 2021. FlashAttention published. ICLR 2022 Outstanding Paper. Lab output accelerates.

2023

DPO published - becomes a standard LLM alignment technique. NeurIPS 2023 Outstanding Paper Runner-up. SEDD discrete diffusion work begins.

2024

SEDD wins ICML 2024 Best Paper (1 of 10 from 9,473 submissions). Inception Labs co-founded with Aditya Grover and Volodymyr Kuleshov.

2025

Mercury launches from stealth. $50M Series A with Andrej Karpathy, Andrew Ng, NVIDIA, Microsoft, Databricks, Snowflake. Monograph on diffusion models published.

2026

Mercury 2 launches: world's first reasoning diffusion LLM at 1,000+ tokens per second. 5x faster than any speed-optimized competitor.

Awards & Honors

🏆
ICML 2024

Best Paper Award for SEDD - 1 of 10 best papers from 9,473 submissions

🏆
ICLR 2021 & 2022

Outstanding Paper Awards both years - exceptional in competitive ML venues

🎓
IJCAI Computers & Thought 2018

Premier award for AI researchers under 35. Awarded by the International Joint Conference on AI.

🔬
Sloan Research Fellowship 2017

Alfred P. Sloan Foundation fellowship for outstanding early-career scientists

NSF CAREER Award

NSF's most prestigious award in support of early-career faculty

✈️
AFOSR & ONR Young Investigator

Air Force and Office of Naval Research both selected Ermon for young investigator programs

📊
H-Index: 85

62,700+ citations on Semantic Scholar. 190+ publications across AI, ML, and sustainability

💡
Microsoft Research Fellowship

Plus Sony Faculty Innovation Award and Hellman Fellowship

For the Record

01

He grew up in an Alpine village in northeastern Italy with a community of just a few hundred people. From there to shaping how AI generates text globally.

02

The ermongroup at Stanford produced the algorithms powering Stable Diffusion, DALL-E, and Midjourney. Most users of those tools have never heard his name.

03

Mercury 2 hits 1,000+ tokens per second. A fast human reader processes about 5 tokens per second. The gap is 200x.

04

Soccer, hiking, hockey. Not an obvious combination, but it tracks for someone from the Alps who ended up in Silicon Valley.

05

His fascination with neuroscience - with how the brain actually works - started in high school. It still drives the research agenda.

06

The transition from academia to startup wasn't about money. It was about scale. You can't run a 37-person company's worth of inference experiments from a university lab.

On the Record

As an academic, as a researcher, you always have to take contrarian bets - you're never going to be first if you just follow what everybody else is doing.

- Stefano Ermon

If we think about an autoregressive model, once it outputs something, it can never take it back. A diffusion model is actually trained to correct mistakes, and during inference when it generates an answer, it keeps refining, it keeps fixing mistakes.

- Stefano Ermon, on the fundamental difference

The theory and mathematical analysis does provide intuition, but it's still very empirical.

- Stefano Ermon, on deep learning practice vs. theory

Links & Resources