Inception

A model that writes a paragraph at once.

Inside a converted office above University Avenue in Palo Alto, a developer types a half-formed prompt into a terminal and hits enter. The response does not creep across the screen one token at a time, the way a Claude or a GPT reply usually does. It arrives in a single visible block, sharpens, settles. The whole reply, drawn at once and then refined - the way a Polaroid pulls itself into focus.

That is the parlor trick. It is also the entire pitch of Inception, a 37-person company founded in 2024 by Stanford professor Stefano Ermon and two of his former PhD students, Aditya Grover and Volodymyr Kuleshov. They have spent a decade building diffusion models - the math that underwrites Midjourney's images and OpenAI's Sora videos. They are now arguing, with a flagship model named Mercury and $50 million in fresh seed money, that the same trick works for text.

It is not a small claim. Almost every commercial LLM you have used generates tokens autoregressively: predict the next word, append it, predict the next word, append it, repeat. The architecture is famously elegant and famously sequential. Inception's bet is that elegance has been paid for in latency - and that the bill is finally coming due as agents, voice interfaces and live coding tools demand answers in milliseconds, not seconds.

“When we began applying diffusion to language in my lab at Stanford, many doubted it could work.” — Stefano Ermon, co-founder & CEO

10×

Reported speedup vs autoregressive baselines

1,000+

Tokens/sec on a single H100

$50M

Seed round, Nov 2025

People in Palo Alto

Why parallel beats sequential.

Diffusion models start with noise and refine it into a signal. Autoregressive models start with nothing and guess forward, word by word. Both can produce a paragraph. Only one of them produces it in a single shot.

Two ways to write a sentence

Method AAutoregressive LLMs

One token at a time, strictly left to right.
Each token waits for the one before it.
Hard to revise an earlier mistake.
Latency scales linearly with output length.

Method BDiffusion LLMs (dLLMs)

Generates blocks of tokens in parallel.
Iteratively denoises the whole sequence.
Built-in error correction across the draft.
Throughput is the goal, not the byproduct.

Reported throughput, tokens/sec on a single H100

Mercury (Inception)1,000+

GPT-4o Mini~200

Claude 3.5 Haiku~180

Llama-class baseline~120

Approximate; Mercury figures per Inception. Comparison models are illustrative, speed-optimized peers.

What you can actually buy.

Three Mercury models, an OpenAI-compatible API, and an on-prem option for companies that won't ship their data over the public internet.

Mercury Coder

Code, fast.

The model that put Inception on the map. Diffusion-based, tuned for code generation, priced at $0.25 per million input tokens and $0.75 per million output tokens. OpenAI API compatible, which means swapping it in is mostly a base URL change.

Mercury 2

Reasoning, faster.

The company's pitch for the next leg: a diffusion LLM built for reasoning workloads. The selling point isn't a higher benchmark score - it's getting a comparable score back before the user notices a delay.

Mercury Edit 2

Code, in flight.

A compact dLLM specifically for code editing - the latency-sensitive loop inside your IDE where every 200ms of lag is a paper cut. Small model, narrow job, fast.

Three researchers, one bet.

Stefano Ermon is a Stanford computer science professor whose lab is one of the small handful of groups responsible for the diffusion idea in the first place. His co-founders Aditya Grover (UCLA) and Volodymyr Kuleshov (Cornell) are his former doctoral students who went on to start their own research groups. Between them, they've contributed to direct preference optimization, flash attention and decision transformers - the foundational machinery of the current AI moment.

Inception is what happens when researchers stop publishing about a thing and start selling it.

Stefano Ermon

Co-founder · CEO

Aditya Grover

Co-founder

Volodymyr Kuleshov

Co-founder

A short, fast year.

February 2025

Mercury Coder is unveiled.

Inception introduces what it calls the first commercial-grade diffusion LLM. Code is the wedge.

June 2025

Mercury technical paper hits arXiv (2506.17298).

The research write-up follows the product, not the other way around.

November 2025

$50M seed round closes.

Menlo leads. Mayfield, M12, NVentures, Snowflake and Databricks join. So do Andrew Ng and Andrej Karpathy.

November 2025

Mercury 2 and Mercury Edit 2 launch.

Reasoning and code-editing variants, both diffusion-based, both API-compatible with OpenAI.

The terminal, revisited.

Return to that developer in the Palo Alto office. A year ago, the response would have crawled out token by token; the developer would have read it as it streamed, the way people do now. With Mercury, the answer is on the screen before the eye has decided where to land. The waiting is gone. What was a stream is suddenly a sentence.

That is the thing Inception is selling. Not a smarter model - the company is careful not to claim that - but a different shape of the same thing, one where latency stops being the tax you pay for fluency. If they are right, the chunk of AI products that only kind of work today because the answer arrives too slowly - real-time voice agents, live pair-programmers, conversational agents that do not feel like dictation - become things that simply work.

If they're wrong, autoregression keeps winning. The cap table suggests a fair number of people who have built the current generation of LLMs would rather not bet against the alternative. Andrew Ng is in. Andrej Karpathy is in. NVIDIA, whose silicon Mercury runs on, is in. So is a Stanford professor who has spent ten years arguing that you can denoise your way to language, and who finally has a product to point at.

The developer hits enter again. The next answer is already there.

Inception

A model that writes a paragraph at once.

Why parallel beats sequential.

Two ways to write a sentence

Method AAutoregressive LLMs

Method BDiffusion LLMs (dLLMs)

Reported throughput, tokens/sec on a single H100

What you can actually buy.

Code, fast.

Reasoning, faster.

Code, in flight.

Three researchers, one bet.

Stefano Ermon

Aditya Grover

Volodymyr Kuleshov

A short, fast year.

Mercury Coder is unveiled.

Mercury technical paper hits arXiv (2506.17298).

$50M seed round closes.

Mercury 2 and Mercury Edit 2 launch.

Interviews & demos.

The terminal, revisited.

Where to find Inception.

Inception

A model that writes a paragraph at once.

Why parallel beats sequential.

Two ways to write a sentence

Method AAutoregressive LLMs

Method BDiffusion LLMs (dLLMs)

Reported throughput, tokens/sec on a single H100

What you can actually buy.

Code, fast.

Reasoning, faster.

Code, in flight.

Three researchers, one bet.

Stefano Ermon

Aditya Grover

Volodymyr Kuleshov

A short, fast year.

Mercury Coder is unveiled.

Mercury technical paper hits arXiv (2506.17298).

$50M seed round closes.

Mercury 2 and Mercury Edit 2 launch.

Interviews & demos.

The terminal, revisited.

Share Inception.

Where to find Inception.