◉ Breaking Inception closes $50M seed led by Menlo Ventures Mercury reportedly hits 1,000+ tokens/sec on a single H100 Andrew Ng and Andrej Karpathy join as angels Mercury Coder pricing: $0.25 / $0.75 per million tokens OpenAI API compatible, on-prem deployable Mercury 2 and Mercury Edit 2 announced NVIDIA, Snowflake, Databricks, Microsoft all on the cap table
YesPress / Profiles Issue No. 042 Palo Alto, CA

Inception

The diffusion company that wants to make autoregressive LLMs feel like dial-up.

Founded 2024 Team 37 Raised $50M Seed Flagship Mercury
Inception - A new frontier in LLM speed
FIG. 1 — The company calling card. ASCII drift over a navy field, the word inception in lowercase, a single tagline: A new frontier in LLM speed. Read it as a thesis statement, not a slogan.
The Scene · June 2026

A model that writes a paragraph at once.

Inside a converted office above University Avenue in Palo Alto, a developer types a half-formed prompt into a terminal and hits enter. The response does not creep across the screen one token at a time, the way a Claude or a GPT reply usually does. It arrives in a single visible block, sharpens, settles. The whole reply, drawn at once and then refined - the way a Polaroid pulls itself into focus.

That is the parlor trick. It is also the entire pitch of Inception, a 37-person company founded in 2024 by Stanford professor Stefano Ermon and two of his former PhD students, Aditya Grover and Volodymyr Kuleshov. They have spent a decade building diffusion models - the math that underwrites Midjourney's images and OpenAI's Sora videos. They are now arguing, with a flagship model named Mercury and $50 million in fresh seed money, that the same trick works for text.

It is not a small claim. Almost every commercial LLM you have used generates tokens autoregressively: predict the next word, append it, predict the next word, append it, repeat. The architecture is famously elegant and famously sequential. Inception's bet is that elegance has been paid for in latency - and that the bill is finally coming due as agents, voice interfaces and live coding tools demand answers in milliseconds, not seconds.

“When we began applying diffusion to language in my lab at Stanford, many doubted it could work.” — Stefano Ermon, co-founder & CEO
10×
Reported speedup vs autoregressive baselines
1,000+
Tokens/sec on a single H100
$50M
Seed round, Nov 2025
37
People in Palo Alto
The Mechanic

Why parallel beats sequential.

Diffusion models start with noise and refine it into a signal. Autoregressive models start with nothing and guess forward, word by word. Both can produce a paragraph. Only one of them produces it in a single shot.

Two ways to write a sentence

Method AAutoregressive LLMs

  • One token at a time, strictly left to right.
  • Each token waits for the one before it.
  • Hard to revise an earlier mistake.
  • Latency scales linearly with output length.

Method BDiffusion LLMs (dLLMs)

  • Generates blocks of tokens in parallel.
  • Iteratively denoises the whole sequence.
  • Built-in error correction across the draft.
  • Throughput is the goal, not the byproduct.

Reported throughput, tokens/sec on a single H100

Mercury (Inception)1,000+
GPT-4o Mini~200
Claude 3.5 Haiku~180
Llama-class baseline~120
Approximate; Mercury figures per Inception. Comparison models are illustrative, speed-optimized peers.
The Lineup

What you can actually buy.

Three Mercury models, an OpenAI-compatible API, and an on-prem option for companies that won't ship their data over the public internet.

Mercury Coder

Code, fast.

The model that put Inception on the map. Diffusion-based, tuned for code generation, priced at $0.25 per million input tokens and $0.75 per million output tokens. OpenAI API compatible, which means swapping it in is mostly a base URL change.

Mercury 2

Reasoning, faster.

The company's pitch for the next leg: a diffusion LLM built for reasoning workloads. The selling point isn't a higher benchmark score - it's getting a comparable score back before the user notices a delay.

Mercury Edit 2

Code, in flight.

A compact dLLM specifically for code editing - the latency-sensitive loop inside your IDE where every 200ms of lag is a paper cut. Small model, narrow job, fast.

The People

Three researchers, one bet.

Stefano Ermon is a Stanford computer science professor whose lab is one of the small handful of groups responsible for the diffusion idea in the first place. His co-founders Aditya Grover (UCLA) and Volodymyr Kuleshov (Cornell) are his former doctoral students who went on to start their own research groups. Between them, they've contributed to direct preference optimization, flash attention and decision transformers - the foundational machinery of the current AI moment.

Inception is what happens when researchers stop publishing about a thing and start selling it.

SE

Stefano Ermon

Co-founder · CEO
AG

Aditya Grover

Co-founder
VK

Volodymyr Kuleshov

Co-founder
The Receipts

A short, fast year.

February 2025

Mercury Coder is unveiled.

Inception introduces what it calls the first commercial-grade diffusion LLM. Code is the wedge.

June 2025

Mercury technical paper hits arXiv (2506.17298).

The research write-up follows the product, not the other way around.

November 2025

$50M seed round closes.

Menlo leads. Mayfield, M12, NVentures, Snowflake and Databricks join. So do Andrew Ng and Andrej Karpathy.

November 2025

Mercury 2 and Mercury Edit 2 launch.

Reasoning and code-editing variants, both diffusion-based, both API-compatible with OpenAI.

Watch

Interviews & demos.

Back to the Scene

The terminal, revisited.

Return to that developer in the Palo Alto office. A year ago, the response would have crawled out token by token; the developer would have read it as it streamed, the way people do now. With Mercury, the answer is on the screen before the eye has decided where to land. The waiting is gone. What was a stream is suddenly a sentence.

That is the thing Inception is selling. Not a smarter model - the company is careful not to claim that - but a different shape of the same thing, one where latency stops being the tax you pay for fluency. If they are right, the chunk of AI products that only kind of work today because the answer arrives too slowly - real-time voice agents, live pair-programmers, conversational agents that do not feel like dictation - become things that simply work.

If they're wrong, autoregression keeps winning. The cap table suggests a fair number of people who have built the current generation of LLMs would rather not bet against the alternative. Andrew Ng is in. Andrej Karpathy is in. NVIDIA, whose silicon Mercury runs on, is in. So is a Stanford professor who has spent ten years arguing that you can denoise your way to language, and who finally has a product to point at.

The developer hits enter again. The next answer is already there.

Share Inception.