The diffusion company that wants to make autoregressive LLMs feel like dial-up.
Inside a converted office above University Avenue in Palo Alto, a developer types a half-formed prompt into a terminal and hits enter. The response does not creep across the screen one token at a time, the way a Claude or a GPT reply usually does. It arrives in a single visible block, sharpens, settles. The whole reply, drawn at once and then refined - the way a Polaroid pulls itself into focus.
That is the parlor trick. It is also the entire pitch of Inception, a 37-person company founded in 2024 by Stanford professor Stefano Ermon and two of his former PhD students, Aditya Grover and Volodymyr Kuleshov. They have spent a decade building diffusion models - the math that underwrites Midjourney's images and OpenAI's Sora videos. They are now arguing, with a flagship model named Mercury and $50 million in fresh seed money, that the same trick works for text.
It is not a small claim. Almost every commercial LLM you have used generates tokens autoregressively: predict the next word, append it, predict the next word, append it, repeat. The architecture is famously elegant and famously sequential. Inception's bet is that elegance has been paid for in latency - and that the bill is finally coming due as agents, voice interfaces and live coding tools demand answers in milliseconds, not seconds.
Diffusion models start with noise and refine it into a signal. Autoregressive models start with nothing and guess forward, word by word. Both can produce a paragraph. Only one of them produces it in a single shot.
Three Mercury models, an OpenAI-compatible API, and an on-prem option for companies that won't ship their data over the public internet.
The model that put Inception on the map. Diffusion-based, tuned for code generation, priced at $0.25 per million input tokens and $0.75 per million output tokens. OpenAI API compatible, which means swapping it in is mostly a base URL change.
The company's pitch for the next leg: a diffusion LLM built for reasoning workloads. The selling point isn't a higher benchmark score - it's getting a comparable score back before the user notices a delay.
A compact dLLM specifically for code editing - the latency-sensitive loop inside your IDE where every 200ms of lag is a paper cut. Small model, narrow job, fast.
Stefano Ermon is a Stanford computer science professor whose lab is one of the small handful of groups responsible for the diffusion idea in the first place. His co-founders Aditya Grover (UCLA) and Volodymyr Kuleshov (Cornell) are his former doctoral students who went on to start their own research groups. Between them, they've contributed to direct preference optimization, flash attention and decision transformers - the foundational machinery of the current AI moment.
Inception is what happens when researchers stop publishing about a thing and start selling it.
Inception introduces what it calls the first commercial-grade diffusion LLM. Code is the wedge.
The research write-up follows the product, not the other way around.
Menlo leads. Mayfield, M12, NVentures, Snowflake and Databricks join. So do Andrew Ng and Andrej Karpathy.
Reasoning and code-editing variants, both diffusion-based, both API-compatible with OpenAI.
Return to that developer in the Palo Alto office. A year ago, the response would have crawled out token by token; the developer would have read it as it streamed, the way people do now. With Mercury, the answer is on the screen before the eye has decided where to land. The waiting is gone. What was a stream is suddenly a sentence.
That is the thing Inception is selling. Not a smarter model - the company is careful not to claim that - but a different shape of the same thing, one where latency stops being the tax you pay for fluency. If they are right, the chunk of AI products that only kind of work today because the answer arrives too slowly - real-time voice agents, live pair-programmers, conversational agents that do not feel like dictation - become things that simply work.
If they're wrong, autoregression keeps winning. The cap table suggests a fair number of people who have built the current generation of LLMs would rather not bet against the alternative. Andrew Ng is in. Andrej Karpathy is in. NVIDIA, whose silicon Mercury runs on, is in. So is a Stanford professor who has spent ten years arguing that you can denoise your way to language, and who finally has a product to point at.
The developer hits enter again. The next answer is already there.