He gives the software away for free. Then he raised $150 million to make it run faster.
Training gets the headlines. Inference pays the bills - every time a model answers a question, somewhere a GPU is doing the work, and somebody is footing the meter.
Simon Mo builds the meter-shrinking machinery. As lead maintainer of vLLM, the open-source inference engine born in a Berkeley lab, he spends his days on the unglamorous problem of memory management - how to pack more answers into the same silicon. The trick has a name, PagedAttention, and a payoff: large language models that run cheaper and faster without anyone rewriting their model.
In November 2025 he and his fellow vLLM maintainers turned the project's gravity into a company, Inferact. Two months later it stepped out of stealth with a $150 million seed round at an $800 million valuation, co-led by Andreessen Horowitz and Lightspeed. The product they are commercializing is the same one they keep giving away.
Illustrative - the direction of vLLM's design goals, not benchmarked figures.
vLLM is used by Amazon's cloud service and the shopping app.
Mo did not parachute into the inference boom. He grew up in it. Before vLLM there was Ray Serve, which he helped build from zero to one at Anyscale - the model-serving layer on top of the Ray distributed framework. Before that, he was a student researcher at UC Berkeley's RISELab, asking the question he is still asking: how do you make machine-learning serving more efficient, more ergonomic, more scalable?
His fingerprints sit across the open-source map - Ray, Modin, Clipper - and his production scars run through GPU inference work and Kubernetes multi-tenancy. The throughline is not a product. It is a problem: serving is hard, and almost nobody wants to do the plumbing. Mo wanted to do the plumbing.
At PyCon 2021 he stood up to talk about "Patterns of ML Models in Production." Four years later he was delivering a keynote at the PyTorch Conference on vLLM. Same lane, bigger stage.
Built from scratch at Anyscale - the serving layer that taught him how production models actually break.
Lead maintainer of the inference engine, now stewarded by the PyTorch Foundation.
Berkeley research on ML systems and cloud infrastructure - the lab lineage behind Spark and Ray.
Co-founder of the company commercializing vLLM, while the project stays open.
An odd transcript for a man who counts GPU memory pages for a living.
At Berkeley he studied all three. The philosophy part is the tell. Inference is a question of how to allocate scarce, expensive resources fairly and fast - which is, if you squint, an ethics problem dressed up in CUDA.
Today he is a PhD student in Berkeley's Sky Computing Lab, advised by Joseph Gonzalez and Ion Stoica - the professor whose lab spun out Databricks, Anyscale, and Spark. Mo is, in the most literal sense, building the company while finishing the degree. The startup did not pull him out of research. It walked out of the research with him.
His GitHub badges include "Galaxy Brain" and "Arctic Code Vault Contributor."
vLLM's "v" nods to virtual memory - the PagedAttention trick that made the project famous.
His advisor, Ion Stoica, is the same researcher behind Databricks and Anyscale.
Inferact kept vLLM open - the project is now governed under the PyTorch Foundation.
The conventional startup hides the crown jewels. Mo's company runs the opposite play: keep vLLM free, open, and community-governed, then sell the speed, reliability, and operational muscle that hyperscalers need to run it at scale. The funding isn't a pivot away from open source. It is a wager that the open project and the company make each other stronger.