The people who built vLLM - the open-source engine already serving a slice of the world's AI - turned it into a company. Then two rival venture firms co-led the round.
FOUNDED 2025 · SEED $150M · VALUATION $800M · ~22 PEOPLE
Here is a fun way to think about Inferact. Most startups spend their first two years trying to convince anyone at all to use their software. Inferact started the opposite way. Before the company had a name, a bank account, or a valuation, its software - an open-source engine called vLLM - was already running, at any given moment, on more than 400,000 GPUs around the world. The problem was never distribution. The problem was that all of it was free.
vLLM was incubated in 2023 at UC Berkeley, in the lab of Databricks co-founder Ion Stoica - the same lab that, in a bit of academic sibling rivalry, also produced its main competitor, SGLang. It does the unglamorous half of artificial intelligence. Not the training of models, which gets the headlines and the magazine covers, but the inference - the part where a trained model actually answers your question, over and over, for every user, forever. That part is expensive. vLLM makes it cheaper and faster by handling the fiddly, low-level work of batching requests, caching keys and values, and squeezing performance out of whatever chip it lands on.
The thing about doing the unglamorous half well is that everyone quietly ends up depending on you. vLLM is used in production by Meta, Google, Character.ai, Amazon's cloud, and the Amazon Shopping app, among thousands of others. Which raises the obvious question that eventually confronts every wildly successful open-source project: how do you turn a thing people love precisely because it is free into a business without ruining the thing they love?
"Open source, especially how vLLM itself is structured, is critical to the AI infrastructure in the world."
Inferact's answer is deliberately counterintuitive: keep vLLM open, pour paid engineering resources back into it, and build a commercial product - a next-generation "universal inference layer" - on the same foundation. The optimizations developed for money flow back to the community. The pitch to the 50-plus core developers who could work anywhere is that here, they get paid to keep doing the exact thing they already care about.
It is also a smart read on where the hardware is going. As CEO Simon Mo frames it, the enormous GPU clusters being built today for training don't stay training clusters for long - within months they get repurposed to serve models. Inference, in other words, eventually eats everything.
Make AI inference cheaper and faster by growing vLLM as the default engine for serving models - and accelerate AI progress in the process.
Any model, any chip, run efficiently - with open-source vLLM at the center, and Inferact collaborating with existing providers rather than competing with them.
All four are founding maintainers of vLLM. CEO Simon Mo was a Berkeley doctoral student before leading the company; co-founder Kaichao You is a Tsinghua University special-award winner. Team size is roughly 22 people.
The open-source LLM inference and serving engine created by the founders. It manages continuous batching, KV caching, and low-level chip optimization so diverse models run efficiently across varied hardware. Runs on 400,000+ GPUs concurrently worldwide, with 2,000+ contributors and 50+ core developers.
A planned next-generation commercial engine built on the vLLM foundation - designed to serve any model on any hardware at lower cost, while collaborating with existing inference providers rather than replacing them.
The round was co-led by Andreessen Horowitz and Lightspeed Venture Partners - two firms that don't often share a table - with Sequoia Capital, Altimeter Capital, Redpoint Ventures, and ZhenFund joining. a16z's stated reasoning: inference demand grows super-linearly as AI agents take on longer tasks and generate more tokens.