The startup that thinks your AI should run on the hardware you already own - not somebody else's cloud.
The company mark, photographed straight: a plain blue square on white. There is no mascot, no swoosh, no clever negative-space trick hiding a letter. For a company whose whole pitch is "stop paying for things you do not need," the restraint reads as the point.
There is a quiet, slightly heretical idea at the center of OpenInfer, and it is this: a great deal of the money that companies now send to cloud providers to run artificial intelligence is, technically speaking, optional. Not all of it. But a lot of it. OpenInfer's founders looked at the CPUs, GPUs and NPUs already sitting inside corporate data centers - much of it idle, most of it paid for - and asked why the industry decided that "running AI" and "renting someone else's computer" had to be the same sentence.
*On distilled DeepSeek models, per the company and its investors. †Market estimate cited by MFV Partners. Numbers here are self-reported or from backers; treat them as claims, not audited fact.
OpenInfer describes itself as the "Inference OS for the agentic era," a tagline that does a lot of work. The short version: inference is the part of AI where a trained model actually answers your question, and it is increasingly the expensive part. Training a model is a one-time capital event. Inference is a bill that arrives every single time someone uses the thing, forever.
Agents made that bill worse. A single chat message is cheap. An AI agent that plans, loops, calls tools and checks its own work can demand five to fifteen times the compute of an ordinary chat turn. OpenInfer's argument is that most of those agent tasks - roughly ninety percent, by their framing - are routine, latency-tolerant and boring, and yet they are being run on the most expensive silicon money can rent.
The OpenInfer Engine is the piece that fixes the mismatch. It takes an AI workload, automatically breaks it apart, and routes each piece to whatever hardware handles it best, whether that is a CPU under a desk or a mixed rack of GPUs in a private data center. Critically, it presents itself as a drop-in replacement: it speaks LangChain, Ollama and vLLM without code changes, so an engineer can point an existing stack at it and, in theory, notice nothing except a smaller bill.
On top of the engine sits an orchestration layer the company calls Loom, and a closed-loop router called Weave that schedules requests across whatever heterogeneous compute a company owns. Weave, per OpenInfer, gets better the more inference flows through it - it learns the shape of your workloads and tunes itself. The marketing sentence for all of this is admirably blunt: "No rewrites. No new hardware. No cloud dependency."
Run AI agents on the CPUs, GPUs, and NPUs you already own - at the best cost and speed, with no cloud lock-in.
Every writeup of OpenInfer eventually arrives at the same statistic: the first preview build of its engine ran distilled DeepSeek models two to three times faster than Llama.cpp and Ollama, the two most common ways people run models locally. The company credits a stack of unglamorous optimizations - smarter handling of quantized values, better memory-access caching, and model-specific tuning. Here is the claim, drawn as a bar chart, with the honest caveat that OpenInfer drew the underlying numbers.
Relative throughput on distilled DeepSeek models, illustrative. Source: OpenInfer and its investors. Independent benchmarks were not available at time of writing - so read the bars as the company's homework, not a referee's.
The useful thing to know about OpenInfer's founders is where they come from, because it explains the whole company. Behnam Bastani and Reza Nourai spent nearly ten years building AI systems together across Meta's Reality Labs and Roblox - domains where the computer is a headset or a game console, the memory budget is cruel, and "just add more GPUs" is not an option. Constrained-compute discipline is not a marketing angle for them. It is a professional habit.
More than twenty years working AI across constrained-compute platforms. Former Director of Architecture at Meta's Reality Labs, led mobile rendering, VR and display teams at Google, and served as a senior engineering director for AI at Roblox. Has shipped AI engines at Meta, Google and Roblox.
Two decades in GPU and large-scale memory architecture, with breakthroughs at Meta, Microsoft, Roblox and Magic Leap. The graphics-and-gaming background is the tell: he has spent a career squeezing performance out of hardware that refuses to cooperate.
A third name rounds out the front office: Kam Eshghi, Chief Business Officer, who co-founded Lightbits Labs and helped push NVMe/TCP storage into data centers. He is the one whose email address you will find on the company's contact page.
Making sovereign inference inevitable.
"Sovereign" is the word OpenInfer keeps returning to, and it is doing a specific job. The company is for engineers who believe, in its words, that AI infrastructure should be "sovereign, efficient, and built to outlast any single provider's roadmap." Translated out of manifesto: if your entire AI operation depends on one cloud vendor not changing its prices, its terms, or its mind, you do not have infrastructure. You have a landlord. When Anthropic tightened certain agentic usage limits, OpenInfer promptly published a piece using it as Exhibit A for exactly this risk.
The flagship. Disaggregates AI workloads and routes them to optimal hardware, works as a drop-in replacement for existing endpoints, and speaks LangChain, Ollama and vLLM with no code changes. The 2-3x speed claim lives here.
A private, email-native agentic AI system that runs entirely on your own infrastructure - no cloud costs, no data exposure. Built on persistent contextual memory. AskJean.ai is the public way to try it.
The scheduling layer for enterprises running big models across fragmented compute. Weave routes and schedules every request in a closed loop, learning workload patterns and improving as more inference passes through it.
A zero-rewrite inference API that slots into an existing stack as a drop-in replacement for current inference endpoints. The whole design philosophy is "change nothing above the line."
OpenInfer is a B2B company, and its natural buyer is an engineering team running agentic AI workloads that want two things at once: to stop sending everything to a hyperscaler, and to actually use the CPUs, GPUs and NPUs they have already bought. For those teams the promise is concrete - point your existing LangChain or vLLM stack at OpenInfer, keep your code, and run the routine ninety percent of your agent traffic on leaner, cheaper hardware while reserving the expensive topology for the tasks that genuinely need it.
There is a second, quieter audience: developers and privacy-sensitive organizations who simply cannot or will not put their data in the cloud. Jean is aimed squarely at them - an agent that reads its context, remembers your history, and never leaves your infrastructure. In regulated corners like defense, finance and healthcare, "runs entirely on your own hardware" is not a performance feature. It is a permission slip.
In February 2025 OpenInfer closed an $8 million seed round, led by Cota Capital and Essence VC. The institutional list is long, but the part that made people look twice was the angels: Google DeepMind's chief scientist Jeff Dean, Oculus co-founder and former CEO Brendan Iribe, and Microsoft's chief product officer for Experiences and Devices, Aparna Chennapragada. When the people who helped build modern AI and modern VR write personal checks into an edge-inference startup, it is at least worth noting.
The bet, per lead-adjacent investor MFV Partners: inference hardware is growing at a ~48% CAGR through 2032, faster than training, and "true AI adoption will happen at the edge." OpenInfer is a wager on that sentence being right.
Publishes early work on running large models within a fixed memory footprint and on memory-optimization gains - the seeds of the engine.
Announces the OpenInfer API, ships the first preview build of the OpenInfer Engine, and closes the $8M seed. VentureBeat covers the round.
Runs Llama 4 Scout locally in a client-side inference demo, making the "big model, your device" claim tangible.
Announces collaborations with Intel (Partner Alliance) and Microsoft (Pegasus Program).
Launches Jean, a sovereign agentic AI system; hires a revenue chief; ties agentic infrastructure inefficiency to Anthropic's Claude usage limits.
Argues "NVIDIA Dynamo Proved Inference Needs a Control Plane," positioning Weave/Loom as that control plane, and hosts a build-day hackathon.
OpenInfer sits in a crowded, fast-moving neighborhood. On the local-runtime side, its named comparisons are Ollama, Llama.cpp and vLLM - the tools it wants to replace or sit beneath. On the other flank are the hyperscalers themselves, AWS, Azure and Google Cloud, whose hosted inference is precisely the dependency OpenInfer is selling companies a way out of. And in the agentic-serving layer, NVIDIA's Dynamo and a wave of commercial vLLM-based offerings are circling the same "inference needs a control plane" idea. A sixteen-person company is picking a fight with all three groups at once. Whether that is ambition or overreach is, for now, an open question.
Video note: OpenInfer has shipped concept demos - Mementos and its client-side Llama 4 Scout run - via its news page. Direct YouTube and product-demo video links were not publicly confirmed at time of writing; the news page is the reliable jumping-off point.