OpenInfer: The Startup That Wants to Run Your AI on Hardware You Already Own

The Pitch

There is a quiet, slightly heretical idea at the center of OpenInfer, and it is this: a great deal of the money that companies now send to cloud providers to run artificial intelligence is, technically speaking, optional. Not all of it. But a lot of it. OpenInfer's founders looked at the CPUs, GPUs and NPUs already sitting inside corporate data centers - much of it idle, most of it paid for - and asked why the industry decided that "running AI" and "renting someone else's computer" had to be the same sentence.

$8M

Seed Raised · Feb 2025

2-3x

Faster vs. Ollama*

~16

Employees

$400B

Inference Mkt by 2030†

*On distilled DeepSeek models, per the company and its investors. †Market estimate cited by MFV Partners. Numbers here are self-reported or from backers; treat them as claims, not audited fact.

What It Actually Does

An operating system for inference, which is a phrase that means more than it sounds

OpenInfer describes itself as the "Inference OS for the agentic era," a tagline that does a lot of work. The short version: inference is the part of AI where a trained model actually answers your question, and it is increasingly the expensive part. Training a model is a one-time capital event. Inference is a bill that arrives every single time someone uses the thing, forever.

Agents made that bill worse. A single chat message is cheap. An AI agent that plans, loops, calls tools and checks its own work can demand five to fifteen times the compute of an ordinary chat turn. OpenInfer's argument is that most of those agent tasks - roughly ninety percent, by their framing - are routine, latency-tolerant and boring, and yet they are being run on the most expensive silicon money can rent.

The OpenInfer Engine is the piece that fixes the mismatch. It takes an AI workload, automatically breaks it apart, and routes each piece to whatever hardware handles it best, whether that is a CPU under a desk or a mixed rack of GPUs in a private data center. Critically, it presents itself as a drop-in replacement: it speaks LangChain, Ollama and vLLM without code changes, so an engineer can point an existing stack at it and, in theory, notice nothing except a smaller bill.

On top of the engine sits an orchestration layer the company calls Loom, and a closed-loop router called Weave that schedules requests across whatever heterogeneous compute a company owns. Weave, per OpenInfer, gets better the more inference flows through it - it learns the shape of your workloads and tunes itself. The marketing sentence for all of this is admirably blunt: "No rewrites. No new hardware. No cloud dependency."

Run AI agents on the CPUs, GPUs, and NPUs you already own - at the best cost and speed, with no cloud lock-in.

— OpenInfer's own description of the pitch

The Number Everyone Repeats

2-3x, with an asterisk

Every writeup of OpenInfer eventually arrives at the same statistic: the first preview build of its engine ran distilled DeepSeek models two to three times faster than Llama.cpp and Ollama, the two most common ways people run models locally. The company credits a stack of unglamorous optimizations - smarter handling of quantized values, better memory-access caching, and model-specific tuning. Here is the claim, drawn as a bar chart, with the honest caveat that OpenInfer drew the underlying numbers.

OpenInfer

~2.5x speed

Ollama

baseline 1x

Llama.cpp

baseline 1x

Relative throughput on distilled DeepSeek models, illustrative. Source: OpenInfer and its investors. Independent benchmarks were not available at time of writing - so read the bars as the company's homework, not a referee's.

The People

Two systems engineers who spent a decade making AI fit in small places

The useful thing to know about OpenInfer's founders is where they come from, because it explains the whole company. Behnam Bastani and Reza Nourai spent nearly ten years building AI systems together across Meta's Reality Labs and Roblox - domains where the computer is a headset or a game console, the memory budget is cruel, and "just add more GPUs" is not an option. Constrained-compute discipline is not a marketing angle for them. It is a professional habit.

Behnam Bastani

Co-Founder & CEO

More than twenty years working AI across constrained-compute platforms. Former Director of Architecture at Meta's Reality Labs, led mobile rendering, VR and display teams at Google, and served as a senior engineering director for AI at Roblox. Has shipped AI engines at Meta, Google and Roblox.

Reza Nourai

Co-Founder & CTO

Two decades in GPU and large-scale memory architecture, with breakthroughs at Meta, Microsoft, Roblox and Magic Leap. The graphics-and-gaming background is the tell: he has spent a career squeezing performance out of hardware that refuses to cooperate.

A third name rounds out the front office: Kam Eshghi, Chief Business Officer, who co-founded Lightbits Labs and helped push NVMe/TCP storage into data centers. He is the one whose email address you will find on the company's contact page.

The Belief System

Making sovereign inference inevitable.

— OpenInfer's framing of its own mission

"Sovereign" is the word OpenInfer keeps returning to, and it is doing a specific job. The company is for engineers who believe, in its words, that AI infrastructure should be "sovereign, efficient, and built to outlast any single provider's roadmap." Translated out of manifesto: if your entire AI operation depends on one cloud vendor not changing its prices, its terms, or its mind, you do not have infrastructure. You have a landlord. When Anthropic tightened certain agentic usage limits, OpenInfer promptly published a piece using it as Exhibit A for exactly this risk.

The Products

An engine, an agent, and the plumbing between them

Core

OpenInfer Engine

The flagship. Disaggregates AI workloads and routes them to optimal hardware, works as a drop-in replacement for existing endpoints, and speaks LangChain, Ollama and vLLM with no code changes. The 2-3x speed claim lives here.

Agent

Jean · AskJean.ai

A private, email-native agentic AI system that runs entirely on your own infrastructure - no cloud costs, no data exposure. Built on persistent contextual memory. AskJean.ai is the public way to try it.

Orchestration

Loom & Weave

The scheduling layer for enterprises running big models across fragmented compute. Weave routes and schedules every request in a closed loop, learning workload patterns and improving as more inference passes through it.

Integration

OpenInfer API

A zero-rewrite inference API that slots into an existing stack as a drop-in replacement for current inference endpoints. The whole design philosophy is "change nothing above the line."

Who It's For

If you run agents at scale and hate your cloud bill, you are the customer

OpenInfer is a B2B company, and its natural buyer is an engineering team running agentic AI workloads that want two things at once: to stop sending everything to a hyperscaler, and to actually use the CPUs, GPUs and NPUs they have already bought. For those teams the promise is concrete - point your existing LangChain or vLLM stack at OpenInfer, keep your code, and run the routine ninety percent of your agent traffic on leaner, cheaper hardware while reserving the expensive topology for the tasks that genuinely need it.

There is a second, quieter audience: developers and privacy-sensitive organizations who simply cannot or will not put their data in the cloud. Jean is aimed squarely at them - an agent that reads its context, remembers your history, and never leaves your infrastructure. In regulated corners like defense, finance and healthcare, "runs entirely on your own hardware" is not a performance feature. It is a permission slip.

The Money

An over-subscribed $8M seed with a notable guest list

In February 2025 OpenInfer closed an $8 million seed round, led by Cota Capital and Essence VC. The institutional list is long, but the part that made people look twice was the angels: Google DeepMind's chief scientist Jeff Dean, Oculus co-founder and former CEO Brendan Iribe, and Microsoft's chief product officer for Experiences and Devices, Aparna Chennapragada. When the people who helped build modern AI and modern VR write personal checks into an edge-inference startup, it is at least worth noting.

Cota Capital · lead Essence VC · lead B5 Capital Brave Capital Future Fund Machine Ventures Pretiosum SilverCircle StemAI Tau Ventures YG Ventures Jeff Dean · angel Brendan Iribe · angel Aparna Chennapragada · angel Gokul Rajaram · angel

The bet, per lead-adjacent investor MFV Partners: inference hardware is growing at a ~48% CAGR through 2032, faster than training, and "true AI adoption will happen at the edge." OpenInfer is a wager on that sentence being right.

The Story So Far

From preview build to control plane

Jan 2025

Publishes early work on running large models within a fixed memory footprint and on memory-optimization gains - the seeds of the engine.

Feb 2025

Announces the OpenInfer API, ships the first preview build of the OpenInfer Engine, and closes the $8M seed. VentureBeat covers the round.

Apr 2025

Runs Llama 4 Scout locally in a client-side inference demo, making the "big model, your device" claim tangible.

Oct 2025

Announces collaborations with Intel (Partner Alliance) and Microsoft (Pegasus Program).

Apr 2026

Launches Jean, a sovereign agentic AI system; hires a revenue chief; ties agentic infrastructure inefficiency to Anthropic's Claude usage limits.

Jun 2026

Argues "NVIDIA Dynamo Proved Inference Needs a Control Plane," positioning Weave/Loom as that control plane, and hosts a build-day hackathon.

Marginalia

Things that amuse, in no particular order

The founders met and built together not in a lab but inside Meta's Reality Labs and Roblox - the AI they optimized had to fit in headsets and game engines first.
The company's agent is named Jean, and it is designed to work over email, because that is how people already work.
CBO Kam Eshghi's last company, Lightbits Labs, pioneered NVMe/TCP - so the front office has already done the "convince data centers to adopt new plumbing" tour once before.
OpenInfer's mission statement is not a paragraph. It is three words: "making sovereign inference inevitable."
Its logo is a plain square. For a company selling restraint, the branding is on-message.

The Alternatives

Who else is in the ring

OpenInfer sits in a crowded, fast-moving neighborhood. On the local-runtime side, its named comparisons are Ollama, Llama.cpp and vLLM - the tools it wants to replace or sit beneath. On the other flank are the hyperscalers themselves, AWS, Azure and Google Cloud, whose hosted inference is precisely the dependency OpenInfer is selling companies a way out of. And in the agentic-serving layer, NVIDIA's Dynamo and a wave of commercial vLLM-based offerings are circling the same "inference needs a control plane" idea. A sixteen-person company is picking a fight with all three groups at once. Whether that is ambition or overreach is, for now, an open question.

Go Deeper

Links, sources & further reading

■ Website · openinfer.io in LinkedIn ■ About & Team ■ News & Blog ■ Try Jean · AskJean.ai ■ Solutions ■ VentureBeat: $8M Round ■ MFV: Why We Invested ■ Crunchbase Profile

Video note: OpenInfer has shipped concept demos - Mementos and its client-side Llama 4 Scout run - via its news page. Direct YouTube and product-demo video links were not publicly confirmed at time of writing; the news page is the reliable jumping-off point.