The frontier AI inference cloud, built by the researchers who invented continuous batching. Serving open-weight and custom generative models the way they were meant to be served.
It is a Tuesday afternoon in the Crown Point Press building. Below the high windows that once lit Wayne Thiebaud's etchings, a different kind of plate is being inked. Engineers from FriendliAI - many of them flown in from Seoul - are watching a dashboard tick upward. Tokens per second. GPU utilization. A customer in Tokyo just shipped a new LLM. The room does not cheer. It just keeps reading the numbers.
FriendliAI engineers reviewing inference metrics, San Francisco office, 2026. The mood: calm, the way a pit crew is calm.
Most of the AI conversation in 2026 is about who is training the next foundation model. FriendliAI sits one layer down, where the metaphor changes. Training is the laboratory. Inference is the factory. Once a model exists, the question is not who built it but who can run it: in production, at scale, without burning down the GPU budget. That is the room FriendliAI lives in.
The company calls itself the Frontier AI Inference Cloud. The phrase does some work. Frontier signals support for state-of-the-art open-weight models - the latest Llamas, Qwens, Mistrals, multimodal entries - alongside whatever a customer has fine-tuned in private. Cloud signals that you do not have to manage any of it. You point Friendli at a model on the Hugging Face Hub, and a few seconds later you have an endpoint.
Under the hood is PeriFlow, the proprietary inference engine FriendliAI has been refining since the company was founded. PeriFlow is where the magic actually happens: scheduling tricks, GPU kernels, native quantization, speculative decoding, multi-LoRA serving, all aimed at one number - tokens per dollar.
FriendliAI was founded in 2021 by Byung-Gon Chun, a computer science professor at Seoul National University, along with members of his research group. The team had already done the kind of academic work that, in hindsight, ends up everywhere. Specifically: continuous batching, also called iteration-level scheduling. It is the technique that lets an inference server pack new requests into a GPU's in-flight workload instead of waiting for the slowest one to finish. If you have used a hosted LLM in the last two years, you have used continuous batching, even if no one told you.
That is the polite way to introduce FriendliAI. Less politely: the people running your inference are the people who figured out how to run inference.
Inference performance drives profitability. - FriendliAI, company mantra
FriendliAI sells the same core technology in three shapes. Friendli Serverless Endpoints are the front door - pay per token, popular open-weight models, no infrastructure to think about. Friendli Dedicated Endpoints are for teams who need their own model in their own autoscaling endpoint, often deployed straight from a Hugging Face repo. Friendli Container is for the security-conscious: drop the engine into your own VPC or on-prem cluster, keep your data behind your firewall, still get the speed.
The pitch is the same in all three rooms. You can pick whichever flavor of model you want - open, closed, multimodal, fine-tuned, quantized - and Friendli will make it fast enough and cheap enough to actually use. The platform supports structured outputs, tool calls, multi-LoRA adapters, and the kind of debugging hooks that turn an inference endpoint from a black box into something an engineering team can reason about.
The Korean conglomerate uses FriendliAI to scale generative AI products in production.
Video-understanding pioneer serving multimodal models through Friendli.
One-click deploy from the Hub to a Friendli endpoint - the path of least resistance from research to production.
Featured PeriFlow in its startup program as the cost-efficient path to generative AI training and serving.
Custom kernels across the latest NVIDIA accelerators, tuned where the per-token economics actually live.
Type II audited and HIPAA-certified, which is the small print that lets enterprise legal teams sleep.
In August 2025, FriendliAI closed a $20M seed extension led by Capstone Partners, with Sierra Ventures, Alumni Ventures, Korea Development Bank and KB Securities along for the ride. It brought the total raised to roughly $26.7M - modest compared to the nine-figure rounds being thrown around in the rest of the inference space. That, oddly enough, may be the point. FriendliAI's pitch to investors is not give us a billion dollars and we will buy GPUs. It is we already know how to get more out of the GPUs you have.
The capital is being spent on three things: go-to-market expansion across North America and Asia, more research and engineering against the PeriFlow stack, and additional GPU capacity to keep up with the customers who keep showing up.
The Frontier AI Inference Cloud - built by the people who invented continuous batching. - FriendliAI tagline, 2026
Byung-Gon Chun, the founder and CEO, is a working researcher who never quite stopped publishing. The other half of the executive story arrived in 2026, when FriendliAI named Brian Yoo - formerly Chief Operating Officer at Moloco - as its Chief Business Officer. Yoo's last act took Moloco from a ten-person startup to a global organization with revenue above $250M. He landed at FriendliAI to do something similar with the commercial side of an inference cloud that the engineers had quietly turned into a serious business.
The team is about fifty people, split between San Francisco and Seoul, with research talent on one continent and customer-facing operations being built on the other.
Byung-Gon Chun spins FriendliAI out of his Seoul National University research group.
The inference engine that started in a research lab becomes a product. Headquarters relocates to Redwood City, California.
Capstone Partners leads. Sierra Ventures, Alumni Ventures, KDB and KB Securities follow.
Moloco's former COO takes over commercial operations.
Seven thousand square feet at 20 Hawthorne Street, inside the historic Crown Point Press building.
The continuous batching paper that anchors so much of modern LLM serving did not come from Google or OpenAI. It came from an academic team in Seoul, and most of that team is still inside FriendliAI. The company's San Francisco office sits inside a building that used to print Sol LeWitt etchings - a useful reminder that the Bay Area's tech industry rents space from somebody's older idea of what a city should be. And the claim FriendliAI is most willing to put in writing is also the simplest: up to ninety percent lower GPU cost, with no measurable loss in accuracy.
Back in the Crown Point Press building, the dashboard keeps ticking. The Tokyo customer is now serving a few thousand requests a second. Somewhere in Seoul, an engineer pushes a config change and watches latency drop another fifteen milliseconds. The room still does not cheer. The numbers are the cheer. The walls that once held proof prints by Thiebaud now hold a quieter craft - the slow, exact work of making other people's models run.