The GPU Whisperer - Seoul → Berkeley → San Francisco • Inference at Scale
He wrote the paper that taught the world how to run AI at scale. Now he's building the company that makes it pay.
In 2022, a paper landed at OSDI - one of computing's most selective systems conferences. It was called ORCA: A Distributed Serving System for Transformer-Based Generative Models. It introduced a technique called continuous batching. Within two years, every major LLM inference engine on the planet - vLLM, TensorRT-LLM, Hugging Face Text Generation Inference - had adopted some version of it. The paper came from the Seoul National University lab of a professor named Byung-Gon Chun. Most people in the industry know the technique. Far fewer know the name behind it.
Call him Gon. That's what colleagues do. The full name - Byung-Gon - carries the weight of a career built across four institutions and three continents before FriendliAI ever existed. Berkeley for the PhD in 2007. Intel Research Berkeley for the postdoc years. Yahoo Research and Microsoft Silicon Valley for the industry stints. A visiting post at Facebook in Menlo Park in 2016. Then back to Korea, to Seoul National University, to run a lab and teach operating systems. And then, from that lab, the paper that changed everything.
The greatest challenge lies in scaling workloads from prototypes to production, where costs, latency, and GPU management complexity often stall deployment.
- Byung-Gon Chun, at The AI ConferenceThe insight behind continuous batching is deceptively simple. Traditional LLM serving treated each request like a discrete job: wait for it to finish before starting the next. This left GPUs sitting idle between requests, wasting compute on a resource that costs a fortune per hour. Continuous batching inserts new requests into the processing pipeline mid-stream, filling those gaps dynamically. In some workloads, the throughput improvement exceeds 10x. It sounds obvious in retrospect. Most important ideas do.
What Chun recognized next was the gap between the idea and the reality. The concept was out there. Implementing it reliably at enterprise scale - with the scheduling, caching, fault tolerance, and monitoring that actual production demands - was a completely different problem. That gap is what FriendliAI was built to close.
FriendliAI was founded in 2021, still technically running out of SNU, with an initial $6M seed from Capstone Partners. The company's early product was called PeriFlow - a name that pointed at the fluid, continuous nature of the scheduling approach underneath. By 2023, the company had moved its headquarters to California (first Redwood City, then San Francisco), rebranded to FriendliAI, and was doing real business with real enterprise clients. The name shift was deliberate: friendli as in approachable, low-friction, enterprise-safe.
The traction followed the credibility. ScatterLab - makers of the Luda Lee 2.0 conversational AI that topped Korean app store rankings - used FriendliAI's platform to deploy a model 17 times larger at comparable speeds and costs. LG Electronics became a client. Upstage, one of Korea's most respected AI labs, signed on. By mid-2025, FriendliAI had 25 to 30 large enterprise customers and revenue on track for 6 to 7 times its 2024 level. In August 2025, the company closed a $20M seed extension led again by Capstone, with Sierra Ventures, Alumni Ventures, KDB, and KB Securities joining the round.
80% to 90% of GPUs are dedicated to inference, and only the remainder used for training.
- Byung-Gon Chun, CEO, FriendliAIThat statistic matters because it reframes where the real compute costs in AI actually sit. Training grabs the headlines - $100M+ runs for frontier models, data center buildouts measured in gigawatts. But inference is where the bills land, every day, at every scale. Every chatbot query, every coding assistant suggestion, every enterprise search result - that's inference. And Chun's bet is that the company with the best inference engine wins that market.
FriendliAI's current platform covers an almost absurd range of models - 550,000 text, vision, audio, and multimodal models deployable directly from Hugging Face in one click. The engine claims 2x or better inference speed versus standard deployments, with enterprise-grade compliance: SOC 2 Type II, HIPAA certified. The pitch to enterprises is not just speed - it's the removal of the operational complexity that keeps AI pilots from becoming production deployments. GPU provisioning, auto-scaling, failure recovery, performance monitoring - all managed.
The academic pedigree that grounds the whole operation is formidable. Chun's research during his SNU years collected awards from every major AI lab - Google, Microsoft, Amazon, Facebook. He received the Microsoft Research Faculty Fellowship in 2014, the first Asian researcher based in Asia to get it. The ACM SIGOPS Hall of Fame Award followed in 2020. The EuroSys Test of Time Award in 2021. Earlier, his collaboration with Microsoft produced REEF - the Retainable Evaluator Execution Framework - which the Apache Software Foundation elevated to Top Level Project status, a recognition typically reserved for mature, production-critical infrastructure.
The man collects milestones the way other academics collect citations. But what distinguishes Chun is the rare combination of credibility in both directions: rigorous enough for OSDI, practical enough for LG Electronics. A researcher who ships. A CEO who can still read a systems paper and immediately see where the math breaks down in production. FriendliAI is the institutional expression of that combination.
The ambition is stated plainly: make inference a commodity, the way compute storage became a commodity. Not a black box that requires a team of PhD engineers to operate, but a utility - reliable, measurable, cheap enough to make the GPU tax irrelevant. The $250B AI inference market that Sierra Ventures projects for 2030 is the backdrop. Chun's bet is that the company with the deepest technical foundation and the broadest model coverage wins the enterprise slice of it.
He remains an Associate Professor at Seoul National University - on leave, technically, though FriendliAI has long since consumed the foreground. The lab that produced continuous batching still exists. The insight that powered it is now running inside virtually every serious LLM deployment in the world. And the CEO of the company that turned it into a business still answers to Gon.
Inducted into computing's most prestigious systems research recognition for sustained contributions to operating and networked systems.
Awarded for research that proved its lasting significance a decade after publication - the conference's highest retrospective honor.
First Asian researcher based in Asia to receive this fellowship, awarded to faculty making breakthrough contributions to their field.
The Retainable Evaluator Execution Framework, co-developed with Microsoft, became an Apache Software Foundation Top Level Project - a mark of production-critical software maturity.
Grants from Google (2020), Amazon ML Research (2018), and Facebook Caffe2 (2017) - collectively validating the relevance of his systems work to industry production challenges.
Published the paper that introduced continuous batching to LLM inference, now standard in every major framework including vLLM, TensorRT-LLM, and Hugging Face TGI.
Every AI application eventually hits the same wall. The model works in the prototype. The demo runs fine on a single GPU. Then it goes to production, and suddenly there are 10,000 concurrent users, inference costs 3x the model cost, latency spikes unpredictably, and the engineering team is spending more time managing GPU clusters than building features. That wall is what FriendliAI was built to knock down.
The platform operates on three layers. The inference engine itself - built on the ORCA continuous batching foundation, extended with speculative decoding, custom GPU kernels, and smart caching. A deployment layer that handles GPU provisioning, auto-scaling, failure recovery, and load balancing without requiring the customer to touch a YAML file. And a model coverage layer that connects directly to Hugging Face, making 550,000+ models deployable in a single click.
The enterprise pitch is concrete: up to 90% GPU cost reduction, industry-leading inference speeds, SOC 2 Type II and HIPAA compliance, 99.99% uptime SLA. These numbers are not marketing - they emerge from the same optimization techniques that Chun's lab spent years developing. When a customer like ScatterLab deploys a 17x larger model at comparable cost and speed, that is continuous batching, speculative decoding, and kernel optimization working in concert.
FriendliAI is not yet profitable. Chun says so directly. The focus has been on scaling efficiently while maintaining strong gross margins. The $20M seed extension goes toward North American and Asian go-to-market expansion, software development, and GPU procurement for the cloud service. The newest initiative, InferenceSense, aims to monetize idle enterprise GPUs by routing inference workloads to them - turning the compute slack in corporate data centers into revenue.