When you tap the voice button in ChatGPT, your phone is calling a server he built.
Subject 001 / d'Sa, R.
In January 2026, LiveKit took $100 million at a billion-dollar valuation and the founder's reaction was, more or less, to keep shipping. The round was led by Index Ventures, with Salesforce Ventures, Altimeter, Redpoint and Hanabi tagging along. Bloomberg wrote it up. So did half the trade press. Russ d'Sa wrote a blog post about voice as the next paradigm and got back to work.
This is the part of the AI stack nobody photographs. The brains - GPT, Claude, Gemini - get the cover stories. The hands - browsers, phones, headsets - get the launches. The nervous system, the thing that carries the signal between mouth and model and back again in under a hundred milliseconds, is open-source code maintained by a small team in San Francisco. The team is LiveKit. The CEO is d'Sa.
He likes that framing. "While companies like OpenAI, Anthropic, Google, and Apple are building the brain behind AI," he told Data Innovation, "LiveKit is building the nervous system." Then he goes and ships another release.
Voice is the most natural interface we have. It's the one we use with each other every day.- Russ d'Sa, on why LiveKit exists
LiveKit is an open source framework for moving real-time audio, video and data between humans, machines, and the things in between. It started as a WebRTC server you could self-host. Then it became a cloud. Then OpenAI showed up.
The story d'Sa tells - and his co-founder David Zhao tells it too - is that LiveKit became an AI company "by accident." They were building infrastructure for video calls and live audio rooms. Then customers started wiring large language models into one end of the call. The latency budget collapsed. The shape of the workload changed. The product, more or less, rewrote itself.
Today, when you press the voice button in the ChatGPT app, your phone opens a WebRTC connection to a LiveKit server. The audio streams up. A model listens, thinks, and speaks back. The whole loop, mouth to model to ear, lands inside the cognitive window of a normal conversation. That window is the product.
A timeline assembled from public interviews, press, and the company's own blog. The dates that matter are not the ones on the cap table.
LiveKit's cap table reads like a curve - quiet, then loud. The Series C lands in a market where voice AI is the only product category investors trust to grow inside a single budget cycle.
Round amounts are approximate, based on a $181M cumulative figure reported in company materials.
d'Sa argues the move to voice is not a feature, it's a substrate change. The web was built for clicks - request, response, render. Voice is a stream. It is stateful. It has turns. It has interruptions. It has the unforgiving timing of human conversation, where 300 milliseconds of dead air reads as awkward and 800 reads as broken.
Most of the engineering at LiveKit, in his telling, is the boring infrastructure work that lets a model behave like a person who is listening: turn detection, interruption handling, low-latency transport, region pinning, telephony bridges, the long tail of "what happens when the network drops for 90 milliseconds."
"Make building and scaling voice AI as easy as building and scaling on the web," he wrote in the Series C announcement. It is a sentence that hides a decade of engineering inside the word as.
He is @dsa on Twitter and dsa on GitHub. No numbers, no underscores. He was on the internet early.
The link preview cards that fill every embedded tweet - those are Twitter Cards. He built them as employee #75.
Evie Launcher was a beloved Android home screen replacement before Medium bought the team in 2019.
He's said his 20s wanted to be Steve Jobs. His 40s want to ship something that matters and go home. Plumbing fits.
LiveKit wasn't built for AI. AI showed up at the door. The founders have publicly described the pivot as customer-pulled, not founder-pushed.
By his own count this is his fifth company. Each one, he says, gets a little better.
LiveKit's Series C deck, distilled to a single bet, is this: the dominant way humans will instruct computers in 2030 is by talking to them. Not transcribed dictation. Not Siri-style command grammars. Real conversation, with interruptions and pauses and accents, mediated by a model on one end and a microphone on the other, with whatever latency budget the human ear allows in the middle.
If that is right, the company doing the carrying is worth a lot. The Series C is priced like d'Sa is right. Whether he is right is, as ever, a question of distribution and time.
What he is doing in the meantime: expanding into compute, storage and network services tuned for voice and computer vision. Building out the Agents framework. Holding a hiring bar in San Francisco. Writing the blog posts himself.
The brain gets the headlines. The nervous system does the work.