The Linguist Who Rewired the Phone Call
When you dial Domino's to place an order and the voice on the other end sounds oddly... normal - not robotic, not aggressively cheerful, just present - there's a decent chance Lily Clifford built that voice. She is CEO and co-founder of Rime, a San Francisco company that has quietly become the invisible voice layer for some of the busiest phone lines in America. Over 100 million conversations a month. Restaurant ordering, healthcare triage, telecom support, enterprise IVR. All of it running on models trained in a recording studio Lily's team built from scratch, filled with people talking over each other the way people actually do.
Lily came to this with a background most tech founders don't have: a deep, genuine obsession with how humans speak differently depending on where they're from, who they're talking to, and what they're trying to get away with socially. That's sociophonetics - the study of speech as a social act. It was her PhD research at Stanford. She left before finishing. Not because she failed at academia, but because she found something more urgent to build.
I ended up dropping out because I wanted to hack on speech synthesis models - specifically for customer support.
- Lily Clifford, CEO, RimeThe thesis that launched Rime is counterintuitive enough that most product teams would have killed it in a committee: a slightly bored AI voice outperforms a peppy, polished one. In real enterprise deployments, voices that sounded genuinely human - with the occasional flat affect, the slight hesitation before a proper noun, the breath before a long sentence - converted better, contained more calls, and got fewer hang-ups than the over-articulated, studio-perfect alternatives everyone else was shipping.
Three Co-Founders, One Recording Studio, and a Bet on Messy Speech
Lily graduated from Pitzer College in 2014 with a linguistics degree. She spent the next several years moving deeper into the academic study of speech - eventually landing at Stanford's linguistics PhD program, where she was researching how social and demographic factors shape the way people speak. People in Texas and California don't just have different accents; they have different speech rhythms, intonation patterns, and expectations about what an authoritative voice sounds like. She was studying all of that when it clicked: AI voices were failing not because the models were bad, but because the data was wrong.
Training on audiobook recordings - the industry standard - produced voices that sounded like someone reading aloud. Deliberate. Articulate. Fundamentally unlike a conversation. Lily's research instinct said the fix was obvious: get real conversational data. Full-duplex speech. People interrupting each other. Laughing mid-sentence. Trailing off. Nobody in the market was building that dataset because it was hard and expensive and the industry had convinced itself audiobooks were good enough.
In 2022, Lily cold-recruited two people with credentials that made no obvious sense together: Brooke Larson, a PhD linguist who had been building voice for Amazon's Alexa, and Ares Geovanos, a Stanford-trained engineer who had been working on brain-computer interfaces at UCSF - technology to help people who had lost the ability to speak. She pitched them on a company built not on the premise that AI voices should be perfect, but that they should be real.
They rented space in San Francisco, built a recording studio, and started capturing spontaneous, full-duplex conversations - the kind where people talk at the same time, lose their train of thought, and laugh in the wrong places. That dataset became the foundation for everything Rime has built since.
Linguistics as Infrastructure
Most voice AI companies talk about latency and naturalness as if they're in tension. Rime treats them as an integrated problem with a linguistic solution. The models are trained to understand not just what a word is but what a word does in context - how a list sounds different from a question, how regional pronunciation shifts when someone is tired, why the same sentence lands differently in an IVR versus a customer support chat.
Lily calls the approach "linguistics as a service" - making it easy for developers to adjust pronunciation and speech patterns without needing to know the International Phonetic Alphabet. And "demographics as a service" - because enterprise customers shouldn't need to become voice casting directors to get a voice that sounds right for their audience. People hear demographic cues in voices unconsciously, and they respond to them. A Texas-based customer service operation and a New York-based financial services firm need different things from an AI voice, even if neither of them can articulate exactly why.
Rime's Product Arsenal
- Mist - Next-gen TTS model family trained on massive conversational speech dataset; powers real-time voice applications
- Arcana - Expressive AI voice model; designed for emotional range and nuanced delivery
- Mist v2 - Fastest customizable TTS with sub-200ms end-to-end latency
- On-premises deployment - Only next-gen voice AI available on-prem; critical for healthcare & finance
- 200+ distinct voices - Including demographically specific voices and custom cloning via API
The technical differentiation shows up in the compliance stack too. Rime achieved SOC 2 Type II certification and HIPAA compliance, and it's the only next-generation TTS provider offering on-premises deployment. For regulated industries - healthcare, finance, government - that's not a feature. It's the only way they can legally deploy voice AI at all.
The Thing Nobody Warned Her About
Lily will tell you she came in as a researcher and a builder. She built the models. She hired the linguists. She ran the recording sessions. What she didn't expect was that the hardest part of building a voice AI company wouldn't be the voice AI - it would be sales.
One of the best lessons I've learned as a first-time founder is that there's no substitute for just diving in and doing the hard work when learning something new. I'm talking about sales.
- Lily Clifford, X (formerly Twitter)In the early days at Rime, the instinct was to build the best model and collect the best data and let the technology speak for itself. The model would find customers. The customers didn't show up on their own. Lily spent months calling enterprise prospects cold, learning what questions they actually had about voice AI deployment, learning how to talk about millisecond latency in terms of call containment rates and revenue per answered call. The scientific precision of her research background translated, eventually, into a very specific sales methodology: show the data, name the number, anchor on the customer's actual problem.
Twenty-plus enterprise clients later - including chains that together handle tens of millions of food orders annually - the methodology is working. The company hit $1.1M in revenue with a 10-person team in 2025. It was a lean, technical team that figured out the commercial side by necessity, not by plan.
Lily on Voice, Startups, and Getting It Right
People in Texas sound different from people in California, and we all pick up on these cues, consciously or not.
Voice AI should be as rich, diverse, and expressive as the people it serves.
We wanted voices that sounded like a friend, not a voice actor.
You can increase your IVR call containment rate from 85% to 95% just by making the bot sound better.
The Global Accent Problem and the Speech-to-Speech Future
The demand Lily is most focused on right now is India. The linguistic diversity of the subcontinent - hundreds of languages and thousands of dialects, each with its own phonology, its own intonation rules, its own sociolinguistic expectations - represents exactly the kind of problem Rime's linguistic-first approach is built for. Getting an AI voice right for Tamil-speaking customers in Chennai is a fundamentally different challenge than getting it right for English-speaking customers in Atlanta, and most voice AI companies are treating both problems as if they're the same thing. Rime isn't.
Further out, Lily's vision for the technology goes past text-to-speech entirely. The current paradigm - convert speech to text, process text, convert text back to speech - introduces latency and loses prosodic information at every step. The next phase, she believes, is direct speech-to-speech: models that process the acoustic signal directly, preserving the full richness of how something was said rather than just what was said. That's not a 2025 product. But it's the research direction.
For now, the trajectory is straightforward: more enterprise customers, more languages, deeper integration into the contact center stack. Rime is hosting industry events in San Francisco. It is building in public. It is doing the work of convincing enterprise software buyers - some of the most skeptical, contract-heavy, procurement-dependent buyers in technology - that voice AI is ready for their most sensitive customer interactions. Given who's already bought in, the argument is getting easier to make.