Voice AI for the loudest place on earth
Most AI demos happen in a quiet room. Umut Isik's product lives at a drive-thru speaker box - the acoustic worst-case, where an engine idles, a kid yells in the back seat, and three lanes of orders bleed together.
TThat is the problem Incept AI exists to solve. Isik co-founded the New York company in 2023 with Justin Foster, and the pitch is deceptively plain: build voice AI that actually works outside the lab. The phrase he uses for it is the "last mile" of voice AI - the gap between a transcript that looks fine on a slide and a transcript that holds up when someone mumbles "no pickles" through a tinny microphone in a thunderstorm.
Restaurants are the proving ground. Incept's system takes drive-thru and phone orders, handles the combinatorial nightmare of menu modifiers, suggests the upsell, checks what is out of stock, and drops a finished order into the point-of-sale. Underneath sits the Incept Neural Engine, audio networks built to strip out noise, acoustic echo, and overlapping speakers before a foundation model ever sees the words. The company integrates with the plumbing of the industry - Toast, Square, PAR, HME - and reports 95%+ AI-only completion at roughly 812 milliseconds of conversational latency.
In February 2025 the bet got funded: a $3 million pre-seed led by Rally Ventures, with 10VC along for the ride. Ben Fried, Google's former CIO, took a board seat. By then Incept had been live since May 2024 and was running pilots across chains with more than a thousand locations between them. One coffee-chain CTO put it bluntly in a customer note: every new store starts with Incept from day one.
A geometer who learned to listen
Here is the detail that explains the rest of him: before the citation that reads "Better speech enhancement with frequency-positional embeddings," there is one that reads "Equivalence of the derived category of a variety with a singularity category." Same author. The second has been cited 120 times. It is pure algebraic geometry, the kind of mathematics with no obvious use and a deep internal beauty.
Isik earned his Ph.D. in mathematics at the University of Pennsylvania, then spent years inside the field - including a turn as a Visiting Assistant Professor at UC Irvine - working on algebraic geometry, category theory, and categorical complexity. His Google Scholar page carries the receipts of that double life: roughly 1,446 citations and an h-index of 16, split between abstract algebra and applied deep learning.
The bridge between the two careers was audio. As a Principal Applied Scientist at Amazon Web Services, Isik turned the rigor of a theorem-prover loose on a much messier object: sound. His most-cited papers are foundational speech-enhancement work - Attention Wave-U-Net, PoCoNet, channel-attention dense U-Net - the literature of teaching a network to pull a clean voice out of a dirty signal. It is the exact problem a drive-thru hands you, dressed in a lab coat. When he left to start Incept, he was not switching fields. He was taking the same problem out of the building.
Voice AI gets graded on a curve in quiet rooms. The harder a room gets, the wider the gap between providers. Here is the spread Incept is chasing - the move from "good enough on a slide" to "good enough in the rain."
He makes art out of math
Somewhere between the theorems and the neural nets, Isik built mathvas.com - a tool that turns simple mathematical functions into images. He calls it a new medium for artistic and mathematical expression, and he has used it in math circles and undergraduate workshops. His pieces showed at the mathematical art galleries tied to the Joint Mathematics Meetings.
Uses probability distributions as "colors" - a uniform distribution inside a circle set against a bimodal one outside, the two blurring where the sampling rate climbs.
Algebraic curves slice the plane into regions, each filled with distributions that blend strong and faint color across the whole composition.
The other body of work
- 187Attention Wave-U-Net for speech enhancement 2019
- 139Channel-attention dense U-Net for multichannel speech enhancement 2020
- 133PoCoNet: better speech enhancement with frequency-positional embeddings 2020
- 132A perceptually-motivated approach for low-complexity, real-time enhancement of fullband speech 2020
- 120Equivalence of the derived category of a variety with a singularity category 2013