It is two in the morning somewhere, and a medical student is grading a chest X-ray on her phone for a few cents and the small thrill of being right. A thousand miles away, a radiologist on a coffee break taps through ultrasound clips. Their verdicts pour into a server in Boston, get weighed against each other, and quietly become the ground truth that teaches an AI model how to spot a tumor.
That server belongs to Centaur AI - the company formerly, and still legally, known as Centaur Labs. It does the least glamorous job in artificial intelligence: it labels things. Images, audio, video, EEG squiggles, the messy raw stuff that medical models have to learn from. The twist is who does the labeling, and how the company decides whose opinion to trust.
Most AI companies fight over the model. Centaur picked the fight nobody wanted - the data itself, and specifically the human verdicts stamped onto it before any model ever sees it. It is an unsexy front line. It is also, the company argues, the one that decides whether a medical model is safe to ship. From a 50 Milk Street office in Boston, a team of roughly 78 people now sits between the world's healthcare data and the algorithms hungry to learn from it.
In healthcare, where AI hallucinations can cost lives, "garbage in, garbage out" data problems are unacceptable.
- Erik Duhaime, Co-Founder & CEOThe problem they saw
A model is only as smart as its worst label
Every AI system trained on human-labeled data inherits the judgment of whoever drew the boxes and ticked the boxes. In most industries, a sloppy label means a slightly worse product recommendation. In medicine, it means a model that confidently misreads a melanoma. The catch: even expert doctors disagree with each other - and with themselves - more often than anyone likes to admit. Hand the same scan to ten radiologists and you can get a surprising spread of answers.
The old fixes were unsatisfying. You could pay one expert and pray. You could outsource to a generalist labeling shop that had never seen a heart murmur. Either way, you were trusting a credential instead of a track record. Centaur's founders thought the credential was the wrong thing to trust.
Trust the labelers who actually get the answers right - not the ones with the most impressive titles.
- The Centaur AI operating principle, paraphrasedThe founders' bet
From flash cards to collective intelligence
The idea arrived, as good ideas occasionally do, by accident. Erik Duhaime was a PhD student at the MIT Center for Collective Intelligence, studying how groups of people - nudged by the right technology - can be smarter than any individual in them. At home, he watched his wife grind through medical flash cards, and it clicked: all that practice judgment was, in effect, free training data going to waste.
His research had already turned up something almost rude to the medical establishment: groups of medical students, properly managed, could classify skin lesions more accurately than seasoned dermatologists. The trick was not crowd size. It was bookkeeping - measure each person's accuracy on cases with known answers, discard the people who are bad at the task, and intelligently pool only the ones who are good.
In 2017 he turned the research into a company with two friends from Brown, Zach Rausnitz and Tom Gellatly. MIT's Sandbox fund wrote an early check; the delta v accelerator followed in 2018; Y Combinator that same year. The name is a wink - a "centaur," in chess, is a human paired with a machine, a team that beats either one playing alone.
How a PhD side-project became a $35M company
The product
A game, a platform, and a very strict referee
The crowd-facing side is a mobile app called DiagnosUs, where experts compete to tag real medical data - dermoscopic images, heart and lung sounds, radiology, ECG and EEG traces, text - and earn small cash prizes for being accurate. Roughly half the players are medical students. One doctor in eastern Europe reportedly earned around $10,000. The platform quietly slips in cases with known answers to score everyone, then weights opinions accordingly. It is a game with a referee who never blinks.
The business-facing side is the part customers pay for: a HIPAA-compliant, SOC 2 platform that turns those millions of weekly opinions into training labels, model evaluations, and ongoing monitoring across text, image, audio, video, and waveform data. A newer on-demand tier returns expert labels in under 30 minutes instead of days. The economics are tidy in their own way - experts are paid for performance, customers pay for clean data, and Centaur keeps the spread and the software around it.
What makes the platform useful is less the labeling than the scoring underneath it. Because every contributor is continuously measured against known answers, Centaur can tell a client not just what the label is, but how confident the crowd was, where the experts split, and which cases are genuinely ambiguous. For an AI team, that confidence signal is gold: it flags the edge cases worth a second look instead of burying them in an averaged-out spreadsheet. The same machinery lets the company evaluate finished models - including large language models - by pitting their answers against an expert consensus that has already proven it can be trusted.
DiagnosUs
The gamified app where experts compete to label data and get paid for accuracy. The crowd, with a scoreboard.
Annotation Platform
Multimodal labeling for model training, evaluation, monitoring and feedback - performance-weighted, not credential-weighted.
Model Evaluation
Expert benchmarking and validation - including LLMs and imaging models - to surface the edge cases that break things.
On-Demand Labels
Expert results in under 30 minutes. Because "we'll get back to you next week" is not a feature.
The trick was to continually measure each person's performance on cases with known answers, throw out the opinions of people who were bad at the task, and intelligently pool the opinions of people who were good.
- On the method behind Centaur AI's crowdThe proof
Numbers that survived a skeptic
The pitch invites eye-rolling - crowdsourced medicine sounds like a recipe for confident nonsense. The results are harder to wave away. When Centaur worked with Eight Sleep to teach a model to hear snoring in spectrograms and waveforms, detection accuracy climbed from 70% to 93%. In published comparisons, performance-weighted crowds matched expert radiologists on lung ultrasound and out-classified experienced dermatologists on skin images. Customers report annotation speedups translating to as much as 20X return.
The Eight Sleep snore test
The customer list reads like a who's who of places that cannot afford to be wrong: Mass General Brigham, Memorial Sloan Kettering Cancer Center, Microsoft, Eko, Paige, SciBite (Elsevier), Volastra, Activ Surgical, Medtronic. The Series B - an oversubscribed $16M+ round in October 2024 led by SignalFire, with Matrix, Accel, Susa, Omega, Y Combinator, Samsung Next and Alumni Ventures - was the market agreeing.
The mission
Make the data trustworthy first
Centaur's stated aim is narrow and stubborn: provide high-quality, trusted data annotation at scale so that AI in health and science can be relied on for decisions that matter. The company would rather you never hear the phrase "garbage in, garbage out" applied to a model that touches a patient. Everything - the scoreboard, the hidden test cases, the performance weighting - exists to keep the bad inputs out.
I realized my wife's studying could be productive work for AI developers.
- Erik Duhaime, on the origin of DiagnosUsWhy it matters tomorrow
The crowd in the loop
As medical AI moves from demo to diagnosis, the bottleneck stops being model architecture and starts being trust - in the data, and in the people who vouched for it. Centaur is wagering that the most defensible thing in AI is not a clever algorithm but a measured, motivated, performance-ranked crowd that can be pointed at any new modality the next model demands. It is an unfashionable bet: that human judgment, organized well, still has the edge.
So it is two in the morning, and the medical student is still grading scans on her phone. Nothing about the scene looks like the future of medicine. But her verdict, blended with thousands of others and filtered through a referee that only cares whether she is right, is now part of the ground truth a model will carry into a clinic. The least glamorous job in AI turns out to be the one everything else stands on.