Picture the typical AI user in 2025: fifteen tabs open, a ChatGPT subscription, half a Claude trial, a vague sense that Gemini might be better for this specific task and Grok for that one. Nobody really knew which model was best. The labs certainly weren't going to tell you.
Yupp had a different idea. Give everyone access to every model, side by side, for free. Then pay them to tell you which one won. It sounds almost too simple - until you realize it was sitting on top of what the entire AI industry desperately needed: genuine, real-world human preference data at scale.
The company launched publicly in June 2025 with $33 million in seed funding, 500 AI models in the catalog, and the backing of people who between them had built Google Pay, Twitter's recommendation engine, and the machine learning systems at Google X. It was a credibility stack that made even skeptical observers lean in.
Every AI builder wants to know how good their models are, and for which use cases. We built the system to answer that question - at the scale of a million real users.
- Pankaj Gupta, Co-founder & CEO, Yupp
The Problem Nobody Was Solving
Benchmarks in AI are a little like restaurant awards given by the chef's own mother: technically rigorous, structurally flawed. Academic benchmarks measure what labs decide to measure, and labs have a habit of training toward whatever gets measured. The results started looking suspiciously good across the board.
What nobody had built was a system where ordinary people - not PhDs, not cherry-picked evaluators - could just use AI models on their actual questions, in their actual workflows, and vote. Not on curated prompts. On their own prompts. The difference matters enormously.
This was the problem Yupp sat down squarely in front of. The insight was simple: AI evaluation is a consumer product problem, not a research problem. The consumers are already there. You just have to give them a reason to show up.
The Founders' Bet
Pankaj Gupta and Gilad Mishne were not first-time founders chasing an idea. They were two operators who had spent years inside the institutions shaping modern technology, and they left to build something that those institutions couldn't.
PhD CS from Stanford. Built the "Who to Follow" recommendation engine at Twitter. Led Google Pay Consumer engineering for a billion users worldwide. VP Engineering at Coinbase. Three prior acquisitions. The kind of resume that makes VC associates nervous about asking follow-up questions.
Machine learning lead at Google X. Built recommendation and search systems at Twitter and Yahoo. Deep roots in bioinformatics and biotech. The person in the room who actually understood what the models were doing - and how to make their evaluation trustworthy.
Their bet was essentially this: human preference data is the scarce input in AI development. Labs can generate synthetic data, hire contractors, run A/B tests - but real users genuinely choosing between models, at volume, across diverse use cases? That's harder to fake and more valuable than it looks.
The Product
Yupp was not complicated to use. You typed a prompt, and two (or more) AI models answered it. You picked the better one. Yupp logged your vote, added it to an aggregated preference dataset, and gave you credits for your trouble. Credits could be spent on more AI usage, or cashed out via PayPal, Stripe, or - and this is the part that attracted a16z's crypto arm - stablecoins on Base L2 and Solana.
The stablecoin integration was deliberate and functional, not decorative. It solved a real problem: how do you pay 1.3 million users across 200 countries without banking infrastructure eating the margins? Base L2 answered that. Instant. Free. Global.
Beyond simple head-to-head comparisons, Yupp built a public AI leaderboard ranked by "VIBE Score" - standing for Vibe Intelligence BEnchmark, which was either a genuine acronym or a successful act of branding, possibly both. The VIBE Score was crowdsourced and updated in real time, tracking which models users actually preferred for which types of tasks. It was the anti-benchmark benchmark.
One of the things that meant the most to us was being able to give our community free access to the world's top AI models. That was never a feature - it was the point.
- Pankaj Gupta, wind-down announcement, March 2026
Scale relative to comparable consumer AI platforms at seed stage. Data as of platform peak, late 2025.
The Yupp Timeline
The Proof - and the Catch
The traction was real. 1.3 million sign-ups is not a vanity metric for a seed-stage startup. Multiple AI labs paying for preference data meant the business model had legs, at least in theory. The 90,000-person active community suggests the product was genuinely useful and not just a novelty click.
The investor list tells its own story. Getting Jeff Dean - the chief scientist of Google - to write you a check is not easy. Getting him alongside Biz Stone, Evan Sharp, Aravind Srinivas, and the a16z crypto fund in the same round is something else entirely. These are people who have seen enough companies to know when a thesis is interesting.
The Mission - Then the Pivot That Didn't Come
Yupp's mission was stated plainly in their tagline: "Every AI for everyone." It was consumer populism applied to a field that had drifted toward being a specialists-only game. You shouldn't need a corporate subscription to try the best models. You shouldn't need a PhD to evaluate them. Yupp made both of those things free and then put credits in your pocket for helping.
The crypto angle was structural, not cosmetic. a16z's crypto fund led the round precisely because Yupp was building what DePIN advocates called "DePIN 2.0" - decentralized physical infrastructure, in this case a globally distributed evaluation network, incentivized at the token level. It was a genuinely novel structure for a genuinely novel problem.
Where it ran out of road was the speed at which AI labs reassessed what they actually wanted from evaluation. The industry spent 2025 deciding that it preferred expert-curated data from specialists over broad consumer preferences. Simultaneously, the conversation shifted from "which chatbot is better?" to "which agentic system can complete a 47-step workflow?" Yupp was built for a layer of the stack that the industry proceeded to de-emphasize in real time.
The AI model capability landscape has changed dramatically in the last year alone and will continue to change quickly. The future is not just models but agentic systems.
- Pankaj Gupta, Yupp wind-down statement, March 2026
Why It Matters Tomorrow
It is tempting to file Yupp under "interesting experiment, didn't work." That would be wrong in at least one important way. The product worked. The community validated it. The technology - real-time, crowdsourced, crypto-incentivized model evaluation - was technically sound and genuinely novel. What didn't work was the timing of the business model against a market moving at historically unusual speed.
The question Yupp was asking - who should decide which AI models are good? - has not gone away. It just hasn't been answered yet in a way that sustains a business. Whoever answers it next will probably read the Yupp case study before they build.
The 1.3 million users were real. The preference data was real. The problem was real. In a field where half the startups are solutions looking for problems, Yupp was a solution that found exactly the right problem and ran into a wall when the problem moved.
AI Evaluation DePIN Consumer AI a16z crypto Crowdsourcing Model ComparisonRemember that person with fifteen tabs open, trying to figure out which AI to trust? Yupp built the thing that was supposed to solve that problem. For a while, 1.3 million of them agreed it helped. They compared models they'd never heard of. They earned credits. They contributed, without realizing it, to a dataset that AI labs found genuinely valuable enough to pay for.
The platform is gone. The tabs are still open. The problem persists - slightly better understood, at least in part because Yupp spent ten months mapping it with the help of a million voices. That is a strange and specific kind of contribution to make to a field. But it was a real one.