On a Tuesday in Redwood City, a research engineer at DatologyAI is staring at a 200-billion-token training corpus and deciding which 40 billion of them are worth keeping. The other 160 billion - duplicates, near-duplicates, garbage tokens, low-signal web pages, scraped advertisements, all the residue of a hungry crawler - will be discarded. This is what they do here. The unglamorous part of artificial intelligence. The part that nobody puts on a keynote slide.
And yet this is the company that Geoff Hinton, Yann LeCun, and Jeff Dean all personally wrote checks for. Not the next generative-art startup. Not another wrapper on GPT. This one. The data janitors.
01The Problem They Saw
Here is the embarrassing truth of modern AI. Most of the money pouring into the field has gone toward making models bigger - more parameters, more layers, more GPUs in the rack. Far less has gone toward making the data better. Researchers will spend a year refining attention mechanisms and then train the model on whatever was sitting around in Common Crawl. The garbage-in, garbage-out joke has been a joke for so long it stopped being funny.
Ari Morcos noticed. He spent five years at Meta's AI lab and two before that at DeepMind, watching teams burn enormous amounts of compute to chase tenths of a percentage point on benchmarks while their underlying training corpora were, charitably, a mess. The dirty secret of frontier labs is that data work happens. It is just done by hand, by small heroic teams, with bespoke pipelines that never leave the building.
Matthew Leavitt was running data research at MosaicML when Databricks acquired it in 2023. Bogdan Gaza had built data infrastructure at Amazon and Twitter for a decade. The three of them looked at the landscape - everyone scaling models, almost no one scaling data craft - and decided to start a company on a single contrarian wager.
02The Founders' Bet
The bet is this. Data curation at petabyte scale is not a chore. It is a competitive advantage that can be productized. Filter the right tokens out. Identify the highest-signal examples. Find the gaps. Balance the distribution. Generate synthetic data to plug the holes. Do all of this automatically, with deep learning techniques that scale to the size of the modern internet.
If you do it well, the same model architecture gets meaningfully better on the same compute budget. Or - more interestingly - you can hit the same quality with a much smaller model. Which, given the price of a frontier training run, is a very large number with very real implications.
DatologyAI came out of stealth in February 2024 with $11.65M and a quietly ambitious roadmap. Three months later, Felicis led a $46M Series A. Existing backers Amplify and Radical doubled down. Microsoft's M12 and the Amazon Alexa Fund joined. Elad Gil wrote a check. And then - perhaps the most unusual signal - the three angels at the very top of the field: Hinton, LeCun, Dean.
Ari Morcos
Harvard PhD in neuroscience. Five years at Meta AI. Two at DeepMind. Wanted to study how neural networks actually work, ended up productizing the answer.
Matthew Leavitt
Ex-head of data research at MosaicML (acquired by Databricks). Has opinions about token-level deduplication that you would not believe.
Bogdan Gaza
A decade of infrastructure at Amazon and Twitter, then CTO at Moonsense. The one who actually has to make the petabytes move.
The founding three: a neuroscientist, a data-research lead, and the engineer who keeps the floor from caving in
03The Product
DatologyAI is sold as a platform, but it functions more like a service-with-software-attached. A customer brings a raw corpus - their crawl, their internal documents, their proprietary medical records, whatever. Datology runs it through a pipeline that does the boring, important things: deduplication, quality filtering, model-based scoring, distribution matching, source rephrasing, synthetic augmentation. The output is a curated dataset that, the company claims, trains a model faster, smaller, and to higher quality than the raw input would have.
The numbers they put on this are admittedly load-bearing. They have publicly cited equivalent performance at one-tenth the compute. That is a strong claim. It is also exactly the kind of claim a company in this space has to make - the value proposition is meaningless without it.
A Two-Year Sprint
Funding, Plotted
Eighteen months, two rounds, three of the most consequential AI angels alive
04The Proof
It is one thing to claim that data curation matters. It is another to point at customers who paid for it and saw the difference. Arcee AI, a frontier open-model lab, has publicly built models on DatologyAI's curated data and reported quality gains. Thomson Reuters - a 175-year-old company that takes its data very seriously - has reported measurable improvements on legal reasoning and retrieval after running its corpora through the platform.
These are not warm testimonials about how friendly the team is. They are specific claims about model performance on real tasks, made by customers who have other options. Which, in this market, is the only kind of proof that counts.
05The Mission
The phrase DatologyAI uses on its About page is "democratize AI data curation." That sounds, on first read, like a marketing line. Hold it up to the light, though, and there is something specific in it. Right now the small number of organizations that can train state-of-the-art models do so partly because they have the GPUs - and partly because they have the data teams. The GPUs are buyable. The data teams are not.
If DatologyAI succeeds, that second moat gets a great deal shallower. A mid-sized enterprise that has never employed a data-quality researcher in its life can suddenly produce a competitive corpus. A research lab can iterate on dataset design the way it iterates on model design. The barrier to building meaningful models moves down, and the centre of gravity in AI shifts a little.
That is what democratization actually looks like when you write the word into a deck. It is also, not coincidentally, the kind of mission that explains why Hinton and LeCun put their own money in. Both of them have been arguing for years that the field over-indexes on model architecture. Datology is, in a sense, the operationalization of their long-standing complaint.
06Why It Matters Tomorrow
The next year of AI is going to be expensive. Frontier training runs are still climbing. Inference costs are eating margins at every company that ships an AI feature. Everyone is hunting for efficiency. Some of that efficiency will come from better silicon. Some from better algorithms. And, if DatologyAI is right, a meaningful slice will come from being smarter about what you feed the model in the first place.
The interesting thing about a data-curation company is that it gets more valuable as the rest of AI gets more expensive. Every dollar that GPU prices climb makes the case for trimming the corpus stronger. Every regulation that constrains data use makes the case for synthetic augmentation stronger. The thesis ages well.
Back in Redwood City, the engineer is still looking at her 200-billion-token corpus. She runs a final pass. The pipeline drops another six billion tokens, flags a billion as low-confidence, and rebalances the language distribution so that Tagalog and Hindi are not crushed under the weight of English. A model will train on this tomorrow, somewhere, on someone else's GPUs. It will get better than it would have. Nobody will see the work that went into the data. Which, here, is exactly the point.
07Where to Look Next
Filed by the YesPress desk · Sources verified May 2026 · Numbers are public-record; opinions are ours