DatologyAI

On a Tuesday in Redwood City, a research engineer at DatologyAI is staring at a 200-billion-token training corpus and deciding which 40 billion of them are worth keeping. The other 160 billion - duplicates, near-duplicates, garbage tokens, low-signal web pages, scraped advertisements, all the residue of a hungry crawler - will be discarded. This is what they do here. The unglamorous part of artificial intelligence. The part that nobody puts on a keynote slide.

And yet this is the company that Geoff Hinton, Yann LeCun, and Jeff Dean all personally wrote checks for. Not the next generative-art startup. Not another wrapper on GPT. This one. The data janitors.

Better data, not bigger models, is the real scaling law no one wants to talk about. — The DatologyAI Thesis, paraphrased

01The Problem They Saw

Here is the embarrassing truth of modern AI. Most of the money pouring into the field has gone toward making models bigger - more parameters, more layers, more GPUs in the rack. Far less has gone toward making the data better. Researchers will spend a year refining attention mechanisms and then train the model on whatever was sitting around in Common Crawl. The garbage-in, garbage-out joke has been a joke for so long it stopped being funny.

Ari Morcos noticed. He spent five years at Meta's AI lab and two before that at DeepMind, watching teams burn enormous amounts of compute to chase tenths of a percentage point on benchmarks while their underlying training corpora were, charitably, a mess. The dirty secret of frontier labs is that data work happens. It is just done by hand, by small heroic teams, with bespoke pipelines that never leave the building.

Matthew Leavitt was running data research at MosaicML when Databricks acquired it in 2023. Bogdan Gaza had built data infrastructure at Amazon and Twitter for a decade. The three of them looked at the landscape - everyone scaling models, almost no one scaling data craft - and decided to start a company on a single contrarian wager.

Every AI team has a data problem. Almost none of them have a data team. — DatologyAI's pitch deck, in one sentence

02The Founders' Bet

The bet is this. Data curation at petabyte scale is not a chore. It is a competitive advantage that can be productized. Filter the right tokens out. Identify the highest-signal examples. Find the gaps. Balance the distribution. Generate synthetic data to plug the holes. Do all of this automatically, with deep learning techniques that scale to the size of the modern internet.

If you do it well, the same model architecture gets meaningfully better on the same compute budget. Or - more interestingly - you can hit the same quality with a much smaller model. Which, given the price of a frontier training run, is a very large number with very real implications.

DatologyAI came out of stealth in February 2024 with $11.65M and a quietly ambitious roadmap. Three months later, Felicis led a $46M Series A. Existing backers Amplify and Radical doubled down. Microsoft's M12 and the Amazon Alexa Fund joined. Elad Gil wrote a check. And then - perhaps the most unusual signal - the three angels at the very top of the field: Hinton, LeCun, Dean.

CEO

Ari Morcos

Harvard PhD in neuroscience. Five years at Meta AI. Two at DeepMind. Wanted to study how neural networks actually work, ended up productizing the answer.

CSO

Matthew Leavitt

Ex-head of data research at MosaicML (acquired by Databricks). Has opinions about token-level deduplication that you would not believe.

CTO

Bogdan Gaza

A decade of infrastructure at Amazon and Twitter, then CTO at Moonsense. The one who actually has to make the petabytes move.

The founding three: a neuroscientist, a data-research lead, and the engineer who keeps the floor from caving in

03The Product

DatologyAI is sold as a platform, but it functions more like a service-with-software-attached. A customer brings a raw corpus - their crawl, their internal documents, their proprietary medical records, whatever. Datology runs it through a pipeline that does the boring, important things: deduplication, quality filtering, model-based scoring, distribution matching, source rephrasing, synthetic augmentation. The output is a curated dataset that, the company claims, trains a model faster, smaller, and to higher quality than the raw input would have.

The numbers they put on this are admittedly load-bearing. They have publicly cited equivalent performance at one-tenth the compute. That is a strong claim. It is also exactly the kind of claim a company in this space has to make - the value proposition is meaningless without it.

Same model, same architecture, better diet. Suddenly it's a different animal. — What DatologyAI is actually selling

A Two-Year Sprint

SEPTEMBER 2023

Company founded by Ari Morcos, Matthew Leavitt and Bogdan Gaza in Redwood City, California.

FEBRUARY 2024

Emerges from stealth with $11.65M seed led by Amplify Partners.

MAY 2024

Closes $46M Series A led by Felicis. Hinton, LeCun and Dean among the angel investors. Total raised: $57.6M.

2024 - 2025

Headcount climbs past 50. Customers include Arcee AI and Thomson Reuters.

2025

Ships ÜberWeb, a multilingual curation suite designed to match top open models on a fraction of compute.

NOW

Petabyte-scale curation in production. Hiring across research, infra and go-to-market.

Funding, Plotted

DatologyAI capital raised (USD millions, public sources)

Seed · Sep 2023

$11.65M

Series A · May 2024

$46.00M

Total to date

$57.65M

Eighteen months, two rounds, three of the most consequential AI angels alive

04The Proof

It is one thing to claim that data curation matters. It is another to point at customers who paid for it and saw the difference. Arcee AI, a frontier open-model lab, has publicly built models on DatologyAI's curated data and reported quality gains. Thomson Reuters - a 175-year-old company that takes its data very seriously - has reported measurable improvements on legal reasoning and retrieval after running its corpora through the platform.

These are not warm testimonials about how friendly the team is. They are specific claims about model performance on real tasks, made by customers who have other options. Which, in this market, is the only kind of proof that counts.

When a 175-year-old news wire decides your pipeline beats their pipeline, you stop calling it a hypothesis. — On the Thomson Reuters validation

05The Mission

The phrase DatologyAI uses on its About page is "democratize AI data curation." That sounds, on first read, like a marketing line. Hold it up to the light, though, and there is something specific in it. Right now the small number of organizations that can train state-of-the-art models do so partly because they have the GPUs - and partly because they have the data teams. The GPUs are buyable. The data teams are not.

If DatologyAI succeeds, that second moat gets a great deal shallower. A mid-sized enterprise that has never employed a data-quality researcher in its life can suddenly produce a competitive corpus. A research lab can iterate on dataset design the way it iterates on model design. The barrier to building meaningful models moves down, and the centre of gravity in AI shifts a little.

That is what democratization actually looks like when you write the word into a deck. It is also, not coincidentally, the kind of mission that explains why Hinton and LeCun put their own money in. Both of them have been arguing for years that the field over-indexes on model architecture. Datology is, in a sense, the operationalization of their long-standing complaint.

06Why It Matters Tomorrow

The next year of AI is going to be expensive. Frontier training runs are still climbing. Inference costs are eating margins at every company that ships an AI feature. Everyone is hunting for efficiency. Some of that efficiency will come from better silicon. Some from better algorithms. And, if DatologyAI is right, a meaningful slice will come from being smarter about what you feed the model in the first place.

The interesting thing about a data-curation company is that it gets more valuable as the rest of AI gets more expensive. Every dollar that GPU prices climb makes the case for trimming the corpus stronger. Every regulation that constrains data use makes the case for synthetic augmentation stronger. The thesis ages well.

The bet is asymmetric. If they are right, every training run on earth eventually runs through something like this. — The case for Datology, distilled

Back in Redwood City, the engineer is still looking at her 200-billion-token corpus. She runs a final pass. The pipeline drops another six billion tokens, flags a billion as low-confidence, and rebalances the language distribution so that Tagalog and Hindi are not crushed under the weight of English. A model will train on this tomorrow, somewhere, on someone else's GPUs. It will get better than it would have. Nobody will see the work that went into the data. Which, here, is exactly the point.

07Where to Look Next

Website LinkedIn Twitter / X Blog & Research Careers Crunchbase Series A Post TechCrunch Coverage YouTube Interviews

Filed by the YesPress desk · Sources verified May 2026 · Numbers are public-record; opinions are ours