A 31-person company in a Menlo Park office is quietly assembling the largest fact-checked database of the public web. Most people have used it without knowing.
It is a Tuesday morning in 2026, and somewhere in Menlo Park a fleet of crawlers is reading the internet. Not skimming it. Reading it - rendering each page in a virtual browser, looking at it the way a human would, and writing down what it sees. Article. Author. Date. Product. Price. Person. Title. Employer. A patient act of comprehension, repeated billions of times before lunch.
The robots do not get bored. They have been doing this since 2008, before the iPhone had an App Store, before "AI" was a marketing department. They feed a database now measured in trillions.
The company running them is called Diffbot. It has 31 employees.
Most AI companies make headlines by generating things. Diffbot's bet has always been quieter and stranger: that the more interesting problem is reading. That structure is the bottleneck. That if you can turn the web into a database, every downstream question - search, sales, research, grounded generation - gets easier.
Eighteen years later, that bet is paying off in a way the rest of the industry is now scrambling to imitate.
The pitch is almost embarrassingly simple. The web is the largest and richest dataset humanity has ever made. It is also a mess - text wrapped in markup wrapped in ads wrapped in JavaScript. To use it, you scrape. To trust it, you cross-check. To scale it, you give up.
Diffbot's answer is to treat web pages the way a person does: visually. Its extractors render a page in a real browser, look at the pixels, and decide what the page is about. Article? Product? Forum thread? A profile of a person? Then the relevant fields - byline, price, ratings, author - come out as structured JSON. No fragile CSS selectors. No site-specific scripts.
Run that across the entire public web, continuously, for almost two decades, and you do not have a scraper. You have a knowledge graph.
Diffbot's graph now spans more than 10 billion entities - people, organizations, articles, products, places - linked by over a trillion facts. In January 2026 alone the company added 50 billion new facts, 30 million new organizations, and 600 million new articles. Most companies celebrate quarterly product launches. Diffbot celebrates a slow news month.
When generative AI arrived, the industry's first instinct was to make models bigger. Diffbot's instinct, predictably, was to make them honest. In January 2025 the company released Diffbot LLM - a fine-tune of Meta's Llama 3.3, plugged directly into the knowledge graph. Ask it a question and it answers with citations to specific facts, retrieved at query time. The company called it the first open-source GraphRAG implementation. You can try it at diffy.chat.
The model is not a chatbot dressed up to look like a research tool. It is a research tool that learned to chat.
Diffbot's customer list reads like the back of a cereal box you didn't know you'd been eating from. DuckDuckGo's instant answers. Snapchat's link previews. AOL. Bing. Adobe. Cisco. eBay. Salesforce. Samsung. CBS Interactive. If you have ever pasted a URL into a chat app and watched a tidy preview appear, the pipework was often Diffbot's.
The newer products - Enhance and LeadGraph - take the same graph and aim it at sales teams. Funding events become searchable. Org charts become reliable. The thing you used to pay three vendors for becomes one query.
Diffbot has been profitable. Diffbot has not gone public. Diffbot has not raised a Series B. It raised $10 million in February 2016, on top of an earlier $2 million seed from Matrix Partners and Tencent, and then it went back to work. Eighteen years. Thirty-one people. A trillion facts. In an industry that confuses motion with progress, this counts as a philosophical statement.
Founder Mike Tung - patent lawyer turned Stanford AI grad student turned engineer at eBay, Yahoo, and Microsoft - did not set out to build a unicorn. He set out to build a map of human knowledge. The map turned out to be the unicorn.
A rough sketch of what Diffbot's crawlers added to the knowledge graph in a recent month. Bars are scaled relative to one another - the point is the order of magnitude, not the decimal.
10B+ entities, trillion+ facts. Queryable, refreshable, sourced. The thing under everything else.
Article, Product, Discussion, Image, Video, Analyze. Point at a URL, get structured JSON back.
Run extraction across entire sites at scale. The polite, distributed cousin of your homemade scraper.
Pull entities, relationships, and sentiment out of raw text - same ontology as the graph.
Open-source GraphRAG model on Llama 3.3. Cites the graph. Try it at diffy.chat.
Enrich CRM records, track funding events, find decision-makers - powered by the same graph.
UC Berkeley EECS. Stanford AI Lab. Stints as a patent lawyer and as an engineer at eBay, Yahoo, and Microsoft. Started Diffbot the year smartphones learned to multitask, and has spent the time since trying to teach robots to read.
It is still that Tuesday morning in Menlo Park. The crawlers are still reading. The graph just grew by another billion facts while you were on this page. Somewhere a salesperson opens a CRM and finds a lead enriched, a developer pastes a URL and gets clean JSON back, a chatbot answers a question with a citation instead of a guess. Most of the people on the receiving end will never know the name Diffbot. That is, in a way, the point.