Diffbot

The Scene

It is a Tuesday morning in 2026, and somewhere in Menlo Park a fleet of crawlers is reading the internet. Not skimming it. Reading it - rendering each page in a virtual browser, looking at it the way a human would, and writing down what it sees. Article. Author. Date. Product. Price. Person. Title. Employer. A patient act of comprehension, repeated billions of times before lunch.

The robots do not get bored. They have been doing this since 2008, before the iPhone had an App Store, before "AI" was a marketing department. They feed a database now measured in trillions.

The company running them is called Diffbot. It has 31 employees.

"Diffbot's AI model doesn't guess - it knows, thanks to a trillion-fact knowledge graph." — VentureBeat, 2025

Most AI companies make headlines by generating things. Diffbot's bet has always been quieter and stranger: that the more interesting problem is reading. That structure is the bottleneck. That if you can turn the web into a database, every downstream question - search, sales, research, grounded generation - gets easier.

Eighteen years later, that bet is paying off in a way the rest of the industry is now scrambling to imitate.

The Idea

What if the web came pre-structured?

The pitch is almost embarrassingly simple. The web is the largest and richest dataset humanity has ever made. It is also a mess - text wrapped in markup wrapped in ads wrapped in JavaScript. To use it, you scrape. To trust it, you cross-check. To scale it, you give up.

Diffbot's answer is to treat web pages the way a person does: visually. Its extractors render a page in a real browser, look at the pixels, and decide what the page is about. Article? Product? Forum thread? A profile of a person? Then the relevant fields - byline, price, ratings, author - come out as structured JSON. No fragile CSS selectors. No site-specific scripts.

Run that across the entire public web, continuously, for almost two decades, and you do not have a scraper. You have a knowledge graph.

A trillion facts is not a metaphor

Diffbot's graph now spans more than 10 billion entities - people, organizations, articles, products, places - linked by over a trillion facts. In January 2026 alone the company added 50 billion new facts, 30 million new organizations, and 600 million new articles. Most companies celebrate quarterly product launches. Diffbot celebrates a slow news month.

The grounded-AI bet

When generative AI arrived, the industry's first instinct was to make models bigger. Diffbot's instinct, predictably, was to make them honest. In January 2025 the company released Diffbot LLM - a fine-tune of Meta's Llama 3.3, plugged directly into the knowledge graph. Ask it a question and it answers with citations to specific facts, retrieved at query time. The company called it the first open-source GraphRAG implementation. You can try it at diffy.chat.

The model is not a chatbot dressed up to look like a research tool. It is a research tool that learned to chat.

Who uses it

Diffbot's customer list reads like the back of a cereal box you didn't know you'd been eating from. DuckDuckGo's instant answers. Snapchat's link previews. AOL. Bing. Adobe. Cisco. eBay. Salesforce. Samsung. CBS Interactive. If you have ever pasted a URL into a chat app and watched a tidy preview appear, the pipework was often Diffbot's.

The newer products - Enhance and LeadGraph - take the same graph and aim it at sales teams. Funding events become searchable. Org charts become reliable. The thing you used to pay three vendors for becomes one query.

The patient company

Diffbot has been profitable. Diffbot has not gone public. Diffbot has not raised a Series B. It raised $10 million in February 2016, on top of an earlier $2 million seed from Matrix Partners and Tencent, and then it went back to work. Eighteen years. Thirty-one people. A trillion facts. In an industry that confuses motion with progress, this counts as a philosophical statement.

Founder Mike Tung - patent lawyer turned Stanford AI grad student turned engineer at eBay, Yahoo, and Microsoft - did not set out to build a unicorn. He set out to build a map of human knowledge. The map turned out to be the unicorn.

What's on offer

Six products. One graph.

Knowledge Graph

10B+ entities, trillion+ facts. Queryable, refreshable, sourced. The thing under everything else.

Extract APIs

Article, Product, Discussion, Image, Video, Analyze. Point at a URL, get structured JSON back.

Crawlbot

Run extraction across entire sites at scale. The polite, distributed cousin of your homemade scraper.

Natural Language API

Pull entities, relationships, and sentiment out of raw text - same ontology as the graph.

Diffbot LLM (Diffy)

Open-source GraphRAG model on Llama 3.3. Cites the graph. Try it at diffy.chat.

Enhance & LeadGraph

Enrich CRM records, track funding events, find decision-makers - powered by the same graph.

Diff·bot

What if the web came pre-structured?

A trillion facts is not a metaphor

The grounded-AI bet

Who uses it

The patient company

The graph, in monthly intake.

New entities added per month (relative scale)

Six products. One graph.

Knowledge Graph

Extract APIs

Crawlbot

Natural Language API

Diffbot LLM (Diffy)

Enhance & LeadGraph

The pipework behind names you know.

From patent law to a map of knowledge.

What Diffbot has been up to.

Going deeper.

Mike Tung on building an autonomous knowledge graph

Diffbot's official YouTube - APIs, Diffy, and walkthroughs

Diff·bot

What if the web came pre-structured?

A trillion facts is not a metaphor

The grounded-AI bet

Who uses it

The patient company

The graph, in monthly intake.

New entities added per month (relative scale)

Six products. One graph.

Knowledge Graph

Extract APIs

Crawlbot

Natural Language API

Diffbot LLM (Diffy)

Enhance & LeadGraph

The pipework behind names you know.

From patent law to a map of knowledge.

What Diffbot has been up to.

Going deeper.

Mike Tung on building an autonomous knowledge graph

Diffbot's official YouTube - APIs, Diffy, and walkthroughs

Share this dispatch.