BREAKING
Mike Tung, Founder and CEO of Diffbot
Profile • Founder • AI Researcher

Mike Tung

The Architect of the World's Largest Knowledge Graph

"He didn't scrape the web. He taught machines to read it."

Founder & CEO, Diffbot Mountain View, CA UC Berkeley EECS Stanford AI Lab
10B+ Entities in Knowledge Graph
1T+ Structured Facts
$13M Total Funding Raised
400+ Enterprise Customers
15+ Years Building Diffbot
4-5 Days Per Full Rebuild

The Man Who Taught Machines to Read the Internet

His Twitter bio says it plainly: "CEO at Diffbot, world's largest knowledge graph. Mostly here to read papers." Mike Tung has been quietly building one of the most consequential AI infrastructure companies in Silicon Valley for 15 years, funded on less money than most startups spend getting to product-market fit, and powering the searches of hundreds of millions of people who have never heard of him.

Diffbot's knowledge graph is rebuilt from scratch every four to five days by crawling 50 billion URLs. It contains over 10 billion entities - people, companies, products, articles, events - resolved, linked, and structured into over a trillion facts. It is, without serious dispute, the largest automated knowledge graph on earth. Microsoft Bing queries it. DuckDuckGo uses it for knowledge panels. Snapchat feeds it into trending news. Adidas runs counterfeit detection against it. And none of that required a unicorn valuation or a $500M Series C.

The Math

Diffbot raised approximately $13 million across its entire history. The average Series A in 2024 was $18.7 million. Tung built the world's most factually accurate AI database on less funding than a standard first institutional round - and reached profitability on it.

Tung graduated from UC Berkeley with a degree in Electrical Engineering and Computer Sciences in 2002, then moved through Microsoft, eBay, and Yahoo before landing in Stanford's AI Lab. The PhD program did not hold him. The specific problem did: machines couldn't read web pages the way humans could. HTML scrapers worked fine until the site changed its template. The rules broke. You rewrote them. They broke again. Tung decided to stop writing rules and start writing perception.

He left Stanford's PhD program to pursue that perception problem. To pay rent, he filed patents - the kind of meticulous legal document work that earns around $20,000 per filing - while simultaneously developing the mathematical foundations of what would become Diffbot's computer vision engine. His meals were beans, rice, and ramen. His ambition was considerably larger.

"The technology is scouring the web and is trying to simulate what a human being is doing when they're on the page."
- Mike Tung, Diffbot Founder & CEO

The first public Diffbot APIs launched in August 2011. One call, one URL, structured data back. The price: $0.008. Less than a penny per page read. Tung called it a "decoder ring for the web." The model was deliberately transactional - no subscriptions, no minimums, pay for what you use. It turned out companies had a lot of URLs they needed decoded.

In May 2012, Diffbot raised $2 million in seed funding from a roster that read like a who's-who of early Silicon Valley infrastructure bets: Andy Bechtolsheim (who wrote the first check to Google), Joi Ito (MIT Media Lab director), Brad Garlinghouse (YouSendIt), Maynard Webb (eBay COO), Elad Gil (Twitter VP), and Jonathan Heiliger (Facebook VP). Sky Dayton, EarthLink's founder, joined the board. These were not people who invested in social apps. They were people who bet on infrastructure with long time horizons.

Stanford's StartX accelerator made Diffbot its first company investment. There is something appropriate about that. Diffbot is the kind of technical undertaking Stanford's AI lab would produce if it were running as a commercial operation: deeply academic in method, relentlessly practical in application.

"We're taking the Internet and converting it into semantic knowledge."
- Mike Tung

By 2015, Diffbot's extraction accuracy had reached 90-95%. Not rules-based. Not scraping. Computer vision trained to read arbitrary web pages at scale. In February 2016, Tencent led a $10M Series A, joined by Felicis Ventures, Amplify Partners, and Valor Capital. That brought total funding to roughly $13 million - a number Tung has never seemed particularly concerned about growing.

In September 2019, he published what he called "The Diffbot Master Plan (Part One)." The structure was almost deliberately Musk-like in its clarity and ambition, without the theatrics. Phase one: extraction APIs. Phase two: Crawlbot - whole-domain crawling powered by Matt Wells, the founder of Gigablast, who Tung hired as VP of Search. Phase three: the Knowledge Graph itself - 50 billion URLs, crawled over 50 months, entity-resolved, fact-checked, and continuously rebuilt. His framing was not about disruption. It was about completeness.

Google's Knowledge Graph, he noted quietly, relies heavily on human curation. Watson's too. "There's a lot of human beings behind the scenes creating the rules." Diffbot's is entirely automated. No human in the loop. The machine reads, structures, resolves, and updates. Every four to five days, from scratch.

The Knowledge Graph in Numbers

10+ billion entities across people, organizations, products, articles, events, and locations. 1 trillion+ structured facts. 98% of the internet covered in approximately 50 languages. 150 million new facts added monthly. Rebuilt completely every 4-5 days. No human curators.

The enterprise use cases are instructive in their variety. DuckDuckGo uses Diffbot for knowledge summaries - the panels that appear when you search a company or person. Snapchat uses it for trending news extraction. Adidas and Nike run counterfeit product detection against it. Nasdaq uses it for market intelligence. Crunchbase uses it for data enrichment. Adobe, Salesforce, HubSpot. The connective tissue of how companies know things about the world increasingly runs through Tung's infrastructure.

In January 2025, Tung launched the Diffbot LLM - an open-source language model built on Meta's LLaMA 3.3 architecture (70B and 8B versions) and grounded in the knowledge graph. The core design choice was deliberate: when the model doesn't know something with confidence, it says so instead of guessing. On MMLU-Pro, it scored 70.36% - best in its class among open-source models under 100 billion parameters. On FreshQA, which tests knowledge of recent events, it hit 81% - beating ChatGPT with web search, Google Gemini, and Perplexity.

The model is available as diffy.chat and open-sourced on GitHub. Tung describes it as the first open-source production GraphRAG system. In an AI landscape full of systems that confabulate convincingly, Diffbot built one that knows what it doesn't know.

"We want to build the world's largest database of structured knowledge."
- Mike Tung

Tung's career choices have been consistently counterintuitive for the era they happened in. While his Stanford contemporaries were building social networks and photo-sharing apps in 2008, he was writing computer vision algorithms to parse arbitrary HTML at web scale. While the AI boom of 2021-2023 produced billion-dollar valuations for companies with little infrastructure, he was operating profitably with ~33 employees across North America and Europe. While every AI company in 2024 was racing to build the biggest model, he was building the most accurate one.

His team is small by design. The engineering team runs at roughly 13 people. There are approximately two people in marketing. The company generates around $3.1M in annual revenue. These are not the numbers of a company trying to become a unicorn. They are the numbers of a company that has decided its job is to get a very hard technical problem exactly right.

The DARPA Robotics Challenge gave him another data point for that disposition. He led Stanford's entry - a project defined by precision engineering under uncertainty, where the cost of being wrong was a robot falling on its face in front of judges. The parallels to building a knowledge graph that enterprises bet production systems on are not accidental.

Tung's aspiration, stated plainly and without embellishment: a complete, machine-readable database of all human knowledge on the internet - a factual backbone for AI systems that eliminates hallucination through grounded retrieval. He has been working toward it for fifteen years. The ticker says he is still going.

The Long Game

2001-2002
Software Development Engineer at Microsoft Corporation. Early industry role straight out of UC Berkeley EECS.
2002
Software Engineer at eBay. Early data-focused engineering on one of the web's largest commerce platforms.
2004-2006
Founding Machine Learning Engineer at TheFind, Inc. - a shopping search engine later acquired by Facebook.
2006
Data Scientist, Search and Marketplace Intelligence at Yahoo!
2006-2008
Graduate research at Stanford AI Lab. Simultaneously filed patents at ~$20K each to fund himself. Left the PhD to found Diffbot.
2007-2010
Founding Software Engineer at ClickTV (later acquired by Cisco).
2008
Founded Diffbot. Started with beans, rice, ramen, and a vision to teach machines to read the web.
Aug 2011
Launched first public Diffbot APIs. Pay-per-call at $0.008 per URL. The web as a structured data source.
May 2012
Raised $2M seed. Investors: Andy Bechtolsheim, Joi Ito, Elad Gil, Maynard Webb. StartX's first investment.
2015
Reached profitability. Extraction accuracy at 90-95%. Proof of concept becomes proof of business.
Feb 2016
$10M Series A led by Tencent. Felicis Ventures, Amplify Partners, Valor Capital also participate.
2018
Announced world's largest automated knowledge graph at O'Reilly Strata NY.
2019
Launched Knowledge Graph publicly: 2B+ entities, 10T+ facts. Published "The Diffbot Master Plan."
Jan 2025
Launched Diffbot LLM - open-source GraphRAG system. Best factual accuracy in class under 100B params. Outperforms ChatGPT on FreshQA.

Inside Diffbot's Knowledge Graph

10B+ Entities

People, companies, products, articles, events - all resolved and linked

1T+ Structured Facts

Machine-readable, queryable facts from across the open web

50B URLs Crawled

The original build: 50 billion URLs processed over 50 months

4-5 Days Per Rebuild

The entire knowledge graph is rebuilt from scratch every 4-5 days

98% Web Coverage

Approximately 98% of the internet covered across ~50 languages

150M New Facts / Month

Fresh facts continuously added as the web evolves

Trusted by 400+ companies including

Microsoft Bing
DuckDuckGo
Snapchat
eBay
Cisco
Adobe
Salesforce
HubSpot
Crunchbase
Adidas
Nike
Nasdaq
Yandex
Avast
AOL
Instapaper

What Mike Tung Says

"We want to build the world's largest database of structured knowledge."
- Mike Tung, Diffbot
"We're taking the Internet and converting it into semantic knowledge."
- Mike Tung, Diffbot
"Automatically extracting structure from arbitrary URLs works 10X better compared to manually creating scraping rules."
- Mike Tung, Diffbot blog
"From day one we made it an on-demand service. For every hit to our server we earn .008 cents."
- Mike Tung, early interview
"Google has this knowledge graph using human curation... there's a lot of human beings behind the scenes creating the rules."
- Mike Tung, on the automated advantage
"CEO at Diffbot, world's largest knowledge graph. Mostly here to read papers."
- Mike Tung, Twitter/X bio

Building Big on Small Rounds

Seed $2M
Series A $10M - Led by Tencent
$2M Seed Round - May 2012
~$13M Total Raised (ever)
$10M Series A - Feb 2016

Seed investors included Andy Bechtolsheim, Joi Ito, Brad Garlinghouse, Maynard Webb, Elad Gil, Jonathan Heiliger, and Sky Dayton (who joined the board). The Series A was led by Tencent with Felicis Ventures, Amplify Partners, and Valor Capital.

Diffbot LLM:
The Anti-Hallucination Model

In January 2025, Tung launched the Diffbot LLM - an open-source language model built on LLaMA 3.3 and grounded in the knowledge graph. The core design principle: when the model doesn't know something, it says so. No guessing. No fabrication. Just a trillion facts retrieved and verified.

Available at diffy.chat and open-sourced on GitHub. It's the first open-source production GraphRAG system - combining the knowledge graph's structured facts with a language model's generation capability. The approach: factual grounding over fluent confabulation.

70.36% MMLU-Pro Score
81% FreshQA Score
#1 Factual Accuracy Open-Source <100B
70B LLaMA 3.3 Base (also 8B)

Watch Mike Tung

Share this profile