The Man Who Taught Machines to Read the Internet
His Twitter bio says it plainly: "CEO at Diffbot, world's largest knowledge graph. Mostly here to read papers." Mike Tung has been quietly building one of the most consequential AI infrastructure companies in Silicon Valley for 15 years, funded on less money than most startups spend getting to product-market fit, and powering the searches of hundreds of millions of people who have never heard of him.
Diffbot's knowledge graph is rebuilt from scratch every four to five days by crawling 50 billion URLs. It contains over 10 billion entities - people, companies, products, articles, events - resolved, linked, and structured into over a trillion facts. It is, without serious dispute, the largest automated knowledge graph on earth. Microsoft Bing queries it. DuckDuckGo uses it for knowledge panels. Snapchat feeds it into trending news. Adidas runs counterfeit detection against it. And none of that required a unicorn valuation or a $500M Series C.
Diffbot raised approximately $13 million across its entire history. The average Series A in 2024 was $18.7 million. Tung built the world's most factually accurate AI database on less funding than a standard first institutional round - and reached profitability on it.
Tung graduated from UC Berkeley with a degree in Electrical Engineering and Computer Sciences in 2002, then moved through Microsoft, eBay, and Yahoo before landing in Stanford's AI Lab. The PhD program did not hold him. The specific problem did: machines couldn't read web pages the way humans could. HTML scrapers worked fine until the site changed its template. The rules broke. You rewrote them. They broke again. Tung decided to stop writing rules and start writing perception.
He left Stanford's PhD program to pursue that perception problem. To pay rent, he filed patents - the kind of meticulous legal document work that earns around $20,000 per filing - while simultaneously developing the mathematical foundations of what would become Diffbot's computer vision engine. His meals were beans, rice, and ramen. His ambition was considerably larger.
"The technology is scouring the web and is trying to simulate what a human being is doing when they're on the page."- Mike Tung, Diffbot Founder & CEO
The first public Diffbot APIs launched in August 2011. One call, one URL, structured data back. The price: $0.008. Less than a penny per page read. Tung called it a "decoder ring for the web." The model was deliberately transactional - no subscriptions, no minimums, pay for what you use. It turned out companies had a lot of URLs they needed decoded.
In May 2012, Diffbot raised $2 million in seed funding from a roster that read like a who's-who of early Silicon Valley infrastructure bets: Andy Bechtolsheim (who wrote the first check to Google), Joi Ito (MIT Media Lab director), Brad Garlinghouse (YouSendIt), Maynard Webb (eBay COO), Elad Gil (Twitter VP), and Jonathan Heiliger (Facebook VP). Sky Dayton, EarthLink's founder, joined the board. These were not people who invested in social apps. They were people who bet on infrastructure with long time horizons.
Stanford's StartX accelerator made Diffbot its first company investment. There is something appropriate about that. Diffbot is the kind of technical undertaking Stanford's AI lab would produce if it were running as a commercial operation: deeply academic in method, relentlessly practical in application.
"We're taking the Internet and converting it into semantic knowledge."- Mike Tung
By 2015, Diffbot's extraction accuracy had reached 90-95%. Not rules-based. Not scraping. Computer vision trained to read arbitrary web pages at scale. In February 2016, Tencent led a $10M Series A, joined by Felicis Ventures, Amplify Partners, and Valor Capital. That brought total funding to roughly $13 million - a number Tung has never seemed particularly concerned about growing.
In September 2019, he published what he called "The Diffbot Master Plan (Part One)." The structure was almost deliberately Musk-like in its clarity and ambition, without the theatrics. Phase one: extraction APIs. Phase two: Crawlbot - whole-domain crawling powered by Matt Wells, the founder of Gigablast, who Tung hired as VP of Search. Phase three: the Knowledge Graph itself - 50 billion URLs, crawled over 50 months, entity-resolved, fact-checked, and continuously rebuilt. His framing was not about disruption. It was about completeness.
Google's Knowledge Graph, he noted quietly, relies heavily on human curation. Watson's too. "There's a lot of human beings behind the scenes creating the rules." Diffbot's is entirely automated. No human in the loop. The machine reads, structures, resolves, and updates. Every four to five days, from scratch.
10+ billion entities across people, organizations, products, articles, events, and locations. 1 trillion+ structured facts. 98% of the internet covered in approximately 50 languages. 150 million new facts added monthly. Rebuilt completely every 4-5 days. No human curators.
The enterprise use cases are instructive in their variety. DuckDuckGo uses Diffbot for knowledge summaries - the panels that appear when you search a company or person. Snapchat uses it for trending news extraction. Adidas and Nike run counterfeit product detection against it. Nasdaq uses it for market intelligence. Crunchbase uses it for data enrichment. Adobe, Salesforce, HubSpot. The connective tissue of how companies know things about the world increasingly runs through Tung's infrastructure.
In January 2025, Tung launched the Diffbot LLM - an open-source language model built on Meta's LLaMA 3.3 architecture (70B and 8B versions) and grounded in the knowledge graph. The core design choice was deliberate: when the model doesn't know something with confidence, it says so instead of guessing. On MMLU-Pro, it scored 70.36% - best in its class among open-source models under 100 billion parameters. On FreshQA, which tests knowledge of recent events, it hit 81% - beating ChatGPT with web search, Google Gemini, and Perplexity.
The model is available as diffy.chat and open-sourced on GitHub. Tung describes it as the first open-source production GraphRAG system. In an AI landscape full of systems that confabulate convincingly, Diffbot built one that knows what it doesn't know.
"We want to build the world's largest database of structured knowledge."- Mike Tung
Tung's career choices have been consistently counterintuitive for the era they happened in. While his Stanford contemporaries were building social networks and photo-sharing apps in 2008, he was writing computer vision algorithms to parse arbitrary HTML at web scale. While the AI boom of 2021-2023 produced billion-dollar valuations for companies with little infrastructure, he was operating profitably with ~33 employees across North America and Europe. While every AI company in 2024 was racing to build the biggest model, he was building the most accurate one.
His team is small by design. The engineering team runs at roughly 13 people. There are approximately two people in marketing. The company generates around $3.1M in annual revenue. These are not the numbers of a company trying to become a unicorn. They are the numbers of a company that has decided its job is to get a very hard technical problem exactly right.
The DARPA Robotics Challenge gave him another data point for that disposition. He led Stanford's entry - a project defined by precision engineering under uncertainty, where the cost of being wrong was a robot falling on its face in front of judges. The parallels to building a knowledge graph that enterprises bet production systems on are not accidental.
Tung's aspiration, stated plainly and without embellishment: a complete, machine-readable database of all human knowledge on the internet - a factual backbone for AI systems that eliminates hallucination through grounded retrieval. He has been working toward it for fifteen years. The ticker says he is still going.