BREAKING
Vik Paruchuri
FOUNDER & BUILDER

Vik Paruchuri - Oakland / Brooklyn

AI Engineer · Founder · Open-Source Creator

VIK
PARUCHURI

The man who taught a million people to code,
then went back to school himself.

DATALAB FOUNDER OPEN-SOURCE OCR PIONEER SELF-TAUGHT THE VIK LETTER

History major. UPS supervisor. Foreign Service Officer with a Top Secret clearance. Then, somehow, the guy who built AI tools beating Google on multilingual benchmarks. Vik Paruchuri's career is not a straight line. It's a Hacker News front page.

1M+ Students at Dataquest
85.9% Chandra OCR Benchmark
90+ Languages Supported
$3.5M Datalab Seed Round
3,400 GitHub Stars in 3 Days

The Unlikely Engineer Who Turned PDFs Into a Startup

Nobody hands you a roadmap that goes: UPS operations, State Department intelligence, one million coding students, viral open-source OCR tool, AI startup with Anthropic as a customer. Vik Paruchuri didn't get a roadmap. He got a history degree and an internet connection, and he built everything else from scratch.

Today, Paruchuri runs Datalab - a Brooklyn-based AI company training small, efficient foundation models to extract structured data from documents. The company raised $3.5 million from Pebblebed, a fund started by founding members of OpenAI and Facebook AI Research. Among Datalab's paying customers: Anthropic, the $61 billion AI company building Claude. Not a bad client list for a startup less than a year old.

But the story that actually matters starts earlier. In 2023, after eight years running Dataquest - the online learning platform that quietly taught over a million people to code and do data science - Paruchuri felt what he describes as "the itch to start building again." He wanted to understand deep learning, not just teach it. So he did what any self-respecting self-taught engineer does: he read the hard book.

"The angle at which you approach something makes all the difference."
- Vik Paruchuri

He worked through The Deep Learning Book cover to cover, sketching diagrams and coding every concept he didn't immediately understand. He implemented foundational papers from 2015 to 2022 in PyTorch. He trained dozens of models. Then he discovered a real problem nobody had properly solved: getting clean, structured data out of PDF files at scale. Existing tools were slow and inaccurate. So he built Marker.

Marker converts PDFs, images, PowerPoints, Word docs, spreadsheets, and HTML into clean markdown and JSON. It runs 10 times faster than the leading academic alternative (Nougat) and is more accurate on real-world documents. When Paruchuri released it, it hit number one on Hacker News within 72 hours. Three thousand four hundred GitHub stars in three days. The kind of launch most open-source projects dream about for years.

From Marker came Surya - an OCR engine supporting over 90 languages that outperforms Tesseract across almost every language it touches. From Surya came Chandra, a four-billion-parameter model hitting 85.9% accuracy on the olmocr benchmark, the current state-of-the-art for open-source OCR. Chandra scores 89.9% on tables, 89.3% on math equations, and 92.5% on headers. Across 43 languages, it averages 77.8% against Gemini 2.5 Flash's 67.6%. A history major beat Google's flagship model on multilingual document understanding. That sentence deserves a moment.

Jeremy Howard - the co-founder of fast.ai and one of the most respected figures in applied deep learning - noticed the open-source work and offered Paruchuri a position at Answer.ai. That is what happens when you publish code that actually works.

VIK SAYS
"I didn't learn about tech and machine learning in school - I majored in American History for undergrad and failed quite a few classes."
- Vik Paruchuri, vikas.sh

From Top Secret to Top of HN

Most AI founders took the obvious path: Stanford, Google Brain, a prestigious PhD, then a demo at NeurIPS. Vik Paruchuri's path involved supervising logistics at UPS, then briefing diplomats as a Foreign Service Officer at the U.S. Department of State with an active Top Secret security clearance.

He discovered machine learning the old way: Kaggle competitions, forums, and sheer stubbornness. By 2012 he had taught himself Python and was winning competitions in automated essay scoring, bond trading, and stock market prediction. No computer science coursework. Just problems worth solving and the patience to solve them.

He codified this self-teaching philosophy when he built Dataquest. The platform's entire premise - learn by doing, on real data, not toy examples - was how Paruchuri had learned himself. Eight years and a million students later, the philosophy held up. He then applied it to himself one more time, learning deep learning in 2023 the same way he'd learned everything else: from the bottom up, building until it worked.

Small Models, Big Accuracy

The prevailing assumption in AI is that bigger is better. More parameters, more compute, more everything. Paruchuri's bet at Datalab runs in the opposite direction: train specialized models with 100 to 500 million parameters that genuinely understand their specific domain, rather than general-purpose billion-parameter behemoths that handle everything adequately.

For document intelligence, this means models that handle messy handwriting, complex tables, math equations in multiple scripts, checkboxes on forms - the stuff real-world documents actually contain. The Chandra model does this at four billion parameters and runs on consumer-grade GPUs. Throughput on an H100: 122 pages per second.

The data thesis is equally grounded. Paruchuri has said data work comprises over 70% of his actual ML workflow. Not architecture tweaks. Not hyperparameter sweeps. Data. Clean it, label it, understand it. It's unglamorous. It also happens to be why his models work.

Key Projects

Marker

#1 HN

Converts PDF, PPTX, DOCX, XLSX, images, and HTML to clean markdown and JSON. Ten times faster than Nougat. Hit the top of Hacker News with 700 votes within 72 hours of release. 3,400+ GitHub stars in three days. v2 adds image extraction, commercial licensing, and improved OCR accuracy.

Surya OCR

90+ LANGS

Open-source OCR engine with layout analysis, reading order detection, and table recognition. Outperforms Tesseract across almost all of the 90+ languages it supports. Powers Marker under the hood and holds its benchmark performance a year after release - unusual in a fast-moving field.

Chandra OCR

85.9%

Four-billion-parameter model achieving state-of-the-art open-source accuracy on the olmocr benchmark. Scores 89.9% on tables, 89.3% on math, 92.5% on headers. Averages 77.8% across 43 languages versus Gemini 2.5 Flash's 67.6%. Available via Datalab API with open-source HuggingFace and VLLM support.

Dataquest

1M+

The online learning platform Paruchuri ran for eight years. Taught Python, data science, machine learning, and SQL to over one million students worldwide. Built on the principle that learning by working with real data beats video lectures. Still running as a business today.

Datalab

$3.5M

The current company. Trains specialized 100-500M parameter models for structured data extraction from documents. Backed by Pebblebed (OpenAI and FAIR founders). Customers include Anthropic. Founded June 2024 with co-founder Sandy Kwon. Based in Brooklyn, NY.

The Vik Letter

SEMI.

Paruchuri's newsletter covering semiconductors and technology. Bridges the gap between the hardware layer and AI applications. Published from his personal site vikas.sh, it reflects a technical breadth that extends well beyond the OCR tools he's most known for building.

Anecdotes

The Details That Don't Fit the Resume

The Deep Learning Book story. When Paruchuri decided to learn deep learning in 2023, he didn't watch YouTube tutorials. He sat with Goodfellow, Bengio, and Courville's textbook - the 800-page academic bible of the field - and worked through it deliberately. Every unfamiliar concept got sketched out by hand and coded in Python before he moved on. He then built and taught a "Zero to GPT" course simultaneously, forcing himself to explain concepts he'd just learned. This is the method he used to teach a million students. He used it on himself.

The Jeremy Howard moment. After releasing Marker and watching it go viral, Paruchuri received something many engineers would consider career-defining: an offer from Jeremy Howard - co-creator of fast.ai, one of the most influential figures in making deep learning accessible - to collaborate at Answer.ai, his AI research lab. The offer came not from a pitch deck or a warm intro, but from Paruchuri simply publishing code that worked. Open source as professional credential.

The security clearance-to-startup arc. Before any of this, Paruchuri spent time as a U.S. Foreign Service Officer holding an active Top Secret security clearance. The specific nature of that work remains - fittingly - undisclosed. What followed was a pivot to machine learning, then education, then document AI. The through-line, in hindsight, is pattern recognition: extracting signal from complex information under constraints. That skill transfers regardless of whether the source material is classified briefings or messy PDFs.

Fun Facts
01

Majored in American History and failed multiple classes. No computer science courses. Learned ML entirely through Kaggle competitions starting in 2012.

02

Was a U.S. Foreign Service Officer with a Top Secret clearance before pivoting to tech. One of the more unusual origin stories in AI.

03

His PDF tool Marker earned 3,400 GitHub stars in just three days - faster growth than most open-source projects see in a year.

04

Anthropic - builder of Claude, valued at $61 billion - is a paying customer of his startup Datalab, less than a year after Datalab was founded.

05

His OCR models average 77.8% across 43 languages on multilingual benchmarks. Google's Gemini 2.5 Flash averages 67.6%. A small startup beat a tech giant.

06

Spent 8 years teaching 1 million people to code data science at Dataquest - then went back to being a student himself, learning deep learning from scratch in 2023.

Who He Is

Built on Fundamentals

Self-taught Pragmatic Open-source advocate Bottom-up thinker Transparent about failure Data-obsessed The itch to build

Paruchuri has a consistent and somewhat rare trait among founders: he says true things about himself in public. He admits he barely graduated college. He admits he dismissed deep learning for years before committing to learning it. He writes about the gap between what ML courses teach and what actual ML work looks like - that 70% data cleaning reality most practitioners know but few discuss openly.

His aspirations are equally concrete. He wants to build small AI models that solve document intelligence problems without requiring data-center-scale compute. He wants to advance open-source tooling in OCR and document understanding. He's not pitching a vision of artificial general intelligence. He's fixing the unsexy, pervasive problem of getting structured data out of unstructured files - and doing it better than anyone else right now.

From the Loading Dock to the AI Frontier

2004-2009 BA in American History, University of Maryland. By his own account, barely graduated. Failed multiple courses.
2006-2009 Operations Supervisor at UPS. Logistics, scheduling, team management - real-world systems at scale.
2010-2011 Foreign Service Officer, U.S. Department of State. Active Top Secret security clearance. The specifics remain undisclosed.
2012 Teaches himself Python and machine learning through Kaggle. Wins competitions. No formal CS training. Discovers that self-teaching works.
2012-2014 Machine Learning Engineer and consultant at VCP Analytics and the early edX team. 901 commits to Open edX. Starts building for real.
2015 Founds Dataquest - online learning platform for data science and ML. The teaching philosophy: learn by doing with real data, not toy examples.
2015-2023 Runs Dataquest for 8 years, growing to 1M+ students and 500,000+ career advancers. Bootstrapped, sustainable, and genuinely impactful.
2023 Feels "the itch to start building again." Commits to learning deep learning from fundamentals. Reads The Deep Learning Book. Implements two dozen foundational papers in PyTorch. Trains dozens of models.
2023-2024 Releases Marker (PDF converter), then Surya OCR engine. Both go viral. Jeremy Howard of fast.ai offers collaboration at Answer.ai. GitHub stars accumulate at remarkable speed.
2024 Founds Datalab with Sandy Kwon. Raises $3.5M from Pebblebed. Releases Chandra OCR - state-of-the-art open-source performance. Anthropic signs on as a customer.
2024-NOW Building Datalab's document intelligence API. Publishing The Vik Letter newsletter on semiconductors and tech. Advancing open-source OCR. Still building.
Latest Updates

The Vik Letter

Semiconductors, tech, and what's actually happening under the hood. Published by Vik Paruchuri at vikas.sh.

Connect & Follow

Share This Profile