Vik Paruchuri

Nobody hands you a roadmap that goes: UPS operations, State Department intelligence, one million coding students, viral open-source OCR tool, AI startup with Anthropic as a customer. Vik Paruchuri didn't get a roadmap. He got a history degree and an internet connection, and he built everything else from scratch.

Today, Paruchuri runs Datalab - a Brooklyn-based AI company training small, efficient foundation models to extract structured data from documents. The company raised $3.5 million from Pebblebed, a fund started by founding members of OpenAI and Facebook AI Research. Among Datalab's paying customers: Anthropic, the $61 billion AI company building Claude. Not a bad client list for a startup less than a year old.

But the story that actually matters starts earlier. In 2023, after eight years running Dataquest - the online learning platform that quietly taught over a million people to code and do data science - Paruchuri felt what he describes as "the itch to start building again." He wanted to understand deep learning, not just teach it. So he did what any self-respecting self-taught engineer does: he read the hard book.

"The angle at which you approach something makes all the difference."

- Vik Paruchuri

He worked through The Deep Learning Book cover to cover, sketching diagrams and coding every concept he didn't immediately understand. He implemented foundational papers from 2015 to 2022 in PyTorch. He trained dozens of models. Then he discovered a real problem nobody had properly solved: getting clean, structured data out of PDF files at scale. Existing tools were slow and inaccurate. So he built Marker.

Marker converts PDFs, images, PowerPoints, Word docs, spreadsheets, and HTML into clean markdown and JSON. It runs 10 times faster than the leading academic alternative (Nougat) and is more accurate on real-world documents. When Paruchuri released it, it hit number one on Hacker News within 72 hours. Three thousand four hundred GitHub stars in three days. The kind of launch most open-source projects dream about for years.

From Marker came Surya - an OCR engine supporting over 90 languages that outperforms Tesseract across almost every language it touches. From Surya came Chandra, a four-billion-parameter model hitting 85.9% accuracy on the olmocr benchmark, the current state-of-the-art for open-source OCR. Chandra scores 89.9% on tables, 89.3% on math equations, and 92.5% on headers. Across 43 languages, it averages 77.8% against Gemini 2.5 Flash's 67.6%. A history major beat Google's flagship model on multilingual document understanding. That sentence deserves a moment.

Jeremy Howard - the co-founder of fast.ai and one of the most respected figures in applied deep learning - noticed the open-source work and offered Paruchuri a position at Answer.ai. That is what happens when you publish code that actually works.

Most AI founders took the obvious path: Stanford, Google Brain, a prestigious PhD, then a demo at NeurIPS. Vik Paruchuri's path involved supervising logistics at UPS, then briefing diplomats as a Foreign Service Officer at the U.S. Department of State with an active Top Secret security clearance.

He discovered machine learning the old way: Kaggle competitions, forums, and sheer stubbornness. By 2012 he had taught himself Python and was winning competitions in automated essay scoring, bond trading, and stock market prediction. No computer science coursework. Just problems worth solving and the patience to solve them.

He codified this self-teaching philosophy when he built Dataquest. The platform's entire premise - learn by doing, on real data, not toy examples - was how Paruchuri had learned himself. Eight years and a million students later, the philosophy held up. He then applied it to himself one more time, learning deep learning in 2023 the same way he'd learned everything else: from the bottom up, building until it worked.

VIK
PARUCHURI

The Unlikely Engineer Who Turned PDFs Into a Startup

From Top Secret to Top of HN

Small Models, Big Accuracy

Marker

Surya OCR

Chandra OCR

Dataquest

Datalab

The Vik Letter

The Details That Don't Fit the Resume

Built on Fundamentals

From the Loading Dock to the AI Frontier

VIKPARUCHURI

The Unlikely Engineer Who Turned PDFs Into a Startup

From Top Secret to Top of HN

Small Models, Big Accuracy

Marker

Surya OCR

Chandra OCR

Dataquest

Datalab

The Vik Letter

The Details That Don't Fit the Resume

Built on Fundamentals

From the Loading Dock to the AI Frontier

The Vik Letter

Share This Profile

VIK
PARUCHURI