Nobody hands you a roadmap that goes: UPS operations, State Department intelligence, one million coding students, viral open-source OCR tool, AI startup with Anthropic as a customer. Vik Paruchuri didn't get a roadmap. He got a history degree and an internet connection, and he built everything else from scratch.
Today, Paruchuri runs Datalab - a Brooklyn-based AI company training small, efficient foundation models to extract structured data from documents. The company raised $3.5 million from Pebblebed, a fund started by founding members of OpenAI and Facebook AI Research. Among Datalab's paying customers: Anthropic, the $61 billion AI company building Claude. Not a bad client list for a startup less than a year old.
But the story that actually matters starts earlier. In 2023, after eight years running Dataquest - the online learning platform that quietly taught over a million people to code and do data science - Paruchuri felt what he describes as "the itch to start building again." He wanted to understand deep learning, not just teach it. So he did what any self-respecting self-taught engineer does: he read the hard book.
"The angle at which you approach something makes all the difference."- Vik Paruchuri
He worked through The Deep Learning Book cover to cover, sketching diagrams and coding every concept he didn't immediately understand. He implemented foundational papers from 2015 to 2022 in PyTorch. He trained dozens of models. Then he discovered a real problem nobody had properly solved: getting clean, structured data out of PDF files at scale. Existing tools were slow and inaccurate. So he built Marker.
Marker converts PDFs, images, PowerPoints, Word docs, spreadsheets, and HTML into clean markdown and JSON. It runs 10 times faster than the leading academic alternative (Nougat) and is more accurate on real-world documents. When Paruchuri released it, it hit number one on Hacker News within 72 hours. Three thousand four hundred GitHub stars in three days. The kind of launch most open-source projects dream about for years.
From Marker came Surya - an OCR engine supporting over 90 languages that outperforms Tesseract across almost every language it touches. From Surya came Chandra, a four-billion-parameter model hitting 85.9% accuracy on the olmocr benchmark, the current state-of-the-art for open-source OCR. Chandra scores 89.9% on tables, 89.3% on math equations, and 92.5% on headers. Across 43 languages, it averages 77.8% against Gemini 2.5 Flash's 67.6%. A history major beat Google's flagship model on multilingual document understanding. That sentence deserves a moment.
Jeremy Howard - the co-founder of fast.ai and one of the most respected figures in applied deep learning - noticed the open-source work and offered Paruchuri a position at Answer.ai. That is what happens when you publish code that actually works.