She went into a neuroscience lab as an undergrad and came out with a thesis that would eventually be worth $1.3 billion. Her bet: that a few well-written rules could replace armies of human annotators.
Right now, Paroma Varma is running a company that just crossed a $1.3 billion valuation and closed a $100 million Series D. She is also, simultaneously, a researcher who studies whether reinforcement learning from verifiable rewards works in low-data environments. Neither thing is a side project.
At Snorkel AI, where she is co-founder and Head of Solutions, Varma works on the part of AI that everyone finds boring until they realize it's the bottleneck. Training data. Specifically, the problem that 80% of AI development time gets consumed by data preparation, labeling, and management - before a single model trains. Her answer, built out of six years of Stanford research, is a platform that lets domain experts write rules instead of hand-labeling tens of thousands of examples. The rules are noisy. The algorithms handle the noise.
The company she helped build from a Stanford AI Lab experiment now serves Fortune 500 enterprises across banking, telecom, biotech, insurance, and government. Snorkel's latest products - Snorkel Evaluate and Snorkel Expert Data-as-a-Service - are pushing into LLM evaluation and fine-tuning, which means Varma's decade-long bet on data quality is increasingly being validated by the generative AI wave, not disrupted by it.
You've gone from weeks or months of manual annotation required to label these data points to maybe a matter of hours or days by just writing a couple of rules.
- Paroma Varma, Snorkel AIThe number that frames her argument: 80% of practitioners report AI projects blocked by insufficient training data. The number that proves her answer works: a Fortune 500 bank reduced a labeling timeline from months to days using Snorkel Flow. These are not theoretical wins. They show up in contracts, in renewals, in a funding trajectory that has gone from seed to Series D in six years without a pivot.
Paroma Varma did not plan to found an AI company. She planned to study circuits. At UC Berkeley, where she completed her B.S. in Electrical Engineering and Computer Science, she ended up in a neuroscience lab - the kind that uses computational imaging to analyze brain signals. Machine learning was a tool she picked up because the data demanded it.
That particular entry point mattered. She learned ML not as an end in itself, but as a solution to a specific problem: a domain expert needed to make sense of complex data with limited labels. The experience stuck. When she arrived at Stanford for her Ph.D., working in Professor Christopher Re's lab - home to the Hazy Research group, affiliated with DAWN, SAIL, and the StatML groups - she was already asking the question that would define her career: what if you didn't need perfectly labeled data to train good models?
The answer the lab developed was called weak supervision. Instead of labeling individual data points by hand, users write labeling functions - simple heuristics, regular expressions, rules-of-thumb from domain knowledge. The heuristics are noisy and sometimes contradictory. A statistical model resolves the conflicts and produces probabilistic labels good enough to train a classifier. The result: label quality that approaches hand-annotation at a fraction of the cost and time.
Varma detected abnormal heart valve structures in clinical MRI data using weak supervision - work published in Nature Communications. The method wasn't a demo. It was a clinical pipeline.
Varma's specific contribution to this framework was opening it up to non-text data. The early Snorkel system worked well for natural language - you could write regex patterns, keyword searches, and simple rules. Images were harder. You can't write a regex for a pixel. She responded by developing what she called domain-specific primitives: interpretable building blocks that let a domain expert write meaningful labeling functions for image and video data. In medical imaging, that meant annotating image patches, shapes, and spatial relationships. In autonomous driving, it meant describing object orientations and bounding-box relationships.
She also built Snuba - a system that could automatically generate labeling functions using a small labeled seed dataset, then iterate. Published at VLDB in 2018 with 227 citations, it remains one of the core references for automated weak supervision. The academic record shows someone who was, simultaneously, advancing theory and building systems that solved real problems for real domain experts.
The Snorkel project started in 2015 - four years before the company existed. That gap matters. By the time Varma, Alex Ratner, Christopher Re, Braden Hancock, and Henry Ehrenberg incorporated Snorkel AI in 2019, the research had already been deployed in production at Google, Apple, and several large hospital systems. They were not pitching a hypothesis. They were commercializing a proven method.
The company grew without a pivot. The core thesis - that programmatic, data-centric AI development beats manual annotation - has remained constant from the first paper to the latest Series D deck. What changed is the surface area: Snorkel Flow originally focused on classification and NLP. Today it covers LLM evaluation, fine-tuning, RAG optimization, and expert data-as-a-service.
Varma's role shifted as the company scaled. Her title is now Co-Founder and Head of Solutions, which means she sits at the intersection of research and enterprise deployment. She works with customers - banks, telecoms, government agencies, biotech firms - to understand where the training data bottleneck is most acute, and what labeling strategies actually hold up in production. That requires both technical depth and an ability to translate academic concepts into things a compliance officer or clinical director can act on. It is a specific kind of skill, and it is not common.
It is challenging to replace domain expertise completely. The ideal solution is to inject the domain expertise and then automate different parts of this pipeline.
- Paroma Varma, on the limits of automation in AI developmentThe May 2025 Series D - $100 million, led by Addition, with participation from Greylock, Lightspeed, Prosperity 7 Ventures, BNY, and QBE Ventures - arrived in a market where every enterprise AI story was about model selection, not data quality. Snorkel AI is, essentially, making the contrarian argument: better data beats bigger models. The funding suggests that argument is landing. The $1.3 billion valuation suggests investors believe it at scale.
Varma's Google Scholar profile reads like a tour through two distinct literatures. The most-cited paper on her page - 2,342 citations - is about decomposing neural power spectra in computational neuroscience, co-authored during her Berkeley years. Her AI work follows: Snuba (2018), Training Classifiers with Natural Language Explanations (ACL 2018, 192 citations), and a cluster of papers on weak supervision theory and application.
The dual legacy - neuroscience and ML - is not a quirk. It is evidence of the same underlying question asked in two different domains: how do you extract reliable signal from noisy, limited data? The neuroscience paper is about isolating periodic brain rhythms from aperiodic background noise. The weak supervision papers are about extracting clean labels from imperfect human-written rules. Same problem structure, different substrate.
With an h-index of 16 and 3,789+ total citations, Varma sits well within the territory of a researcher who has built a body of work. She was invited to deliver a keynote at the ICLR 2021 Workshop on Weakly Supervised Learning - a signal that the broader ML community considers her one of the field's leading voices.
"Pursuing a Ph.D. is incredible because it forces you to go through all these different phases and sharpen your skills in every single one."
- Paroma Varma"80% of AI application development time is spent on data preparation, management, and labeling."
- On the core problem Snorkel solves"My favorite part is that Snorkel AI is full of fantastic people. I do get to learn something new every single day."
- On company culture"You've gone from weeks or months of manual annotation to maybe a matter of hours or days by just writing a couple of rules."
- On the efficiency of weak supervisionSnorkel AI closed a $100M Series D led by Addition, with participation from Prosperity 7 Ventures, Greylock, Lightspeed, BNY, and QBE Ventures. The round coincided with the launch of two new products: Snorkel Evaluate (AI model evaluation infrastructure) and Snorkel Expert Data-as-a-Service.
Varma is actively publishing on reinforcement learning from verifiable rewards (RLVR) effectiveness under constrained data and compute conditions - extending Snorkel's core thesis into the LLM fine-tuning era.
Varma spoke at SnorkelCon 2024, Snorkel AI's annual enterprise AI conference, covering advances in data-centric AI development and enterprise LLM deployment.
Featured as an expert panelist at the 2023 Enterprise LLM Summit and in a Microsoft-hosted interview series on the pace of AI development in enterprise settings.
Varma's most-cited academic paper - 2,342 citations - is about computational neuroscience, not AI. She was helping decompose brain rhythms before she was building data labeling platforms.
The name "Snorkel" comes from the idea of snorkeling through data - using lightweight signals to see just below the surface, without needing to dive deep with expensive annotations.
The Snorkel project started in 2015. The company didn't incorporate until 2019. Four years of Stanford research before a single pitch deck.
Snorkel AI grew from a 5-person founding team to 1,200 employees and $148M in annual revenue - without ever pivoting its core thesis about the primacy of training data quality.
Her personal website (paroma.xyz) was eventually hijacked and redirected to unrelated content - a small irony for a researcher whose life's work is about data reliability and provenance.