A San Francisco TechBio company turning the dirtiest data in pharma into the fuel that runs modern AI - and quietly powering drug programs at Pfizer, Genentech and a hundred others.
It is a Tuesday morning in a pharma data lake somewhere outside Boston. Twelve terabytes of gene-expression files sit in folders no one has opened since 2019. A machine-learning team is two months behind because half the metadata is in German and half is in Excel. Then, somebody on the Slack channel types one word: "Polly?" Within the hour, a curated, harmonized, model-ready dataset shows up in an S3 bucket. The team meets its sprint goal. The drug program does not slip. That, more or less, is the entire business model of Elucidata.
Elucidata sells the boring part of the AI revolution. While larger names fight over foundation models and benchmark leaderboards, this 170-person company has spent ten quiet years building the equivalent of a municipal water system for biomedical data. The pipes are not glamorous. They are, however, load-bearing.
Every pharmaceutical company on earth runs into the same wall. The science is genuinely hard. The data infrastructure is genuinely worse. A 2020 industry survey suggested that scientists spend something like 60 to 80 percent of their time finding, cleaning and re-formatting data before they can do anything useful with it. Generative AI did not fix this. It made it loud.
Co-founders Abhishek Jha and Swetabh Pathak noticed the gap in 2015. Jha had just finished a postdoc at MIT and a stint as a senior scientist at Agios Pharmaceuticals. Pathak came out of IIT Delhi with a mathematics-and-computer-science background. They began with a question that, in retrospect, sounds almost banal - what if you treated biology's data quality problem like a software problem.
Almost nobody had. Bio was full of brilliant tooling for analysis and almost nothing for curation. The plumbing was missing.
Jha and Pathak's bet was unfashionable in the way only good bets are. The idea was that a horizontal platform for cleaning, harmonizing and indexing biomedical data could become more valuable than any single drug it helped discover. Not because the platform itself was sexy, but because the alternative - every pharma company hiring its own private army of bioinformaticians - was both expensive and slow.
They called the platform Polly, after the parrot, because the system's job was to learn biological patterns and repeat them back, cleanly. Then they did the unglamorous work of getting Polly to read scientific literature, parse messy lab files, reconcile differing schemas, and spit out machine-learning-ready datasets across genomics, transcriptomics, proteomics, metabolomics and clinical records.
It took years. The first customers were small biotechs. The first big logos were Pfizer and Janssen. By 2022, the company had grown enough to attract a $16 million Series A led by Eight Roads Ventures, with F-Prime Capital, IvyCap and Hyperplane joining in. Total funding to date sits at about $23.5 million - modest by frothy AI-era standards, ample for a company that has chosen to grow on revenue more than checks.
Elucidata founded in Delhi by Abhishek Jha and Swetabh Pathak.
Polly launches commercially. First biopharma contracts.
YourStory Tech30 list. First marquee customer: Pfizer.
Wins the NCI AI-Readiness Challenge.
$16M Series A. Eight Roads Ventures leads.
#8 Biotech on Fast Company's Most Innovative Companies.
Launches Elucidata AI Labs with hubs in SF, Boston, India.
Polly is a data-centric ML-Ops platform. Translated from the marketing dialect, it is three things stacked. A data lake of more than 115 terabytes of curated biomedical content, drawn from 30-plus public and proprietary sources. A Bio-NLP engine that reads scientific papers and structured files and turns them into something a model can train on. And a workbench where pharma data scientists actually run their experiments.
The data-centric platform. Ingests, harmonizes, curates. Used across genomics, transcriptomics, proteomics, metabolomics and clinical records.
The open-source library that lets data teams pull curated datasets into their notebooks the way they pull Pandas.
The managed-service offering. Elucidata scientists build bespoke datasets for a specific drug program, on top of Polly.
Launched in 2026. Combines agentic AI with the company's curated data layer to pursue "biomedical AGI". A big phrase. They are trying it anyway.
Elucidata is the kind of company whose customer list does most of the persuading. Pfizer. Genentech. Janssen. Alnylam. Eli Lilly. The Bill & Melinda Gates Foundation. Stanford. The platform supports 40-plus drug programs at or beyond the IND stage, which is the point at which a drug stops being theoretical and starts costing real money.
CHART · Bars scaled visually; numbers are absolute. Yes, datasets are several orders of magnitude bigger than headcount - which is the whole point.
The Pfizer collaboration is, by Elucidata's own accounting, one of the cleaner showcases. The two organizations used Polly's integrated-omics pipeline to study metabolic changes during T-cell activation - a corner of immunology so technical that the case study practically requires a glossary. Pfizer kept the IP. Elucidata kept the credibility.
The official mission statement is the kind of thing companies write on a wall. Elucidata wants to make biomedical data AI-ready so that drug discovery moves at the speed of computing rather than the speed of curation. The unofficial version is shorter. Make the data not embarrassing.
Culturally, the company sits in an unusual place. Scientists and engineers in roughly equal measure. Customers who are themselves PhDs. A San Francisco headquarters on Market Street, a Boston outpost, and a center of gravity in Delhi, where much of the engineering team still lives. The blend shows up in how the product gets built - it is one of the few platforms where a benchmark conversation can swing from "what's the F1 on entity extraction" to "what is actually happening in the cell" in the same standup.
The next decade of biomedicine will be shaped less by the model architecture race and more by the question of whose data actually trains anything useful. Most clinical and omics data still lives in proprietary silos, with idiosyncratic schemas, varying provenance, and no shared vocabulary. The labs that solve that problem will not get the Nobel. They will get the contracts.
Elucidata has chosen to be that lab. Whether the AI Labs experiment with agentic biomedical AI pans out is genuinely uncertain - "biomedical AGI" is a phrase that demands skepticism. But the underlying business is plain. Curate the data. Sell the access. Let the rest of the industry build on top.
It is now Tuesday morning, again, in a different pharma data lake. The same twelve terabytes of expression files exist somewhere on AWS. This time, a junior scientist does not have to know German or remember Excel. She runs a query against Polly. The dataset arrives clean. The model trains. The slide deck is ready by Thursday. The drug program does not slip. That is what a quiet revolution looks like before anyone calls it one.