Unstructured

Somewhere inside a Fortune 100 bank, a folder named "Q3_FINAL_FINAL_v7" contains 41 PDFs. Three are scanned. Two have handwritten margin notes. One is rotated ninety degrees for reasons nobody alive remembers. A data scientist somewhere wants to feed all of it to a large language model. This is where Unstructured walks in, rolls up its sleeves, and quietly does the work nobody wanted to do.

Who They Are, Right Now

Unstructured is a 120-person San Francisco company that built the boring, essential layer of the AI stack. Not the models. Not the chatbots. The pipe that gets enterprise documents into the models in the first place. Their open-source library has been downloaded more than six million times. Their enterprise platform sits inside one-third of the Fortune 500. In March 2024 they closed a $40 million Series B led by Menlo Ventures with Databricks, IBM, and NVIDIA writing checks - not casual money, that.

If you have ever watched a slick AI demo and wondered how the model got the data, the unromantic answer is probably some version of what Unstructured does for a living.

Every generative AI project starts with a beautiful demo and ends with someone arguing about PDF parsing. - Industry truism, observed in every conference Slack since 2023

The Problem They Saw

By some estimates, 80% of enterprise data is unstructured: emails, contracts, scanned forms, slide decks, meeting recordings, support tickets, that one Word doc the legal team has been editing since 2017. Language models are extraordinarily good at reading text. They are extraordinarily bad at reading a 400-page financial filing where half the data lives in tables and the other half lives in footnotes set in six-point Helvetica.

The polite name for the gap between those two facts is "data preprocessing." The honest name is "the thing that kills 80% of corporate AI pilots before they reach production." Companies were spending months writing custom parsers, only to have someone email a new file format and shatter the whole thing.

80%

of enterprise data is unstructured

70+

file types supported

30+

enterprise connectors

6M+

OSS downloads

The Founder's Bet

Brian Raymond is an unusual person to be running an AI infrastructure company. Before Unstructured, he was the Iraq Country Director on the National Security Council, briefing President Obama on ISIS. Before that, the CIA. He spent years at the intersection of intelligence analysts and intractable document piles, and then at Primer AI watching the same problem at commercial scale: smart people, expensive models, and an awful lot of unread paper sitting in between.

Founder & Chief Executive

Brian Raymond

Ex-CIA analyst. Former Iraq Country Director, National Security Council. MBA. Spent the Primer years staring at the data layer until he could not stop staring. Founded Unstructured in 2022 because, in his telling, somebody had to.

The thing that kills enterprise AI is never the model. It is the file format from 2003 that nobody has the will to look at. - The Unstructured thesis, paraphrased

The bet was contrarian in 2022, when every check in Silicon Valley was chasing models, not pipes. Build the unglamorous layer. Open-source the core. Make it so good that paid enterprise adoption becomes the inevitable next conversation. It is the kind of bet that, with the benefit of hindsight, looks obvious. At the time, less so.

A Short, Slightly Crowded Timeline

2022

Unstructured Technologies is founded in San Francisco. Open-source library shipped.

2023

Series A from Madrona and Bain Capital Ventures. Library crosses 1M downloads.

2024 - MAR

$40M Series B led by Menlo Ventures. Databricks, IBM, NVIDIA participate.

2024 - LATE

Unstructured Platform launches - enterprise ETL for GenAI workloads.

2024-2025

Adopted by one-third of the Fortune 500. Platinum tier added for VLM and handwriting.

The Product, In Plain English

There are three things to know. First, the open-source library: a Python package that takes a document - PDF, HTML, Word, image, email, whatever your finance team last invented - and breaks it into typed elements. Title. Paragraph. Table. Figure. The model gets clean structured pieces instead of a wall of broken text.

Second, the Unstructured Platform: a hosted, enterprise-grade version with 30+ source connectors (S3, SharePoint, Google Drive, Salesforce, Confluence, the usual suspects), 70+ file type parsers, automatic chunking, metadata enrichment, embedding, and a no-code UI for people who would rather not write airflow DAGs at midnight. Three transformation tiers: Basic for clean text, Advanced for PDFs and images, Platinum for the truly cursed - scanned forms, handwriting, vision-language models doing actual work.

Third, a serverless API for engineers who want the platform without the platform. SOC 2 Type 2, HIPAA, GDPR. VPC and on-prem deployment options for the customers whose lawyers would otherwise still be talking.

Field Note

What people actually do with it: Build RAG systems. Fine-tune domain LLMs. Power compliance and discovery workflows. Feed agentic AI workflows that need to read the same forms a human analyst would. Replace nine separate parsing scripts with one library call.

Adoption, Roughly To Scale

Approximate, public figures. The actual numbers are bigger by the time you finish reading.

OSS downloads

6M+

Codebases

12,000+

Organizations

45,000+

Fortune 500

~1/3

Funding

$65M

The Proof

The customer roster is mostly redacted, which is what happens when you sell into banks, hospital systems, and federal agencies. The investor roster, though, says enough. Menlo led the B. Databricks Ventures and IBM Ventures and NVentures - NVIDIA's arm - all came in. These are not the kind of strategics who write a check and disappear. Databricks already integrates Unstructured into its lakehouse story. IBM productized it inside watsonx as their document preprocessing layer. NVIDIA wants every piece of the AI stack accelerated; preprocessing is now in scope.

In a category where five companies make a polished landing page and two of them quietly evaporate, this is what traction looks like when it is real.

Six million downloads is not a vanity metric. It is what happens when engineers find something that works on a Friday and ship it on a Monday. - Observation from the open-source ecosystem

The Mission, Stated Without Embarrassment

The official line is "make enterprise data LLM-ready." The unofficial line is more interesting: every generative AI workflow worth running depends on data the company already owns but cannot read. Unstructured wants to be the boring middleware that closes that gap, the way Stripe became the boring middleware for payments. You do not think about it. You just use it. And then one day you cannot imagine building without it.

It is, to be candid, not the kind of mission that sells T-shirts. It is the kind that sells contracts.

Why It Matters Tomorrow

Two things are happening at once. First, agentic AI is moving from demo to deployment, which means systems that need to read documents on behalf of humans - in real time, at production volume, without supervision. Second, the easy data is gone. The internet has been scraped. The next frontier is the proprietary data sitting inside companies: the contracts, the claims, the clinical notes, the trading documents. None of it is in a useful format. All of it has to be turned into one.

If those two trends are correct, the company that owns the preprocessing layer is positioned the way Snowflake was positioned for the data warehouse decade. That is the Unstructured pitch in eighteen words. The remaining question is execution, and the early returns suggest the team is doing fine.

Back to the Folder Named Q3_FINAL_FINAL_v7

The data scientist runs a script. Forty-one PDFs go in. Out comes a clean stream of titles, paragraphs, tables, captions, and metadata, tagged and chunked and ready to be embedded. The handwritten margins are caught. The rotated page is righted. The footnotes survive. By Tuesday the LLM is answering questions about the bank's Q3 with quotes that actually cite the source. The data scientist closes the laptop and goes home on time, for once.

Somewhere in San Francisco, Unstructured ships another release. Nobody throws a parade. That, more or less, is the point.