YesPress Profile · Company

The wordmark that quietly slipped onto the cap tables of half the AI labs you've heard of. No mascot, no fireworks - just a database that does the job.

LanceDB

The open-source lakehouse teaching databases to handle video, images and embeddings as easily as a spreadsheet handles numbers.

San Francisco Founded 2021 ~44 people Series A · $30M Open Source

Dispatch · The data layer nobody sees

Somewhere right now a model is training on a few million video clips, and an engineer is not awake at 3am babysitting the pipeline that feeds it. That is the whole pitch. LanceDB sits underneath the AI you actually use - the image generators, the chatbots, the self-driving stacks - holding the messy, enormous, multimodal data those systems eat for breakfast.

It does not have a flashy demo. It is a file format and a database. But it has wedged itself into the plumbing of Midjourney, Runway, Character.AI and a long list of companies that move petabytes of training data and search across billions of vectors. By mid-2025 its open-source packages had been downloaded more than 20 million times. That is the kind of number you get by being useful, not loud.

Most infrastructure earns its keep by disappearing. LanceDB is very good at disappearing.- The editorial read

Above: A database with no logo animation and no growth-hack hoodie. Suspicious, in the best way.

The Problem · Why this exists

Databases were built for rows. AI eats everything else.

Here is the uncomfortable truth the data world spent a decade not saying out loud: the tools we built for tidy tables of numbers and strings are terrible at images, audio and video. And by 2025, video alone made up roughly 90% of the world's generated data - the kind of statistic that should have triggered a panic and mostly triggered a shrug.

So AI teams improvised. They kept their raw files in one place, their embeddings in a vector store, their features in a pipeline, their training data in yet another bucket. Five tools, four formats, and a standing army of engineers whose actual job was moving data between them. It worked, the way duct tape works.

Engineers today need to stitch together multiple tools, spending all their time with plumbing instead of experimenting, shipping features, or improving the cognitive layer in the agent they're building.- Chang She, Co-founder & CEO

LanceDB's bet is that this stitching is not a fact of life. It is a missing layer. Build one open format that natively understands every modality, and the plumbing problem - the thing eating the calendar of every ML team - stops being a problem.

The tension: 90% of data is video. Almost no database was designed for it. Someone had to volunteer.

The Bet · Two people, one format

The man who helped build the tables, now building tables for AI

If you have ever written import pandas as pd, you have used Chang She's work. He was one of the original co-authors of pandas, the library that a generation of data scientists treats as gravity. He spent two decades building the tools humans use to wrangle data. LanceDB is his answer to a newer question: what do the machines need?

His co-founder, Lei Xu, came from the data infrastructure team at Cruise, where the data was not tidy spreadsheets but oceans of sensor and video logs from cars trying not to hit things. Between them they had seen both ends of the problem - the friendly tabular world and the brutal multimodal one - and concluded they should not be two worlds at all.

Co-founder & CEO

Chang She

Original co-author of pandas, ex-CTO of DataPad, ex-VP Eng at Tubi. Now arguing that the spreadsheet needs a sequel.

Co-founder & CTO

Lei Xu

Built data infrastructure for self-driving at Cruise, where "the dataset" meant petabytes of video that refused to behave.

The bet, made in 2021 and backed first by Y Combinator and CRV: write a new open columnar format - Lance - from scratch in Rust, optimized for the random-access, version-everything, search-anything reality of AI. Not a fork. Not a wrapper. A new foundation.

Above: Two founders who looked at "this is just how AI data works" and decided it was a typo.

The Product · Three layers, one idea

Store it. Search it. Train on it. Stop moving it.

LanceDB is not one thing, which is precisely the point. It is a stack designed so that a clip of video and a row of numbers can live in the same place and play by the same rules.

LAYER 01 · FORMAT

Lance

An open columnar format on Apache Arrow. Convert from Parquet in two lines for up to 100x faster random access, plus vector indexing and Git-style data versioning baked in.

LAYER 02 · DATABASE

LanceDB

A developer-friendly, embedded/serverless vector database. Vector, full-text and hybrid search with SQL filtering - runs on your laptop or your cloud, fully open source.

LAYER 03 · PLATFORM

Multimodal Lakehouse

The enterprise tier: feature engineering with Python UDFs, curation, dedup, and training straight from curated data at up to ~70% MFU. One system, every modality.

Convert from Parquet in two lines of code. The most radical ideas in infrastructure usually look like a shortcut.- From the Lance docs

What can you actually do with it? Keep a media data lake that you can search by meaning, not filename. Run embedding pipelines without exporting to a separate store. Deduplicate and curate training sets in place. Evolve a schema - add a column to billions of rows - without rewriting the dataset. Version your data so a bad training run is a rollback, not a forensic investigation.

The trick: it is boring to describe and thrilling to use, which is the correct order for infrastructure.

The short, fast history of a quiet company

FROM A FORMAT REWRITE TO HALF THE AI LABS · 2021 → 2025

2021

FoundedChang She and Lei Xu start LanceDB, backed by Y Combinator (W22). The plan: a new open format for AI data, written in Rust.

2024 · May

$8M raised, $11M totalCRV leads, with Essence VC and Swift. The world learns Midjourney and Character.AI are already customers.

2025 · Jun

$30M Series ATheory Ventures leads; CRV, Y Combinator, Databricks Ventures, Runway, Swift and Zero Prime join. Total funding ~$41M.

2025

20M+ downloadsLance is cited as the fastest-growing format in the data ecosystem. The Multimodal Lakehouse becomes the headline product.

Note: Four years from "new file format" to "the thing Databricks Ventures wanted on its cap table." Databases are not supposed to move this fast.

The Proof · Receipts, not adjectives

The customer list reads like a who's-who of things that scared you in 2024

You can claim a database is fast. It is more convincing when the companies generating the world's most demanding multimodal data quietly standardize on it. LanceDB's roster includes Midjourney, Runway, Character.AI, World Labs, Harvey, Hex, ByteDance's Volcano Engine, UBS, Netflix, Second Dinner and the self-driving outfit WeRide.

20M+

OSS DOWNLOADS

$41M

TOTAL RAISED

~70%

TRAINING MFU

100x

RANDOM ACCESS

The WeRide story is the one that sticks. The autonomous-driving team reported roughly a 90x jump in ML developer productivity after moving to Lance, with a data-mining task that used to take a week collapsing to about an hour. That is not a benchmark in a vacuum. That is a week of human attention handed back, every time.

WeRide: one task, before and after Lance

DATA-MINING TIME · REPORTED CUSTOMER FIGURES (APPROX.)

Legacy stack

~1 week

On Lance

~1 hour

Read it sideways: the orange bar is a workweek. The teal sliver is a coffee break. Same task.

Runway didn't just use the database. It joined the Series A. The most honest review a customer can write is a wire transfer.- On the cap table

Backers: Theory Ventures, CRV, Y Combinator, Databricks Ventures, Runway, Swift Ventures, Zero Prime.

The Mission · What it's all for

Make multimodal data as boring as a spreadsheet

The goal LanceDB keeps repeating is unglamorous on purpose: make embeddings, images, video and documents as easy to store, search and train on as ordinary tables. Boring is the ambition. Boring means nobody is awake at 3am. Boring means the engineer spends the week on the model instead of the pipeline.

One open lakehouse where every modality of AI data lives together - no stitching, no plumbing, no separate tool for each shape of data. - THE LANCEDB THESIS

The business follows the open-core script that the data world has learned to trust: give away the Lance format and the LanceDB libraries, let adoption compound, and charge enterprises for the managed Multimodal Lakehouse when their workloads outgrow a laptop. Twenty million downloads suggest the giving-away half is working.

Translation: the dream is a database so unremarkable you forget it's there. Which is the highest compliment infrastructure can earn.

Tomorrow · The bet that's still open

If AI keeps getting hungrier, the kitchen has to change

The world generated something like 37 zettabytes of data in 2018 and was on track for roughly 156 by the mid-2020s, most of it video. Every one of those bytes is a candidate to train, search or retrieve against. The systems we built for the tabular era were not designed for that and quietly admitting it is the first useful step.

LanceDB's wager is that the next decade of AI is bottlenecked less by model architectures and more by the unglamorous job of feeding them. Whoever owns the open format underneath multimodal data owns a very large, very load-bearing piece of the future. Lance being called the fastest-growing format in the ecosystem is either a head start or a coincidence, and the cap table is betting head start.

The model gets the headline. The data layer gets the bill - and increasingly, the leverage.- The closing argument

Back to 3am. The model is still training on those few million video clips. The pipeline still isn't on fire. The engineer is asleep. Somewhere under all of it, a database written in Rust by the guy who helped build pandas is holding the whole thing up - searched, versioned, and entirely unbothered.

That is the LanceDB version of success. Not applause. Just a system doing exactly what it promised, while everyone gets to think about something more interesting. Boring, on purpose, at scale.

Five things worth knowing

Co-founder Chang She was one of the original co-authors of pandas - the library a generation of data scientists treats as a law of physics.
Lance lets you version data like Git, with ACID transactions running on images and video, not just rows.
Co-founder Lei Xu came from the data infrastructure team at self-driving startup Cruise.
By 2025, video made up roughly 90% of the world's generated data - and almost no database was designed for it.
You can convert a Parquet dataset to Lance in two lines of code, which is either magic or just good engineering.

LanceDB

Databases were built for rows. AI eats everything else.

The man who helped build the tables, now building tables for AI

Chang She

Lei Xu

Store it. Search it. Train on it. Stop moving it.

Lance

LanceDB

Multimodal Lakehouse

The short, fast history of a quiet company

The customer list reads like a who's-who of things that scared you in 2024

WeRide: one task, before and after Lance

Make multimodal data as boring as a spreadsheet

If AI keeps getting hungrier, the kitchen has to change

Five things worth knowing

Find LanceDB