The wordmark that quietly slipped onto the cap tables of half the AI labs you've heard of. No mascot, no fireworks - just a database that does the job.
The open-source lakehouse teaching databases to handle video, images and embeddings as easily as a spreadsheet handles numbers.
Somewhere right now a model is training on a few million video clips, and an engineer is not awake at 3am babysitting the pipeline that feeds it. That is the whole pitch. LanceDB sits underneath the AI you actually use - the image generators, the chatbots, the self-driving stacks - holding the messy, enormous, multimodal data those systems eat for breakfast.
It does not have a flashy demo. It is a file format and a database. But it has wedged itself into the plumbing of Midjourney, Runway, Character.AI and a long list of companies that move petabytes of training data and search across billions of vectors. By mid-2025 its open-source packages had been downloaded more than 20 million times. That is the kind of number you get by being useful, not loud.
Above: A database with no logo animation and no growth-hack hoodie. Suspicious, in the best way.
Here is the uncomfortable truth the data world spent a decade not saying out loud: the tools we built for tidy tables of numbers and strings are terrible at images, audio and video. And by 2025, video alone made up roughly 90% of the world's generated data - the kind of statistic that should have triggered a panic and mostly triggered a shrug.
So AI teams improvised. They kept their raw files in one place, their embeddings in a vector store, their features in a pipeline, their training data in yet another bucket. Five tools, four formats, and a standing army of engineers whose actual job was moving data between them. It worked, the way duct tape works.
LanceDB's bet is that this stitching is not a fact of life. It is a missing layer. Build one open format that natively understands every modality, and the plumbing problem - the thing eating the calendar of every ML team - stops being a problem.
The tension: 90% of data is video. Almost no database was designed for it. Someone had to volunteer.
If you have ever written import pandas as pd, you have used Chang She's work. He was one of the original co-authors of pandas, the library that a generation of data scientists treats as gravity. He spent two decades building the tools humans use to wrangle data. LanceDB is his answer to a newer question: what do the machines need?
His co-founder, Lei Xu, came from the data infrastructure team at Cruise, where the data was not tidy spreadsheets but oceans of sensor and video logs from cars trying not to hit things. Between them they had seen both ends of the problem - the friendly tabular world and the brutal multimodal one - and concluded they should not be two worlds at all.
Original co-author of pandas, ex-CTO of DataPad, ex-VP Eng at Tubi. Now arguing that the spreadsheet needs a sequel.
Built data infrastructure for self-driving at Cruise, where "the dataset" meant petabytes of video that refused to behave.
The bet, made in 2021 and backed first by Y Combinator and CRV: write a new open columnar format - Lance - from scratch in Rust, optimized for the random-access, version-everything, search-anything reality of AI. Not a fork. Not a wrapper. A new foundation.
Above: Two founders who looked at "this is just how AI data works" and decided it was a typo.
LanceDB is not one thing, which is precisely the point. It is a stack designed so that a clip of video and a row of numbers can live in the same place and play by the same rules.
An open columnar format on Apache Arrow. Convert from Parquet in two lines for up to 100x faster random access, plus vector indexing and Git-style data versioning baked in.
A developer-friendly, embedded/serverless vector database. Vector, full-text and hybrid search with SQL filtering - runs on your laptop or your cloud, fully open source.
The enterprise tier: feature engineering with Python UDFs, curation, dedup, and training straight from curated data at up to ~70% MFU. One system, every modality.
What can you actually do with it? Keep a media data lake that you can search by meaning, not filename. Run embedding pipelines without exporting to a separate store. Deduplicate and curate training sets in place. Evolve a schema - add a column to billions of rows - without rewriting the dataset. Version your data so a bad training run is a rollback, not a forensic investigation.
The trick: it is boring to describe and thrilling to use, which is the correct order for infrastructure.
Note: Four years from "new file format" to "the thing Databricks Ventures wanted on its cap table." Databases are not supposed to move this fast.
You can claim a database is fast. It is more convincing when the companies generating the world's most demanding multimodal data quietly standardize on it. LanceDB's roster includes Midjourney, Runway, Character.AI, World Labs, Harvey, Hex, ByteDance's Volcano Engine, UBS, Netflix, Second Dinner and the self-driving outfit WeRide.
The WeRide story is the one that sticks. The autonomous-driving team reported roughly a 90x jump in ML developer productivity after moving to Lance, with a data-mining task that used to take a week collapsing to about an hour. That is not a benchmark in a vacuum. That is a week of human attention handed back, every time.
Read it sideways: the orange bar is a workweek. The teal sliver is a coffee break. Same task.
Backers: Theory Ventures, CRV, Y Combinator, Databricks Ventures, Runway, Swift Ventures, Zero Prime.
The goal LanceDB keeps repeating is unglamorous on purpose: make embeddings, images, video and documents as easy to store, search and train on as ordinary tables. Boring is the ambition. Boring means nobody is awake at 3am. Boring means the engineer spends the week on the model instead of the pipeline.
One open lakehouse where every modality of AI data lives together - no stitching, no plumbing, no separate tool for each shape of data.- THE LANCEDB THESIS
The business follows the open-core script that the data world has learned to trust: give away the Lance format and the LanceDB libraries, let adoption compound, and charge enterprises for the managed Multimodal Lakehouse when their workloads outgrow a laptop. Twenty million downloads suggest the giving-away half is working.
Translation: the dream is a database so unremarkable you forget it's there. Which is the highest compliment infrastructure can earn.
The world generated something like 37 zettabytes of data in 2018 and was on track for roughly 156 by the mid-2020s, most of it video. Every one of those bytes is a candidate to train, search or retrieve against. The systems we built for the tabular era were not designed for that and quietly admitting it is the first useful step.
LanceDB's wager is that the next decade of AI is bottlenecked less by model architectures and more by the unglamorous job of feeding them. Whoever owns the open format underneath multimodal data owns a very large, very load-bearing piece of the future. Lance being called the fastest-growing format in the ecosystem is either a head start or a coincidence, and the cap table is betting head start.
Back to 3am. The model is still training on those few million video clips. The pipeline still isn't on fire. The engineer is asleep. Somewhere under all of it, a database written in Rust by the guy who helped build pandas is holding the whole thing up - searched, versioned, and entirely unbothered.
That is the LanceDB version of success. Not applause. Just a system doing exactly what it promised, while everyone gets to think about something more interesting. Boring, on purpose, at scale.