Co-founder and CEO of LanceDB. Quant desk to Cloudera to Tubi to a database built for the messy, multimodal way AI actually sees the world.
Ask Chang She what LanceDB is really about and you get a complaint disguised as a mission: working with a table of numbers is easy, and working with embeddings, images, and video is not. Decades into the data era, that gap never closed. His whole company is an argument that it should.
Today he is co-founder and CEO of LanceDB, the company behind the open-source Lance columnar format and a database built for multimodal AI. Its users are the names attached to the current wave of generative models. Runway, Midjourney, and Character.ai store and search their training data on Lance, some of them across petabytes and tens of billions of vectors. In June 2025 the company raised a $30M Series A led by Theory Ventures, with CRV, Y Combinator, Databricks Ventures, and Runway among the backers.
The pitch is blunt about what it wants to unseat. "None of that really fit into the traditional data stack: Pandas, Spark, Parquet, even Arrow," She has said. That is a striking sentence coming from him, because he helped build the first name on that list.
Alongside the Series A, LanceDB launched what She calls the Multimodal Lakehouse: one platform meant to hold every kind of AI data - text, embeddings, images, audio, video - and to serve every workload against it, from semantic search to feature engineering to model training. His objection to the current wave of tooling is that it is too small. "Vector databases tend to be very narrow solutions for a very narrow problem," he says. He is not trying to add a feature. He is trying to move the floor.
The reason is scale, and his favorite line for it has stuck: "We often say a trillion is the new billion. We have folks operating at roughly a thousand times the scale they were at just a year or two ago." AI agents, in his telling, will use however much capacity you hand them, so the honest move is to build for a size that looks absurd today.
Before the lakehouse: a builder who never quite left the data behind.
“In five years, ‘multimodal’ won’t even be a word anymore. It’ll just be data.”
— Chang SheThe origin is almost too neat. Around 2006, Chang She was a research associate at the hedge fund AQR Capital Management. His colleague and roommate was Wes McKinney. "At that time, data scientist wasn't really a job title," She recalls. One day McKinney walked over with something he had been building in Python and said, in effect, look at this.
That something became pandas. She was one of its earliest users, evangelizing it inside the firm before it was even open-sourced, and he is credited as a co-author of the library that a generation of analysts would later learn on. It is a rare thing to be present at the birth of a foundational tool. It is rarer still to spend the next twenty years trying to build the thing that comes after it.
He came to code sideways. At MIT he took degrees in electrical engineering and computer science and in political science. He started his working life as, in his own words, "a former quant researcher-trader turned developer of data science platforms and tools." The math training left a fingerprint on how he thinks: "In math, it's like you always try to reduce a problem to a previously known or solved state."
Degrees in EECS and political science, including a master's in EECS.
Research associate. Roommate Wes McKinney shows him an early pandas.
Co-founds the commercial vehicle built around pandas.
Co-founder and CTO with McKinney as CEO. Acquired by Cloudera in 2014.
Engineering manager, leading Cloudera Navigator.
VP of Engineering. Recommenders, ML-ops, experimentation - and the multimodal wall.
Co-founds with CTO Lei Xu. Y Combinator W22.
$30M led by Theory Ventures. Multimodal Lakehouse ships.
Between pandas and LanceDB there was a decade of shipping other people's infrastructure. DataPad, the Python-stack data product he co-founded with McKinney, was acquired by Cloudera in 2014. He stayed on to run engineering for Cloudera Navigator. Then came Tubi, the ad-supported streaming service, where he was VP of Engineering.
"I was VP of Engineering at Tubi, where I built a lot of the recommendation systems, ML-ops systems, and experimentation systems," he says. That is where the abstract complaint became a daily one. Recommenders live on embeddings. Streaming lives on video, images, audio, and subtitles. And every one of those objects had to be bent, awkwardly, into tools designed for rows and columns.
He had spent his career inside the tabular data stack. At Tubi he watched it fail to hold the shape of the data he actually had. The response, in 2022, was LanceDB, co-founded with Lei Xu, a former core contributor to HDFS who had led ML infrastructure at Cruise. They went through Y Combinator's W22 batch and launched the product publicly on May 1, 2023.
The wager underneath it: enough people were hitting the same wall that a better foundation would spread on its own. It did. By June 2025 the open-source project reported more than 20 million downloads, and the customer list had filled with the labs defining generative AI.
Why is working with embeddings, images, and video still so difficult, when compared to tabular data?
I've been building data and machine learning tooling for almost two decades at this point.
Vector databases tend to be very narrow solutions for a very narrow problem.
A trillion is the new billion. Folks are operating at roughly a thousand times the scale they were at a year or two ago.
None of that really fit into the traditional data stack: Pandas, Spark, Parquet, even Arrow.
In math, you always try to reduce a problem to a previously known or solved state.
His handle is @changhiskhan - a pun on Genghis Khan that has quietly outlasted several of his companies.
Among his GitHub repos sits a fork of an Advantage360 ergonomic mechanical keyboard config. The hardware itch runs deep.
He posts satirical engineering jokes on X, once riffing that LanceDB should be rewritten in assembly for "bare metal performance."
LanceDB did not arrive with a marketing blitz. It arrived as a file format and a library that solved a specific pain well enough that engineers passed it along. That is a strategy She has run before. pandas spread the same way, as did the open-source projects he shipped in the years between. Give people something that removes friction from their day and the adoption curve takes care of itself.
The pattern shows up in the numbers. By June 2025 the open-source project had crossed 20 million downloads, and the paying customers were not logos chasing a trend but teams operating at the edge of what the old tooling could bear. When Midjourney, Runway, and Character.ai standardize on your storage layer, it is because the alternative broke first.
His diagnosis of the industry's development experience is unsentimental. Machine learning engineers, he argues, are "often stuck with a subpar development experience," and "AI teams are spending most of their time dealing with low-level data infrastructure details." Lance is his attempt to hand those hours back. The Series A investor list reads like a vote from people who know the terrain: Theory Ventures led, with CRV, Y Combinator, Databricks Ventures, and Runway alongside, and angels including his old collaborator Wes McKinney.
There is a lesson buried in the DataPad chapter that seems to guide him still. A polished product can be acquired and absorbed; a foundational format embeds itself into how an entire field works. The second time around, he built the format first.
“AI teams are spending most of their time dealing with low-level data infrastructure details.”
— Chang She, TechCrunch, 2024Runway. Midjourney. Character.ai. Petabytes of training data, tens of billions of vectors, one open format underneath.
She's stated mission is to build "the most efficient and scalable data platform for AI applications." The bet is that the roughly twenty-year-old stack - the one he helped start - is buckling under AI workloads, and that narrow vector databases are a patch, not an answer. The fix he wants is a single lakehouse that holds every data type and serves every workload, batch and real-time, from search to training.
There is a tidy symmetry in it. He was in the room when pandas made messy data feel manageable for a generation of analysts. Now he is trying to do the same for the messier, richer data that machines learn from. Same instinct, larger canvas.