Scene-Setter
Somewhere in a government data center, a machine is watching 2 million hours of footage.
It takes two days. It doesn't get tired. It doesn't miss a detail. And when it's done, it can tell you exactly when the gray sedan entered the frame in camera 47, what the on-screen text said in the broadcast clip from three years ago, and where in the archive you'll find the other nine instances of that same face.
This isn't a science fiction scenario. It happened. And TwelveLabs built the infrastructure that made it possible.
The San Francisco startup occupies a strange and valuable corner of the AI landscape - one that most foundation model companies walked past without stopping. Video is the internet's dominant medium. It's also, historically, completely opaque to machines. You can search for the title of a video. You can search for keywords in a transcript. But searching for what actually happens inside a video? That was a human's job - until TwelveLabs decided it didn't have to be.
"Video is the richest form of human communication. The world runs on it. And yet, until recently, machines were essentially illiterate when it came to video."
- The core bet behind TwelveLabsThe Problem
800 million hours of video are uploaded every day. Almost none of it is searchable.
The scale is genuinely hard to process. Every day, humans upload enough video to keep a single person watching for over 91,000 years. This content contains faces, events, objects, conversations, emotions, and meaning - but standard search systems can only skim the surface: titles, tags, transcripts added by humans at enormous cost.
The result is that most enterprise video is dark data. Broadcasters have archives of thousands of hours they can barely navigate. Security teams watch footage in real-time because indexing it after the fact was impossible. Sports organizations manually clip highlights. Media companies pay editors to log content by hand. And government agencies - sitting on petabytes of surveillance video - simply couldn't process what they had.
Existing AI tools couldn't close the gap. Speech-to-text handles audio. Object detection flags pixels. But neither understands what's actually happening in a scene - the context, the sequence, the meaning. That required something closer to genuine video comprehension.
The Founding Bet
Jae Lee came from a very unusual place for a startup founder: the South Korean Ministry of Defense.
Before TwelveLabs, Lee worked as a lead data scientist for the South Korean government, applying machine learning to national-scale problems. He graduated from UC Berkeley in computer science, interned at Amazon and Samsung, and somewhere along the way concluded that the world was generating more video than any team of humans could ever process - and that the tools to handle it simply didn't exist yet.
He co-founded TwelveLabs in 2021 with four others: Aiden Lee, Dave Chung (who became COO), SJ Kim, and Soyoung Lee. The founding team had a mix of ML research depth and engineering rigor, and a clear target: build foundation models that understand video the way humans do - not just from pixels, but from the combined signal of what you see, hear, and read on screen.
That multimodal approach - vision, audio, and text together - was the core technical bet. And it turned out to be the right one.
"The founders didn't just want to process video. They wanted machines to watch it the way a film editor would - with attention to narrative, motion, context, and time."
- TwelveLabs founding thesisCompany Milestones
TwelveLabs incorporated in San Francisco. Research begins on video-native multimodal AI.
Raises from Index Ventures, Radical Ventures, and Korea Investment Partners. Launches developer API platform to first 10,000+ users.
Series A extension led by NEA and NVIDIA NVentures. Validates video AI as a major frontier.
Presents TWLV-I video foundation model evaluation framework at NeurIPS 2024. Third straight year on CB Insights AI 100.
Databricks, SK Telecom, Snowflake Ventures, HubSpot Ventures, and In-Q-Tel invest. Yoon Kim (ex-Apple Siri, SK Telecom CTO) joins as President.
Major funding round. Hwang Dong-hyuk - creator of Squid Game - personally invests $3M via Firstman Studio.
Marengo 3.0 launches on Amazon Bedrock. Processes up to 4 hours of video per API call. Pegasus 1.5 adds video segmentation and structured outputs.
The Product
Two models. One very large idea about what video can become.
TwelveLabs ships two foundation models that work as a pair. Marengo handles embedding and retrieval - given a video library, it indexes everything into a searchable vector space, so you can query with natural language ("find moments where the quarterback is sacked from the left side") or with an image ("find scenes that look like this reference frame"). Marengo 3.0 processes up to four hours of video or 6GB files in a single call, understands cinematography cues like zoom and pan, and outperforms the major cloud providers on standard video understanding benchmarks.
Pegasus handles generation. You give it a video; it gives you back structured prose, summaries, transcripts, timestamps, and - in its 1.5 version - segmented JSON output that can break a broadcast into labeled editorial units automatically. No pre-indexing required. Pegasus can watch a three-hour interview and return a structured breakdown of every topic change, speaker shift, and brand appearance.
Both are available via API, through the TwelveLabs platform, and via Amazon Bedrock. For government customers with strict data requirements, there's an on-premise option through a partnership with Vast Data. The developer experience is clean: a Python SDK (pip install twelvelabs), straightforward API, and growing documentation.
The Proof
The customer list tells you who takes this seriously.
TwelveLabs has 30,000 developers on its platform - a number that covers everything from individual experimenters building side projects to major enterprise deployments. The verticals are telling: media and entertainment, sports analytics, government and security, and increasingly, defense and intelligence.
One government client used TwelveLabs to index two million hours of footage in two days. That's 228 years of continuous video, organized and queryable in 48 hours. The practical implications for evidence management, intelligence review, and security operations are significant enough that In-Q-Tel - the venture arm of the US Intelligence Community, which funds technology it wants available to federal agencies - became an investor.
Sports organizations use TwelveLabs for real-time gameplay analysis and highlight generation. Media companies automate content moderation, ad insertion timing, and social clip creation. Broadcasters search their archives with natural language instead of manual loggers. The Ecosystem Partner Program, launched in late 2025, lists CineSys, ScorePlay, Quickplay, Overcast HQ, and a growing roster of media infrastructure companies building on top of TwelveLabs models.
Hwang Dong-hyuk - the director and creator of Netflix's Squid Game, South Korea's most globally viewed series - personally invested $3 million in TwelveLabs in October 2025. Apparently when you build the world's best video AI, the people who make the world's most-watched video take notice.
The Investors
The cap table is a who's who of organizations that understand at scale what video is worth.
NVIDIA invested through NVentures - this is a company that runs on GPU infrastructure, and TwelveLabs represents a major customer for serious compute. Databricks and Snowflake invested strategically: their businesses run on enterprise data pipelines, and video is increasingly part of those pipelines. Databricks explicitly integrated TwelveLabs into its data intelligence platform.
SK Telecom's participation signals APAC expansion with a trusted carrier partner. HubSpot Ventures signals interest in video-powered sales and marketing intelligence. And In-Q-Tel means TwelveLabs has a pathway into federal agencies without the usual procurement friction - they're already positioned as a trusted supplier to the national security apparatus.
New Enterprise Associates, Index Ventures, and Radical Ventures round out a list of mainstream institutional backers who see this as a foundational AI infrastructure play, not a niche vertical tool.
"When the CIA's VC arm, the GPU company, the data platform duopoly, and the Squid Game director are all in your cap table - you've either built something genuinely important, or you've thrown the most interesting dinner party in venture history."
- Reading the TwelveLabs investor listMission and Why It Matters
The mission is deceptively simple: make video as searchable as text.
The implications are not simple at all. If video becomes fully queryable - if you can ask "find every instance where safety protocol wasn't followed" or "show me all clips where this product is visible in the background" - entire industries change. News archives become navigable research databases. Security footage becomes actionable intelligence. Training data for AI becomes scalable. Corporate video libraries stop being storage costs and start being knowledge assets.
TwelveLabs positions its models as the infrastructure layer for this shift - the same way AWS is infrastructure for web applications, or Stripe is infrastructure for payments. They're not building the end applications; they're building the platform those applications run on. The TWLV-I evaluation framework, published at NeurIPS 2024, is part of this strategy: establishing the benchmarks for video understanding makes TwelveLabs the reference point others get measured against.
The competition includes Google, Meta, and OpenAI, all of which have video understanding capabilities bundled into larger multimodal systems. The TwelveLabs argument - and it's a solid one - is that video-native specialization beats video-adjacent generalization. A model trained specifically to understand the temporal dynamics of video, the relationship between audio and image, and the semantics of motion will outperform a model that handles video as one task among dozens.
Coming Full Circle
Back in that government data center, the machine has finished its work.
Two million hours of footage. Indexed, searchable, queryable by natural language. The analysts who used to spend weeks manually reviewing footage can now ask questions and get answers in seconds. The evidence that would have taken months to locate is a keyword search away.
That's the difference TwelveLabs is making - not in a press release sense, but in a practical, measurable, operational sense. The world generates more video than any human workforce could ever process. TwelveLabs built the machine that can.
The company is four years old, has 170 people, and has raised $110 million. It's on its third-generation models, has Marengo running on Amazon Bedrock, and Pegasus producing structured video intelligence at enterprise scale. The Squid Game guy put money in. So did NVIDIA. So did the US intelligence community.
That's not a coincidence. That's a signal.