London • Founded 2020 • LLMOps
The company that made AI teams look like they had their act together
An enterprise LLM infrastructure platform built by UCL PhD students, backed by Index Ventures and Y Combinator, used by companies like Duolingo and Gusto - and eventually absorbed by Anthropic, the very AI lab whose models it helped teams wrangle.
The Origin Story
University research labs have given the world remarkable things. The World Wide Web came out of CERN. Google came out of Stanford. And in 2020, a team of machine learning PhD students at University College London decided they were tired of academia's pace and wanted to build something that worked.
Raza Habib and Peter Hayes, deep in their ML research at UCL, co-founded Humanloop through the Portico Ventures spinout scheme - the university's mechanism for turning research into real companies. Jordan Burgess, who had already been building and selling things since he was 15, joined as the third co-founder and product brain.
The founding thesis was clear enough: enterprise teams were starting to build with large language models, and they were discovering that building with LLMs was nothing like building with traditional software. You couldn't unit test a prompt the way you'd test a function. You couldn't version control it in Git. You couldn't tell, after a model update, whether your application had gotten better or quietly worse. These were not minor inconveniences. They were the things that kept AI projects from reaching production.
Humanloop set out to fix that. Three months after joining Y Combinator's Summer 2020 batch - the batch that ran entirely over Zoom - they had their first paying customers.
"Picks and shovels in an AI gold rush."
- Humanloop's own description of their businessThe gold rush framing is apt, and Humanloop knew it. When everyone is racing to build AI applications, the people who sell the infrastructure - the tools for evaluation, monitoring, version control, deployment - don't need to pick winners. They benefit regardless of which applications succeed, because all of them need the same plumbing.
The Pivot
In 2022, something shifted. OpenAI released instruction-tuning for GPT-3 - the breakthrough that made language models dramatically more useful for practical tasks. Humanloop had been watching the space closely, and they made a call: pivot the company to LLM prompt management.
They moved fast. Ten paying customers signed up within two days. Four months later, ChatGPT launched and the entire world realized that this technology mattered. The timing wasn't luck exactly - it was the product of researchers who understood what was happening in the models before most people understood why it mattered.
By 2024, Humanloop was processing millions of LLM logs daily. The platform had grown into a full suite: prompt version control, automated evaluations, observability, and a model proxy that could route requests across eight different model providers. They were an operating system for teams that were building seriously with AI, not just experimenting.
What made Humanloop distinct in a crowded field was the depth of the evaluation tooling. Running an AI application in production is only half the job. The other half is knowing whether it's doing what you intended, catching regressions when models update, and proving to compliance teams that the system behaves predictably. Humanloop built pipelines for all of that - automated code checks, LLM-as-judge scoring, human review workflows - and wired them into CI/CD so a bad prompt couldn't reach production without triggering an alert.
"They got 10 paying customers in 2 days, and ChatGPT launched just 4 months later, validating their timing."
What It Did
Humanloop's platform addressed the full lifecycle of building AI applications - from writing and versioning prompts to monitoring what happens after deployment.
Git-like version control for prompts - diff views, staging vs. production environments, a collaborative UI that let engineers and non-technical teammates both work on the same prompts. Humanloop invented the .prompt file format to bring prompts into standard Git workflows. Think of it as making your prompts as manageable as your code.
Automated evaluation pipelines using code checks, LLM-as-judge scoring, or human review. Integrated directly into CI/CD: if your latest prompt change drops quality below a threshold, the deploy gets blocked. Includes dataset version control and custom metrics so teams can define what "good" actually means for their specific application.
Built on OpenTelemetry, Flows traced multi-step AI applications from start to finish - tracking production logs, cost, token usage, and latency across the full pipeline. When something went wrong with a specific user interaction, teams could deep-link to that conversation and replay what happened. Human review spot-checks were built in.
A unified proxy sitting in front of eight-plus model providers: OpenAI, Anthropic, Google Cloud, AWS Bedrock, Azure. Same-day SLA for adding new models. Switch providers without rewriting your application, test one model against another, manage API keys in one place. Useful when the best model for your task is not always the same model.
Case Study
Gusto, the payroll and HR platform, built an AI support agent called Gus using Humanloop as the underlying infrastructure. The results were significant enough to publish: case deflection - the share of support requests handled by AI without human intervention - went from 10% to 30%. Accuracy improved roughly threefold.
The improvement came from systematic evaluation. Rather than deploying changes to Gus and hoping for the best, Gusto used Humanloop's evaluation pipelines to test prompt changes against historical data before pushing them live. Bad changes didn't reach customers. Good changes could be validated quickly. The feedback loop got shorter, and the product got better faster.
This is what "LLMOps" actually means in practice: not just logging what your AI does, but having the infrastructure to know whether what it's doing is good, and to change it safely when it isn't.
The People
ML PhD from UCL. Named Forbes 30 Under 30 Technology in 2022. Listed by Sifted as one of the most influential Gen AI founders in Europe in 2025. While running the company, also hosted a podcast called "High Agency: The Podcast for AI Builders" - because apparently running an AI infrastructure startup wasn't filling enough hours in the day.
ML PhD from UCL. Met Raza during their research, which is either a remarkable coincidence or exactly how great infrastructure companies tend to start - two people who understand the underlying technology deeply enough to know what's missing from it.
Built and sold his first website at 15. By the time he joined Humanloop as the product co-founder, he had been building things on the internet for most of his life. That instinct for what users actually need - not what researchers imagine they need - gave the product its sharp edges.
Details That Stick
All three co-founders met through their ML PhDs at UCL. The spinout came via Portico Ventures, UCL's scheme for turning research into companies. Academia's loss; enterprise AI's gain.
In 2022, they spotted GPT-3 instruction-tuning breakthroughs and pivoted to prompt management. Ten paying customers signed up within two days. ChatGPT launched four months later. Timing.
Jordan Burgess built and sold his first website at age 15. By the time he was building Humanloop, he'd had more than a decade of instinct for how products should feel.
Humanloop invented the .prompt file format - a way to store and version prompts inside standard Git repositories. Quietly, this is the kind of thing that shapes how an industry works.
$3.8M ARR. 14 employees. That's $271K revenue per person - exceptional even by SaaS standards. The team ran lean by design, not by accident.
Raza Habib ran a podcast called "High Agency" while running the company. Guests included founders and AI researchers navigating the same landscape his company was building tools for.
Their self-description: "picks and shovels in an AI gold rush." Then Anthropic - one of the gold rush's most serious participants - bought the whole operation and put the team to work inside their enterprise console.
From Lab to Anthropic
WHAT ANTHROPIC ACQUIRED
The product DNA was integrated into Anthropic Console as the "Workbench" and "Evaluations" tabs - which is as close as a startup gets to immortality.
When Anthropic absorbed Humanloop in August 2025, TechCrunch called it evidence that "competition for enterprise AI talent heats up." That framing undersells it. Humanloop had spent five years building exactly the infrastructure that Anthropic needed to make its enterprise products credible. The three co-founders and twelve engineers didn't join a company to start over - they joined to keep building the same things, at larger scale, with the backing of a lab that can actually shape how foundation models work. The platform shut down. The work continues.
The Lasting Contribution
By 2024, "LLMOps" had become a crowded category. But Humanloop was building in this space when it was still called "that thing nobody has figured out yet." A few things they established that are now table stakes in the industry:
Prompts are code. They need version control, environment management (staging vs. production), and audit trails. Humanloop's .prompt file format was an early, practical answer to this problem.
Evaluation is not optional. You cannot ship AI features responsibly without knowing whether they work. Humanloop integrated evaluation into CI/CD before most teams were even thinking about that connection. Block the deploy if quality drops. Don't find out in production.
Non-technical users need access. One of Humanloop's design choices was making prompt editing accessible to people who were not engineers - domain experts, customer support leads, content strategists who understand what a good response looks like but can't write Python to test it. That's not a nice-to-have. That's how AI teams actually function at scale.
Observability for AI is different. Traditional application monitoring tells you if something is broken. LLM observability needs to tell you if something is wrong in ways that are subtle, gradual, and hard to catch without specific tooling. Humanloop built for that nuance.
None of these are controversial ideas now. But someone had to build them first and prove they were worth building. Humanloop did that - and then Anthropic decided the people who built them were worth bringing inside.
"You cannot ship AI features responsibly without knowing whether they work."
- The philosophy behind Humanloop's evaluation tooling