Humanloop

The Origin Story

Two PhD Students Walk Out of UCL. An Industry Follows.

University research labs have given the world remarkable things. The World Wide Web came out of CERN. Google came out of Stanford. And in 2020, a team of machine learning PhD students at University College London decided they were tired of academia's pace and wanted to build something that worked.

Raza Habib and Peter Hayes, deep in their ML research at UCL, co-founded Humanloop through the Portico Ventures spinout scheme - the university's mechanism for turning research into real companies. Jordan Burgess, who had already been building and selling things since he was 15, joined as the third co-founder and product brain.

The founding thesis was clear enough: enterprise teams were starting to build with large language models, and they were discovering that building with LLMs was nothing like building with traditional software. You couldn't unit test a prompt the way you'd test a function. You couldn't version control it in Git. You couldn't tell, after a model update, whether your application had gotten better or quietly worse. These were not minor inconveniences. They were the things that kept AI projects from reaching production.

Humanloop set out to fix that. Three months after joining Y Combinator's Summer 2020 batch - the batch that ran entirely over Zoom - they had their first paying customers.

"Picks and shovels in an AI gold rush."

- Humanloop's own description of their business

The gold rush framing is apt, and Humanloop knew it. When everyone is racing to build AI applications, the people who sell the infrastructure - the tools for evaluation, monitoring, version control, deployment - don't need to pick winners. They benefit regardless of which applications succeed, because all of them need the same plumbing.

The Pivot

Ten Customers in Two Days. Then ChatGPT Launched.

In 2022, something shifted. OpenAI released instruction-tuning for GPT-3 - the breakthrough that made language models dramatically more useful for practical tasks. Humanloop had been watching the space closely, and they made a call: pivot the company to LLM prompt management.

They moved fast. Ten paying customers signed up within two days. Four months later, ChatGPT launched and the entire world realized that this technology mattered. The timing wasn't luck exactly - it was the product of researchers who understood what was happening in the models before most people understood why it mattered.

By 2024, Humanloop was processing millions of LLM logs daily. The platform had grown into a full suite: prompt version control, automated evaluations, observability, and a model proxy that could route requests across eight different model providers. They were an operating system for teams that were building seriously with AI, not just experimenting.

What made Humanloop distinct in a crowded field was the depth of the evaluation tooling. Running an AI application in production is only half the job. The other half is knowing whether it's doing what you intended, catching regressions when models update, and proving to compliance teams that the system behaves predictably. Humanloop built pipelines for all of that - automated code checks, LLM-as-judge scoring, human review workflows - and wired them into CI/CD so a bad prompt couldn't reach production without triggering an alert.

"They got 10 paying customers in 2 days, and ChatGPT launched just 4 months later, validating their timing."

What It Did

Four Tools. One Problem: AI That Actually Behaves in Production.

Humanloop's platform addressed the full lifecycle of building AI applications - from writing and versioning prompts to monitoring what happens after deployment.

Version Control

Prompt Management

Git-like version control for prompts - diff views, staging vs. production environments, a collaborative UI that let engineers and non-technical teammates both work on the same prompts. Humanloop invented the .prompt file format to bring prompts into standard Git workflows. Think of it as making your prompts as manageable as your code.

Quality Gates

LLM Evaluations (Evals)

Automated evaluation pipelines using code checks, LLM-as-judge scoring, or human review. Integrated directly into CI/CD: if your latest prompt change drops quality below a threshold, the deploy gets blocked. Includes dataset version control and custom metrics so teams can define what "good" actually means for their specific application.

Observability

Flows - End-to-End Tracing

Built on OpenTelemetry, Flows traced multi-step AI applications from start to finish - tracking production logs, cost, token usage, and latency across the full pipeline. When something went wrong with a specific user interaction, teams could deep-link to that conversation and replay what happened. Human review spot-checks were built in.

Model Routing

Model Proxy

A unified proxy sitting in front of eight-plus model providers: OpenAI, Anthropic, Google Cloud, AWS Bedrock, Azure. Same-day SLA for adding new models. Switch providers without rewriting your application, test one model against another, manage API keys in one place. Useful when the best model for your task is not always the same model.

Case Study

How Gusto's AI Support Agent Went from 10% to 30% Case Deflection

Gusto, the payroll and HR platform, built an AI support agent called Gus using Humanloop as the underlying infrastructure. The results were significant enough to publish: case deflection - the share of support requests handled by AI without human intervention - went from 10% to 30%. Accuracy improved roughly threefold.

The improvement came from systematic evaluation. Rather than deploying changes to Gus and hoping for the best, Gusto used Humanloop's evaluation pipelines to test prompt changes against historical data before pushing them live. Bad changes didn't reach customers. Good changes could be validated quickly. The feedback loop got shorter, and the product got better faster.

This is what "LLMOps" actually means in practice: not just logging what your AI does, but having the infrastructure to know whether what it's doing is good, and to change it safely when it isn't.

RESULT

3x

Accuracy improvement for Gus, Gusto's AI support agent

RESULT

10→30%

Case deflection improvement via Humanloop evaluations

The People

Three Founders, One Research Lab, Zero Plans to Stay in Academia

Raza Habib

CEO & Co-founder

ML PhD from UCL. Named Forbes 30 Under 30 Technology in 2022. Listed by Sifted as one of the most influential Gen AI founders in Europe in 2025. While running the company, also hosted a podcast called "High Agency: The Podcast for AI Builders" - because apparently running an AI infrastructure startup wasn't filling enough hours in the day.

Peter Hayes

CTO & Co-founder

ML PhD from UCL. Met Raza during their research, which is either a remarkable coincidence or exactly how great infrastructure companies tend to start - two people who understand the underlying technology deeply enough to know what's missing from it.

Jordan Burgess

CPO & Co-founder

Built and sold his first website at 15. By the time he joined Humanloop as the product co-founder, he had been building things on the internet for most of his life. That instinct for what users actually need - not what researchers imagine they need - gave the product its sharp edges.

Details That Stick

Seven Things Worth Knowing About Humanloop

01

All three co-founders met through their ML PhDs at UCL. The spinout came via Portico Ventures, UCL's scheme for turning research into companies. Academia's loss; enterprise AI's gain.

02

In 2022, they spotted GPT-3 instruction-tuning breakthroughs and pivoted to prompt management. Ten paying customers signed up within two days. ChatGPT launched four months later. Timing.

03

Jordan Burgess built and sold his first website at age 15. By the time he was building Humanloop, he'd had more than a decade of instinct for how products should feel.

04

Humanloop invented the .prompt file format - a way to store and version prompts inside standard Git repositories. Quietly, this is the kind of thing that shapes how an industry works.

05

$3.8M ARR. 14 employees. That's $271K revenue per person - exceptional even by SaaS standards. The team ran lean by design, not by accident.

06

Raza Habib ran a podcast called "High Agency" while running the company. Guests included founders and AI researchers navigating the same landscape his company was building tools for.

07

Their self-description: "picks and shovels in an AI gold rush." Then Anthropic - one of the gold rush's most serious participants - bought the whole operation and put the team to work inside their enterprise console.

From Lab to Anthropic

Five Years, Compressed

JULY 2020

UCL spinout via Portico Ventures. Initial seed from Y Combinator and UCL Technology Fund. Y Combinator S20 batch - conducted entirely over Zoom due to the pandemic. First paying customers arrived within three months.

JULY 2022

Raised $2.6M seed extension led by Index Ventures, with participation from Y Combinator, LocalGlobe, and AlbionVC. Pivoted to LLM prompt management after spotting GPT-3 instruction-tuning breakthroughs. Raza Habib named Forbes 30 Under 30 Technology.

FEBRUARY 2024

Achieved SOC 2 Type II certification - a prerequisite for enterprise sales that signals operational maturity. By this point, Humanloop was processing millions of LLM logs daily across 300+ production deployments.

NOVEMBER 2024

General Availability launch, disclosing ~$8M total funding raised. Pricing published: Starter ($100/month), Team ($1,000/month), Enterprise (custom). ~$3.8M ARR confirmed.

JANUARY 2025

Raza Habib named one of the most influential Gen AI founders in Europe by Sifted. By any measure, Humanloop was a recognized name in the LLMOps space.

AUGUST 2025

Anthropic acqui-hired all three co-founders plus approximately 12 engineers. The entire Humanloop team moved inside one of the world's most prominent AI labs to build enterprise AI tools within Anthropic Console.

SEPTEMBER 8, 2025

The Humanloop platform was officially sunsetted. Its DNA lives on in Anthropic Console's Workbench and Evaluations tabs. The picks and shovels became part of the mine.

WHAT ANTHROPIC ACQUIRED

→ Three technical co-founders with deep ML research backgrounds
→ ~12 engineers with specialization in LLM evaluation and observability
→ Proven product architecture for prompt management and evals
→ Enterprise deployment experience across 300+ production environments
→ Hard-won lessons from Duolingo, Gusto, Vanta, and 300+ other deployments

The product DNA was integrated into Anthropic Console as the "Workbench" and "Evaluations" tabs - which is as close as a startup gets to immortality.

The Lasting Contribution

What Humanloop Got Right Before Everyone Else Did

By 2024, "LLMOps" had become a crowded category. But Humanloop was building in this space when it was still called "that thing nobody has figured out yet." A few things they established that are now table stakes in the industry:

Prompts are code. They need version control, environment management (staging vs. production), and audit trails. Humanloop's .prompt file format was an early, practical answer to this problem.

Evaluation is not optional. You cannot ship AI features responsibly without knowing whether they work. Humanloop integrated evaluation into CI/CD before most teams were even thinking about that connection. Block the deploy if quality drops. Don't find out in production.

Non-technical users need access. One of Humanloop's design choices was making prompt editing accessible to people who were not engineers - domain experts, customer support leads, content strategists who understand what a good response looks like but can't write Python to test it. That's not a nice-to-have. That's how AI teams actually function at scale.

Observability for AI is different. Traditional application monitoring tells you if something is broken. LLM observability needs to tell you if something is wrong in ways that are subtle, gradual, and hard to catch without specific tooling. Humanloop built for that nuance.

None of these are controversial ideas now. But someone had to build them first and prove they were worth building. Humanloop did that - and then Anthropic decided the people who built them were worth bringing inside.

"You cannot ship AI features responsibly without knowing whether they work."

- The philosophy behind Humanloop's evaluation tooling

Go Deeper

Sources, Profiles & More

🌐 humanloop.com in LinkedIn 𝕏 Twitter / X ⌥ GitHub 📰 TechCrunch: Acqui-hire story Y Y Combinator profile 🎙 High Agency Podcast (Apple) 🎧 High Agency Podcast (Spotify) 📝 GA announcement blog 📊 Gusto case study 🗞 Sifted: European angle 🎓 UCL spinout case study

Humanloop

Two PhD Students Walk Out of UCL. An Industry Follows.

At a Glance

Investors

Notable Customers