Judgment Labs

Somewhere right now, an AI agent is booking a flight, reconciling an invoice, or drafting a legal memo. It reasons. It picks a tool. It remembers what you told it last Tuesday. And when it quietly gets something wrong, almost nobody can say where, or why. Judgment Labs exists for that exact silence.

The scene: an agent humming along in production, three steps from a mistake nobody will notice until a customer does.

Who they are now

A lab that grades the machines

Judgment Labs is a San Francisco applied-research company building infrastructure for what it calls deep agents - the AI systems that don't just answer a prompt but run long, multi-step jobs with tools and memory. Its job is unglamorous and enormous: trace what an agent actually did, measure whether it was any good, and turn that into the next, better version. In May 2026 it announced $32 million in combined seed and Series A funding. The team is about two dozen people. The founders are barely old enough to rent a car.

"Judgment is solving the hardest problem in the agent stack - how do you measure and improve something that thinks, plans, uses tools, and remembers?"

- James Alcorn, Lightspeed Venture Partners

The problem they saw

Demos lie. Production tells the truth.

Here is the uncomfortable thing about agents: they are dazzling in a demo and unpredictable at scale. A model that aces a benchmark can still fumble the seventh tool call in a real workflow, and traditional software tests have no idea what to do with a system that improvises. Pass/fail doesn't fit a thing that reasons. You can't unit-test a hunch.

Teams shipping agents found themselves flying blind - drowning in logs, guessing at failure modes, unable to tell a clever recovery from a lucky accident. The data that could explain everything was right there in production, and almost nobody had the tools to read it.

"We tried other tools, but none could automatically point toward failures. Judgment is different; we see exactly where agents err."

- Aqil Naeem, CEO, E3 Group

A customer quote that doubles as the entire product roadmap.

The founders' bet

Three friends and one stubborn idea

Judgment Labs was started by three people who have known each other since childhood. Alex Shan ran research in Stanford's NLP group inside the Stanford AI Lab before becoming CEO. Andrew Li, now Chief Scientist, was an early research hire at TogetherAI. Joseph Camyre, the CTO, built large-scale infrastructure at Datadog - which is to say he already knew what it takes to watch millions of things at once without melting the servers.

Their bet was simple and a little contrarian: production data is the best teacher an agent will ever have. Not synthetic benchmarks. Not a tidy eval set. The messy, real record of what an agent did when a real person needed something. Capture that well enough and improvement stops being a guessing game.

Alex ShanCo-Founder & CEO

Andrew LiCo-Founder & Chief Scientist

Joseph CamyreCo-Founder & CTO

Best friends since childhood, now responsible for grading everyone else's robots. Average age at the Series A: 22.

Milestones

A very fast eighteen months

EARLIERResearch roots

Founders converge from Stanford NLP, TogetherAI, and Datadog - NLP research, large-scale systems, and agentic evaluation in one room.

2026Judgment Labs founded

The company sets out to build the continuous-improvement layer for agents from San Francisco.

2026Judgeval goes open source

The open-source post-building layer for agents ships on GitHub and PyPI - tracing and evals that power monitoring and post-training.

2026 · Q1Seed round

Lightspeed Venture Partners backs the seed - the first half of a two-part conviction.

2026 · MAY 12$32M seed + Series A announced

Lightspeed leads again, with Nova Global, SV Angel, Valor Equity Partners and Dynamic. Two rounds, less than six months apart.

NEXTScaling the lab

Expanding research, engineering, and a forward-deployed engineering team in San Francisco.

The product

From raw trajectory to better agent

The platform is a stack, not a single dashboard. At the open base sits Judgeval, the SDK developers drop into their own code to record what an agent actually does - the long reasoning, the tool calls, the memory. Above it, the work of making sense of all that:

Agent Search

Query across trajectories at a behavioral level - not keyword matching, but "show me every time the agent did this."

Evaluate

Agent Judge

Cheaper, more accurate trajectory-level evaluators built on harnesses and LLM-as-a-judge techniques.

Discover

Behavior Discovery

Surfaces failure modes and usage patterns from unlabeled production data - the stuff you didn't know to look for.

Score

AutoRubrics

Builds and refines evaluation rubrics automatically from verifiable signals, so grading gets sharper over time.

There's also Judgment MCP, which wires the whole thing into coding agents like Claude, Codex, and Cursor. And because the environment data and evals can be exported, the same record that caught a failure can feed reinforcement learning and fine-tuning - improvement that compounds instead of resetting.

"Your agents need better judgment."

- Judgment Labs, on the nose and proud of it

The proof

The numbers behind the conviction

Investors don't lead two rounds in six months on vibes. Here's the shape of the bet so far - funding, headcount, and the kind of customers betting alongside them.

Judgment Labs, by the numbers

// combined seed + series A disclosed May 2026. Bars scaled for comparison, not to a single unit.

Total raised

$32M

Series A

$25.6M

Team size

~24

Founders

Lightspeed rounds

Two rounds, one lead investor, six months apart. Either it's conviction or it's a very patient coincidence.

The customer list skews toward teams who live or die by agent reliability: agent-native startups in legal, finance, and customer support, plus names referenced in the company's own case studies - E3 Group, Monaco, Human Behavior, Contrario, Vigil Labs, and DoorDash. These are not tire-kickers. They are companies whose product is the agent.

The mission

Better with every interaction

The company states its mission plainly: give every team building agents the tools to make their product better with every interaction. It's a quietly radical framing. Most software ships, then slowly rots until someone patches it. Judgment Labs is arguing the opposite - that an agent should improve the more it's used, because every interaction is a labeled lesson if you have the instruments to read it.

"The hardest problem isn't building an agent. It's knowing whether the one you built is getting better or just getting different."

- The thesis, paraphrased

Why it matters tomorrow

The audit layer for an agent economy

If agents really do take over the long, consequential workflows - the legal review, the financial close, the support queue - then the ability to measure them stops being a nice-to-have and becomes the cost of doing business. You don't deploy what you can't inspect. Judgment Labs is betting the inspection layer is as important as the agents themselves, and that owning the open-source entry point is how you end up trusted to grade the whole industry.

So return to that opening scene. An agent is booking a flight, reconciling an invoice, drafting a memo - three steps from a quiet mistake. Except now the trajectory is being recorded, the failure mode is being named, and the next version already knows better. The silence Judgment Labs was built for is starting to talk back.

Same agent, same workflow - only this time someone's actually watching, and the agent gets the note.