What is the agent actually doing?
Most AI tools grade an agent by what goes in and what comes out. Alex Shan thinks that misses the whole story - the part in the middle, where an agent reasons, calls a tool, remembers something it shouldn't, and quietly goes off the rails. Judgment Labs is built to watch that middle.
The company calls itself "the continuous-improvement stack for agents." In plain terms: it follows long reasoning traces, tool use, and memory through a live system, flags the moment behavior drifts, and turns that production data into agents that get measurably better the next time around. You can investigate a failure from Slack, find the root cause across real traffic, test a fix against actual production cases, then ship it - all without guessing.
That is the unglamorous layer of AI. Nobody tweets about regression tests. But it is exactly the layer that decides whether a deep, multi-step agent is something you trust with real work or something you babysit forever. Shan picked the boring problem on purpose, because the boring problem is the one standing between agents and production.
- Alex Shan, on why Judgment Labs existsWe set out to build Judgment because the teams building deep agents didn't have tools that understood what their agents were actually doing.
The first customer of a middle-school Python course
Judgment Labs has three founders, and they didn't meet in a dorm or a Slack channel. They have been best friends since they were kids. Andrew Li - now the company's Chief Scientist - taught Alex natural language processing when they were children. Joseph Camyre - now CTO - ran a Python course in middle school, and Alex was his very first customer.
Years later, the hobby turned into resumes. Alex became an AI researcher in Stanford's NLP group inside the Stanford AI Lab. Andrew landed as an early research hire at TogetherAI. Joseph built large-scale infrastructure as a systems engineer at Datadog. Three threads - research, models, and systems - that happen to be exactly what you need to build evaluation infrastructure for agents. They pulled them together in 2025.
Alex Shan, 22
CEO. Stanford NLP researcher under Chris Manning. The thesis guy.
Andrew Li, 23
Chief Scientist. Early research hire at TogetherAI. Taught Alex NLP as a kid.
Joseph Camyre, 23
CTO. Systems engineer at Datadog. Ran the Python class Alex first signed up for.
One thesis
Agents should improve from production data, not just pass a test once.
Stanford freshman to Lightspeed term sheet
Most people who work in Chris Manning's NLP group arrive as PhD students. Alex walked in as a freshman. He stayed close to the research, eventually earning an M.S. in Computer Science from Stanford, then went looking for where agents actually break.
He found it at Juniper Networks, where he pioneered autonomous networking agents - software making its own decisions inside live infrastructure. That is where the abstract worry ("what is the agent doing?") became a daily, operational one. The lesson carried straight into Judgment Labs.
- As kidsLearns NLP from Andrew; becomes the first customer of Joseph's middle-school Python course.
- Freshman yearJoins Stanford's NLP group under Professor Chris Manning.
- StanfordAI researcher in the NLP group within the Stanford AI Lab; earns an M.S. in Computer Science.
- Pre-JudgmentPioneers autonomous networking agents at Juniper Networks.
- 2025Co-founds Judgment Labs with Andrew Li and Joseph Camyre; becomes CEO.
- Oct 2025Co-authors the company essay, "Climbing the Hills That Matter."
- May 2026Closes $32M in combined seed and Series A, led by Lightspeed Venture Partners.
$32 million, and a thesis playing out in production
On May 12, 2026, Judgment Labs announced it had closed $32 million in combined seed and Series A funding. Lightspeed Venture Partners led both rounds. Nova Global, SV Angel, Valor Equity Partners, and Dynamic joined. The platform is already running at agent-native companies, powering the monitoring-and-improvement loop behind their agents every day - including at customers like E3 Group.
- Alex Shan, on the roundLightspeed has been the right partner from day one: they backed us when we were a handful of researchers with a thesis, and they're doubling down now that the thesis is playing out in production.
Why input-output isn't enough
Here is the line that explains the whole company. Shan: "Input-output evals miss so much of where agents go wrong." A chatbot you can grade on its answer. A deep agent - one that plans, calls tools, holds memory, and acts over many steps - fails somewhere inside that chain, in a place a final-answer score never sees. Judgment's open-source layer, judgeval, exists to make that interior visible: it captures environment data and evaluations that feed both monitoring and post-training (reinforcement learning and supervised fine-tuning).
The name is a small joke that tells you who built it - "judgeval," judgment plus evaluation, shipped open-core so the people running agents can see the machinery, not just the marketing.