ZeroEval

The Origin Story

Two Friends, One Very Expensive Problem

Jonathan Chavez and Sebastian Crossa met during their first year of college in Mexico. Seven years later, both were deep in the machinery of AI infrastructure - Jonathan on the LLM observability team at Datadog, watching enterprises struggle to understand why their models kept failing; Sebastian as a founding engineer building email at Micro (backed by a16z) and before that at Atrato (YC W21).

They had seen the same problem from different angles: companies build an AI agent, ship it, and then have no reliable way to know why it is performing badly or how to fix it. The evaluation tooling either doesn't exist, requires a small army of data labelers, or produces static judges that degrade the moment your production data drifts from your test data.

Before applying to YC, they built a side project together - llm-stats.com, an LLM leaderboard that quickly grew to 60,000 monthly active users and a third of a million unique visitors. It was proof they could build things people actually wanted. ZeroEval is the serious version of that instinct applied to a much harder problem.

The companies that win the next decade of AI won't be those that build the best agents. They'll be the ones whose agents get better over time.

- ZeroEval founding thesis

That single sentence explains everything about the company's positioning. ZeroEval isn't trying to make the world's best AI model. It's building the feedback infrastructure that makes your model - whatever it is - incrementally smarter every day without you having to manually audit thousands of outputs.

The Product

What ZeroEval Actually Does

AI agent evaluation is a genuinely hard problem. A chatbot that answers questions is one thing. An agent that makes dozens of tool calls across a long, multi-turn session to accomplish a complex task is another. There is no simple accuracy metric. The failure modes are subtle, numerous, and deeply task-specific.

ZeroEval attacks this in three stages: observe, evaluate, fix - and then do it automatically.

01

Real-Time Tracing

Logs every LLM call, tool call, and session with minimal overhead. Automatic PII redaction handles emails, phone numbers, SSNs, and credit cards before data leaves your infrastructure.

02

Calibrated LLM Judges

Custom evaluators that score quality on your criteria - binary pass/fail or 1-10 rubric. The key difference: these judges learn from human corrections and improve their alignment over time. Static judges don't do this.

03

Autotune

Automatically runs evaluations across multiple models and prompt variants, finds the best performer, compares candidates side by side, and deploys the winning version across your stack.

04

Built-in Safety Judges

Detects hallucinations, unsafe outputs, and user frustration signals out of the box. Define custom rubrics aligned to your quality standards - not generic benchmarks.

05

MCP Server

Plugs directly into Claude Code, Cursor, and 30+ coding agents via the Model Context Protocol. Your agents can literally ask ZeroEval what they are getting wrong and self-correct.

06

Easy Integration

Python and TypeScript SDKs, REST API, OpenTelemetry support, and a CLI. Five minutes from sign-up to first trace. SOC 2 Type II compliant for enterprise security requirements.

An example from ZeroEval's own benchmarks: 18% accuracy improvement across 2,400 traced production runs, driven by just 127 human feedback corrections. That's the leverage - a small amount of human signal producing a large model improvement.

The Founders

Two Engineers Who've Seen This Movie Before

Most AI evaluation startups are built by researchers who understand benchmarks but have never shipped a production system under pressure. ZeroEval's founders are operators first.

Jonathan Chavez

Co-Founder

Early employee on the LLM observability team at Datadog - one of the few places where you watch enterprise AI break in real time at scale. Before that: undergrad research on vision transformers for particle physics and reinforcement learning for robotics at Tecnologico de Monterrey. The kind of background that makes "AI evaluation is hard" feel like an understatement.

Sebastian Crossa

Co-Founder

Founding engineer at Micro (a16z-backed), building the future of email - then founding engineer at Atrato (YC W21). Two first-engineer roles at funded companies before 30. The pattern: join early, build fast, learn what actually matters. He brought the product instincts that turned a side-project leaderboard into 60K monthly users.

Their YC partner is Jon Xu. The batch: Summer 2025. The location: New York City, which has quietly become one of the better places to build B2B AI infrastructure - closer to the enterprise customers writing the big checks.

The Bigger Picture

Why "Self-Improving" Is the Right Frame

The AI agent wave of 2025 created a new category of operational headache. Companies are shipping agents that do things - book flights, write code, triage support tickets, analyze medical data - and discovering that "it worked in testing" and "it works in production" are very different statements.

The evaluation tooling industry is responding. But most solutions look like observability products with an evaluation tab tacked on. ZeroEval's bet is different: don't just tell developers what broke, automatically fix it. Close the loop from failure detection to prompt deployment without a human in the middle.

The vision is explicitly ambitious. In the founders' words, they see a future where "developers define the evaluation criteria as a starting point and errors back-propagate to find the optimal implementation." That's less an observability product than a continuous training loop for production software.

We're building the second line of offense to fill foundational models' capability gaps and create AI products that actually work.

- ZeroEval Launch Post, 2025

The MCP server integration is an underrated detail here. When ZeroEval publishes a server that lets coding agents like Claude Code query their own evaluation data to understand where they fail - AI agents using an AI evaluation tool to get smarter - that's a glimpse of the feedback loop the company is trying to institutionalize.

DoorDash, Datadog, Hugging Face, and Harvard Medical School all trusted ZeroEval early. That's a remarkably diverse set of use cases for a two-person company from a single YC batch. Delivery logistics, developer tools, ML infrastructure, and medical AI - the common thread is agents making consequential decisions where getting it wrong has a cost.

Timeline

How We Got Here

7 Years Ago

Jonathan and Sebastian meet during their first year of college in Mexico. Side projects begin.

Pre-2025

Sebastian builds as founding engineer at Atrato (YC W21) and Micro (a16z). Jonathan joins Datadog's LLM observability team. Both accumulate front-row seats to AI failing at scale.

Early 2025

They build llm-stats.com together - an LLM leaderboard that reaches 60,000 monthly active users and 333,000 total unique visitors. Proof of concept that they can ship quickly and build an audience.

Summer 2025

Accepted into Y Combinator's Summer 2025 batch. Photographed at the YC campus in August 2025. YC partner: Jon Xu.

2025

Launch: ZeroEval goes live with tracing, calibrated judges, Autotune, and MCP server. Early customers include DoorDash, Datadog, Hugging Face, and Harvard Medical School. Achieves SOC 2 Type II compliance.

Zero
Eval

Two Friends, One Very Expensive Problem

What ZeroEval Actually Does

Two Engineers Who've Seen This Movie Before

Why "Self-Improving" Is the Right Frame

How We Got Here

Links & Resources