Braintrust

Who they are nowThe quiet workbench behind a lot of loud AI

It is a Tuesday afternoon. An engineer at Notion pushes a prompt change. Somewhere, an agent at Stripe answers a refund question it has never seen. A data scientist at Vercel compares two models, side by side, and learns - in under a minute - which one is quietly worse. None of this is glamorous. All of it runs through Braintrust.

Braintrust is the developer platform for shipping AI products that don't silently fall apart. It bundles evaluations, observability, prompt management, and an agentic optimizer called Loop into one workflow. Boring, on purpose. The point is the loop, not the magic.

That positioning - infrastructure rather than spectacle - is unusual for a 2023-vintage AI company. The company is three years old, has 160-odd employees, $121 million in funding, and a customer roster that reads less like a startup deck and more like a list of the companies you already think are good at AI.

Evals are the new PRD. - Ankur Goyal, Founder & CEO

The problem they sawAI is the only software that gets worse on its own

For most software, "shipped" is a stable state. The code does what it did yesterday. It does not, generally, decide overnight to start hallucinating a refund policy. AI systems are different. Models drift. Prompts rot. Vendors silently re-train. A change that improves accuracy on the easy 80% can quietly torch the hard 20% that actually matters.

Ankur Goyal spent his previous company, Impira, watching this play out in document AI - and again at Figma, where he led the AI team after the acquisition. The pattern was always the same. Demos sparkled. Production sagged. Nobody could agree on whether the new version was actually better than the old one, because nobody had a measurement that everyone trusted.

The lightly ironic part: every AI team knew they should be running evals. Almost none of them were doing it well. Spreadsheets, half-finished notebooks, vibes-based judgments at 11pm. The discipline existed in machine-learning research papers and almost nowhere else.

Margin note: "Works on my prompt" is not a release strategy. It is, however, a remarkably common one.

The founder's betIf evals are the workflow, build the workbench

Goyal's bet was specific and almost stubborn: if you make evaluation the center of how AI engineers work - not an afterthought, not a research artifact, but the actual default workflow - the quality follows. Every change gets measured. Every measurement informs the next change. Boring. Compounding.

In 2023, Greylock wrote a $5.1 million seed check on this thesis. A year later, Martin Casado at Andreessen Horowitz led a $36 million Series A, with Datadog, Databricks Ventures, Greg Brockman, Arthur Mensch, Guillermo Rauch, and Simon Last all piling in - investors who, between them, have seen what production infrastructure for AI actually has to do. In February 2026, ICONIQ led an $80 million Series B. Total raised: $121 million.

The pattern in those investor lists is not subtle. They are mostly people who build, or invest in, the systems Braintrust's customers use. They were buying picks and shovels from someone who had mined the seam before.

AI drifts, hallucinates, and regresses silently. The best teams observe production, evaluate against expectations, and iterate continuously. - From the Braintrust playbook

The productOne workbench, four loud parts

Evals. The center of gravity. Engineers define datasets, write scorers (in code, with LLMs, or by hand), and run experiments against any model or prompt. Compare two prompts side by side. Catch regressions in CI before a pull request gets merged.

Observability. Trace what production is actually doing. Every LLM call, every tool invocation, every agent step - logged, searchable, and turn-into-an-eval-with-one-click. The closing of that loop - production into evals - is the part teams stay for.

Loop. An agent that proposes prompt and scorer improvements based on your data. It is, fittingly, the most AI-flavored part of the platform: an evaluator that helps you evaluate.

Brainstore. A purpose-built database for AI traces, designed for the moment a customer realizes their AI app is producing fifty million log events a week and Postgres has feelings about that.

      [ EDIT PROMPT ] → [ RUN EVAL ] → [ SCORE ] → [ COMPARE ] → [ SHIP ]

      ↑__________________________________ trace from production __________________________________↓

Exhibit B. The loop, drawn by someone who clearly believes arrows are a productivity tool.

Milestones / 2023 → 2026

From a $5M seed to a quiet standard

2023

Founded. Ankur Goyal leaves Figma's AI team. Greylock leads a $5.1M seed.

2024

$36M Series A. a16z leads. Datadog, Databricks, Brockman, Mensch, Rauch and Last invest.

2024

Adoption inflection. Notion, Stripe, Vercel and Airtable standardize on Braintrust.

2025

Loop and Brainstore ship. Agentic optimizer plus a database tuned for AI logs.

2026

$80M Series B led by ICONIQ. Team passes ~160 people.

The proofThe customers, the cash, the receipts

The most useful way to read an infrastructure company is to look at who buys from it. Braintrust's customer list is a tell.

Notion

Stripe

Vercel

Airtable

Instacart

Zapier

Ramp

Dropbox

Cloudflare

BILL

Coursera

Replit

These are companies that have, between them, written about how they build AI. They have engineering blogs, internal platform teams, and the option to build evals in-house. They mostly didn't. They picked the workbench someone else was sharpening.

Funding, stacked. Or: who keeps writing checks.

Cumulative funding by round / USD millions

Seed 2023

$5.1M

Series A 2024

$36M

Series B 2026

$80M

Total raised

$121M

Source: Braintrust funding announcements, 2023-2026. Bars scaled to $121M. Bars do not, regrettably, scale to ambition.

A platform for AI engineers that AI engineers actually use - a sentence which sounds tautological until you remember how often it isn't true. - YesPress field notes

The missionMake measurement the default, not the chore

Goyal has been admirably consistent in interviews about what he is trying to build. He wants evaluation to feel as native to AI engineering as unit tests feel to backend engineering. The metaphor is exact, and a little provocative: nobody high-fives the person who wrote the unit test. They also nobody ships without one.

That is the cultural fight Braintrust is actually in. Not "which model is best" - the model leaderboards do that already - but "what does it mean to know your AI is getting better." The product is the answer; the workflow is the bet.

Founder

Ankur Goyal

Founded

2023

San Francisco

Total raised

$121M

Latest round

Series B / $80M

Team size

~160

Why it matters tomorrowIf agents are the future, somebody has to grade them

The next phase of AI is not chatbots. It is agents that book things, refund things, write code, file tickets, and occasionally do something stupid that costs real money. The blast radius is wider. The failures are more expensive. The need for measurement is no longer a nice-to-have; it is the thing that keeps you employed.

Braintrust's competitors - LangSmith, Weights & Biases Weave, Arize Phoenix, Humanloop, Datadog's LLM observability - all see the same horizon. The fight is less about whether evals matter, and more about whose workbench engineers actually open every morning. Braintrust's bet is that if you make the loop fast enough, the workbench wins by default.

It is a Tuesday afternoon. The engineer at Notion who pushed that prompt change sees the eval results before she finishes her coffee. The agent at Stripe answers the refund question correctly, and the trace is already an eval case for next week. The data scientist at Vercel picks the better model and moves on. None of this is glamorous. All of it runs through Braintrust. That, finally, is the point.

Brain·trust