When an LLM hallucinates at 3am, when an agent silently picks the wrong tool, when a model drifts a half-percent every Tuesday - somewhere on the screen of a tired engineer, Arize is the dashboard already open.
It is the spring of 2026 and a product manager at a global travel company is staring at a Slack thread. A customer-facing agent has been confidently recommending a hotel that no longer exists. Twelve times this week. Somewhere in a tab, a chart in Arize is glowing orange: a particular prompt template's hallucination rate has crept from 0.4% to 3.1%. Within an hour, an evaluator catches it, a guardrail ships, the agent stops lying.
This is the unglamorous, indispensable middle of the AI economy. Not the model. Not the chatbot. The thing watching both. Arize AI sells that thing - and is, by most reasonable measures, the company that most defined what "AI observability" even means.
Traditional software, when it breaks, has the courtesy of a stack trace. AI does not. A model can drift, a prompt can regress, a retrieval pipeline can pull yesterday's data into tomorrow's answer - and the only signal is a slow, mild rise in customer complaints. By the time anyone catches it, the model has been confidently wrong to thousands of users.
Arize's founders had spent careers watching this. Aparna at Uber had co-led the company's first model lifecycle management system - a now-famous internal tool called Michelangelo. Jason at TubeMogul had shipped AI strategy into production ad systems. Both had seen, up close, what happens when a model that worked last Tuesday quietly stops working this Tuesday. The lesson, in both cases, was unpleasant.
So in 2020 they did the deeply unsexy thing: they started a company about debugging.
In 2020, the wager looked questionable. "ML in production" was still, for most companies, a single engineer with a Jupyter notebook and a recurring nightmare. Calling that engineer's pain a "category" took some optimism. Seed money was small. The list of competitors was even smaller, which is the kind of detail that should worry a founder and almost never does.
Then ChatGPT happened. The single Jupyter notebook turned into a fleet of LLM-powered agents shipping to production every Friday afternoon. Every company in the Fortune 500 suddenly had what Aparna and Jason had at Uber - except they had it everywhere, all at once, and nobody had a Michelangelo to help.
Arize had been quietly building exactly the thing that pain needed. The bet had paid off. The category was real.
The opinion is this: every interaction a model has with a user is a span, every span is data, and that data is the only honest record of how AI actually behaves. Everything Arize sells follows from that one belief.
The open-source library that started it all. LLM tracing, evaluation, experimentation. 2M+ monthly downloads. Free, self-hostable, no feature gates.
The enterprise platform. Online evaluations, drift monitoring, prompt management, retrieval debugging, agent tracing - at production scale.
An open standard for AI tracing built on OpenTelemetry. Arize co-authored it because someone had to. Now adopted across the LLM ecosystem.
A trillion spans a month is the kind of number that, in any other industry, you would politely assume is a typo. In Arize's case it is the actual rate at which their platform ingests AI behavior from customer systems. The category is new. The volume is not.
Who uses it: The customer roster reads like a list of companies you have probably interacted with this week without realizing.
That is the official phrasing, and it sounds modest in the way that genuinely large goals sometimes do. "Working" in the real world means a model that does not silently get worse over time. It means an agent that, when it does the wrong thing, can be traced, evaluated, and fixed. It means an engineering team that can ship AI on a Friday afternoon and still go home.
Arize's bet on open-source is part of the same idea. Phoenix is free because Arize would rather AI engineers had decent tools at all than only at companies that signed a contract. The enterprise platform is for the ones whose trillion-span problems would crush a laptop.
The cap table is, conveniently, a list of believers. The Series C in February 2025 - the largest single round in AI observability history - was led by Adams Street Partners, with Microsoft's M12, Datadog, PagerDuty, Sinewave, OMERS and Industry Ventures joining. (Yes, Datadog. The observability incumbent putting money into the AI observability upstart is the kind of detail that tells you which way the wind is blowing.)
The next wave of AI deployments will not be chatbots. It will be agents - systems that take many steps, call many tools, spend real money, and make real decisions on behalf of real customers. That is, on a good day, a thrilling product. On a bad day, it is a long, expensive, undebugged loop.
Arize is building the supervision layer for that future: trace every step, evaluate every output, alert when behavior changes, and give the humans a way to intervene before the bill arrives. It is unglamorous work, in the way that brake systems on cars are unglamorous. Nobody buys a car for the brakes. Nobody drives one without them.
Back to that travel-company PM, still in her Slack thread. Six years ago she would have spent the next week reading raw logs and writing a postmortem. This afternoon, the chart is already green again. The agent is back to recommending hotels that exist. The hallucination rate is 0.3%. She closes the laptop. Somewhere in Berkeley, Arize ingests another billion spans, and nothing about that fact is interesting to anyone except the people for whom it is everything.