The AI company that decided the black box was a design choice - and then chose otherwise.
There is a quiet, load-bearing lie at the center of modern artificial intelligence, and it goes roughly like this: nobody knows why the model did that. It is said with a shrug, the way you'd explain weather. The largest language models are black boxes, the reasoning goes, and the black box is simply the price you pay for the intelligence. Guide Labs, a San Francisco company founded in 2023, exists to argue that this is not a law of physics. It's a habit. And habits can be broken by anyone willing to do the engineering.
The company builds interpretable foundation models - AI systems designed from the start so that a human can trace an output back to its causes. Their pitch is not that they can explain a model after the fact, which is the usual industry move and, per their own founder's academic work, frequently unreliable. Their pitch is that the model is interpretable by construction. The difference matters the way it matters whether a bank shows you the math on your loan denial or just says "computer says no." One of those is auditable. The other is a shrug wearing a lab coat.
"Training interpretable models is no longer a sort of science. It's now an engineering problem."
That line, from CEO Julius Adebayo, is the whole thesis compressed into a sentence. It sounds modest. It is not. Declaring that something has crossed from "open research question" into "thing we can reliably build" is one of the more consequential claims a founder can make, because it changes what you're allowed to expect. If interpretability is science, you wait. If it's engineering, you ship - and then someone has to explain why they didn't.
In February 2026, Guide Labs open-sourced Steerling-8B: an 8-billion-parameter language model released, weights and code, under Apache 2.0 on Hugging Face and GitHub. The headline capability is that every token the model produces can be traced back to three things - the input context, the specific training data that shaped it, and a set of human-understandable concepts. It is, in the company's framing, the first large-scale inherently interpretable language model. You get the answer and the receipt in the same breath.
Under the hood it does two unusual things at once. It generates text via masked diffusion rather than the usual left-to-right prediction, filling in tokens by confidence rather than in reading order. And it inserts a "concept layer" that decomposes the model's internal representations into explicit, inspectable pathways. That decomposition is the part worth staring at.
The residual is the tell. A credible interpretability story isn't "we understand everything." It's "here is exactly the part we don't."
The practical payoff is a set of things you can't do with an ordinary model. You can suppress or amplify a specific concept at inference time, no retraining required - which means alignment stops being a matter of generating thousands of safety examples and retraining, and starts being a dial you can turn. You can pull training-data provenance for any generated chunk. And Guide Labs says all of this costs surprisingly little: Steerling-8B reaches roughly 90% of the capability of larger, less interpretable models while using less training data. If that holds up, the long-assumed tax on transparency is a lot smaller than the field priced it at.
Guide Labs aims at the domains where a black box isn't a shrug - it's a compliance problem. Lending. Medicine. Drug discovery. Places where someone eventually has to answer, out loud, why the model did that.
Follow any generated chunk back to the training data and concepts that produced it. Provenance you can put in front of a regulator.
Amplify or suppress specific concepts at inference time - inference-time alignment without a retraining cycle.
Inspect known and discovered concepts to locate errors, spurious signals, and bias instead of guessing at a black box.
Steerling-8B ships under Apache 2.0. Read the code, run the model, look inside the concept layer yourself.
It's official: the first large-scale inherently interpretable language model is here.
The team carries an academic pedigree - PhDs from MIT, MILA, and the University of Maryland, with, by their own count, more than 20 years of combined interpretable-ML experience and two dozen-plus papers at top ML conferences. The founding thread runs through Julius Adebayo, who while at MIT co-authored a widely cited 2018 paper arguing that AI's popular explanation methods weren't actually reliable. Most people file a result like that under "known limitations." He filed it under unfinished business.
ML researcher, MIT PhD. Co-author of the influential 2018 saliency-map critique. Set out to build models interpretable by design rather than explained after the fact.
A veteran of the interpretable-ML research community, part of the founding team behind Guide Labs' approach.
Leads the science behind Guide Labs' interpretable foundation models, including the concept-decomposition work in Steerling-8B.
In 2024 Guide Labs closed a $9M seed round led by Initialized Capital to advance large-scale interpretable language models. The company came up through Y Combinator's Winter 2024 batch, and the cap table reads like a bet on interpretability as a business, not just a research agenda.
Search results for interviews and product demos below open on the platforms where Guide Labs publishes. The model, code, and platform are public - go look inside.