Yao Fu

The Profile

The Man Who Made Language Models Think

There is a particular kind of patience that separates researchers from the people who merely publish papers. Yao Fu has it. In 2022, he tried an idea - using string-matching as a binary reward signal to train language models to reason. It failed. His friends tried the same idea in 2023. Failed again. Then in early 2024 with a different base model. Still failed. Most people would have moved on. Fu catalogued the failures and kept watching. Then, in late 2024, with Qwen 2.5 and DeepSeek V3 as base models, the approach suddenly worked - spectacularly. The R1 and K1.5 technical reports confirmed what three years of negative results had quietly suggested: the idea was never wrong. The models just weren't ready.

That story is, in miniature, the story of Yao Fu's career. He works at the intersection of what's mathematically true and what's practically possible - and he has learned to hold both without flinching. Born in China, educated at Peking University, Columbia, and the University of Edinburgh, Fu has accumulated a resume that reads like a tour of every consequential AI lab of the last five years: Allen Institute for AI, Alibaba, Google DeepMind (on both Gemini 3 and the still-mysterious Project Astra), NVIDIA, and now xAI, where he works on scaling research. Four top organizations in as many years. That's not job-hopping. That's someone every major lab wants in the building.

It is only after late 2024 with Qwen 2.5 and DeepSeek V3 as base models that the string matching based reward approach began to work very well. The idea was always there - waiting for the infrastructure to catch up.

- Yao Fu, on X/Twitter, January 2025

Fu's PhD at Edinburgh, supervised by Prof. Luo Mai, was nominally about deep generative models and Bayesian learning. In practice it was training ground for the question that would come to define his work: can we make small models reason like large ones? His ICML 2023 paper - presented as an oral, which is the conference's highest distinction - answered yes, mostly. He showed that a fine-tuned T5 model of 11 billion parameters could be coaxed into multi-step reasoning by distilling capabilities from GPT-3.5 at 175 billion. The catch: you had to be precise about what you were distilling and how. Compress the wrong thing and the reasoning disappears. Compress the right thing and you get a model that punches far above its weight class.

He noticed something while doing this work. There was no reliable way to compare how well different models reasoned. Benchmarks existed, but they were scattered across papers, measured on different splits, reported with different methodologies. Fu built Chain-of-Thought Hub - a unified leaderboard tracking reasoning ability across GSM8K, MATH, TheoremQA, BBH, MMLU, HumanEval, and long-context datasets. The repository now has over 2,700 GitHub stars. In a field where everyone was claiming their model was best at reasoning, Fu created the shared ruler.

Inside the Lab

The DuoAttention paper starts from an observation that sounds almost too clean to be useful: not all attention heads in a transformer behave the same way. Some of them - the retrieval heads - attend broadly across the entire context window when needed. Others - the streaming heads - mostly care about the most recent tokens and a few fixed positions. If you could identify which is which, you could treat them differently: give retrieval heads full memory, give streaming heads a short cache. The result is a system that uses up to 2.55x less memory with up to 2.18x faster decoding, accepted at ICLR 2025. Simple observation. Non-trivial implementation. Considerable impact.

At Google DeepMind, Fu worked on the perception components of Gemini 3 and contributed to Project Astra - Google's research into persistent, real-time AI assistants. This is notable not because of the specific work (which remains largely undisclosed) but because of what it reveals about how Fu thinks. Most pure ML researchers, given access to a world-class lab, gravitate toward clean papers. Fu gravitates toward deployed systems. ServerlessLLM, his paper at USENIX OSDI 2024 - a systems conference, not an ML conference - is the clearest evidence of this. The paper solves a problem that anyone running LLM inference at scale knows intimately: models are enormous, cloud infrastructure is shared, and loading a 70-billion-parameter model on demand takes time that users don't have. ServerlessLLM achieves six to ten times faster model loading than SafeTensors, with latency reductions of 10 to 200 times. Publishing that work at OSDI rather than NeurIPS tells you something about where Fu thinks the real leverage is.

There's a Chinese character in his name - 符 - that means symbol, or talisman, or written charm. The kind of mark made to convey something that plain language can't quite capture. It's a coincidence so apt it borders on unfair. Fu's entire research program is about language and its limits: where symbolic reasoning breaks down in neural networks, where it doesn't, and what it would take to make machines that reliably think rather than fluently predict. He runs a newsletter, also called Yao Fu, hosted on Notion, where he writes about these questions with the directness of someone who finds the subject genuinely interesting and has no patience for hype. That's rarer than it sounds.

The key insight behind DuoAttention is that not all attention heads serve the same purpose - some are retrieval heads needed for long context, others are streaming heads that only need recent tokens.

- Yao Fu, on the DuoAttention methodology

Now at xAI, working on scaling research, Fu is closer to the frontier than he has ever been. xAI's Grok series represents some of the most aggressive scaling experiments in the industry, and Fu's particular combination of skills - mathematical rigor, systems intuition, empirical patience - is exactly what scaling research demands. In April 2025, he co-published a comprehensive survey on LLM reasoning frontiers, cataloguing the state of inference scaling, learning-to-reason approaches, and agentic systems. It reads like a field guide written by someone who has personally tried most of the things he's describing.

The broader picture of Yao Fu is someone who arrived at the right moment with the right training and has spent the years since making sure the field has the tools it needs to measure its own progress. He published the benchmark when there was no benchmark. He published the serverless paper when production was outrunning theory. He waited three years before declaring a negative result might be provisional. In a field that routinely mistakes speed for progress, he is the kind of researcher who makes you wonder what took everyone else so long.

Yao Fu

The Man Who Made Language Models Think

Timeline

Key Publications

Things Worth Knowing

Links & Profiles