BREAKING
Yao Fu joins xAI as scaling researcher DuoAttention accepted at ICLR 2025 - reduces LLM memory by 2.55x ServerlessLLM published at USENIX OSDI 2024 Chain-of-Thought Hub: 2,700+ GitHub stars 5,000+ citations across 18 papers Former Google DeepMind - worked on Gemini 3 & Project Astra LLM Reasoning Survey published April 2025 Yao Fu joins xAI as scaling researcher DuoAttention accepted at ICLR 2025 - reduces LLM memory by 2.55x ServerlessLLM published at USENIX OSDI 2024 Chain-of-Thought Hub: 2,700+ GitHub stars 5,000+ citations across 18 papers Former Google DeepMind - worked on Gemini 3 & Project Astra LLM Reasoning Survey published April 2025
Yao Fu - AI Researcher at xAI
YesPress Profile  ◆  AI Research

Yao Fu

符尧 - The Reasoning Architect
xAI Researcher LLM Reasoning Edinburgh PhD Open Source

He waited three years to prove a single idea right. That's not stubbornness - that's how you know someone is actually doing science.

5K+ Citations
18 Papers Published
2.7K GitHub Stars
4 Top AI Orgs

The Man Who Made Language Models Think

There is a particular kind of patience that separates researchers from the people who merely publish papers. Yao Fu has it. In 2022, he tried an idea - using string-matching as a binary reward signal to train language models to reason. It failed. His friends tried the same idea in 2023. Failed again. Then in early 2024 with a different base model. Still failed. Most people would have moved on. Fu catalogued the failures and kept watching. Then, in late 2024, with Qwen 2.5 and DeepSeek V3 as base models, the approach suddenly worked - spectacularly. The R1 and K1.5 technical reports confirmed what three years of negative results had quietly suggested: the idea was never wrong. The models just weren't ready.

That story is, in miniature, the story of Yao Fu's career. He works at the intersection of what's mathematically true and what's practically possible - and he has learned to hold both without flinching. Born in China, educated at Peking University, Columbia, and the University of Edinburgh, Fu has accumulated a resume that reads like a tour of every consequential AI lab of the last five years: Allen Institute for AI, Alibaba, Google DeepMind (on both Gemini 3 and the still-mysterious Project Astra), NVIDIA, and now xAI, where he works on scaling research. Four top organizations in as many years. That's not job-hopping. That's someone every major lab wants in the building.

It is only after late 2024 with Qwen 2.5 and DeepSeek V3 as base models that the string matching based reward approach began to work very well. The idea was always there - waiting for the infrastructure to catch up.
- Yao Fu, on X/Twitter, January 2025

Fu's PhD at Edinburgh, supervised by Prof. Luo Mai, was nominally about deep generative models and Bayesian learning. In practice it was training ground for the question that would come to define his work: can we make small models reason like large ones? His ICML 2023 paper - presented as an oral, which is the conference's highest distinction - answered yes, mostly. He showed that a fine-tuned T5 model of 11 billion parameters could be coaxed into multi-step reasoning by distilling capabilities from GPT-3.5 at 175 billion. The catch: you had to be precise about what you were distilling and how. Compress the wrong thing and the reasoning disappears. Compress the right thing and you get a model that punches far above its weight class.

He noticed something while doing this work. There was no reliable way to compare how well different models reasoned. Benchmarks existed, but they were scattered across papers, measured on different splits, reported with different methodologies. Fu built Chain-of-Thought Hub - a unified leaderboard tracking reasoning ability across GSM8K, MATH, TheoremQA, BBH, MMLU, HumanEval, and long-context datasets. The repository now has over 2,700 GitHub stars. In a field where everyone was claiming their model was best at reasoning, Fu created the shared ruler.

Inside the Lab

The DuoAttention paper starts from an observation that sounds almost too clean to be useful: not all attention heads in a transformer behave the same way. Some of them - the retrieval heads - attend broadly across the entire context window when needed. Others - the streaming heads - mostly care about the most recent tokens and a few fixed positions. If you could identify which is which, you could treat them differently: give retrieval heads full memory, give streaming heads a short cache. The result is a system that uses up to 2.55x less memory with up to 2.18x faster decoding, accepted at ICLR 2025. Simple observation. Non-trivial implementation. Considerable impact.

At Google DeepMind, Fu worked on the perception components of Gemini 3 and contributed to Project Astra - Google's research into persistent, real-time AI assistants. This is notable not because of the specific work (which remains largely undisclosed) but because of what it reveals about how Fu thinks. Most pure ML researchers, given access to a world-class lab, gravitate toward clean papers. Fu gravitates toward deployed systems. ServerlessLLM, his paper at USENIX OSDI 2024 - a systems conference, not an ML conference - is the clearest evidence of this. The paper solves a problem that anyone running LLM inference at scale knows intimately: models are enormous, cloud infrastructure is shared, and loading a 70-billion-parameter model on demand takes time that users don't have. ServerlessLLM achieves six to ten times faster model loading than SafeTensors, with latency reductions of 10 to 200 times. Publishing that work at OSDI rather than NeurIPS tells you something about where Fu thinks the real leverage is.

There's a Chinese character in his name - 符 - that means symbol, or talisman, or written charm. The kind of mark made to convey something that plain language can't quite capture. It's a coincidence so apt it borders on unfair. Fu's entire research program is about language and its limits: where symbolic reasoning breaks down in neural networks, where it doesn't, and what it would take to make machines that reliably think rather than fluently predict. He runs a newsletter, also called Yao Fu, hosted on Notion, where he writes about these questions with the directness of someone who finds the subject genuinely interesting and has no patience for hype. That's rarer than it sounds.

The key insight behind DuoAttention is that not all attention heads serve the same purpose - some are retrieval heads needed for long context, others are streaming heads that only need recent tokens.
- Yao Fu, on the DuoAttention methodology

Now at xAI, working on scaling research, Fu is closer to the frontier than he has ever been. xAI's Grok series represents some of the most aggressive scaling experiments in the industry, and Fu's particular combination of skills - mathematical rigor, systems intuition, empirical patience - is exactly what scaling research demands. In April 2025, he co-published a comprehensive survey on LLM reasoning frontiers, cataloguing the state of inference scaling, learning-to-reason approaches, and agentic systems. It reads like a field guide written by someone who has personally tried most of the things he's describing.

The broader picture of Yao Fu is someone who arrived at the right moment with the right training and has spent the years since making sure the field has the tools it needs to measure its own progress. He published the benchmark when there was no benchmark. He published the serverless paper when production was outrunning theory. He waited three years before declaring a negative result might be provisional. In a field that routinely mistakes speed for progress, he is the kind of researcher who makes you wonder what took everyone else so long.

Timeline

2017
CCF Elite Collegiate Award at Peking University
2018
Wangxuan Scholarship, Peking University - B.S. Computer Science
2020
PhD begins at University of Edinburgh under Prof. Luo Mai
2022
Research internships at Allen Institute for AI and Alibaba Group. First attempt at string-matching RL reward - fails.
2023
ICML 2023 Oral: "Specializing Smaller Language Models towards Multi-Step Reasoning". Chain-of-Thought Hub repository launched.
2024
ServerlessLLM accepted at USENIX OSDI 2024. Joined Google DeepMind - Gemini 3 perception & Project Astra.
2025
Joined xAI as scaling researcher. DuoAttention at ICLR 2025. Comprehensive LLM Reasoning Survey published.

Key Publications

USENIX OSDI 2024
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
Achieves 6-10x faster model loading than SafeTensors. Reduces latency by 10-200x for on-demand LLM serving in cloud environments.
arXiv: 2401.14351
ICLR 2025
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Splits attention heads into retrieval (full context) and streaming (recent-only) heads. Reduces memory by 2.55x, speeds decoding by 2.18x.
arXiv: 2410.10819
ICML 2023 - Oral
Specializing Smaller Language Models towards Multi-Step Reasoning
Distills complex reasoning from GPT-3.5 (175B+) into T5 variants (up to 11B). Shows that targeted distillation preserves reasoning chains.
arXiv: 2301.12726
2023 Benchmark
Chain-of-Thought Hub: Measuring Complex Reasoning in LLMs
Unified benchmark tracking CoT reasoning across GSM8K, MATH, MMLU, HumanEval and more. The field's shared ruler for reasoning progress.
2,700+ GitHub Stars
NeurIPS 2024
AutoGuide: Automated Generation of Context-Aware Guidelines for LLM Agents
Automatically generates context-aware behavioral guidelines from offline agent experiences. Demonstrates strong out-of-domain generalization.
arXiv: 2403.08978
arXiv 2025
A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems
Comprehensive categorization of the LLM reasoning landscape as of 2025. Written by someone who has personally tried most of the approaches.
arXiv: 2504.09037

Things Worth Knowing

His Chinese character 符 means "symbol" or "talisman" - fitting for someone who studies language representations for a living.
📷
His profile photo was taken on an iPhone 13 in July 2022 and edited in Capture One 22. Science, meet aesthetics.
He tracked a failed research idea across three years and multiple base models before calling it validated. That's empirical patience most researchers don't have.
🎓
GRE 324 + 4.0 analytical, TOEFL 107 - the scores that got him from Peking University to Columbia and Edinburgh.
💻
GitHub handle: FranxYao - a mashup of Francis (English name) and Yao. He has 39 public repositories.
🏠
Four top AI organizations in five years: Allen AI, Google DeepMind, NVIDIA, xAI. That's not a resume. That's a tour of the field's frontier.