Tagged Content
Everything on the platform tagged with inference.

Yao Fu (符尧) is an AI researcher at xAI specializing in large language model reasoning, efficient inference, and distributed systems. A PhD graduate of the University of Edinburgh, he previously worked at Google DeepMind on Gemini 3 and Project Astra. With over 5,000 citations and key papers like ServerlessLLM (OSDI '24) and DuoAttention (ICLR '25), Fu bridges systems engineering and ML research. He writes the 'Yao Fu' newsletter on Notion and is known for the Chain-of-Thought Hub benchmark repository, which helped track LLM reasoning progress across the field.

OctoAI (formerly OctoML) was a Seattle-based AI infrastructure company founded in 2019 by University of Washington researchers — including Apache TVM creator Tianqi Chen and CEO Luis Ceze. The company built a generative AI inference platform that gave developers fast, affordable API access to leading open-source LLMs and image generation models, along with OctoStack, an enterprise-grade private AI deployment stack. After raising ~$132M and pivoting from ML optimization to GenAI infrastructure, OctoAI was acquired by NVIDIA in September 2024 and wound down its commercial services by October 31, 2024.
Fireworks AI is a generative AI inference platform founded in 2022 by seven engineers — five of whom built PyTorch at Meta — that gives enterprises fast, cost-efficient, and customizable access to hundreds of open-source models. The company's proprietary FireAttention kernels and speculative-execution engine deliver up to 40× faster inference and 8× cost reduction versus alternatives, while its fine-tuning and model-deployment tooling lets companies own their AI stack end-to-end. With $327M+ raised, a $4B valuation, 10,000+ customers including Samsung, Uber, Shopify, and Cursor, and a $315M annualized run-rate as of early 2026, Fireworks AI has become the go-to inference layer for production generative AI applications.

Predibase was a San Francisco-based AI infrastructure company (founded 2020, acquired by Rubrik in June 2025) that pioneered efficient LLM fine-tuning and serving at scale. Built by the creators of Uber AI's Ludwig and Horovod frameworks, Predibase made it easy for enterprises to fine-tune and deploy open-source LLMs using LoRA adapters — often outperforming GPT-4 on domain-specific tasks for under $8 of compute. Its open-source LoRAX inference server enabled serving thousands of fine-tuned models from a single GPU, dramatically cutting costs. After raising $28M from Greylock and Felicis, Predibase was acquired by cybersecurity firm Rubrik for over $100M to accelerate agentic AI adoption.

Baseten is a San Francisco-based AI inference infrastructure company that provides dedicated and serverless GPU compute for running AI models at scale. Founded in 2019 by four ex-Gumroad engineers, the company has grown into a unicorn with a $5B valuation and $585M in total funding, backed by NVIDIA and other top-tier investors. Baseten powers inference workloads for 100+ enterprises including Cursor, Notion, HeyGen, and Clay, offering an inference stack with near-zero cold starts, proprietary networking, and open-source tooling like Truss for model packaging.

Modal (Modal Labs) is an AI-native serverless cloud computing platform that gives developers instant, elastic access to GPUs and CPUs through a clean Python SDK — no YAML, no Dockerfiles, no infrastructure management required. Founded in 2021 by Spotify ML veteran Erik Bernhardsson, Modal enables AI and ML teams to scale from zero to thousands of GPUs in seconds, paying only for what they use. With customers like Suno, Mistral AI, Harvey, Ramp, and Substack, Modal reached unicorn status at a $1.1B valuation in September 2025 and was reportedly in talks to raise at $2.5B just five months later.

RunPod is an AI cloud infrastructure company that provides on-demand GPU compute for training, fine-tuning, and deploying AI/ML models. Founded in 2022 by two former Comcast engineers who pivoted their Ethereum mining rigs into AI servers, RunPod grew to $120M ARR with just $22M raised by early 2026, serving 500,000+ developers across 183 countries. Its marketplace model, per-second billing, and support for 30+ GPU SKUs — from consumer RTX 4090s to enterprise H100s and B200s — make it a capital-efficient disruptor to hyperscaler GPU clouds like AWS, GCP, and Azure.