The Engineer Who Teaches Machines to Fail - and Humans to Learn
"You can't prevent your last outage, no matter how hard you try." That's not pessimism. That's the most useful thing anyone has said about software reliability in the last decade.
Most engineers spend their careers trying to prevent things from going wrong. Lorin Hochstein spent his figuring out what happens when they do - and why that question is far more interesting.
Today he is a Staff Software Engineer focusing on reliability at Airbnb. Before that came a career arc that reads like a tour of the most ambitious and turbulent corners of modern software: Netflix's chaos engineering and cloud operations teams, Coupang's platform engineering, SendGrid's infrastructure, academic labs at USC, and a stint as a professor at the University of Nebraska-Lincoln.
What ties it together is a single, stubborn conviction: systems fail in ways you cannot fully anticipate, and the real work is building organizations that learn from failure rather than organizations that pretend it won't happen.
At Netflix, Hochstein didn't just work on reliability - he wrote version 2 of Chaos Monkey, the tool that deliberately kills production servers to test whether the system can recover. The philosophy behind it is almost offensive in its directness: if your system can't survive a random server going down, you need to know that today, not during a 3am incident next quarter.
He also led the OOPS project at Netflix - a cultural initiative to make it safe, even rewarding, for engineers to report operational surprises. The name was intentional. Mistakes happen; hiding them is the real problem. The OOPS project reframed the question from "who is to blame?" to "what did we learn?" - a deceptively simple shift that requires enormous organizational trust to pull off.
Beyond the engineering work, Hochstein contributes to the broader intellectual ecosystem of incident analysis. His blog, Surfing Complexity, applies insights from resilience engineering, cognitive systems engineering, and human factors research to the messy reality of running software at scale. His GitHub repository of resilience engineering papers has become a canonical reading list for the field, curated with the care of someone who has actually read all of them.
He holds a PhD in Computer Science from the University of Maryland, an M.S. in Electrical Engineering from Boston University, and a B.Eng. in Computer Engineering from McGill. The academic background shows - not in jargon, but in a habit of treating questions about incident analysis the way a scientist treats an experiment: rigorously, with intellectual honesty about what the data does and doesn't support.
His newsletter and community contributions through Learning from Incidents have helped define a new genre of engineering writing: one that takes human cognition, team dynamics, and organizational culture as seriously as architecture diagrams and SLOs. The movement he is part of is slowly changing how the industry thinks about postmortems - away from root-cause theater and toward genuine inquiry.
The Mastodon handle says it plainly: @norootcause. Complex system failures don't have a single root cause. Finding one is a story we tell ourselves to feel in control. Learning to live with that uncertainty - and to be rigorous anyway - is what Hochstein's career has been about.
At Netflix, Hochstein rewrote Chaos Monkey - the tool that randomly terminates production servers - into its second version, integrating it with Spinnaker. The logic: if you want resilient systems, break them on purpose before reality does it for you. He also co-authored the paper "Automating Failure Testing Research at Internet Scale" at SOCC '16, taking the concept from practice to scholarship.
Led the OOPS project at Netflix - a cultural program encouraging engineers to openly document operational surprises. Not just incidents with big blast radius, but the small weirdnesses that most organizations sweep under the rug. The bet: small surprises contain large lessons, if you bother to look. The challenge: convincing engineers it's safe to share, which is harder than it sounds.
Before the cloud giants, Hochstein was an assistant professor at the University of Nebraska-Lincoln and a computer scientist at USC's Information Sciences Institute. His paper "Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers" (SC'05) won a Best Student Paper Award. Academia shaped the analytical rigor that now distinguishes his engineering writing from most of the industry.
His personal blog, Surfing Complexity, is a rare thing in tech: writing that applies ideas from cognitive systems engineering, resilience engineering, and naturalistic decision-making to real software operations. Not frameworks. Not hot takes. Just careful thinking, with citations. He writes the way someone writes when they actually read the papers rather than the conference talk abstracts.
Hochstein wrote the first edition of "Ansible: Up and Running" for O'Reilly, co-authored the second with Rene Moser, and contributed to the third edition with Bas Meijer and Rene Moser. Ansible had been around for about a year when the first edition landed. The book helped the DevOps community actually use a tool they weren't sure how to approach. It remains a foundational reference.
From SREcon to QCon to the Learning from Incidents Conference, Hochstein speaks the way he writes: grounded, measured, and occasionally unsettling. His SREcon22 EMEA opening keynote and LFI23 talk "Your Understanding of Reality is Wrong" cemented a reputation for talks that challenge the audience's frame rather than confirm it. Scheduled to speak at SREcon26 Americas: "The Power of Stories."
You can't prevent your last outage, no matter how hard you try. The goal isn't zero incidents - it's building systems and organizations that recover well when they happen.
What can be learned from investigating an outage is not proportional to impact. It can actually be easier to learn from smaller incidents - there's less pressure for closure, and more room for honest inquiry.
People use systems in surprising ways and changes over time invalidate initial assumptions. You can only discover this when systems break down and you take the time to actually investigate.
The real value of incident review comes from reading postmortem writeups, attending review meetings, and having conversations with the people who were in the room when it happened.
The go-to reference for Ansible, the configuration management and automation tool. Hochstein wrote the first edition when Ansible was barely a year old - before anyone else thought a book was warranted. Now in its third edition with co-authors Bas Meijer and Rene Moser. The book that helped the DevOps community stop fighting Ansible and start using it.
O'Reilly Media - 3 Editions (2014, 2017, 2021)
Co-authored with Casey Rosenthal, Aaron Blohowiak, Nora Jones, and Ali Basiri. The definitive O'Reilly text on chaos engineering - the practice of deliberately injecting failure into production systems to uncover hidden weaknesses. Hochstein's contribution brings the academic rigor of failure analysis to a field that badly needed it. Required reading for anyone running distributed systems at scale.
O'Reilly Media - 2020
Co-authored with Tom Fifield, Diane Fleming, Anne Gentle, Jonathan Proulx, Everett Toews, and Joe Topjian. A practical guide to running OpenStack clouds in production - the messy real-world challenges that the marketing materials don't cover. Published at a moment when organizations were trying to figure out whether private cloud was viable. For many teams, this was the document that made it possible.
O'Reilly Media - 2014
The handle is a philosophy. His Mastodon account is @norootcause - a statement of belief that complex system failures don't have a single root cause. Most incident postmortems are detective fiction, not forensic science. He's been making this argument for years, and the field is slowly coming around.
He built the monkey that breaks things. Chaos Monkey - Netflix's server-killing tool - is one of the most influential artifacts in the history of reliability engineering. Hochstein didn't invent it, but he rebuilt it. Version 2 integrated with Spinnaker and became the production-ready version the industry learned from.
Three universities, three countries. His degrees span McGill (Montreal), Boston University, and the University of Maryland. The Canadian engineering undergrad, American electrical engineering master's, and American computer science PhD pathway is unusual, and it shows in a career that doesn't fit neatly into any single category.
He maintains the resilience engineering reading list. The GitHub repository github.com/lorin/resilience-engineering is a curated collection of papers on resilience engineering, cognitive systems engineering, and naturalistic decision-making. It has become the field's unofficial syllabus - the first place practitioners go when they want to understand where these ideas come from.
He named a Netflix project after a sound people make when things go wrong. The OOPS initiative - designed to make it safe for engineers to report operational surprises - was named with precision. The name communicates the whole program: things go wrong, that's OK, tell us about it. Naming matters, and Hochstein knows it.
DevOps hat trick. Three O'Reilly books covering three distinct domains: configuration management (Ansible), cloud infrastructure (OpenStack), and reliability experiments (Chaos Engineering). Each was published at a moment when the field needed a reference book and didn't quite have one yet. That timing is not a coincidence.