Apache DataFusion Becomes Top-Level Project (June 2024)
1,000 DataFusion-Based Systems Predicted for 2025
Now at Apple: Principal Distributed Database Engineer
1.5k+ GitHub Followers • 43 Public Repositories
Author of "How Query Engines Work"
Apache DataFusion Becomes Top-Level Project (June 2024)
1,000 DataFusion-Based Systems Predicted for 2025
Now at Apple: Principal Distributed Database Engineer
1.5k+ GitHub Followers • 43 Public Repositories
Author of "How Query Engines Work"
Andy Grove

ANDY GROVE

The engineer who looked at Big Data's Rust problem and said "hold my query engine" - then proceeded to build three of them.

30+ Years Coding
8.7k DataFusion Stars
1,000 2025 Goal
3 Apache Donations

Most people don't wake up one morning and decide to rebuild the entire Big Data ecosystem in a language most database engineers avoid. Andy Grove did exactly that. He donated not one, not two, but three major open-source projects to the Apache Software Foundation: the Rust implementation of Apache Arrow, DataFusion, and Ballista. If you're counting, that's the memory format, the query engine, and the distributed execution framework. The full stack.

Here's the thing nobody tells you about revolutions: they start with someone quietly solving a problem nobody asked them to solve. In 2018, Grove was tinkering with Apache Arrow - a columnar memory format designed for analytics - when he noticed the Rust implementation was missing. Not incomplete. Missing. So he built it. Then he built DataFusion on top of it because, well, if you're going to have a memory format, you might as well have something that can query it efficiently.

"What started as a modest project to provide a simple and efficient query engine has evolved into a robust, high-performance system that powers data-centric applications worldwide."

The man currently serving as Principal Distributed Database Engineer at Apple holds a patent for scalable relational database replication (U.S. Patent No. 8,626,709, if you're keeping track). He founded two companies - one sold successfully, one didn't. He's worked in banking, media, insurance, hardware, and software. Thirty years of writing code, and what does he do? He writes a book called "How Query Engines Work" because apparently building them wasn't enough.

The DataFusion Origin Story

Picture this: it's 2019, and Grove is pitching venture capitalists on a vision - 1,000 projects powered by DataFusion. The VCs probably smiled politely. Fast forward to January 2025, and he's predicting that goal will actually hit. Apache DataFusion became a top-level project in June 2024. The modest side project is now the foundation for hundreds of data-intensive applications.

What makes DataFusion different? It's not just another SQL query engine. It's embeddable, written in Rust, and designed from the ground up to work seamlessly with Apache Arrow's columnar format. Zero-copy data exchange. Memory safety without garbage collection. Performance that makes database engineers do double-takes. Grove didn't just build a better mousetrap - he built a mousetrap that other people could embed in their own mousetraps.

The Three Pillars of Grove's Apache Legacy

2018: Rust Arrow Implementation - The foundational memory layer that enables everything else. Columnar data structures with Rust's safety guarantees.

2019: DataFusion - An embeddable SQL query engine that became the most starred Apache Arrow sub-project with 8.7k GitHub stars.

2021: Ballista - Distributed query execution, because sometimes one machine isn't enough. Think DataFusion, but across a cluster.

From CodeFutures to Apple

Before DataFusion, there was dbShards. Before dbShards, there was AgilData's distributed streaming SQL database. Grove has been building query infrastructure since before it was cool. His career reads like a tour of every impossible database problem: Chief Architect at CodeFutures from 2007 to 2014, designing NewSQL solutions when NewSQL was just a buzzword people were starting to throw around. Chief Architect at AgilData, tackling distributed streaming - because apparently stationary data wasn't challenging enough.

In 2017, he co-founded Raven Data Security as CTO, building a data security platform MVP. The company didn't make it, but Grove did what good engineers do: he learned, moved on, and kept building. By 2020, he was at NVIDIA working on the RAPIDS Accelerator for Apache Spark. GPU-accelerated analytics. Because if you're going to make queries fast, why not make them absurdly fast?

The Plot Twist

Here's where it gets interesting. While at NVIDIA, Grove was simultaneously shepherding DataFusion through its Apache incubation. Working full-time on GPU acceleration while nurturing an open-source project that would eventually compete with the very stack he was accelerating. That's not career suicide - that's conviction. When Apple came calling in April 2024 with a Principal Distributed Database Engineer role, he didn't hesitate. The mission? Contributing to Apache DataFusion Comet, an accelerator for Apache Spark built on - wait for it - DataFusion.

"I predict 2025 will bring a significant acceleration in the number of systems built on DataFusion, and my focus this year is to help drive that growth."

Grove doesn't just write code. He writes ecosystems. sqlparser-rs, his open-source SQL parser for Rust, is described as "one of the leading open-source SQL parsers for the Rust ecosystem." It's the kind of foundational work nobody celebrates because it just works. Parse trees. Abstract syntax trees. The unsexy plumbing that makes everything else possible.

The Philosophy

Ask Grove about his work and he'll talk about performance improvements measured in orders of magnitude. Two orders of magnitude, specifically - that's 100x faster, for those keeping score at home. He doesn't do incremental. When he tackled interactive queries, he didn't make them 20% faster. He made them so fast that analysts stopped taking coffee breaks while waiting for results.

His GitHub profile tells a story: 1.5k followers, 43 public repositories, 156 starred repositories. The starred repos reveal the mind of someone who's constantly learning - embedded systems, query optimizers, distributed computing frameworks. His hobby projects combine embedded hardware, electronics, and fabrication. Because when you spend your days optimizing distributed databases at Apple, the natural hobby is apparently building things with your hands.

The Education Paradox

Grove's formal education consists of Coursera certifications in Kotlin (2017), Scala functional programming (2017-2018), and neural networks (2019). No Stanford CS degree. No MIT pedigree. Just three decades of showing up, solving problems, and learning in public. He gave talks at the Denver Rust Meetup. He appeared on "The Infra Pod" podcast to discuss his mission to build Big Data in Rust. He spoke at Data Science at Home and the LeanPub Front Matter Podcast.

The book he wrote - "How Query Engines Work" - isn't an academic tome. It's a practical guide that walks through building a SQL query engine in Kotlin, step by step, with full source code. He took everything he learned across banking systems, insurance platforms, and hardware companies, distilled it into a book, and made it freely available. Because what good is knowledge if you hoard it?

The Numbers Don't Lie

DataFusion has 8.7k stars on GitHub. DataFusion Comet, the newer Apache Spark accelerator, already has 1.2k. These aren't vanity metrics - they represent engineering teams at companies worldwide who looked at their query performance, looked at DataFusion, and decided to rebuild their stack. InfluxDB rewrote their query engine on DataFusion. That's not an endorsement. That's a bet-the-company decision.

As PMC member (that's Project Management Committee, for the uninitiated) of both Apache Arrow and Apache DataFusion, Grove doesn't just contribute code. He shapes roadmaps. He reviews pull requests. He mentors contributors. In January 2025, he was running Apache DataFusion Community Meetings, the kind of unglamorous work that keeps open-source projects from imploding under their own success.

Grove's Tech Stack

Languages: Rust (primary), Java/Kotlin/Scala, Python, C++

Specializations: Query engines, distributed systems, SQL optimization, Apache Arrow ecosystem

Tools: Apache Spark, Kubernetes, AWS, DataFusion, Ballista, Parquet, HDFS

Superpower: Making databases go fast without sacrificing correctness

What's Next

Grove's 2025 prediction isn't just about hitting 1,000 DataFusion-based systems. It's about what happens after. When a query engine becomes infrastructure - when it's so embedded in the ecosystem that people stop thinking about it - that's when the real work begins. Maintaining backward compatibility. Preventing feature bloat. Resisting the temptation to rewrite everything every three years.

At Carnegie Mellon's Database Group, he gave a talk titled "Accelerating Apache Spark workloads with Apache DataFusion Comet." Not "introducing" or "discussing" - accelerating. Grove doesn't show up to conferences to theorize. He shows up with benchmarks, performance charts, and production-ready code.

Based in Broomfield, Colorado, he's been building distributed systems while most of Silicon Valley chases the next AI wrapper. While everyone else was pivoting to LLMs, he was optimizing join algorithms and improving predicate pushdown. The work isn't flashy. It won't make TechCrunch headlines. But every data scientist running queries against multi-terabyte datasets is standing on infrastructure people like Grove built.

The Legacy in Progress

Here's what matters: in 2018, if you wanted to build a high-performance analytics system in Rust, you had limited options. Today, you have Apache Arrow, DataFusion, and Ballista - all bearing Grove's fingerprints. He didn't just contribute to the ecosystem. He built the ecosystem. Three major donations to Apache. A book. A patent. Decades of experience compressed into open-source projects that thousands of engineers depend on.

The man who shares a name with Intel's legendary CEO Andrew Grove (no relation) is writing his own legend. Not through billion-dollar acquisitions or unicorn startups, but through the patient, deliberate work of building infrastructure that lasts. Query engines that don't need to be rewritten every two years. Distributed systems that actually handle failure gracefully. Documentation that developers can actually understand.

"Becoming a Top-Level Project is a significant milestone, and I am excited to see how the project will continue to innovate and shape the future of data processing."

Andy Grove is 30 years into a career most people burn out from in ten. He's building the future of data infrastructure while maintaining hobby projects in embedded electronics. He's writing books while shipping production code at Apple. He's mentoring contributors while architecting the next generation of query engines. That's not work-life balance. That's someone who found the thing they're supposed to be doing and refuses to stop.

In an industry obsessed with disruption, Grove represents something rarer: consistency. Three decades. Multiple industries. Countless database problems. One unifying thread - making data systems faster, safer, and more accessible. No drama. No Twitter feuds. Just code, community, and a relentless focus on solving hard problems that matter.

The query engine revolution won't be televised. It's happening in GitHub commits, Apache mailing lists, and conference rooms where engineers decide what to build their next system on. And when they choose DataFusion, they're choosing Andy Grove's vision - that Big Data deserves better than garbage-collected languages and memory-unsafe C++. That Rust isn't just for systems programming. That query engines can be embeddable, fast, and correct all at once.

That's the story of the engineer who looked at the state of Big Data in 2018 and decided to rebuild it from scratch. Not because anyone asked him to. Because it needed to be done. And because he knew how.