How Facebook’s open source factory gave rise to Presto
Commentary: When Facebook solves technical problems, it defaults to open source solutions like Presto.
Facebook has been a bit of a punching bag lately, and for good reason. But for all its problems, Facebook continues to be one of the preeminent open source software factories on Earth. From React to Apache Cassandra to PyTorch, Facebook has open sourced some of the world’s most popular software, which, in turn, has given rise to companies built up to commercialize those projects.
Like Starburst, a company started by Facebook veterans to commercialize Presto, an open source distributed SQL query engine for running interactive analytic queries against data sources of any size. Starburst just raised $42 million to further accelerate Presto development and commercialization. In an interview with Starburst co-founder and CTO, Martin Traverso, he talked through how Facebook’s engineering culture gave life to Presto, and the open source ethos that powers it.
SEE: Developer code reviews: 4 mistakes to avoid (free PDF) (TechRepublic)
A culture of creation
Let’s rewind to 2012, when Facebook’s infrastructure team was still knee-deep in Apache Hive, a data warehouse project the company had created and open sourced back in 2010. Facebook had a massive 300 petabyte Hive data warehouse, which sounds great, and it was. But it was also incredibly slow. As Traverso related, a Facebook data scientist once quipped, “It’s a good day when I can run six Hive queries.” Hive, for all its merits, was a big productivity loss.
There was talk throughout the Facebook data infrastructure team about building something better, but it was Traverso, along with Dain Sundstrom, David Phillips, and Eric Hwang, who got the nod to go build something better. Phillips, in particular, had used data warehouse engines and had both the incentive and the passion to do something about Hive, Traverso said.
If the foursome had waited, perhaps they could have used Apache Drill (the first design meeting was in late 2012). But that’s not how Facebook engineering works. There were no obvious alternatives, and they had a need. “We had to do it by ourselves,” he said. And so they did: In 2012, they released Presto.
A culture of open source
This doesn’t explain why they open sourced it. It helped that Sundstrom had been involved in Apache Geronimo, but even that doesn’t really adequately cover the rationale for opening it up. As Traverso related, the founders weren’t simply hoping to solve an immediate Facebook need–they wanted to build something that would endure and be broadly applicable:
We like open source. We believe in open source. We believe that the best software is written by passionate developers working in open source communities. We wanted to build something that would be usable for Facebook, but also something that could be used by everyone else in the world. Also, by making it available to other people, we can make it better because we can get other people involved that have other needs and thereby build something that is more broadly applicable than just a single company and single use case.
And so they have. Today there is a diverse and growing body of contributors, sparked early on by considerable involvement from Teradata, as well as Netflix, LinkedIn, and others. Teradata had roughly 20 people working on Presto at one point, with perhaps half of those working on the Presto core. Over time some of those, including Justin Borgman, who ran Teradata’s Apache Hadoop-related products, eventually left to work on Presto full-time under the auspices of Starburst, which was founded in 2017.
SEE: How to build a successful developer career (free PDF) (TechRepublic)
According to Traverso, the Presto team has worked hard to make it easy to contribute to the project. From a technical point of view, Traverso said, they’ve tried to make the code accessible and easy to understand. “It’s fairly uniform so as to make it easy to see what’s going on in the code. There are some projects where you jump in and it’s a big spaghetti plate, and it’s kind of hard to follow all the threads and make sense of it.” Presto, by contrast, is more structured around the attractions in the code, making it easier for someone to evaluate how and where they can make a meaningful contribution.
In addition, the Presto founders understand that users will likely give up if they can’t do something useful with the project within the first five minutes. Presto makes it simple to go from download to running the query engine in minutes.
Finally, there’s the community. The Presto Slack channel is currently 2,200 strong, with as many as 500 active at any given time. “It’s one of the most active open source projects I’ve seen,” noted Traverso. These people are happy to help new users get started with the project, or work with would-be contributors to facilitate their contributions.
Though Presto was originally used to query data in HDFS (Hadoop), Traverso and the other founders needed it to be able to query not only Facebook’s customized HDFS, but also the “off-the-shelf” open source HDFS. So they created an abstraction over the storage layer, then made it pluggable. Because there’s a very clean interface between the engine and the storage layer, it has allowed the Presto community to build connectors for a wide array of data sources, including Cassandra, MongoDB, Elasticsearch, and over 30 more.
“The more people get involved, the better the software gets,” said Traverso.
It’s worth remembering that Facebook has made it the default for engineers like Traverso to build and open source software precisely to gather communities around these projects. They may be born at Facebook, but because of Facebook’s embrace of open source, they don’t die there.
Disclosure: I work for AWS, but the views expressed here are mine and don’t represent those of my employer.