 Hello, everyone. Welcome to Open Source Summit 2022. I am Rohan and with me is Philip, and we are here to give you an introduction to Presto and also share the product roadmap for the Presto project. Just to give you an introduction about myself, my name is Rohan Pednekar. I am a product manager here at Ahana, and I am also chair for Conformance program within Presto Foundation and I closely work with the Presto community and its user. So let's get started. Presto is very popular open source project, but today I want to share why it matters and why it is the SQL engine for the data platform teams for both interactive as well as batch workloads. So let's quickly revisit what Presto is. Presto is a desegregated, Ansai compliant SQL query engine originally designed to replace Apache Hive to achieve scalable interactive queries across large set of data, as well as across different data sources. Presto got created at Meta and now it is one of the most popular open source projects hosted under the Linux Foundation. Today, internet giants like Meta, Twitter, ByteDance and Uber, Presto runs at a massive scale and can analyze petabytes of data. So here is the quick overview of a Presto architecture. Presto is a distributed system of coordinators and workers. Coordinator is responsible for query planning, query scheduling, et cetera, and workers are responsible to perform actual work. Workers are the one who perform the legwork, all in memory, on across different data sets, different partitions and different splits. You can connect to Presto cluster with any of the reporting tools with the help of standard JDBC or ODBC protocol. You can use the notebooks like Jupiter, Apache's airplane, et cetera, to access and analyze the data. On the bottom, you do have data lakes like S3, as well as other range of data sources like NoSQL databases, relational databases, streaming data sources like Kafka, caching systems like Redis, and other real-time data sources like Apache Pino, et cetera. So these are the some of the common questions we get asked often. So let me talk a little bit more about this. Presto essentially is just a query engine. So if you think about a more traditional database, it comes with a compute and a storage together versus Presto, which is really just a compute and it runs on top of storage, which might be a database or a data lake. So it's just really a part of database, which is query engine. And it also comes out of a Hadoop ecosystem. The way it requires configuration management, it depends on how you metastore, and that's it. In that, it is related to Hadoop. However, it has its own system, so you do not longer need Apache Zookeeper, you do not longer need any other components of Hadoop. If you are querying data lakes, then you just need Hadoop Metastore or others kind of a catalog like AWS Glue, but you do not need again any kind of Hadoop components. And given that, it is just a query engine. It is also different from the data warehouse where the biggest advantage of Presto is just you run your query, like you can query data in place. There's no need to move your data or ingest it into yet another system or create another copy of it. So what makes Presto different from some of the other engines out there like Spark? Spark was built to be a general purpose, computational engine versus Presto, which is meant to be a SQL engine. It is highly scalable. It has a pluggable architecture with a range of connector that you can use and it is built for performance. It is built to be an in-memory engine with a pipeline architecture. Let's take a look at the scalable architecture. As I was mentioning earlier, we have two roles here. We have a coordinator and we have workers. The workers are connected to the data sources and they operate on top of data and they can push down predicates, where clauses, all the way down to the data sources, again, depending on the connector. So in case of a data lake with a high connector, it's used and predicates can get pushed down so that you can read little as less information or as less data as possible and pull it into the query engine to actually process it in memory. And again, Presto is highly scalable. You can support very large number of workers and now with advancement of multiple coordinators, it gets even more scalable. Now, let's take a deeper look at the architecture. So if you look at the coordinator itself, it includes some of the typical components that you would expect out of databases. You have the parser, compiler, analyzer, that's the optimizer. The optimizer decides on what plan is the most efficient and should be executed on, depending on the statistics and the other information. The scheduler will then schedule the queries across the different workers. Data location API tells the scheduler where the data is actually located and the metadata API gives the coordinator information about the tables, the columns, the schemas of the data lying underneath it. And this might be via the high metastore or it could be, again, any other traditional catalog in the relational databases which are built in to the system itself. And at the bottom, the connectors, these different, different connectors connect to all range of different data sources. So now let's talk a little bit about connectors. Presto comes with about 30 different connectors. Here are some of the most popular ones that we often see in the community. And all this plug into Presto. Presto allows for a query that can be joined across different data sources. So you can correlate data across different data sources. You just have to have schema name appropriately defined in the query itself. For a connector you create, you have to have a catalog explaining to the system how it interacts with it. And a schema for how the data should be organized once it is brought in. So a few different concepts here. You have a connector which allows you to connect to the different data sources into Presto. Then the catalog which contains schemas from the different data sources. So for every data source that you connect to the Presto, you have to have a catalog.properties file defined. This file contains information and configuration about data sources like endpoint URL credentials and many other range of different configuration information. And then the schema itself. Tables are organized into schemas. Tables themselves have columns and rows. And Presto analyzes this data in using columnar approach. So here you see a typical Hive connector if you are working with HDFS or S3. The coordinator talks to the Hive meta store as an example. It could be AWS Glue as well. And then the splits are assigned to different workers by the scheduler, which then goes and gets this information from the data lake underneath it. With the data sources and connector, Presto also allows different range of all different data types and different data formats to be queried. What we recommend often time is Apache ORC or Parquet formats because these are most efficient and most performant formats and also more compact ones by, but you can query a data that's in JSON or CSV and a range of other different formats. The beauty of Presto is you do not have to ingest data into yet another system. It can be queried directly where it lives in the data source. So why Presto fast? We often get this question. When you compare Presto with Hive or some of the other traditional query engines, we see anywhere from 10 to 15 times faster performance with Presto. The primary reason is a Presto is built as in memory engine. And so while you can spill to the disk, if your data size that you are processing is larger than the memory size, which is available, most often, and with the most of the deployments, we see users size their cluster to the amount of a memory that is needed to perform these queries. And so given this is a very fast engine, it pulls the data that it needs. It is again stream-based. So it streams the data into memory as it is reading it and it processes it. And then moves on to the next stage. Again, it is also using columnar execution that makes it very well suited for analytical queries. And at the end, last year, Meta introduced support for multi-level caching with the help of RaptorX. That is, as a user, you can configure caching at a different levels. Like you can define caching at the metadata. You can configure caching at the MetaStore. You can define caching at files, header, and footer level. Or you can have a caching configured at the data blocks level. So that really improves the performance at a great extent. Now, let's take a look at what a life of query looks like. You know, this is a simple query. Select star from order stable with a simple predicate. If you look at the explain plan, you will see a different operators that are a part of the plan. So you have the scan operator, which reads the data from the order stable underneath, filter is applied, and then the output is returned to the client. So this is a slightly more involved query plan. So you have a couple of projections. You have aggregate, which is the sum. Then you have a left join, online item table as well, and the order stable on order key. And you have a where clause and then group by and so on. And the plan, you have the operator on the line item, you filter it out, then you see the scan on the order stable, you see the left join, then this is then gets aggregated and the projections are pulled out and then the results are returned to the client. So so far, what I have covered is what Presto is, the architecture of a Presto and why it is so fast. Now let's talk about how users are using it. These are some of the typical use cases of Presto and reporting and dashboarding is one of the most popular use case of a Presto. This is what the stack might look like, which is very similar to what we have discussed earlier. You have a BI tool connected to the Presto and a JDBC driver. And the Presto is further sets on top of your data lake. That is S3 or data sources like MySQL and Presto acts as a SQL engine. Now with the Presto in the reporting and dashboarding use case, another popular use case is the data science. You have a Jupyter network, notebooks. These are the SQL based notebooks where you might use Python or Python SDK on top of Presto to connect to your data lake or the relational data sources underneath it. And since last couple of years, Presto has been getting used for the batch workloads. Examples of data engineering workloads that are now being run on Presto are ETL workloads and one of the technologies underneath is Presto on Spark. Presto can now run as a layer on top of your Spark jobs and allow you to run more robust and reliable batch workloads. Philip will talk a little bit more about this later in this session. And also there is a whole session on Presto on Spark from Intuit Engineer. Please check it out that as well. And finally, I would like to talk about open data lake house with Presto. How can you use Presto to create your own open data lake house? As a cloud data warehouse is becoming more cost prohibitive, data mesh approach or the data federation approach is not performant enough. More and more workloads are migrating to the data lake. If all your data is going to end up in the cloud native storage like AWS S3, ADLS or a Google cloud store, then the most optimized and efficient data strategy is to leverage and open data lake house architecture, which provides more and more flexibility and with a no vendor lock in. It is not anything different from the data lake. What it means is that some of the capabilities of a warehouse are now being moved to the data lake. One of those is transactionality. And this is achieved using technologies and projects like Apache Hoody, Delta Lake and others. And Presto is the SQL engine that then sits on top of the cloud data lake and that you can query with. And this is the architecture for the data lake house that is emerging and we are seeing, we are starting to see more and more of this. Finally, you know a little bit more about Ahana. I worked for Ahana as a product manager. We are a Presto company. We are a premium, premier member of a Presto foundation. We have a strong database and Presto experts here in-house and the product at core is managed service for a Presto. It comes with a SaaS console plane and a managed service where we bring compute and a Presto to your data in your own account. It's in AWS, a single pane of a glass allows you to manage multiple Presto clusters, fully integrated cloud native with a security features like Apache ranger, et cetera. And with that, I am handing it over to you, Philip. Thank you. Thank you Rohan for that summary of what Presto is and where it can be used. Hi, I'm Philip Bell, a developer advocate at Meadow focused on big data open source projects. I'm also a Presto foundation board member and we'll be talking to you about the future of Presto. I'd like to review Presto's roadmap, specifically recent accomplishments and the direction we're taking Presto through its next evolution. The current roadmap can be summed up as Presto for all. Presto is a top tier interactive query engine. It can unify various data sources in your existing architecture and provide a consistent interface for gaining insights from your data. The Presto foundation's vision is that Presto will become an effortless experience for many large scale and complex data related use cases. We want Presto to be the default choice for reporting and dashboards and to effortlessly handle both quantity and shape of queries dashboard tools might send to Presto. We know use cases can range from daily to quarterly check metrics as well as queries you may want your BI tool to perform. We want Presto to be an even better ad hoc interactive query engine performance wise. We're aiming for Presto to be able to efficiently serve data without having to transfer it from your warehouse. This means handling extremely high queries per second such as externally served requests with minimal data migration to other storage solutions. Essentially, we aim for adding Presto to your ecosystem to be the only major change needed to utilize in your architecture. Recently, we've been finalizing the ability to handle large batch workloads with ease. This has been challenging because Presto is an in memory query engine. Most importantly, we continue to consolidate on the same language to accomplish all of this with similar ETL and a unified backend. The reason for this is the more tools you need to support these use cases, the more complicated and expensive your projects become. Ideally, Presto becomes a single addition to enable a wide variety of use cases within your organization. So the big question is, what are we doing to help scale to fit these different workloads and support them effectively? First and foremost is to add more connectors. Connectors are a seamless way to get Presto to plug into other data sources and open up new use cases. A quick path to accessing data in an unsupported data source is creating a connector that uses a manifest file through Hive. Over time, we aim to migrate all connectors to be native so that they have better performance and reliability. Most recently, we completed a native connector for Delta Lake that is read-only and a connector for large sheets. Coming soon will be right support for the native Delta Lake connector as well as a native Hootie connector. Next, we have to solve for workloads that don't fit within Presto's memory limits. We've put a lot of work into adding more operators to participate in spilling to configure where the spilling output can go and to allow our customizable strategies to be used. Next, we have workloads that span time rather than just space, surpassing minutes into hours. Presto originally has no fault tolerance and this must be overcome to support these use cases. The ultimate goal is to prevent the need for rewriting the query in a different engine which may have different semantics or different function support. That's why Presto has migrated its engine to a library that can run on Spark. By leveraging Spark's resource management, we can enjoy the benefits of disaggregated shuffle and recoverability from Spark without having to rewrite the query. Finally, at the opposite end of the spectrum, we have Presto being used for online analytics, something that may be customer-driven. We want to use that same query engine here too. We've put in lots of improvements in Presto's catching architecture. With the completion of the Raptor Actra project, Presto now has intelligent tiered catching resulting in similar performance to having co-located compute and storage. This means you can operate using your existing cloud storage provider without any additional ETL. Adding materialized view support using standard SQL eliminates the need to repeatedly calculate metric rollups or creating pipelines for derived tables. As much as we are working on growing the breadth of use cases Presto can help with, we are also investing a lot in ensuring Presto is even more efficient, reliable and scalable than it is today. We've completed implementation of disaggregated coordinators eliminating a single point of failure in query execution as well as increasing scalability. We've improved scheduling using bin packing, reducing hotspots and failures. Just as important as reliability is efficiency when working at Facebook scale within Meta. One path to efficiency is native execution. We are finalizing migrating workers from Java to C++, which grants us better control over memory and instructions utilized. Early performance tests are showing significant orders of magnitude increases in throughput for certain use cases. Our ARIA efficiency work adds expression pushdown all the way into the scan and we're growing that to include parquet files. Our third focus of extensibility ties into empowering a wider set of users to take advantage of Presto. One way of accomplishing this is SQL function support which allows sharing complex mathematical expressions or other complex functions you can encapsulate. Lastly, we've added the ability to query external UDF servers for function execution, enabling users to support languages that are hard to port to Java or standard SQL. With this feature set, users can point functions to external servers either for specificity or utilizing other compute resources for particularly heavy functions. We are finalizing Apache Ranger support to enable more fine-grained access control for better security as well. If you have any questions, don't hesitate to reach out on Slack or Twitter. I hope to hear from you soon and look forward to seeing what problems you will solve using Presto.