 I'd like to thank you all for joining today's webinar and an introduction to Presto. I'm Ramapendran, Technical Product Marketing Manager for ANA. Over the next 60 minutes, we'll explore and understand what is Presto, look at Presto architecture and learn how to benchmark Presto. So the understanding of technology behind Presto, so what is Presto? Presto is an open source distributed MPPS called Query Engine. The key features of Presto is Query in place, which means you don't need to move data, what you normally have to do in traditional data virus applications, where you have to move data to do analytics, whereas with the Presto, you don't need to do it. It's Query in place. And the next thing is Federated Quering, which means once your sensor data is all in different formats, you can query data from different formats and query the data and then join them. And then it uses the ANSI SQL compliant, which means you don't need to learn new language or anything, the standard SQL is good enough. So Presto was developed by Facebook. They've designed to, from the ground up, they want to develop for fast analytic queries against any data size. They have done it, they approved it against petabytes of data. So one of the architecture behind the Presto is right, since they want to do Query in place and they want to do Federated Quering, what they have done is they have a pluggable architecture, which means they have connectors, which connects to many data formats, which can talk to them and then query the data from that. So this makes a literally a skill on anything. And Presto is an open source, Facebook has open source of Presto, so it's open source, it's hosted on GitHub, you can find it out at dot.com slash PrestoDB. Let's take a overview of Presto, right. Presto, we call it as a Presto cluster, which consists of a coordinator and a set of workers. So the Presto cluster can be reached out by using any application, right, any BI tools or any application or any Jupyter notebooks, like ML notebooks, all the things which uses JDBC, ODBC can easily connect to Presto cluster. Once it comes to Presto cluster, it's easier to query data from different data formats. Presto does it using its connectors. Now, Presto is one of the fast growing open source projects in data analytics, right. The reason for this is there are driver for this are two. One is a business needs and the second is technology trends. When you look at the business needs, it's all data driven decision making, right. And so it is the business have to go through a lot of data to make such decisions. And technology trends with all this cloud transformation, what is happening is desegration of storage and compute. Given that, yes, this storage is so cheap, it's giving rise to data lakes. So I have pasted a few of companies which are currently using Presto and there are more, since it's open source, there are a lot of companies using Presto today. So the common question is, is Presto a database, right? Oh, Presto is not a database, it's a query engine, because it doesn't have the storage layer on it, it's just a query engine, it just can just query data datasets. Is it related to Hadoop? No, it's not related to Hadoop, because Presto was developed as a replacement for Hive, and so it doesn't, it's not related to Hadoop. How is it different from a data warehouse, right? So traditional data warehouses needs to move data around to do data, to do analytics. Presto also does analytics, but without moving the data around, that's the difference between data warehouse and Presto. So Presto use cases, right? The typical interactivity occurring, reporting and dashboarding, these are all typical use cases for where Presto can easily be used. And the next one is very important, right? Today, this data lakes and data lake houses are coming and analytics on top of it has become more prevalent and more important. And Presto is designed for it, because it doesn't have, what data lakes is basically it contains data from different data formats, right? And so Presto can query different data formats, it becomes more suitable for data lake house and analytics. And federated coding is also across data source, that's also future of Presto. And so it's more aligned towards to it. Transformation using a scale is basically, you have a dataset, right? You have some set of CSV files, you want to move to target format, right? You want to change the data format, and it is easy to do using Presto. These are use cases for Presto and the Presto architecture. What makes Presto different, right? You look at the architecture, why it's a scalable architecture, it has got pluggable connectors, and it has got good performance. Let's look at what is scalable architecture, right? So Presto, Presto also contains two roles, right? It has two roles, one is coordinator and the other is worker. The coordinator will talk to the applications like tools or notebooks, right? It will talk to all applications using JDBC, ODBC, whichever is getting connected to it, you can talk to those, send the results back, take the queries and all this things is done through coordinator. Worker is the one which goes and connects to the data source, using the connectors will talk to different data source and bring the data, right? And so as the data grows, you can add workers and scale, right? And it has been proven at web scale companies that you can go from start with two workers and you can go to a thousand workers, it has been validated. So it has got a scalable architecture. If we look a little closer into the architecture, right? You have all this BI tool, notebook clients which connects to the Presto coordinator and then sends the SQL queries. Once it receives the SQL query, what it does is it passes, analyzes the query and creates a plan and for the plan it gets scheduled across all this workers. So what this workers does is uses the Presto connector and reaches out to any storage, right? It could be a data lakes or lake houses, the data could be anything object store, MySQL, Elasticsearch, Kafka, it could be any dataset, using it, there is a connector for it, it can Presto can talk to the data source and get the data. And once it gets the data, right? It has sometimes, since each worker brings some data, it has to move the data within themselves so the workers can communicate among themselves and then create the resulting dataset and send it to the coordinator and coordinate and send it back to the clients. So one of the things, since it's a query engine, right? It has a nice optimizer, it knows how to do efficient query transformation, how to run queries efficiently, just to give an example, right? If you have a select queries, you want to select something from a table and you have that condition where you want to filter some data, right? So when it goes to the worker and when it talks to the data source, right? It's not going to fetch all the data, bring it up and then the Presto is not going to eliminate all the data, throw out one select only few queries. Instead, what it does is it can do filtration at the data connector level, which means when it selects the data itself, it will select only the relevant data, not all the data, right? So the filtration happens at the connector level and so it becomes more efficient. So this is just one example I gave about how Presto query engine is effective. There are many operators and many efficient algorithms have been implemented so that it does the job faster. So as I said, since Presto communicates with the data source using connectors, right? Presto has got different connectors. It can talk MongoDB, Postgres, whether it is structured, unstructured, semi-structured. So we can talk to them using connectors. There are a lot of open source connectors as well as proprietary connectors for Presto. So let's take a look at the Presto connector data model, right? So the connector, right? It's a driver for the data source and the connector needs a catalog and schemas and tables because Presto works on the SQL level, so it has to be mapped. So if you look at this, let's take an example as a Hive connector, right? The Hive connector can talk to your dataset in the S3 storage, right? What happens in this scenario? What will happen is if you have data on S3, you need to use Hive as a connector and then on the Hive, we have to create a table which will map the data on the S3. Once it does this mapping, then the coordinator can, using Hive Metastore, can talk to the, split the work to the workers and get the data from the S3 storage. This is just an example of how a Hive connector works for object store and file system. Since most of the data today resides on S3, Presto Hive connector supports different file types, right? The commonly used file types are RCR, Parquet, Aravro, RRC files, CSV, JSON, X file and sequence file. These are all commonly used files which are supported, which are open format strategies, which are supported. And to support all these things, right? Presto Hive connector or the Presto does not need to do any data ingestion, duplication and a movement of data. It does everything query data in place, it does it in place queries. So let's see why Presto is fast, right? Presto does everything in memory processing, all the processing accounts in memory. And it's a pull model, as I explained to you previously, right? It doesn't pull all the data unnecessary data and then take all the data and then select some of them, not like that, right? When it pulls the data itself from the connector level itself, we can select data what it wants. The pull model is efficiently done. And so it uses columnar storage and execution. So we saw why Presto, right? How it maps to open data like analytics, right? Currently, there are data variances and cloud data variances solution, which are closed format and everything is done through this closed format processing. Whereas with open data like analytics, what is happening is a lot of data moving into cloud data lake and business or building a lot of data lakes, it becomes with all open data format, it needs a SQL processing, which is more open source, right? Presto is a more open source SQL processing. So it becomes more ideal to use Presto on top of data lake when you have different data formats and do a two open source open formats, right? This helps the case for open data lake analytics. So Presto becomes the best use case for open data lake analytics. So we just know we saw what is Presto and what is the architecture of Presto and how efficient it could be, right? And so call this into a price and it's always good to do a benchmarking of Presto, right? It helps to understand whether it is useful for us. Facebook has used and there are a lot of companies using Presto. It performs nicely, it works nicely, all the things are fine, whether it matches your use case is very important. And so benchmarking Presto will help you to make some patience. So how to do benchmarking of Presto is what we are going to see now. Benchmarking is a critical component, right? It helps you to identify what is the system requirement, resource requirement, how many workers you need, how many corner, what kind of a sizing, right? Infrastructure sizing you need, all the things can be determined once you run a benchmarking to get a better understanding. And the resource usage during various operations, each query takes different resource requirements and so you will have a better understanding of resource requirement. And the collection and the performance metrics helps to understand what each operation is doing. So the common standard is, industry standard is to use TPCH for analytics workload and TPCH, since it's an industry standard, it's a one-thirty-some page document. So just to summarize here, right, what it has is it has got like eight tables and it has got like 22 complex and long running queries. So what TPCH says for different data set size, right? It calls it as a scale factor. If you have one GB of data, it will say scale factor one and if you have 100 GB, it's going to say S of scale factor 100 and if it is one terabyte, it will have as a thousand. These are all scale factors, which means the data set size being used for testing. So now we have the TPCH to do benchmarking. So the application, right, you can use this to do benchmarking. So how to do benchmarking with Presto, right? Well, PrestoDB has a Benz 2 tool, a Benz 2 tool to do benchmarking. So you can see on the right-hand side of this slide, this image, you can see there are two components to this Benz 2 tool. One is called benchmark driver and the other is a benchmark service. This benchmark driver executes queries on the Presto cluster and then it sends the results to this benchmark service. The benchmark service collects this results and stores it on your posters database. And then the benchmark UI has a web-based UI, which can show benchmark results and history on this. Also, there is a clustering monitoring component to this benchmarking. What it does is it collects data from the Presto cluster, right, such as the CPU, memory usage and network usage. It collects all this information. So there is a link which links this benchmark and the monitoring UI, so which helps you to get the whole picture of how much, so this helps you to identify how much time this queries took to complete and as well as what is the resource usage requirement. So let us take a step and see how to, since it's an open source tool, right? You have to build this tool, right? You need Java 8 and JDK. You have to download Benz 2 from GitHub. Once you download Benz 2, you need to build the Benz 2 project using Maven. Once you run this command, the Maven command to Benz, build it, right? Once it gets built, you just need to step into this Benz 2 service Docker directory. Once you go into this directory, you run this Docker compose command. What is going to happen is going to start Docker instances for Graphite, Grafana, Postgres and Benz 2. The previous slide, right, saw this whole service, right? Benz 2 servers, the Postgres database, the Graphite and Grafana for all these things, it starts a Docker instance. There is no configuration needed. Everything is pre-configured, so it can be directly, can use it directly. The next is, if you remember this slide, we saw all the things. For the Benz 2 driver, you have to download the PrestoDB source and you have to build this component called Presto Benz 2 Benz 2. This is the Benz 2 driver for Presto. Once you build this, the next step is data generation. So, how to do data generation, right? The Presto itself has built in TPCH within it. All the data generation, data sets has been built within Presto. So, it's easier. You can say, create schema and give a schema name and say, you can give a S3 location, S3 packet name, and so it's going to create all the data set on this location. For example, when you run this, create tables, align items, you set the format packet for this, just use packet. You can use any ORC or other file types, whatever you choose to use for your application, right? You can use any type. And then you can say, as select start from TPCHSF100. Once you say SF100, it's going to create a data set for 100 GB. If you say TPCHS1000, it is going to create a terabyte of data set. So, it has got all the things. Once you run this, run this for the statements, it's going to create all the eight tables and the TPCS data set is ready to be used. Now, we have the TPCS data set ready. What you have to do is you have to do some configuration for this Benz 2 tool. So, to configure it, there is a great application on Tatyama's file where you have to set the URL for your Hive location, wherever your Hive is running. Normally, Hive will be used as a catalog. So, Hive can run within with the coordinator instance itself on the same container where coordinator is running, you can run Hive or you can run Hive on a separate container that is up to you. So, you need to define the URL for Hive, URL location for Hive as well as URL location for Presto cluster. And you set the environment as, I have just set it for this example as Presto Dev environment. You can set anything here. And then you can say whether you want to collect the material side. This is the one which is going to collect all the data and then show the results. So, we have to set metric collection n able to prove. Once this is set, there is one more configuration file as to be modified. So, it's called TPCS, if you see this, it's called TPCS Tatyama. This file is buried inside Presto DB, Presto Benz 2 Benz Mark source, main resources Benz Mark, Presto directory. It has buried inside that you had to go inside this directory particularly and edit this file. So, this has got some keywords here. I'll explain more about these keywords. So, you have to set the main as you have to set what is your schema for the Hive. If you are using Hive, you can use Hive or Blue, but Hive is straightforward. You can use Hive and you can say whatever the schema you are using, we have to specify here and then you can say the data source is Presto because here we mentioned the data source here as Presto. Here we have to say it as Presto and this is the location of where the TPCS queries are residing and this is under this directory. If you look at this Presto and under this there is a TPCS directory where all the TPCS queries are there and you can set the runs. How many times you want to run this benchmarking? You can say you want to run for 10 times or 5 times because you just run one time, you warm up the cache and not much of activity might be. If you run multiple times, you might see some differences. So, you can set the runtime to more than 5 or 10 whatever you choose to and then there are three warm-up runs. So, you can set it to 2 and you can say before execution you can set you can say sleep for 4 seconds. So, between queries there are some timing, there are no race conditions or anything and then you can there is also under this Presto directory there is a sql file where you can set some of the session parameters of Presto. Since Presto is a query engine, it is optimized. Optimizer has got a few tunables which you can use it. All the queries are complicated queries, it has got multiple joins in them. You can select whether you want to use a automatic join or a partition join there are some options there. So, you can set some of them here and you can set the frequency how many times you want to run it or run this entire suite. You can say per week I want to run 10 times or 2 times because depending on your application how you want to run you can set the frequency here and try to see how it works. So, here is a list of keywords that could be used to define the benchmark indicators in the above file. So, we saw the data source, query name, runs, pre-varm operands and the concurrency. By default the Bench 2 tool runs as a sequential benchmark which means it will run each one query at a time. In case if you want to see the concurrency performance you can set more than greater than one and then it will run concurrently. You can set a maximum I think it can run 3 queries concurrently or there are other ways means to run concurrently. You can start with 3 Bench 2 drivers and then run concurrently if you want to have more than 3 concurrently running queries. There are before and after execution of macros you can run. In case you want to clear cache before running or you want to run something additionally before that right. You have this kind of post and pre and post options where you can run some macros where you can run some commands on the presto cluster or you can clean up some of the cleanup or you can do cleanup or you can add something more to all this. As I told you right the frequency can be executed once per day or seven times once per week you can set the frequency of it. This is basically setting the run environment because since you are using the Grafana and the Bench 2 UI right you need to create a link of it so you just set the environment set the content path and this has to be done to satisfy the Bench 2 tool. Once you set this you are will be ready to run the benchmark test. So the Bench 2 benchmark is a Java application so use java command and then run this jar file and here you will be passing the location of where the sql file resides. This is the file resides. Since there is a time set it makes sense to modify some of the TPCH queries the way you want it in case if you want them without breaking the 137 page long document without breaking the specification you can sometimes modify the queries. So you can use alternate directory or something then you can say what is the directory the sql files are going to be the queries are going to be and then this is benchmark is basically this is saying where the TPCH .yml file location is. This is the active benchmark so you can have multiple TPCH .yml so that you can run it with the different variations and then in case of even when you run with TPCH .yml if you want to overwrite something you can use is over at .yml and you can overwrite it. So once you run this benchmark it collects the results and you use the user interface the benchmark you open a browser go to the local host and you can see the test results here. If you notice here that if you created a TPCH .yml it is the name is going to be restore TPCH and it's going to say the status what is the schema used what is the joint distribution used any specific option you have used and then how much time it took to complete the test. So if you click inside this inside the scurry what will happen is it will give more performance metrics such as the duration right since you are going to run multiple times you are going to get the mean standard deviation min and max as well as what was the output data site how much data was processed all those kind of details and how much time it was blocked what was the CPU time all this all this metrics will be listed inside if you drill inside the test. So for each test it has all this information. So additionally right here is example where I am searching for query three and so I've run two different tests one with TPCH stable partition with HANA cache and without a no cache so I used two different TPCH .yml files and then I ran this test and a good thing with this user interface is it has this compare option one is you can export to CSV the other is a compare option once you click on the compare option so it's going to give you a nice query resulting graph of how it perform you can see that with cache it performed better than without cache and so it's going to give a similar metrics for all the metrics listed here for all this metrics listed here it's going to give that graph and how it how it turned out so it gives you more details on when you compare the results in all the metrics. So I spoke about how the difference of running with HANA cache and without cache it's a performance difference so let me just briefly introduce about HANA Cloud. HANA Cloud is a fully managed cluster service so what we do is we enable data platform engineers to run Pesto in minutes versus days right because we understand since it's a query optimizer it has got a lot of tunables the Pesto has got a lot of tunables understanding those tunables and then designing the infrastructure right how to set the memory tuning and all those things like memory options what to set and all those things those takes time to learn and do it and so what we have done is we have a fully integrated and pre-configured solution right here it's just zero to Pesto in 30 minutes I would say that's what it is it just doesn't take much time and there is no ETL so there is no need to go wait for data to move right that time is eliminated because you are going to do in place analytics with this good thing and how HANA Cloud does this HANA Cloud has got two components one is called HANA console where it has the cluster orchestration consolidated logging security and access and billing support Pesto HANA is pay-as-you-go model so you pay for what you use and then there is a compute plane which runs in the customer account it is a in VPC cluster cluster what we do is we create clusters right pre-configured we configure the clusters and run it on the compute plane on the customer's account so we have no access to data although we manage the clusters we don't have any access to the data we just provision the cluster and configure it efficiently and and so that data resides and so the security is not a concern share it's taken care of so I just talked about what is the what HANA can do I'll just do a short demo of HANA so what I do have a cluster here already created for just for the purpose of demonstration I will just walk you through the cluster creation screen to see what HANA does right and so here you can give a name and then there is an option for concurrency mode whether when you run some test clusters you can run it in a low concurrency mode when you want to run for the production case you can choose high concurrency mode there is an option for it and then when you select the co-ordinator and the workers right you need to have a beefier instance for co-ordinators you can select different instances that are depending depending on the workload you can choose instances similarly for the worker also and then the important thing about HANA is the scaling strategy see see if you for example you say today you want to start with four worker mode right when there is no work is being done it will scale back to one worker so you don't have instances running or lying around based without idle right without there is there will be no idle instances so after 30 minutes of idle time it will scale down to one instances that is one thing and there is a one more option which is scale out options right when you run some queries which takes more CPU you can set it for a minimum of whatever you want to and then you can set the maximum here and then you can set a scale out step size so accordingly it's going to scale if you can start with four and you can say maximum 16 steps so as the workload increases it's going to go up and once the workload completes and based on the idle time timer it's going to go back to whatever to to the minimum worker node count HANA comes with since HANA is for AWS right runs on AWS comes with high metastore by default so you can use HANA's high metastore catalog as a connector for S3 or you can bring your own metastore high metastore we are fine with it so in case of you go with HANA's high metastore there is an option to select instance type for high also and we do have options to we have options to collect the query logs which will so which can give you insight on what type of queries are run and how much time it took and also here is the option right you can enable Ivo caching so Ivo caching is basically when you run some of the queries which reuses data what we do is we cache them on an EBS SSD drive so that it becomes more faster that's why you saw that when it is partition right partitions are many files and so when you are going to reuse the partitions instead of going every time to S3 you cache it on this EBS SSD it's going to be faster and the cluster credentials username and password to create a cluster let me walk you through the cluster so within this cluster this provides the cluster information like what is the version what is the worker code you can see that now I'm not running any work right so it's scaled down to one node a one instance one worker instance and what is the coordinator what type of instance I'm using for workers for high listed and Ivo caching is enabled similarly the scaling strategy in case if I want to change the scaling strategy it is possible and this is the presto endpoint presto endpoint is the one which is which shows which is a user interface for presto where it says what are the running queries active workers and all this thing right so and there are so as I mentioned it's how metastore is listed here because I'm using anam and his catalog here and then I'm using your mysql data source also let me show you a simple example here so presto comes with superset which is a data visualization tool it is built within an cloud and so what we do here is as example I'm just selecting data from mysql and joining it with the data from the s3 file and then combining combining it I'm going to run this one um so once you join you can see the the federated joining right I'm able to join data from mysql with s3 and then get the results I was looking for movies with weapons and so I could I got three movies so this is a small example of how presto can do federated query and coming back to them so as you know it's presto is open source software where Sagan and cloud is a managed service for presto and we are available on ews marketplace you can sign up for a 14-day free trial and see how it works you can either use a open source presto configure it run it or if you want to directly give a try you can take a 14-day free trial and see how it works for you for your environment and see how the presto helps in your these cases thank you thanks for listening to my presentation all this time thank you very much and how to get get involved with presto right since presto is open source you can join the slack channel last question so if you run into some issues when you try to run presto you have some something want to know some connector is supported or not or you have some questions right you can join the slack channel last questions or once you use presto if you feel very happy about that you can comment write a blog on how you are able to run presto for your application and there are some virtual meet-up groups currently for presto where you can talk to other users get more get to know more about presto and in case if you want to contribute to the project it's github.com slash presto db I will highly recommend please contribute to the project presto open source needs more help it would be nice to have more people contributing to this project and it's time for questions if you have questions I can answer them so okay see with with respect to encryption right encryption will have some impact on operations it all depends upon what type of encryption we don't see presto is a query engine so whatever encryption you want to use on the storage layer you can use it and if your application can read that encryption it is fine so we don't put any limits on the encryption when it comes to presto db and snowflake right it has I will go back to this particular slide right this is what the difference between this is the difference between presto db and snowflake so on on the data warehouse right this is snowflake basically so it's a close source and and it is a close format so you cannot use any open format right whereas it is it is for for since it's a cloud data warehouse you need to move data around to do any problem any analytics on it here it's in place right if you have data on mysql if you have data on s3 just now I did an example right where I was able to query from mysql s3 whereas here you are with the data warehouse you have to move data around that is one thing so that is the major difference it's open source it's open format it is it's an advantage of presto and that's the difference query in place and federated coding of the two things which differentiates between presto db and snowflake if that helps yeah please reach out to us if you have any additional questions okay wonderful thank you so much Ram for your time today and thank you everyone for joining us this is a quick reminder this recording will be up on the Linux foundation's youtube page later today thanks so much have a wonderful day