 Hi everyone. This is Kyle Bader, architect out of the Storage BU. Today I'm going to be doing a demo of Presto on OpenShift with OpenShift Container Storage 4. Presto is a MPP engine, so a massively parallel processing engine that accepts ANSI SQL, and it doesn't actually have its own data storage engine. So instead, it has a variety of different connectors so that it can query data in situ, right? So where the data lives, you don't have to copy data into Presto in order to drive insight from that data. And there's really a wide variety of different connectors that are available, right? So whether or not, whether you're going to query data in a CEP or Amazon S3-based data lake, or if you have some, let's say you have AMQ streams, Kafka cluster you want to pull some data out of, or a MongoDB instance, or even just kind of a relational database, whether it be like an open source variant like MySQL or Postgres, or if you have kind of a more enterprise-y, like Microsoft SQL server database that you want to pull data out of, there are connectors that have you covered for Presto. Now of course you can interact with Presto via the CLI, but some folks would much rather use kind of a more sophisticated kind of data visualization tool, like Tableau or MicroStrategy or Superset, and all those are available. You can kind of connect over ODBC connection to a Presto cluster. The Presto community is pretty vibrant, right? So it's used by a number of very large practitioners. You see we have the AirBnBs and the Netflix and the Lyfts, or the NASDAQs or the FINRAs that are using Presto at scale, but you also have folks like Amazon who have created a managed service based on Presto's technology. So the Amazon Athena service is using Presto behind the scenes. And then it's also used by a number of different Red Hat customers and partners, which is always interesting as well. So why Presto? I mean it's a community-driven open source project, right? So they're open source brethren. There's this separation of compute and storage that I was talking about. So because compute and storage are separate, you can scale the compute very easily, right? There's no data that has to be reshuffled. The data is in situ in a variety of different data sources, so you don't have to pull it in. You don't have to ETL it into like a Presto storage engine in order to make use of it. It's super high performance. It's, you know, there's those practitioners that have clusters that number into the thousands, right? And it's very fast. Because it's not tied to a Hadoop vendor or a particular storage vendor or, you know, a cloud vendor, right? You can run it anywhere. You can run it on anybody's cloud. You can run it on premises. It's a technology that is not associated with any sort of lock-in, which kind of really resonates with the OpenShift story, right? You can use Presto anywhere you have OpenShift, which is in any cloud, hybrid cloud, on-premise, whichever. So recently we've been doing some work with Starburst. Starburst is a company that has a number of core contributors to the Presto upstream community. They provide a downstream distribution for, you know, enterprises that want paid support and services around Presto. And they have made a number of significant contributions to the upstream community that have helped in areas of, you know, from security to performance to kind of just like syntactical improvements. Now, the most recent contribution that we have kind of done in conjunction with Starburst was making available an operator in the OpenShift catalog. So if you go to the catalog, the big data section of the catalog in an OpenShift 4.0.2 cluster, there's an option to deploy a Presto operator through the OLM, and that will add custom resource definitions to your Kubernetes cluster so that you can decodatively define some custom resources that will result in a Presto cluster. Now, a Presto cluster is going to be composed of a number of parts. You'll have the coordinator, which is going to receive the queries, plan them out, and then distribute the tasks to a number of workers. Those workers are going to source the data from the variety of different data sources, do any sort of shuffling and filtering is necessary for the query. And there's also an auto-scaling component, right? So if the query load is high, additional Presto pods can be automatically provisioned. And then finally, there's the metastore service. In a lot of cases, like in data lake use cases, the data is not necessarily stored in conjunction with the schema. And so this metastore service gives you a separate place where you can store the schema for your particular tables. Now, in this demo, part of our data is going to be stored in a Ceph object store. And so we're going to be using the Hive connector to access that data. So in the configuration for the Hive connector, you would provide your endpoint, and then your credentials, right, your access key, your secret key. And then we have our schema definition, right? So you can think of a schema as like a database, right? So it's like create schema or create database in the Hive catalog, this particular schema we're going to call S3 export, right? And then you provide a location, right? So whether the data is located here or if you want to create a new table and because you're going to insert, then the new data will be stored here inside this database. Once you've created that database or schema that you can create tables within that schema or database. And for each table that you create, you can specify a particular format. You can store this data as CSV or it's already stored as CSV or store it as ORC or Parquet, if you wanted to use kind of a compressed, efficient column or format. And then in this way you can define schema for data that already exists in the object store or you can define schema where you'd like to put data, right? So Presto is not just a read-only tool. It's a read and write tool and combined with a variety of different connectors. In addition to being used for data discovery, it can also be useful for... In addition to discovery and analytics, you can also use it to kind of move data around, right? So you want to pull data out of a relational database and store it into the object store or if you want to pull it out of an older data lake like a HDFS-based data lake and you want to move it into an object store, right? You're trying to move to a more cloud-style approach. You can use Presto as a tool to do that. This is a high-level diagram of what we're going to show here in this demo. So we have this particular SQL query that's going to join data. This data is going to be a TPCDS dataset. We're interacting with two different tables of that TPCDS dataset, the orders table and the customer table. And many of the other tables are going to be stored in a Postgres database. That Postgres database is going to be sitting on top of RWOPV, which is provided by the RITOS block device from OpenShift Container Storage. And then we have the orders table, which is stored as parquet files inside the CEPH object store. So those will be accessed by the Hive connector over the S3 API, which is going to be talking to the CEPH RITOS gateway. So without further ado, we can kind of go OC status here and see the Presto cluster that's been pre-provisioned so that we didn't have to walk through the provisioning. And then we can RSH into the coordinator here. You can go Presto CLI. So if you've ever used a MySQL or a Postgres shell, you'll be right at home here at the Presto CLI. You can go show catalogs here. We see we have the Hive and the Postgres catalog. And then we can go show schemas from Hive. Do the same in the public. So we have the public schema and the Postgres catalog. And then we have the CEPH schema in the Hive catalog. So we can show the tables in each of those. The CEPH one right here. Okay, there's our orders table that we're going to use for the query. SQL.public. And then here's the rest of the Postgres tables right here. So this is the customer table that we're going to interact with. Go ahead. And so what we have here is we're selecting the market segment. We're going to count the matches that we have from the orders table in the CEPH schema in the Hive catalog and then join it with the Postgres customer table. And then we're going to sort it by those that have an account balance over a thousand and then group them by market segment. Boom. Like I said, Presto is pretty fast. So we just query just under 2 million rows here and then return the results. So these are the top market segments. I guess it's not sorted. These are the market segments that have the balances over a thousand. And that's this demo here. Thank you.