 How many of you know about Presta? A little bit? OK, not too many. All right. OK, so I'll just focus on giving a quick overview of Presta and its capabilities. And then we'll discuss how we can deploy it with an OpenShift and a query Redshift and other great things. So first of all, I'd like to talk about Presta as a SQL on anything engine. So it's obviously an open source project. It was first started about seven years ago at Facebook and then spread here in the valley and beyond very, very quickly. And myself and my team were involved in this project for almost five years by now. So what's unique about Presta is a compute-only distributed SQL engine, which means you can deploy it almost anywhere. And you can actually allow Presta to access data from many, many different data sources. Some of those are object storage like Cep or Amazon S3 or Google Cloud Storage or Azure Blob Storage at DLS and other technologies like this. You can also query HDFS and Hadoop, obviously known for storing big data. But you can also connect to a variety of different databases like Oracle, Teradata, SQL Server, Postgres, and so on and so on. And also, no SQL engines like Mongo, Cassandra, and most recently, Elasticsearch as well. So it's a very, very powerful mechanism where you separate compute and storage and you can do provide scalable processing using multiple machines in your Presta cluster. And then from the user perspective, this is very familiar because you're sitting in front of your favorite BI tool or SQL editor. You can run things like SuperSet, or Redash, or Jupyter Notebook. So it's very, very powerful known interface through JDBC and on DBC drivers. Now, Presta community of users is actually very, very large. This is a small sample of that. But you can see some of the biggest companies in the world and from actually all over the world, many different industries, very, very horizontally applicable project. And it's proven at scale for variety of different use cases by all those different companies. So if you decide to apply it, you can sleep well. Those guys are pushing this into the limit already. Now, so why people like Presta? Why so many companies are deciding to leverage Presta rather than alternatives? I think it's several different reasons. And some of them are summarized on this slide. First of all, it's a community-driven open source project used by a number of brick players who better their SQL analytic needs on Presta. So guys like Airbnb, Netflix, Lyft, Airbnb, and many, many LinkedIn, and many, many others. And they're part of the community driving this forward, making sure that the product survives despite of any changes of a single individual company deciding to go further or not. It's a very powerful, high-performance SQL engine proven at scale. So the largest deployments of Presta are approaching about 1,000 machines in a single cluster. And many companies are actually running many, many clusters because since it's computer only, it's very easy to spin them up and down and give access to certain data sources without creating data silos. It's meant to be interactive SQL and handle high concurrency, of course. As I mentioned, fundamental piece of architecture is separation of compute and storage, which means Presta itself doesn't have any favorite storage mechanism. It doesn't come with its own mechanism to store the data. It relies on whenever your big data is, whether that's object storage or HDFS, you may keep some of your older data in Oracle, TeraData, and other data warehouses. You can access some operational data from Postgres or a SQL server or anywhere it lives right now. You don't have to invest in moving data around to start getting into insights, because you can access data where it lives after property and mounting the configuration settings. So with that, we also like to say it represents a big value as a no-vendor locking. First of all, it's an open-source project, so you can run it, use it without any vendor. If you're free from being tied to any Hadoop distribution, it works across any Hadoop distribution. You can change the storage underneath Presta and your applications, your end users, will be still interacting with the same data without knowing you actually moved from HDFS to object storage, for example, or you can move from on-prem deployment to the cloud or the other way around. And things for them do not change, because Presta is isolating them from that entirely. And again, you're not tied to any specific infrastructure, so you can move between clouds, for example, if that's your choice. So it provides a great insulation, flexibility, optionality in deploying and querying your data. So Starbers, as I mentioned, we are involved in the Presta community for many years already. We have large customers in production, both on-premise and in various cloud deployments. With Kubernetes, we are now enabling very similar experience across any cloud and on-premise environments like OpenShift, for example, so which is really great for both customers and us as developers that we don't have to necessarily handle custom deployment mechanisms for each cloud separately. And as an enterprise vendor, we have things that you get extra, in addition to just core open source projects. But we contribute heavily to the open source community. In fact, we probably represent over 70% of contributions to the project right now. So as I mentioned, Presta is a very, very high performance SQL engine. And it was built like that from the beginning. So the objective for the team that was implementing this was make interactive analytics a big scale, a reality. So before Presta, there was high, obviously, a very highly respected engine that can handle better bytes of data. But interactivity wasn't a strong point of hype. So with good design techniques from sort of Facebook's recipes for MPP database, Presta took advantage of pipeline and in-memory execution, columnar processing, internally vectorized processing, efficient memory structures, and computing modern computer engines like multi-core CPUs and multi-threaded architectures. And we've combined with columnar storage under the cover in form of RRC parquet. We are now able to deliver really, really nice performance for analytical use cases. And we then also added Cosbase Optimizer, which now works across many different data sources. And you can arrive at very efficient query plans across your data that could be spread out in many different places. So in terms of performance, this is just to show off the Cosbase Optimizer improvements we introduced some time ago. The primary goal here was for environments where your data is spread across many different sources. And you have main tables being involved in various queries. The decision how to arrange the join order in a query was a really fundamental win, if done properly. So this is showing Presta before and after introduction of Cosbase Optimizer. And you can see we are enjoying benefits of an order of magnitude faster performance for many queries. So with that, I think, hopefully, it's very clear how Presta works internally. What it's good for, I will let Kyle to discuss how you deploy it in OpenShift platform and how we can enable data analytics in the environment. Yeah, so when I was first exposed to Presto was probably a couple of years ago, two, three years ago, we were meeting with a number of customers who were in the middle of kind of switching their data processing architecture over to using on-prem object storage. And that's why I was there, is to help them kind of adjust and use CEF for those needs. And Presto was something that they really liked because they were breaking up compute and storage. And Presto didn't come with an opinion in terms of what storage would be used with it. And a couple of years ago, a lot of these same customers were using Presto and were using the object storage. And they basically wrote their own deployment tools for deploying these different Presto clusters. And one of the things that are great about having OpenShift is alleviating this burden from folks, right? Instead of having to write scripting in some sort of configuration management tools, they can use something like an operator. And so having written Ansible playbooks for Presto, I can appreciate not having to do that anymore. So all the things that Kubernetes are good at, you kind of get once you start using the operator framework to deploy clusters and particularly Presto clusters. So instead of having to worry about provisioning new nodes or dealing with fault tolerance, Kubernetes kind of hails that for you. It can say how many workers, how many Presto workers you want online, and it'll bring that many up. If one goes down, then it'll provision a new one, and it'll get bound to a different node. You can trivially scale it, right? So I can go in. I can change the number of replicas for workers up. And then I have more. And so you could potentially make it so that if you have a higher query volume that you scale out the cluster to be able to keep your query responsiveness low. And then if the volume of queries subsides, then you can scale it back in. And because it's compute-only, you don't have to worry about it, right? One of the classical problems with sorts of database type approaches where you have compute and storage together is scaling in is usually prohibitive because that means the data is there so you can't make it more complicated. But being compute-only makes that a lot easier. So Presto on OpenShift, it had been a little while. And so I reached out to Camille again, and I said, hey, I see that you guys are doing some work with Kubernetes. And I think it would be great if Starburst had an operator inside of OpenShift so that people could really easily provision Presto clusters to process their data. So we kind of connected the dots, and they made it happen with a little bit of help. But mostly, it was like 90% done by the time we started having the conversation with them. So what the operator does is it deploys the coordinator and worker that would then work together. So you submit the queries to the coordinator. The coordinator comes up with the query plan and then distributes the tasks to the workers, which then will source data, process it, filter it, and then do any sort of other more complex type operations and then return it back up. And they also have bundled in a Hive Metastore service so that you can basically catalog your schema there. And if you want to improve the query plans, you can analyze the tables. And then subsequent results of that analysis will be stored in the Metastore so that future queries that interact with those tables will be done more efficiently. So at this point, this is a screenshot from one of my OpenShift 4.2 clusters. If you go into the catalog under the big data section, there's the Presto operator. So it's under the OLM, and you can click and install. And then you can submit CRs and affect a Presto cluster for your environment and begin to experiment with it. So where does Seth come in? Well, I had a little lightning talk a little bit about the scalability of Seth. But Seth and Presto actually work really great together because it's just an object store, and Presto is just a compute engine. So there's not really opinions around using a particular storage or using a particular query engine because it's not a verticalized stack. And originally, I was learned about Presto by way of customers, right? So we had a number of customers that were deploying very large, Seth-based object stores. And we're using Presto to process that data for reporting and so on and so forth. And the things that Seth provides, like erasure coding, really are great for dealing with high amounts, like many petabytes of data, also open source. And then kind of the requisite plug, right? So in an OpenShift environment, we have OpenShift Container Storage, which is the packaging of Seth with an operator that can manage Seth, which is called RookSeth. And then additionally, there's another component called Nuba, which is kind of a multi-cloud gateway that can have multiple different object stores on-prem or public cloud, and it can kind of route and have sophisticated policy around where data should be placed. So how do you use Seth plus Presto? Well, there's a connector. So Presto has this idea of different connectors. And there's different connectors for connecting to different data sources. So if you're connecting to Postgres, you might use Postgres connector. If you're connecting to, in this case, if you're connecting to S3-compatible object store, you would use the Hive connector. And then you reference your meta store where it's going to retrieve schema that's going to have information about what buckets and what prefixes within those buckets map to particular databases and tables. And then this is also where you would configure your credential information and an endpoint if you were not using Amazon S3 proper, and instead we're using an on-premise object store. And so you don't really know, if I'm just a data scientist and I'm interacting with the data, I don't really know necessarily if the tables have already been created and I'm just doing SQL queries. I don't really know where the data is coming from, and that's one of the nice things about Presto. They can have multiple different data sources. You can have some relational databases. You could have an object store. You could have some older data that's in HDFS. And from the data scientist perspective, it doesn't, they don't know that it's being sourced from one place or another place. And so this is kind of nice. If you wanted to create tables that map to object store, it's as simple as running a few statements and then you provide basically an external location. So the path, you would give it like a S3 and then the bucket and then some path in there. And then that maps that particular table and the partitions within that table and then the parquet or our C files that compose those tables to there so that when the workers go to source data, if they have a query that is going to pull in by date range, right? And you've partitioned your tables by date, then it'll read all the files that are in, like it'll filter the path, right? So it'll query for the list of all the objects that are in the bucket with this particular prefix that match based on the time range. And then it'll read in all those files and then parquet has metadata and so on and so forth and so it'll bring all that in. And the person writing the SQL has no idea, it doesn't need to know all of the intricacies around accessing the data in object store. It's all being handled for you by the Hive connector and by Presto and it's kind of abstracted away by this idea of the schema.