 Hello My name is Carol Sopcac and this is Yeah, I'm Michael McKeon. I'm a software developer Red Hat And I'm a software developer at Starburst and actually one of the co-founders of the company I'm also The presto committer and today we will talk about presto And also about Kubernetes and how these two played together So what is presto exactly so presto is an open source community-driven project Which was started by Facebook and Facebook open source it in 2013 and from from that time It the community grew on the project. There were more committers more companies involved into the project So it's quite large at this point In the essence presto is a distributed SQL engine There were two goals Driving the presto. So first of all was the scalability and the other one was the performance so because presto was Developed initially at Facebook those two goals were Like the main motive of creating the presto. So the presto was tested for those two key aspects Every time during it during it it's lifetime Okay, so The main feature of presto is actually that it separates computation from the storage and this gives you great flexibility For one you can scale compute resources separately from the Storage resources you can have multiple clusters for computations for the other the other important reason factor of this is that Because you separate compute You can also federate data from multiple data sources. So presto can read data from the data sources From multiple data sources like HDFS, but also from distributed storage in the cloud but also from other data sources like Traditional relational databases Another thing is that presto Allows you to not lock in with any specific cloud vendor or vendor at all. So presto is had up distro agnostic It's not tied to any big how do distribution Because it's it supports multiple storages. You also are you are also not locked in within the storage So you can read from S3 Google cloud Google Storage Azure files is Azure Azure storage and so on so If you want to move from one cloud to another presto also is a good choice here because you can have a common Interface to your data independently where your data lives There is a lot of companies that use presto a lot of them are web scale companies Obviously the most famous one is Facebook, but there are also other ones like Linked in Twitter Yahoo Netflix and so on we will get to more details later on When I will describe the scale at which they run presto Let's go briefly about let's describe briefly the presto architecture so Presto is essentially Presto is essentially composed of two components One is the coordinator and the other One is our workers So when the user issues a query he goes he he goes to coordinator with the query request the coordinator then Does the query parsing it optimizes the query? And it also plans the work and schedules the work on the workers The workers are the nodes that do the heavy lifting they read the data from the external data sources And do they do the computations like? Distributed join aggregations Whatever is needed to run the SQL query Right and obviously again, it's important that it can read from multiple data sources Yeah, so why presto is That fast first of all is it's in memory processing engine So no intermediate results are down to any persistent storage and this is one aspect why it's So performant another important factor is that it's columnar processing is based on columnar processing So internal structures within the presto itself store the data in a vectorized format and We are very We put much very great focus to details when tuning or writing Performance critical code like operators So we try to use data structures in such a way that we minimize Garbage collections and ideally presto should never trigger a full gc in Java we also do a lot of Micro benchmarks and macro benchmarks and we so for example when we change some aspect of a Operator like join we really try to make sure it's faster. It doesn't break and actually Right, so we try to assert the quality here Another thing is that we generate runtime in at runtime we generate bytecode for the query execution So for example, if you have an expression in SQL on the or a projection or something like that Then for such things we generate bytecode so that we reduce the interpretation overhead And very important factor is that we have a dedicated Parking and or a series which read directly to internal presto structures And because those file formats are also columnar we get really really good performance here as I mentioned presto can be multiple data sources and This is implemented by a thing called connector. So connector There is a in prayer within presto There is a connector SPI and if you want to implement your own connector You have to implement certain classes center interfaces of that SPI So for example, you have to implement the metadata SPI to tell the presto What tables you have what schemas you have and what are the table layouts? You also can implement the statistics API so presto comes with a advanced cost-based optimizer So that if you provide statistics to the presto it can optimize the query better and make the query run faster Also another SPI that you have to implement actually is the data location SPI Yeah, so you you have to tell the presto where the data lives And workers have to and you have to also implement the data streaming SPI For so that presto actually can read the data from your data source if you do all those things then you have your own connector How you can deploy presto so presto I think the most common now deployment pattern is that you can deploy and presto Separately from the storage and this is becoming more and more important once we move to the cloud This gives a great flexibility of upgrading clusters having clusters for multiple different workloads or Just having cluster per internal user group within the company But historically it is it is still possible to deploy presto to collocate presto with the storage nodes and if you are a larger company like Facebook they even they use even more advanced storage patterns like Mixed so you actually put presto nodes in the same rack as the Data nodes so this is for the really really large clusters on prem clusters So what's the ecosystem of presto First of all the most important connector of presto is the Apache hive connector. So we're using this connector You can have a seamless experience when moving from hive to presto and basically this connector Tries to behave similarly as what as what hive does for this storage Despite its name the connector can actually read metadata from not only hive meta store, but also from glue for example and It can be used to read data from HDFS as free Google cloud storage Azure Blob ADLS gen 1 gen 2 or as free compatible Systems, so anything you want Now it is important to understand that we don't use hive underneath to execute the query. So Press it's the presto engine that's running the queries not hive and that's why we get great performance Apart from the Big data connectors, you can also have connectors to your relational databases. So we to name a few we support mycql PostgreSQL Retchseve SQL server and so on there is large. So Yeah, and we do things like pushdown of filters to the relational databases for example, right? And this will continue to grow as we move forward will push even more stuff to the underlying databases Now we also support a bunch of non-relational data sources like Akumulo Kafka Elastic search name a few so because of this you can have a really Unified view over all of your data within your company over various data sources So this frees you from having to have your own ETL for some cases now even Presto we support we try to be Well, we actually we are actually anti SQL compatible So the whole TPCH TP CDS suits pass. We don't need any hacks there. They just run And we support some advanced SQL features like very complex correlated sub queries or I know some advanced window functions to name a few Now apart from that presto also comes in with the security patterns So for example, I think the most notable feature is the user impersonation So the user that comes to the presto will be the presto will then impersonate that user when talking to external data sources so that you One you have you can preserve your security patterns security enforcement And the other aspect of that is that you can have a auditing for example Another important thing is that and this is actually exclusive to starburst We have a sentry on to ranger support So what ranger gives you is that you can do? column masking or row filtering and this is really this really becomes important once we move to this since we are moving to more restrictive law again law about How you store the data and who can access the data? Now also when logging into presto, you can use various patterns like Kerberos or LDAP or Basic authentication, which is password based. We we are constantly adding new things there Yeah So how do you actually talk to presto? So you can use JDBC open source JDBC driver To New application, but you can also you see lie. There is also an ODBC driver, which is commercial and this is also this is from starburst And yeah So and you can use various tools to Use to query presto. So if you want if you don't want to go if you don't want to write your own query pipe query Pipelines or Like scripts for ETL or something you can use one of the existing tools to for example generate reports Yeah, so the list is quite large here And let's let's go back to the companies that are using presto So the most famous company is obviously Facebook. So they have really huge clusters of presto like 10,000 node plus and they have like hundreds of concurrent queries at the same time So they I think they're they're the biggest user of presto internally Like the bigger user of presto at all and they have like 300 petabyte of data. So it's it's astronomical Our companies that use presto are net net net leaks So they also have a lot of data like 100 petabytes. What's different about them is that they're running presto completely within the cloud And they have something like 300 plus nodes there So uber also uses presto 150 petabyte of data 160,000 of queries per day 2000 nodes plus Twitter also 2000 nodes Comcast 100 nodes. This is actually it's Comcast is actually supported by starburst There is lift so they use it for interactive queries mostly reporting There's yahoo Japan 200 nodes plus also starburst supported Finra Finra is a large u.s. Company that does the financial auditing I think so they Have also 100 plus nodes and they use our support So if you want to join the community You can use that I recommend joining the slack you can learn a lot of interesting things there. We are all there and we try to answer the questions or Travels with the problems. So Yeah, I really recommend that There there is a GitHub presto is available on github You can also for example file an issue there if you find something Interesting. There is a mailing group also Yeah, I would also like to talk a bit about starburst so the company was founded in 2017 and it consisted of About It consisted from Committers the presto project which generated about 50% of commits within the crystal SQL and We come from We have a we come from various companies, but we have experience if All past companies were try data had apt which was the actually first SQL first iteration of the SQL on hadoop engine Vertica and it is a What we do is that we offer presto on various platforms and the platform will least will grow So we offer presto on kubernetes On Azure on AWS gcp and so on and the companies had acquired at quarter earth in Boston now What we contributed to presto or give extra is that we have a mission control around presto So this the presto even Presto is really scalable and really fits the cloud, but there are some details that are very specific to presto. So we try to Lower the bar barrier of entry for the project With mission control, which is the UI of a presto We try to automate the cluster creation and make it easy and this will be this will continue to grow into the integrated presto platform We also add that security integrations for held up ranger Kerberos so a lot of enterprise features around security We constantly improve the SQL of the presto support of the presto. So this constantly improves we add We try we add latest features from the latest SQL standard But we also enhance existing features for example for the correlated queries. We support new use cases We also provide ODBC and JDBC drivers and these are certified with tableau And We apart from that there are connectors, which are star bus specific So if you if you want to integrate oracle if you want to integrate snowflake or Tra data We have connectors for these so that you can have unified view But you can also offload from some of these databases to presto. So migrate from from the stuff And we also The next two things are that we are also constantly improving queer performance So we get the feedback from the users, but we are also doing internal benchmarks and we also there are also Long-term projects within the presto that aim at Making it even faster. So we constantly work on these One another for example of an example of such project was the cost base optimizer So it was star bus that added that And this allowed for the standard if you if you run the TPCS benchmarks with the cost base optimizer you get like Many times. I don't recall the number now But it's like many times the performance of running queries without the cost base optimizer Which translates to lower cost of ownership So if you want to try Starbust you can go to our web page. You can try for example with kubernetes So it's with kubernetes. For example, you can do the you can launch the actual presto cluster in like Five minutes. So this is quite amazing Right. So what I would like to do now is I would like to show you a demo how the presto works Let me switch to the console So what I have here Is I have a pre-existing kubernetes cluster within the ws consisting of Six nodes R4 8x large so quite a beefy ones And I have a presto deployed there. So let's see What kind of pods I do have Okay, so as you can see Oh kubernetes Integration is based on the kubernetes operator. So it's really native kind of to the kubernetes cluster Operator monitors the Presto resource type and will create a cluster If you create a cluster, so let me let me show you that so for example with this one with you can do So So kubernetes operator will monitor resources of type prestos and I Created such resource and kubernetes operator Spound a cluster for me the cluster consists of a coordinator, but also It also consists of five worker nodes Now let's see how the definition of the cluster looks like within the kubernetes So so for example what you can see here So for example, what you can see here is that I Defined Cat connector to the presto SQL instance also in the cloud But apart from that I also have a connector Deployed to hive within the cloud. So there is a hive Metastore EMR instance at this EMR instance contains the tables TPC htpc this tables which are stored on S3 So just those two entries allow me to add connectors for those two very different data sources Okay, so what I will now do now is that I will show you the I will run the CLI of Presto and issue some queries there Right, so I'm so what I did now is I run the pistol I connected to kubernetes To the presto coordinator pot and run a CLI there And what I can do now is that I can do See let's what catalogs I have within presto system So as you can see, I have a hive catalog. I have a possible SQL catalog. I also have a tpc h catalog which is for generating tpc h data on the fly I Can show you what I can so I can this Show the schemas from these catalogs for example, so I can show schemas So as you can see hive catalog contains the tables which we usually use for benchmarking And it contains the tables are for example tpc h of 100 gigabytes of our C type like this one Let's see what schemas I have in postgres So I have information schema usual and the public schema usual postgres schemas Let's see what tables I have in public schemas in public schema in postgres What I did for the demo is that Tpc h data set consists of fact tables, but also from dimensional tables I copied some dimensional tables to the postgres SQL And I will issue a query That spans across that will fetch the That will use them Dimensional tables from the postgres SQL, but also but also use the fact tables from the hive connectors so as you can see I can combine S3 data with the Metadata tables from the classical connectors and this also this this show is the power show with the power of the of presto before I do that I will forward the Presto coordinator UI locally. Okay, so I have it So this is the UI of the presto You see that I have five active running workers. I already issued some queries to presto Let me now Run the query So this is one of the tip CH queries that are used for benchmarking as you can see here I use tables from the fact tables from the from the tip CH From the hive connector, but use the dimensional tables from the postgres SQL connector. So let's run the query Let's issue the query and You will see now so Yeah, it actually finished in 10 seconds, but you can see that it was running for a while and it produced So if you can go to the query details, for example you can see Let's see if it produced a result. It did so here is the result Here are details about the execution how it how it was scheduled how many stages it had This is also the status of the task on the workers So again, it's important to know that presto. It's running its own execution engine. It doesn't It's not just a like Semantic overlay over the existing databases. It just runs the computations in a distributed way. Okay, that's that's everything I had Let me go back to the presentation so So So, thank you again and now We will talk about Operator more. Thanks. So One of the things that I do at red hat is that I work with our partners who are also working with open source to help them To help them bring their technology onto Kubernetes and onto open shift and so Let's talk a little bit about operators and you know how we did this So how many people in here are familiar with kubernetes operators or the operator pattern? I should say All right, so I'll go pretty quick through this since probably most of you already know this and when I get details wrong You can just shout at me and let me know how wrong I have it Okay, so in general and you know, this is kind of how we use kubernetes, right? The user comes up says something like I want a pod kubernetes goes and does a bunch of stuff and returns a pod to them, you know in some manner or fashion and This little circle here is kind of indicating somewhere inside. There's kind of a loop Watching for these things happening updating events as they come in and internally there's these mechanisms You know Xcd is like this storage that we use for all the different, you know Objects that come into the system and that's continually talking to internal machinery and side kubernetes Which again is looking for all these different types of objects So we mentioned a pod before but what I tried to do here was grab like a to z of all the different objects and kubernetes So from API service to volume attachment the internal mechanisms are watching for updates here Sending the updated information that CD returning it back to the user deploying containers or attaching volumes or creating API services Whatever that happens to be So the question comes in How do we extend kubernetes because one of the things that happened very early in kubernetes was that people said well You know pod is great, but what I'd like is a presto and what does that even mean? Well, it's a little bit much for the kubernetes maintainers to add special objects for everything And it would it wouldn't scale well for the project as an open-source project to just let everyone add their stuff in So this operator pattern started to emerge And I think if I'm not mistaken our friends at core or s really pioneered a lot of this and and showed us how to Create this pattern and in the beginning there was you know talking between controllers and operators These were different types of mechanisms internally But the idea was we needed a way to be able to customize these things so How can I capture this in? Side of kubernetes right I want to know one presto might be a presto controller It might be a hive meta store a postgres SQL database to store information. How do I encapsulate this all together? Custom resource definitions these used to be known as extensions inside kubernetes And these give us a way to start defining the data or defining the objects that make up You know one presto for example now the back end of this is there's now an operator kind of looking for these things those loops are now Looking for this special type of resource that I've made I'm just going to highlight a few sections here on this So we can see at the top this is the type of thing that I'm creating inside of kubernetes I'm telling kubernetes Please make a custom resource definition for me and it knows how to watch for these things and in this case I'm telling it what do I want to call that I want to call it a presto And what would I call a list of them and what would I call it in plural and and these things help with the command line Tooling that you have and and how you interact with these things And then lastly you can see down here I can give like a version to what I'm creating and the point is that when I create this resource definition Now kubernetes knows what a presto is. It's not necessarily watching to do anything with it at this point But it knows what it is Now if I would like to make a presto I need to issue a custom resource manifest to kubernetes and this gets pushed in and this is you know This is how we actually deploy a presto. So again, you can see the type here This is what we put in the resource definition yep, and You can see the version number after it and then the rest of the information here is really specific to presto It's telling how how does presto get configured, you know, etc And at this point, you know kubernetes isn't really do anything with this because it doesn't really have any sort of watcher That's looking for a presto type. So the presto operator is now gonna Can watch for these resources as they come in and then it can start to do whatever it needs to do to deploy the workers to deploy the hive metastore, etc so Getting started building operators is kind of been a long time coming in the community and what I wanted to do here Was share some resources and talk about some of the ways that you could get started writing your own operator One of the biggest places I think to go is the operator SDK This is at github in the operator framework I guess group and there's lots of different projects in there And you can write these operators and go ansible or helm and then what they have is a tooling that will create a Framework for you So you really have all the pieces necessary To attach to kubernetes to put your custom resource definition in and then to watch for when those resources come in So you'll have you know a code point that you can just start putting in your own custom things Another option is kube builder. This is popular in the upstream kubernetes community And I personally have not used it very deeply so I don't know a lot about it But you know I from what I've heard is it's another way to create these things Another one that I'm familiar with if you're into more JVM based languages is the the ab the JVM abstract operator The author of which Yerka Kremser over here Has kindly put into github and we've seen like Many different JVM language operators come out of that so you could write it in Java you could write it in Scala The list is is pretty large anything that you can compile in a JVM you could make work with this And lastly if you're if you're really excited about this and you just love to hack on kubernetes You could roll your own And there's a whole list of client libraries for many different languages And what you'll learn about during that process is how do I register my custom resource definition? How do I create a watch that looks for updates to that definition? You know you'll see how these things happen, but You also won't be taking advantage of the engineering that's gone into these other projects So you'll have to do all of your own, you know security testing and you know find your own bugs and all that exciting stuff that we love about software development So the next part is okay. I've created an operator. How do I get it into kubernetes? How do I let kubernetes know that my operator is looking for a certain type of resource? well One thing you could do is just manually insert it these operators running containers they running containers in kubernetes and Depending on your kubernetes, they might need a little special permission to be able to you know inject custom resource definitions But when I'm testing these things I run them on my laptop and just connect them to a remote cluster That's not going to scale very well in production. So there's a project called the operator life cycle manager and This is another operator of its own that can help you to Package your operators to get them injected into kubernetes and it's got a whole life cycle around keeping your your operator up and running And it'll you can encode the dependencies that your operator needs There's a way to discover operators in the olm So if users are looking for a presto they can search the operator hub for for a presto and then hey There's an operator for this You can also get automatic updates by subscribing to channels through the olm and then it will upgrade your operators on your cluster And if you want to know more I'd say go to operator hub.io. There's tons of operators there community operators And you can just kind of explore Now what I'm going to show you here is So the kubernetes that I use most frequently is open shift and what we're looking at here is just a screenshot I took of the open shift container platform But if you are an administrator You can see over on the the left hand side I've got this operator hub open and these operators are ready to be deployed directly from the console So I could just log in I could say all right. You can see there's a big data Section there. I go to big data look for presto. I click install it and now I'm ready to start deploying prestos on My kubernetes cluster right so I can start playing with this technology, you know right away So the next part of this is like you've created your operator. You've got the olm managing it You understand, you know, how it works. You fixed all the bugs. How do we publish this? How do I get my operator into the operator hub so that people can Search for it and then be installing prestos in no time, right? So, you know a couple different questions, you know, you first have to ask yourself Are you gonna make an open source or a closed source driver or a closed source operator? You know, how will you manage those things? I won't get too much into that I mean, obviously I'm gonna advocate for an open source driver But you the point here is that if you are in an organization Where you are creating closed source code or you're not able to share your code There is not a requirement that your operator be open source. You could create a closed source operator and run that The other thing you'll want to do is start automating your image builds So as you every time you update your operator every time you're testing You know you are doing testing right every time you're testing passes You're creating new images and automatically uploading them again Continuous testing this is where things get difficult, you know, because you can imagine for something like presto You saw that it deployed a lot of pods with it and there was a lot going on there, right? So sometimes testing can be really difficult in this respect because you could do unit testing just for the small pieces You know if I asked for a presto does it actually make the pods that I think it should But then the next part is how do I test to make sure that the presto it deployed works? That's very specific to your organization and to the code that you're creating because you might need to you might need to Have data to query you have to have something that injects the queries, you know, so You can see this is where things get complicated But then the last part is you can upload your operator Information to the operator hub and this is where the tool this is why I kind of led with operator SDK Because the tooling and operator SDK will create the files necessary for you to make it really easy Just to upload those and you have a definition file. That's this cluster service version that will describe You know what your operator is where the image comes from, you know some bits and pieces of how you could search for it I think maybe you can encode like an image in there, too. Like so it shows the icon for your project now Just to talk a little bit about the tooling that we use When you're testing an operator, you may not necessarily want to deploy your own, you know Kubernetes to AWS or something like that So I just wanted to highlight some of these tools because these are things that you could run on a by today's standards modest laptop 16 gigs depending on you know VRAM depending on how complex your Your deployments are you could run these on you on your laptop So there's mini cube, which is you know, very popular. It uses virtual machines to spawn up Kubernetes on your machine There's another project called kind which is Kubernetes in Docker So you can use if you just have Docker on your machine You can use kind to deploy Kubernetes as containers and then you know you can interact with Kubernetes that way Excuse me and then The last one that I'm highlighting here is code ready containers And this is a project from in my mind just some absolutely brilliant engineers who have taken all the tooling that goes with OpenShift and put it together as an installer that can create an OpenShift instance on your laptop and OpenShift is a little different in some ways from Kubernetes in that it It has a lot more like health and autoscaling kind of built into it So that it needs a little more robust deployment You know to really get the most out of it and this and code-rending containers has been a wonderful way for me to kind of interact with that So if you're really curious about Kubernetes operators and you want to learn more like this little This this little appetizer was not enough. There's actually a workshop in a couple hours Go learn how to make an operator. They'll get you up and running and you can really you know get into it And figure out how to do this I'll leave this up for a second. These are really a bunch of useful links. The first one is kind of the chorus Discussion of operators and kind of the philosophy behind the operator pattern and you know There there isn't like an API necessarily for operators, but you know you can learn about how they work The next one is the operator the operator framework and the operator SDK You know, this is where you go to get all that tooling the kube builder book operator hub the JVM operators and lastly some client libraries so Carol, I think like join me again or Carl. Sorry The the point here is that Starburst and Red Hat are both hiring. Yes So maybe I would like to like if you would like to work on exciting project We have a lot of opportunities So we are building we will be building platform and modern platform based on Kubernetes the For the sauce we are also have positions for work for the Presto project itself Yeah, so we really have top people in the company and We have offices in Warsaw. We have offices. We have office in Boston. We have also office in California so, yeah, so We are hiring and If you want I Recommend what one one more thing is that we recently got I think one of the biggest a round ever So for the for the big data company, so yeah, so if that might be also a recommendation I think it's it's it's worth to mention So yeah, and with that we say thank you and I guess we have some time for questions any questions So the question is can presto be used as a way to migrate from Oracle to other databases and you know what? I think Karl knows the answer to that From Oracle to Presto or from Oracle to something else Yeah, I mean as long as there are connectors you can migrate your data from Oracle to something else and you can also Have a consistent view during the process for example, right? So we can have some tables here some tables there We've seen cases Actually for the for the big data data for big data where big data days of databases are being migrated from so for example, we've seen Customers migrating from tra data, for example, they want to offload their data from tra data to presto Yeah, so this is one of the patterns that we observe Yeah, so the question was what is the difference between between StarBros distribution and open source distribution? so I think I Think some of the key aspects are that we provide this ranger and security extensions which are enterprise components and Those are often enablers for the customers to use presto at all Another thing is that so with ranger for example, we Can be a drop-in replacement for existing hive Hive also uses ranger But if you if you use ranger with presto and hive connector, then you can enforce the same security policies apart from that we give you the QALT connectors so like Oracle connector There are also connectors for the other databases like Tara data like snowflake This is another aspect another thing is that we We also have Some performance improvements, so for a long time CBO was On only within like it was first in our starburst release and then only it landed in presto SQL open source and We also give you the vehicles to deploy presto right so you can deploy presto Using Kubernetes, but you can also deploy presto on AWS with the cloud watch cloud Formation templates, so we also give you this Seamless experience and this will continue to be more important as we go forward so we will I Make more around presto so that like you won't need to set up ranger yourself But for example, we will have policy engine or self integrated so that's even more That's this lowers the barrier even more for for customers and Another thing is that we provide the support for you and this is often very important to have a supported solution So yeah, the question the question is in the context of the operator on Kubernetes What is actually doing the work to spawn and scale out? Is it the operator or is it actually presto so? So it's kind of mixed because if you scale down for example, you don't want to terminate existing queries so we have an integration with presto itself, so we do kind of graceful scale down and This integration like another aspect is that we do auto configuration within presto So when the pot is started, we try to set up properties of the presto so that it matches the pot capacity like memory or CPUs Yeah, and operator also translates Like the spec properties From the operator to the presto kind of properties, so it provides a nice front end of our presto I expect that there will be even more tighter integration in the future between presto and operator so that We are Planning to support things like blue green blue green deployments or auto scaling and then There will be a part that presto would need to be adjusted or extended for the custom scheduler Or custom auto scaler that would trigger some actions around that like tell if I need some more nodes or not like maybe Maybe there will be some required Things around the query dispatching if you have multiple clusters, right? So this will be more tightly coupled in the future Thank you