 So hi everyone, I'm Aditi Pandit, and I'm a Principal Engineer at Ahana, an IBM company, and I'm gonna talk about Presto Small, which is Presto's next-gen query execution engine. So to give a bit more information about Ahana, so Ahana used to be a startup, and a lot of Ahana was about being like the Presto company, and what that meant is that many people from Ahana had high-level organizational roles in the Presto Foundation, many commitors, and the product by itself was a managed service for Presto on AWS. As of two months ago, we got acquired by IBM, and now the Presto engine is heavily used in what is IBM's what's next data platform for Open Data Lakehouse that has got announced a few days ago, and Ahana, Presto, and everything is a huge part of it. So we're very excited about the IBM acquisition. IBM has a long history with open source coming from Eclipse Foundation days, OpenShift, OpenJS, and now they joined the Presto Foundation, so we're very excited for what we can do together. So back to the talk, I will give a very quick introduction to Presto, and then kind of go over the motivation and what PrestoSmo involves, the components and the architecture. PrestoSmo is, the query processing is also built over a library which is called Velox, and Velox, you can think about it as a library of the building blocks or data processing primitives, and it uses key ideas from vectorization, and other things to have these accelerated components, and these components can be reused across different data engines, so there is a bigger community around Velox more than PrestoSmo, so even like Spark and other engines use it, and so I will close with some information about what that broader community is. So starting with Presto, so what is Presto? Presto is a SQL query engine for data analytics and the OpenData Lakehouse. The Presto project came out of Meta, which is probably the biggest installation of Presto, but then it's also used at Uber and offlaid by dance, and I kind of wanted to give an idea about the kind of massive scale we're talking about, so we're looking at clusters which are 10K plus compute cores, looking at million queries a day, 100 million queries a day at Uber, and 300 petabytes, data lake house, 50 petabytes, HDFS bytes read every day and thousands of active users. So Presto comes, every release of Presto comes battle tested in this environment, and so it's a very good query engine to have in your data platform, and of course now the latest entry in this is gonna be IBM or what's next, data lake house. In terms of Presto, the project started at Meta and originally it was for high volume interactive queries, but since then it has been expanded to also work for batch workloads. There's a ETL style Presto on Spark kind of offering to which kind of combines the best of Presto on Spark, and it's used heavily in reporting, dashboarding platforms, that's what Ahana had, most of Ahana's customers had. The Presto architecture, it has this very nice architecture with a very good connector API, so it is used for federated queries, so what that means is you could have multiple data sources in a single query that are actually different storage systems, and you can combine data in a single query across those different storage systems and get results, and so all this lends very well for usage even in open data lake house world. Presto has a very vibrant open source community, like very fast growing, thousands of downloads, Sumeta, Uber, Alibaba, Twitter, some very big names here, ByteDance, Ruppling, Intel is also very heavily associated with the project and so they help us with a lot of benchmarking in the lakes. So yeah, so Presto is a very good open source query engine out there, and heavily used, so what was next for Presto? Like this Presto, is this Prestismo project, so this is like the latest innovation in Presto and the Prestismo kind of goes very core to the data processing and query processing, so I kind of wanted to put a lot more context into what goes on before you go down to the Prestismo query processing part of the project. So to put context, right, where does Presto fit into your data platform? So at the lowest level, you have the storage and so your data could sit in S3 or Hadoop, there's SQL stores like MySQL, no SQL stores like MongoDB, and then at the topmost layer, you have your clients like Looker, or SuperSage, Jupyter, Notebooks, Tableau that are submitting queries to the Presto cluster. Now the Presto cluster does the query processing using two variety of nodes, so there's a coordinator node and like multiple workers. The coordinator is where the query processing started, it receives the query, it parses and optimizes that into like a query plan with plan nodes and operators, and then these plan fragments are submit to workers which actually do like the data processing and results are returned back to the coordinator which returns it to the client. So as an example, I've taken this like simple query, it's a modified TPCH Q1 kinds, so you read from a line item table, there's a predicate on the ship date column, and that kind of restricts the data to like three months, and then the data is grouped by return flag line status to compute some aggregates, and then all these aggregated rows are returned to the user. So what the coordinator does is it receives the query and transforms it into a plan node plan that looks like a scan filter project which kind of reads from a line item table at the lowest leaf fragment level, then the results from the scan are sent to a hash aggregation node that does the grouping and aggregation, and then the results of the hash feed into a sort node that basically sorts the results by line, by return flag and line status and returns these to the user. So that is just a very abstract like query plan, but the optimizer kind of transforms this into this plan with like stages and fragments that are actually executed on the worker. So this is what a query execution plan looks like for the above query. So it has four stages. At the lowest stage, stage three was the scan filter and partial lag. So the scan filter is actually gonna read from the line status, line item table, is going to do a partial aggregation. This feeds into the next stage, which is going to do like the final aggregation computation. This stage feeds into the third stage, stage one, which does like the sorting and then the results are emitted to the output. So your query processing happens in these fragments called stages, and these stages are scheduled by the scheduler and the coordinator at different workers. So what a worker gets is actually just this single fragment plan, which could be a leaf fragment plan or intermediate plan, and the worker does all the query processing to move the data. So when you put this into a presto cluster, this is what it looks like. You have the coordinator node, which has the parse optimizer scheduler, and those stages and fragments are sent to different workers for data processing and execution. So the original presto was all written in Java, which meant that the coordinator and the workers were both in Java. And while Java as a choice is okay for the coordinator at the worker, there were many operational difficulties. And a lot of them like came from garbage collector because that kind of introduced pauses in the running of the system. And those pauses also meant that the external user saw these like kind of cliffs in the performance. And I mean, from an engineering perspective, there was much more you could do with other languages like C++. And so that was kind of seen as a limiting factor for the use of presto. And so that kind of led on to this PrestoSmoke project. So what is PrestoSmoke? So PrestoSmoke is a new C++ presto worker. It is pretty much a drop in replacement of the Java worker. And so the PrestoSmoke C++ presto worker kind of implements the original workers HTTP interface. So it is like the same API. The coordinator sends it plant fragments and the worker executes this plant fragment, does all the data processing and sends results back to the coordinator or other workers depending on whether there's exchange in picture. And like it uses the same exchange wire protocol to actually serialize and deserialize the data on exchanges. So the data format is all the same. The query processing is the same. The status reporting and points, all that also looks very similar to what was in the PrestoJava worker. So what PrestoSmoke clusters look like is pretty much the same Presto architecture except that the PrestoJava worker is substituted with a PrestoSmoke C++ worker. And so if you see like the new worker blocks or all these green blocks which are PrestoSmoke C++ workers, the protocols are all the same. The PrestoSmoke C++ worker like I mentioned before is built on top of this Velox library which is a library of reusable data processing primitives. And the Velox library is pretty sophisticated. It has extensive use of SIMD, a lot of runtime optimizations, many features that can help achieve memory arbitration between query operators queries. A lot of smart strategies around IO prefetching, caching. And so that lends to a much more sophisticated and performant and also operationally very good runtime. Oh, so that logistics is being left to the team deploying the clusters because like you said, there are multiple things you could do. Put in a mixture of workers, right, different connectors. It does. It does in theory. I haven't seen like, well, we've tried some things of that sort but yeah, it does in theory. There's some other designs also we explored. So we didn't really go down the route that we wanna do like a mixture of Java and C++ workers. There's some other designs also that we're exploring. So yeah, there is Hive connector. There are some other connectors. So there is a connector API and people can write connectors. There are a bunch of them. So Hive is the biggest one. There's some other like simple ones we have done for TPC, HTTPS, you know, data generation and the like. And Hive can be reused across like most of these lake house kind of things. And so like, depending on like, so Ahana had even like BigQuery connector and all. So like based on like what we are giving the customers, we will consider building C++ or using Java in some way. So yeah, the benefits of Prestismo, we saw, we see a huge performance boost. The same query processing can be done with much smaller clusters. In terms of operational behavior, avoids performance cliffs, there's no job processes, JVM garbage collection, memory management, SIMD is all under explicit control, improves efficiency, it's easier to build and operate at scale. So because of the use of Velox, they're usable and extensible, so there's a lot of optimizations and optimizations used across multiple engines and so performance is better understood. To put some context on the kind of results we're seeing, so this is a relatively old result from two months ago because this was an Ahana clusters when we were independent. So the results, so I'm using the like the TPCH benchmark here. The hardware was, this was a Ahana managed service on AWS. So your data is all parquet uncompressed and partitioned on S3 and the cluster was a meter for our FIDOS X large nodes. So there was one coordinator, four workers and 128 gigs of memory. So we ran all 22 read-only queries, did multiple runs, so the numbers are aggregated across those runs. So in order to get a warm run, we did like six warm up runs and three actual runs. And we see like really encouraging good numbers. So in terms of a power metric, we can see that like the throughput of Prestus Mo is much improved. So in the case of throughput higher is better. We see like 4.3x improvement. In terms of total elapsed time, the queries run much more faster. And so like the total elapsed time, we see an improvement of 2.6x. In terms of total CPU time, again, lower is better. So we saw an improvement of nearly 3x. So at the best, we do see that you can run a cluster with one third the number of CPUs and can run the same set of queries. The memory usage is also better. We see 1.3x. So this is results of each query. It's mostly all green. So elapsed time, CPU time and memory usage. Q15 was the only one that's read and we kind of understand what was limiting there and have fixed it. So that was about Prestus Mo and the kind of improvements we are seeing. So the Prestus Mo architecture as such right is just like a wrapper over the Velox library. So I mean we take the Prestus Mo plant fragment, convert it into a Velox plan, set up the exchanges and just hand over the Velox driver as it is called, a plan for execution and Velox does the rest. So at this point, I'm going to completely switch into Velox and the Velox open source library and what it is. So Velox, again, the library came out of Meta and last year we had this VLBB paper which kind of goes into how Velox is used in all the different systems at Meta. So the key idea behind Velox was that the Meta data platform had gotten like pretty extensive and so it had, there were multiple data engines in there. So there was Presto, Spark, there's a streaming engine, there's engines for machine learning workloads for like feature engineering and the like and all the data engines like at a higher level have very similar architecture. You have some kind of query language or SQL in our case that is transformed by the coordinator into some kind of intermediate representation parsed, optimized and then you have some kind of fragment plan that actually has the data processing like plan that is run in some like data processing engine over runtime. So all these like engines kind of had very similar architecture and each one was like rewritten in different languages and maintained by different teams. And so that kind of led to like this whole big ecosystem of engines and sometimes like the knowledge was just not reusable and what the architects saw is that like at least at the like execution part right of the data processing parameters there was a lot of scope to have like a common library which can be built using all the sophisticated means of vectorization and sophisticated runtime and reuse this across all the different engines and that kind of lend for a much better experience because like the primitives are the same so like the behavior of all the data processing is similar it is well tuned across all these engines. And so that's why Velox came in. So what does Velox look like, right? So similar to like Presto plan fragments your Velox it's not an engine it's a library but if you give it like a graph of plan nodes and the plan nodes are sort of they look sort of similar to Presto plan nodes but like different engines like Spark also use similar plan nodes. And so the API comprises of this plan node tree and a driver and basically the user of Velox that gives the driver a plan node tree and an executor a CPU executor and just ask it to run the data processing for it. So what Velox does as a first step is to convert this plan node into operators and so operator is the actual execution. So the operator does the actual data reading data processing you will and so in like my previous example, right? The fragment had a table scan node aggregation node or order by node Velox translates this into a table scan operator filter project operator, hashtag operator order by operator. And if you kind of dive into like what is happening in like the Presto small worker, right? So the worker got this task. Now because your data is well partitioned you can exploit the parallelism by just like duplicating that operator pipeline in multiple drivers. And so like each driver pipeline is like one thread of execution but you can repeat this for different partitions in the same like task. So your Velox driver takes the operator pipeline and depending on the desired parallelism repeats the pipeline into drivers. And what the driver does is that it's a single thread of execution of operator pipeline. So the driver loop basically it uses the executor to like get CPU and get the threads it needs for the execution. But essentially what it's doing is it's given an operator pipeline and it is like doing all the data transformations across it. So in this diagram, right? The data flows from like bottom to top but then there's a processing that's being done in three operators. So what the Velox driver tries to do is it gets data from the topmost operator that has data available for it. And so it'll just keep calling get output, get output on the operators. And like the topmost operator that has output rows for it returns the rows. And that could be just like put on the output on the network or then like fed into like the downstream operator for it to continue its processing. So that's about like the orchestration for the data flow. But what Velox kind of brings so most of the IP in Velox is around like all these building blocks. So in terms of the concepts, right? Velox is all very vector based. So like all your data processing is happening on vectors. The vectors could have different encodings. So you could have constant vectors, flat vectors and dictionary encoding which is like the most sophisticated and performant. And like your operators, right? They kind of work on these vectors and do the processing that's needed. A lot of the processing is in tight loops, exploit CPU instruction pipelining. All these vectors are allocated from memory pools and the memory pools are kind of hierarchical across like task operators. And a lot of the Velox code optimizes for zero copies in the processing. The vectors are reused across like operators. So this gives us like very fine track of the memory usage. And vectors can be as big as needed. The buffer pool is based on the Umbra design which is a pretty state of the art. So like because of this vector representation and the dictionaries encoding and all that, right? We can do very sophisticated processing. So like for expressions, right? If you know that your vector is dictionary encoded it has so many distinct values. And like all your rows can be just like basically indexed into those distinct values. So you can compute like expressions only on the distinct values if your expression operates that way. Nulls can be handled in one go. So I have a lot of particular in the slides but I just want to like give a very high level idea of the concepts in Velox and the interest of time. Velox also has building blocks with many runtime optimizations. And like so one of the examples I kind of wanted to give was in hash table. The hash table is used in the aggregation operator the hash tag join window. And the kind of sophistication that can be done is right because you're processing these vectors you know a lot of stats about the contents of those vectors. What are the, how many distinct values are there? How many nulls are there? And based on like the distribution of data in the vectors you can adapt your hashing strategy based on the key distribution. And so like in this example right you start with like just a remote hashing if you know you have less than two million entries and kind of go to a normalized key hashing or like a full-blown hash function based on like the kind of data you see at runtime. Velox also has a spiller framework. And so what that is is that like okay say like in your hash aggregation or during hash join you kind of need to get all your data and hold it in memory for like the operator for the operator before it can actually generate any output data. So what the spiller framework lets you do is that if this like data is very big and like each partition of the data that you operate on is also very big then you're not gonna be able to hold everything in memory but you can spill to disk. So you can during runtime you can ask a partition to spill to disk and what this will do is it'll spill its data in spill files so it'll sort and append the data in spill files and your processing can just proceed. So the partitions that fit in memory I mean you can just process them and then once you're done with the stuff that fits in memory you can go and read the spill partition from disk and continue processing. And like when you have a framework like this right this means that you can do something sophisticated like memory arbitration. If you have an operator that's just blocking you because like it's hogging all the memory right and like in a case where like if the upstream operator is being held back because like an operator downstream is like hogging the memory right then basically your query doesn't proceed but if you have a feature like memory arbitration you can reclaim memory from operators and give it to like other operators to proceed so that like the query processing advances and this memory can also be shared with other queries to improve overall usage. And Velox has this like sophisticated IO subsystem there is a async data cache which has all these optimizations related to prefetching, coalescing read requests based on like the runtime and a lot of nice stuff. Well you are but then it's a trade-off between like getting the query to complete and the performance. So this Velox library right it's a very generic framework that can be used across multiple like data engines. So this has been in the open source like for few years I mean like two or three years, two years now. And so there are other data engines also in the open source that have been built using Velox and like the big one here is Gluten which was a Spark SQL engine built by Intel and they're going to talk about it in other forms. The other two big ones is PyVelox which is Python bindings for Velox it's done by Voltron data and the Torch arrow or library by Torch that's done by Meta for like it's ML workloads that's also out there in the open source. So that's what I had, please join our community. So as a Presto community, the Velox community or I have all the links for GitHub and Slack. Other forms, yeah, yeah I'll see library yes, yeah. So how does that work in terms of operating system slash a process architecture? Like are there different binaries for the different Linux distributions? Are there different binaries for ARM versus? Yeah so ARM versus Intel is like yeah they're different compilations, different binaries. So based on like your deployment you kind of build the library accordingly and like yeah so the Presto small so like it just pretty much links to this library and yeah. And then you have different binaries for the different I guess. Yeah, yeah like based on yeah because like the SIMD libraries are very different based on architecture. So you're saying like okay so if I understand your question right you're saying that like say I'm reading a Parquet file and that has statistics already and like you have this C and the Java runtimes and like how do they. So I can talk a lot to like what's done in C or the C++ like Parquet reader. So it has a lot of optimizations to exploit those statistics if they are there. So like if you know what is like the min column value and max column value and you have a filter and you know that you're going to filter everything right you just don't read that row group. So the runtime has those kind of optimizations. Yeah so the Linux library has a Parquet reader which does all this. So we re-did the Parquet reader yeah because we wanted those fancy numbers. Yeah a lot of the improvements in those numbers came from these Parquet. I mean do you actually publish the profile of the difference like the performance profile of what happens with C++ compared with Java and where it's spending the time and where you are getting the advantages or do you just think if it's not Java and it's not doing Java collections and this is a smart library with vectors and all of those are the reasons for it. Yeah so Ahana didn't do that but I believe Meta did because they had to kind of make a case for this. So they have two clusters, Java and C++ and run the same workloads and they can demonstrate the behavior. Yeah so basically we wrote the TPC-H. So we first ran queries to generate the TPC-H Parquet data and that was written on like ES3 that's why you're to our cluster. Yeah we wrote it uncompressed and partition onto S3 and then so that was done in a different set of queries and then the read queries just read what we had. Yeah I mean yeah there's much more you can do with it but like this is showing it that trying to show the numbers demonstrated at this like level of four. Have you got the results for things like when you use compressed ones or you've got optimized partitions to see what the difference would be in those cases when you've optimized the file formats? So because we wrote a new Parquet library I don't think we handled all the compressions but in terms of partitioning we didn't do a study because yeah this was in March we got acquired and so they're working on more. Because it'd be interesting to see whether or not it showed the same improvements in those sort of cases. Yeah that's interesting. So now you have a Hive connector or we haven't done like the full SQL spec yet. So yeah I mean full TPC-H benchmark full TPC-H benchmark we want to get there but yeah we don't have all the query processing for TPC-DS. So we do, it does a lot of it. There are some things that were left like decimal type and like character type and so there are these like a tail end of things that need to be done but it's pretty far along. Like so okay so the Hive connector is pretty like feature complete. What is not feature complete? And there are just like few things so like there's some operators that were used for more optimal like, so we don't change the optimizer here right? So like the optimizer generated some of these more optimal operators like there's something called mark distinct node and all that and so those operators need to be implemented in the locks. So yeah. The query was to run. So yeah I mean I think White Dan's did do the whole TPC-DS run but like there's some parts of it that need to make it to open source.