 Okay. Shall we start? Okay, so my name is Andriy Roza. I'm a software engineer working for Starburst. I'm also a Trina maintainer and a Presta committer, and today we're going to discuss Trina as an engine for large-scale batch and ETL processing. So in this presentation we're going to provide some insight into what Trina is, how it is related to Presta. We're also going to take a quick look into Presta slash Trina history as an engine for large-scale batch ETL processing and also dive deeper into some of the challenges that are associated with this type of workload. We'll also discuss new execution capabilities that we are adding to Trina that are designed to address all the challenges associated with batch ETL, and we're also going to cover the status of ongoing development, what we've already built, and what we are planning to build in the near future, and some time at the end to ask questions. So first of all, for those who are not familiar with Trina, let me give you some very high-level overview of what Trina is. So Trina is a distributed data processing engine that allows to process large data sets distributed over one or more heterogeneous data sources, and it also provides an NC-SQL-compatible interface for doing computation. So in other words, Trina is a data lake engine, so it supports all modern data lake formats, such as Delta Lake, Iceberg, Hive, Hoodie. It is also a query federation engine, so it allows you to query data from more than 40 popular data sources, such as MySQL, Cassandra, Apache Pina, Apache Kafka, and many, many, many more. And it's also formerly known as Presto SQL. So before we dive deeper into history of Trina, I think it's worth mentioning its direct predecessor. So basically Hive was the first attempt to provide a SQL-compatible interface for large-scale data processing. It was built on top of Hadoop. It is an open-source implementation of MapReduce. And introduction of Hive generally greatly simplified the development process for this large-scale data processing. So with Hive, it was no longer needed to write, compile, and deploy Java code every time you needed to do some analytics or exploration. So what the user ended up writing instead is just a simple SQL-compatible query. However, Hive on its own, it was still running on a MapReduce platform that was mainly designed for this long-running batch processing. So it was fairly slow for interactive workload and the actual experience was nowhere near interactive. So query in Hive, it was very tedious and time-consuming process. So basically how it usually looked, you submit a query and then you wait for a notification in the form of email or something else, telling you that your processing is done, you can go and check your result. And in case there was like any error or any problem, you had to start this process over again. So a single iteration could have taken hours, depends on the load of the Hive cluster. And there was this anecdotal observation from data scientists at Facebook. So Hive was initially introduced at Facebook and in 2012 they were able to only execute as much as six queries a day on average. So there wasn't really convenient for interactive exploration and debugging and building their pipelines. So it was clear that a better solution was needed for interactive querying. And in 2012, Martin, Dane, David and Eric started working on Presto. So effectively it provided an alternative to Hive for interactive querying at Facebook. And some of the key development philosophy were the open source model. So the project is 100% open source. A lot of effort was put into making this project just work. So it doesn't need any external, it doesn't have any external dependencies such as MapReduce framework or anything else. You just deploy it and it runs. It was designed for fast and interactive analytics. And a huge effort was put into making it strongly adhere to open standards like ANSI SQL, JDBC and so on. So everybody pretty much who knows SQL and knows how to deal with the JDBC compatible database was able to run their analytics. So in 2013, Presto went live in production for interactive use cases. And soon after engineers outside of Presto team, they build integration for the batch processing platform and started submitting batch ideal queries to Presto as well. So at first it might have seemed a little counterintuitive. It's like why would somebody prefer Presto which is designed for interactive to run any kind of batch processing. So there were of course like other engines like Hive and Sparky who specifically designed for batch processing and designed to be reliable and so on. And there were mainly two reasons. So basically first was significantly faster. And there was a second, I would say not less important reason that I'm going to talk about is... But before I describe that the second reason I'd love to walk you through anecdote from my life working as an engineer on Presto. So very often if I wanted to understand... If I wanted to debug some problem or better understand like some slowdowns in query processing and so on. I had to pull up some runtime information about some specific query. So for example a user reaches out to us as that my query is running slow or is failing. Here is query ID, could you please go and try to figure out what's going on? So we actually had this data set called Presto query events. It had like a whole bunch of runtime information about all queries that were running in the cluster. So what I usually did, I went and queried this data set by query ID to extract some runtime statistics, query plan and what not. And these queries were usually very, very fast. It's like maybe a second, maybe a couple of seconds. So we were able to iterate quickly. But sometimes the information in that table wasn't enough and we needed to look more into additional information provided by the systems we integrated with. Like for example the Metastore or let's say a storage subsystem. And usually how did we do this? We wrote a join based on some trace token that was joining these two data sets and provided some consistent view about this specific query. Like for example by looking at this data I was able to understand it, okay for this query calls to the Metastore were slow or maybe there was some problem with the storage subsystem. And as we joined more and more these different data sets to provide more information about this query, like the runtime it drew significantly. So when the original query was running only like a couple of seconds, a query that was joining all this data sets could run for like 50 into like 30 minutes. So that was like no longer interactive. And it was not as easy to iterate fast when your query runs for like 15 minutes. And what we usually ended up doing is we went and tried to create a scheduled ETL job that used to run once a day or once an hour depending on the data freshness. And this job used to join all these data sets and create a single table that had all these events grouped by query ID. So then we were able to query this extended data set with the same latency of a couple of seconds. So and basically converting an interactive query. So let's say we have already built an interactive query that joins all these data sets and we decide to create a scheduled pipeline. So converting this query from press door to hive it was like it was full of challenges. So you had to convert SQL because the SQL semantics slightly differ. You have to convert UDFs. Sometimes the exact UDF wasn't available. So you had to find a workaround. You needed to run it several times, verify the output data set to make sure that we are getting the results we are expecting. And often like if we needed to query some other data sources such as Cassandra or MySQL basically everything what is not hive, it wasn't even possible to convert it to hive because hive may not have supported this specific data source. So actually this solution came out pretty naturally. So basically you just copy your interactive query, paste it into a batch processing tool and you have a pipeline. And that was actually this integration was done not by the press door team but it was done by some data engineers that basically wanted to have pretty much this. So basically the second reason why do people wanted to run their batch ETL queries in press door in addition to press being much faster is a single unified SQL interface. However very soon we realized that basically ETL workload comes with its own challenges. So speaking very generically, very generally ETL queries tend to be much larger than interactive queries. So for example people tend to create some very large dimensional tables or differential tables. Very often these queries they use a lot of memory because of very large aggregations and joints involved. Some queries they require really a lot of CPU time for processing because sometimes people end up let's say transforming some less efficient file format into more efficient so they end up scanning and transforming very, very large tables like hundreds of terabytes, maybe even petabytes. So these queries tend to be very long running in certain cases. And one more challenge that we discovered is the resource management. So what if the scheduling system submits multiple memory intensive or long running queries concurrently as the scheduling system may not really know about resource requirements of this query. So it was like very easy for the scheduling system to overload the cluster. Also to better understand why these large queries are challenging in press door it is important to understand press to execution model. So in press the queries are split into multiple stages by the planner and each stage has a whole bunch of tasks processing their own chunks of data. And the data exchange between tasks in press door is performed in a streaming way. So basically all the tasks are required to run all at the same time. So with this execution model this execution model allows you to achieve very low latency and it introduces no overhead so there is no checkpoint involved nothing just network exchanges between tasks. But at the same time with this execution model a query is required to be executed all at once in this like all or nothing fashion. So basically this all at once aspect imposes very significant restrictions. So for example since all tasks must be running concurrently it increases the amount of memory that must be available in the cluster for processing. It also makes it not possible or very difficult to tolerate failures. So in case even a single in case of a node crash or maybe some like very long GC or some other problems if any of the nodes in the cluster crashes that's pretty much wipes out the entire cluster because all tasks are interconnected so as long as one task is lost the entire query is lost. And resource management was like another big challenge so if cluster is overlawed that it is not possible to suspend a query and resume it after resources are available. So instead if cluster is overloaded we had to kill one or more queries to let the other queries succeed and this could be really wasteful assuming these queries are resource intensive. So like the solution to this problem was to pretty much deploy very very large clusters so we used to run 1000 node clusters with 256 GB of memory in each node so it makes it far less likely for the cluster to run out of memory. Each of this node had a lot of processing power that was kind of reducing the amount of queries that are running for a long time and also for resource management what we figured is that we can limit a single query to only be able to utilize up to 5-10% of available memory in the cluster and with that granularity we were also able to build a resource prediction system and the resource of our admission preemption policies based on predicted resources. Basically the idea was to keep these queries relatively small with respect to available resources and manage resources on per query basis. And with this model we were able to scale it pretty well so we were able to support queries that need up to 10 terabyte of distributed memory queries that run for several CPU years and we were also able to support quite a high concurrency with up to 80 queries running on the same cluster. So up until now we will be mostly talking about Presta but the presentation says Trina so you might be wondering what is Presta, what is Trina, how they related. So Presta was originally created by Martin Dain and David in the area at Facebook in 2012 and in 2018 the original founders decided to leave Facebook and focus on building a community-centric project. So they created a fork called Presto Secule and later on we branded it to Trina but the Trina in its core is still the same project led by the same core engineers. So at some point I also decided to leave Facebook to join the founders at Starburst to work on Trina. And when I joined Starburst I was quite surprised to learn that Trina is not very widely adopted for ETL batch in the broader community. So we were trying to understand like why is the hesitancy, what are the problems. So together with Brian John who also happened to be my former Facebook colleague we conducted a number of interviews with Trina community as well as with Starburst customers to better understand the hesitancy, to better understand what problems people are facing and why they prefer other engines to run ETL. And we run like a bunch of these interviews and then we started identifying some common patterns and those were pretty much the same problems. So basically people were mentioning memory constrained queries so they were mentioning that they need to scale their cluster app to support like a single memory intensive query that they have in their ecosystem which is not very cost efficient. They also mentioned long running queries so sometimes when queries fail when they run for a very long time and then fail they have to restart that and it breaks the landing time expectations. And another problem that was mentioned is basically resource management they had a strong expectation that the engine should handle resource management for them and they have to do nothing so they basically just want to submit their queries and not to worry about overloading cluster and so on. The expectation was that the engine does that for them. And unfortunately this work around with deploying very large cluster and managing resources based on query learning granularity or by simply providing more memory to better address memory intensive for long running queries but it wasn't really feasible because a lot of companies in the community they run very small clusters, they run clusters of like 10 to 20 nodes so deploying a cluster with a thousand nodes and running 80 queries concurrently was something that wasn't even up to a consideration. That being said it is also worth mentioning that some companies are already running 304.atl quite successfully so for example there is this like very good article from Salesforce about how they are running their ETL and Trino so the idea is that they are trying to avoid large queries they are trying to structure their workload in a way that all queries are approximately the same in size, memory-wise, CPU-wise and they are fairly short basically up to like 10 minutes. But the problem with this approach with trying to avoid large queries is that you need a team of highly skilled data scientists and engineers to properly structure your workload and some workload cannot simply be broken down into smaller parts because some algorithms are simply not incremental sometimes you really need to run a very large query to compute let's say a yearly report of some kind. So we set our goal to improve Trino to provide an out-of-the-box ETL solution the goal was to provide necessary execution capabilities to handle queries of any size no matter how much memory intensive or CPU intensive are they and we also needed to provide a resource management built-in in Trino that just works so very soon we realized that existing execution model relying on streaming exchanges is just not flexible enough for us to provide all this to solve all these challenges as with streaming model the entire query is required to be executed as a single unit preventing us from applying more advanced scheduling techniques so after careful consideration we decided to remove this limitation so the idea is to introduce an exchange buffer in between stages and this would allow each task to be run independently so this buffer is something that is fully pluggable and abstract from the engine point of view but the point is that there is some buffer now in between the stages that allows tasks scheduling as atomic independent units so it is no longer necessary to schedule the whole query as a single thing so basically allowing to run each task independently opens a lot of interesting opportunities like one of them is it allows us to reduce the memory required to process a query so if you look at this query let's say it has six different tasks that run that use from like 17 to 30 GB each so with streaming you will have to run them all at once and you will need at least 212 GB of memory available in your cluster for this query to succeed but with new execution model since you can run one task at a time you only need 30 GB of memory at peak for this query to succeed so it also allows to implement a fine grained failure recovery so with streaming execution since all the tasks are interconnected and you have to run all at once failure of any of this will result into a failure of the entire query so the entire query has to be restarted but if each task is independent we can basically restart on task level saving a lot of resources speaking of resource management basically when managing resources there are two important parts so one is how do you estimate how much resources a certain unit of work will take and the second is how do you decide to schedule and when and when you make a decision whether a certain unit of work will fit or not so with resource estimates when each task is an independent unit it also simplifies resource estimates a lot because think of a query being submitted to a cluster pretty much the only thing you know about this query is that it reads some number of tables let's say two tables with a total size of something 20 terabyte in this case and when you try to predict resource utilization it is like very very difficult to say how much memory is this query going to need or how much CPU time it will going to take so you can try to employ like maybe some cost based optimization techniques and so on but this tend to be not very reliable and the problem is that the cost of misestimating is very high so if we mispredict the resource utilization and schedule this query on a cluster that doesn't have resources to process it we will have to kill the entire query and start over but when you are able to schedule each task independently what you can do you can try to you can better estimate resource utilization for a single task because you know exactly how much data is this task going to read so for example if it reads the total of 5 gigabytes of data then it's like very unlikely that it will use more than 5 gigabytes of memory and if we are wrong if it does end up using more memory than we predicted it's not a big deal we can just kill this and a successful attempt increase the reservation and restart only a single task so the second part is of course the resource allocation so let's say we have a Trino cluster with 5 nodes we know how much memory is utilized and is utilized on every node and when a new query is submitted we have to give an answer okay can I run this query right now on this cluster so first of all the resource estimate is difficult but let's assume that you build some I don't know some external history-based system that is able to provide you resource estimates for certain query or pipeline you still need to give an answer okay is this query going to fit in this cluster that is currently utilized in a certain way and this is also this is kind of not an easy question to answer because all the tasks have to run in parallel you have to bin pack this tasks in a way that is efficient and so on and of course if you are wrong you have to you have to kill this query and start over that is that is not ideal but again if you have if you have an ability to run each task independently it's much easier to it is a much easier decision to make and a much easier question to answer whether you can schedule a single a certain task and again if you are wrong if you decide to schedule a certain task on certain node and you realize that there is not enough resources to run the task not a big deal you can kill a task and start it over later on also what what it also makes easier is the resource sharing between two queries so basically to share resources between two different queries with the new execution model the schedule simply has to maintain an equal number of tasks running for all the queries that are active and since all tasks are sized approximately to be approximately similar in size and they are relatively short running we can balance the resources between between queries pretty easily without wasting a lot of work but with all the model when when the queries are all at once there is there is not much you can do if a cluster runs out of resources so either query one or query two has to be has to be killed to let the other query proceed so the new execution also makes it much much easier to implement adaptive optimization techniques because it allows queries to be suspended at any point pretty much so what you can do you can suspend a query at a point where you need it and replant it so one of like the classic runtime optimization is the adaptive join you ordering so think of an incorrect join order picked by the cost base optimizer because cost base optimizer is not always ideal it is it is possible that it may pick an incorrect join order so what you can do you can basically run the scan stage for one table run the scan stage for the other table and then suspend the query at that point check whether the join order is correct whether the smaller table is to the right that's what we usually try to do in Trino and if the join order is incorrect you can replant and resume the execution without wasting any work and with this optimization we can do many interesting things so basically we can adaptively reorder joins we can adaptively switch join type from both broadcast to partition and vice versa we can also add some adaptive SKU mitigations adaptive partitioning strategy to planning basically there is many many many things we can do once we are able to suspend and resume queries at any point of time so you may be wondering what is the buffer implementation we provide so from the perspective from the engine perspective it is an opaque distributed buffer so stripping out all the boilerplate how engine communicates with this buffer is pretty simple so basically writer tasks they call write providing a partition ID and some opaque bytes and the reader tasks they can read a certain partition by its ID so that's pretty much how the engine interacts with the buffer and we also provide a very very basic implementation for now that is based on distributed file system so currently we support S3, GCP and Azure this is a very basic implementation so it only supports up to 50 partitions it also requires a full barrier between stages so the stage one has to be has to fully complete before we can start running stage two it also introduces important limitations that I will discuss a little later but this is just a basic implementation so further down the road we are thinking we are building the next generation distributed buffer so it will be a layered implementation that utilizes distributed memory SSDs and only then it falls back to a distributed file system so this implementation will support thousands of partitions it will also support low latency use cases since we will be using distributed memory as much as possible and SSDs it will also significantly reduce overhead as we will no longer need to always communicate with the distributed file system and we are also planning to remove the full barrier limitation with the newer implementation so the full barrier is very important to make queries avoiding full barrier is very important to make queries like limit efficient because if you have queries something like select start from table limit 100 it usually planned as a set of partial limits on every task and then a final limit but the thing is if there is enough rows produced for the final limit to be satisfied there is no need to run all the remaining tasks but now since there is like a full barrier it will still have to run all the tasks so if we remove this limitation we would be able to run limit queries as efficient as it does the streaming execution and it will also allow us to significantly reduce latency if for not so resource intensive queries so think of a query that just has two stages and six tasks in total and let's say you have I don't know 100 tasks slots in your cluster so you are actually able to run all of them at once to maximize resource utilization but currently since there is a full barrier we first need to run stage one and only then we can run stage two so removing this full barrier limitation is also important to reduce latency of the smaller queries so for the current status we fully support iterative scheduling so all the primitives needed to schedule tasks as independent units is already in the engine so engine can do this failure recovery is supported at task level we also support advanced task-based resource management with all the resource predictions and resource aware scheduling we also provide this file system-based buffer implementation this is like I call it a proof-of-concept pretty much just needed for us to be able to iterate but it is fully production ready and it can be used and is used by many Starburst customers and community members already and currently we are working on this multi-layered buffer implementation that would allow us to reduce overhead and improve latency and further down the road we are planning to focus on adding all common adaptive optimizations implement micro batch scheduling to further reduce latency and to close the gap between streaming execution and this fault tolerant execution to make sure that it handles interactive use cases as good as the interactive model, as the streaming model so for early results we were able to run queries that would require 5 terabyte of memory with streaming execution on a single node cluster with fault tolerant execution there are also community members that experimented with spot instances and CPU based autoscaling in Kubernetes and there is company that reported more than 60% in cost savings by utilizing spot instances and autoscaling and we also tested this deployment under high concurrency so we tested on a TPC-H 10 terabyte data set with 20 queries running concurrently on like a 5 node cluster and it always succeeds and now runs out of memory it is able to prioritize resources in a fair way so the resource management primitives they do work so that's pretty much all I have for today so if you're interested please join our community we have our Slack, we also have regular broadcasts, virtual media apps if you're interested in contributing to the project visit our development page and you're more than welcome to write blogs or improve our documentation and we are willing to give you some interesting swag as thank you don't forget to give us a star at GitHub join the discussion on Project Terrigrate channel in our Slack also you're welcome to ping me directly at Twitter and you can come and see me in the team the Starburst booth at this event and yeah, maybe we can, it's time for Q&A any questions, anything, yep okay so the question was that for 5 terabyte join and aggregations since it runs on a single node does the node has to have 5 terabyte of memory available? no, basically with this new execution capabilities we are able to split a query into like multiple tasks and then we are able to run each task independently and iteratively, so effectively you just need one node with 120 AGB of memory and then you can schedule pretty much one or couple of tasks at a time and not all the tasks that was previously required by the old execution model edge node, right? sorry so that works as an edge node, like it can distribute tasks to other nodes and then aggregate all the results, not the entire thing, right? so basically, yes, there is a single worker node and you can schedule tasks iteratively to this node and what it can do, it can process one chunk of data, one part of data at a time and write intermediate results somewhere in this buffer and then a subsequent task will read it back from the buffer and do the further processing and so on okay, yeah, that makes sense, thank you I have two questions the first one is when you actually ingested the storage exchange barrier in between like stages have you considered splitting, for example, like some of the more complex tasks that, for example, if you have two strategies to do it right so first thing is you split the task by partitions and then you schedule each one of these tasks one by one the other way is you run some of these tasks like chop these tasks into smaller tasks and then insert a logical barrier or like storage barrier in between so how do you benchmark against like two different types of logical plans or like physical plans in these type of use cases so for example if you're doing a join and then after that you're doing kind of like a superset group by on top of what already partitioned join logically speaking you should group them into a single stages but what if ingesting a barrier in between then you can run both of those tasks faster how do you... so basically you're asking the question is like if you have a query that has a join followed by an aggregation you often... you kind of need to make a decision whether to add an exchange in between assuming the partitioning matches this is a very good question so basically today what Trino does by default it doesn't add an exchange so in theory you may run into a problem where you have a skew or where you have a task that is too big and currently there is a knob that allows you to change this behavior so basically you can tell like you know whenever you see an aggregation or a join always add an exchange this will of course add an extra cost because now you need to run an additional exchange so this is something that we are planning to optimize by applying adaptive techniques so we can be optimistic and run it without an exchange and then we see if we run into a problem we will try to split it right and the second question actually is following up to that so you mentioned in one of your slides that you are actually going to in the runtime rewrite your logical plan or your physical plan so does... yeah I think... yes so do you actually go beyond the split barriers in this case like so do you actually planning to ingest a barrier in between because that's pretty hard if you're already starting the the entire execution tree so this is a good question basically we are planning to adaptively split and merge stages right for example if you decided to run a broadcast join and then you see that like you know your build table is too big you may need to split a stage into two so this is something we are planning on doing and the same thing for merging like let's say if you went with partition join by cost based decision then you have to merge two stages into one but this is still a stage boundary so basically you run your build stage to generate a build site table and then you can suspend at this point and decide what to do with the with the join stage and prop stage right right so the answer is like it is still done within the split right so your plan is still optimizing within the split boundary within... yeah within stage boundaries yes and if you switch from like a broadcast join to a hash join then you pretty much need to throw away what you already done and then you have to run the entire split no so you can preserve the build site so basically the idea is you run the build site and then you see okay how big is your build site and what are the characteristics maybe it is skewed or what not and then you suspend after that after you run let's say I don't know table two right oh yeah so you are basically suspending on the stage boundary yes you already started up one task and you decided that hey this is not the right way to go then you basically needs to go back and then redo it in a different way yeah but basically we don't have to redo the build stage oh got it that makes sense I guess it's like thank you if no one have another question I have one more thing that I'm kind of wondering in what kind of ETL workload that in your customer research that requires a limit a limit clause like you mentioned that you are early terminating some of the subtasks right it does happen for example there are queries like order by limit when you want to extract the top and then maybe join with the top right it's not very common in ETL use case right but it does happen right but in that case you can't actually cancel the third task right because you never know if you order the build stage barrier we have to run all of them and this is a limitation today but we are working on a better buffer implementation that would allow us to do kind of this pseudo streaming then it will naturally resolve this issue any other questions thank you for joining this presentation