 Hi, guys. Welcome to the Snow Lake lecture for 445-625. We're super excited today because we have an old friend of ours, Bowie Chen, who was, you were an ISR, what master's program were you in? SES. SES. Okay. Anyway, so he did master's, he worked on, it was in my research group, spent a lot of time working on our query optimizer, and he was neatly hired by Snowflake and he been there since then. So he brought along friends Libo and Commander, also from Snowflake and also been there for a long time. So this lecture is basically an overview of what the Snowflake system looks like. And what you'll see is, you know, connect a lot of the pieces that we talked about throughout the semester and the different components of the database system. We've seen how you actually build it in a cloud-native architecture and specifically designed for the left systems. So hopefully I'll like not look crazy because all the things I said during our semester, these guys will say, oh yeah, this is a big problem. This matters a lot. Here's how we've solved it. So with that, guys, take it away. Of course, in the class, if you have any questions, just click on the raise your hand icon and I'll interrupt them and we can fire away and do this at any time. Okay. All right, go for it. Thanks, guys. Yeah. Hi, everyone. I just get started. I'm Bo Wei from Snowflake and today we're going to get a little bit of an overview of Snowflake's architecture and do some mini deep dive in the cloud storage computer layer and as well as the cloud service layer and also workload optimization. Okay. So yeah, we're in the time where more and more large organizations start to move from on-prime software to cloud-based solutions as the cloud-based solutions have following up advantages. So first of all, they're elastic, meaning that you only need to provision resources when you need to and also you get the scalability and you can scale compute resources independent of, for example, storage resources. And also when softwares are built in the cloud, they can be built as a service that is always on and always up to date with the latest release. And also last but not least, we cloud solutions provide an ecosystem for all customers to meet and exchange valuable information as there is no physical barriers. And yeah, database architecture when migrating to the cloud also needs to rethink their architectures. So firstly, most of the last generation on-prime large-scale data processing systems are designed with the philosophy of pushing compute to the storage. And these architectures, they have challenges when migrating to the cloud. And Snowflake was the first to build a data presence system with a cloud-native architecture. And we started building Snowflake around 2012, and now Snowflake runs on all three public cloud providers. Yeah, let's look at Snowflake's architecture. So the Snowflake storage player uses cloud blob storage, for example, AWS S3 for its usability, availability, and durability. And the compute layer, which is also known as virtual warehouses, uses the cloud compute instances, for example, AWS EC2 instances for their scalability, meaning that you can scale your, for example, the size, the number of workers in a virtual warehouse to a decent enough large number. And also for its scalability, sorry, performance isolation, meaning that workload runs on one warehouse, do not interfere with workloads run on other virtual warehouses. And also for its elasticity, our customers can usually create or resize their warehouse in seconds. And the cloud service layer is responsible for interacting with the client, and also the metadata service and query optimizer, and also managing transactions and so on. And also we built data sharing and a lot of other cool features in the cloud services layer. So next I'll touch on the software development philosophy a little bit. So firstly, simplicity, this means that we Snowflake generally does not expose tuning ops to customers unnecessarily. And most of our optimizations are when we implement them, we design them in a way that they can generally be applied to our customer workloads without any side effects. And adaptivity is also a pretty important philosophy. So this means that query execution tries not to make assumptions based on optimizer statistics. And when we implement features, we try to consider a lot of these extreme scenarios that could affect performance drastically and come up with designs that are more stable in during execution. And also we can do that under adaptivity which says query execution tries not to make assumptions based on optimizer statistics. So what that basically means is that they're not trying to do the cost based query optimization that we talked about before in class, that they are going to make a good guess about they think the query plan should be but then while they actually run the query, then they make decisions. So there's a different way to think about query optimization. I guess right now we still do cost based optimization. But not for everything, right? Like there's like some of the push down the aggregation stuff, you do that, you figure that out while it's running. Yeah, some some optimizations when possible, we try to defer it to execution time to make the decision. Right, whereas like in the class, we teach it about like, you try to generate the entire query plan all at once, then run it and then even if you're wrong, you don't change. There's a different way to think about doing query plan. Yeah, we still kind of generate the query plan all at once, but like the execution layer has the kind of capability of making decision kind of changing strategies during execution. I guess there's ongoing projects that allows us to maybe, you know, kind of rewrite the query plan during execution and so on, but we're not there yet currently. And yeah, lastly, we want to provide a platform as a data cloud that has the functionality of a lot of components. For example, an enterprise level analytical database, as well as data pipeline, data lake streaming, transaction and transactional database and so on. And yeah, the query engine is the centerpiece that powers all of these components. When we add these new kind of components to a system, we build them on top of our query engine when possible and a lot of these components can share some code, hopefully. Yeah, I'll talk a little bit about the cloud storage and compute. And then, yeah, I'll hand it over to kind of, hand it over to Commander to talk about cloud services and Lebo will talk about what called optimization. So, let's first get to cloud storage. So, micro partitions, they are immutable cloud files for data storage on, you know, cloud storages. They are usually of tens of megabytes each. And they are common meaning that data in each column are grouped together and there's a file header as you can see in the diagram and the file header stores the region of each column. And when we download the file, we first download the file header to figure out where each column are. And then, based on the query, we can skip the download of nonselected columns. And also column metadata of each file or micro partition is generated when these micro partitions are written. These metadata includes min, max values, number of distinct values and so on. And they are used for data pruning, which means that during compilation or execution, we can, based on the actual data, we look at the metadata of each file and decide that we can skip the download of certain files. And this avoids unnecessary file download, network IO, and when the files are on, when our files are on, you know, remote storages, this skipping IO based on data pruning is really essential for query performance. Yeah, let's move on to the query execution layer. So, Snowflake's query execution engine is a columnar, meaning that data of the same column is grouped together when possible, when they are flowing through the query execution engine. This allows effective use of CPU caches, SIMD instructions, and compression schemes. And the query execution engines vectorize, each operator handles batches of, you know, a few thousands of rows in the columnar format. This avoids full materialization and improves cache efficiency. And we also use a push based execution model, which means that operators, you know, push results to the downstream, as opposed to, you know, pulling results from the upstream operators. So data is also processed in a pipeline fashion, meaning that, you know, the control flow transferred in a way that it always tried to send data to the downstream whenever possible, without materializing everything. And this enables the efficient processing of deck shaped plans in addition to tree shaped plans, because like, for example, in the deck shaped plan, deck shaped plan, you can have a split. And then with the push based query execution model, you can just, let's say, transfer the data to each branch of the downstream, which can be done relatively easily. So Bo, I don't think you can say this. Talk about this. In the class, somebody asked me once, like, why don't we, why don't database use GPUs more, right? Because they have a ton of little cores. They run a lot of threads. They can be vectorized. Has Snowflake considered this and decided it was a bad idea? And if so, why? Right now, we haven't, we haven't started looking into GPU yet. We are aware of it. But yeah, like from what I've seen, I think that GPU is suitable for some tasks that are pretty fast on that type of task. For example, you know, evaluate a bunch of simple expressions and so on. But like, our workload are mostly really like complex. They consist of like CPU, CPU intensive, CPU intensive parts and also IO intensive parts. And I guess right now CPU is still CPU based instances are still kind of the dominant solution or the best solutions we've found so far. We've been looking at FPGA, but yeah, also, yeah, we, they're pretty kind of limited to certain scenarios. Awesome. Thanks. Okay. All right. So yeah, let's look at the, look at a mini query as an example for how Snowflake, you know, data engine works. So this query computes, selects the name and the text, which is computed as salary multiplied by a flat rate from all the employee who's more than 25 years old. So the first step is in the table scan. As you can see in the diagram, we download from the remote files and also decompress from the remote files into three columns, age, name and salary. And the data of the same column is grouped together, as you can see on the data is also produced and sent to the next operators in small batches. So let's say we download the first file and then we decompress it, we extract the first, let's say, 4k rows from, from it in three, in three columns. And then we pass these three columns to the next operators. The next operator is filter, which is responsible for evaluating this age greater than 25 filter and produces a selection vector, which indicates the row indexes that qualify the filter. Since data in the age column is grouped together, this simple expression can be evaluated using SIMD instructions to be more efficient. And the next step then is to evaluate the expression in the projection that computes, namely the expression that computes the the text. So notice that we have a selection vector at this point. We only evaluate for the indexes that are selected. Yeah. And, and finally we produce the final results. The final result is represented as the name and the text column plus the selection. And then after one iteration of this processing is done, we the control flow transferred back to the scan to get the next batch of columns and so on. Okay. So yeah. In this slide, I'll talk a little bit about some, you know, major components in the execution layer and how we make certain design decisions. So the table scan is a pretty essential operator. And, you know, almost all queries have a table scan because they need to kind of get input from either remote or local storages. So the table scan is responsible for downloading from storages and decompress and provide, provide the inputs to other operators. So firstly, you know, files are distributed among a set of workers using consistent hashing. This is to minimize file movement if, for example, we resize a warehouse and so on. And these workers or cloud instances, if they finish early, they can still files from others. And also, these files are cached on the local disk for later queries to reuse to prevent like from, for example, repeatedly downloading remote files. And join is another really, like essential piece in the correct execution layer. Sulfuric implement distributed hash join algorithm with runtime adaptive distribution method selection. This means that based on the runtime statistics, we can choose from how we want to actually implement the distributed hash join. And also, as a byproduct of building, of processing the build side of the join, we also build bloom vectors and maintain value ranges for the keys for effective runtime data pruning. And also, you know, join skill is sometimes a like a problem we've seen a lot in during execution. So we implement auto automatic skill detection and mitigation means that meaning that if there's a hot key, then we can, you know, this either it's on the build side or the probe side of the join kind of we can automatically detect it and mitigate it by kind of distributing it to different, different workers to handle. And also lastly, the scheduling operate of operators, this means that how we run, how we how we schedule different operators in the same query. So mostly we do pipeline pipeline wise lock step execution. This means that our workers work on the same work on the same execution pipeline until they're all done and then they move to the next one. This exploit intra operator parallelism. And for plant fragments with not much data, that can be they we identify them, and they can be scheduled on single notes in parallel to avoid, let's say, scheduling overhead for, for, for, for all workers and thus exploiting inter operator parallelism. Okay, like how do you guys manage memory on a compute node, right? Like, is there a traditional buffer pool where you're like, you're fetching in something from s3, you bring it into memory, and then if it gets evicted, you're evicting it to the local, the local cache, right? Like it's, and then there's like another hierarchy that says, all right, evict from local cash and just drop it. It's like, like it's more complicated than a traditional buffer pool running the shared everything single note system. But there's still, you guys just don't managing memory yourself, correct? Yeah, we're still managing memories, memory ourselves, we have buffer pools and so on, and yeah, memory manager management system. And during, you know, extreme scenarios, we can spill to local disk. And if there's no, no space left in the local disk for us to spill, we can also spill to remote disk and, you know, read them back later. Thanks. All right. These are all the size that I have. I'll hand over to Kevin there to talk about the cloud services layer. Everyone. Yeah, I'm Kevin there. I'm in Snowflake for a couple of years now, and I'll talk about the cloud services layer next slide. Yeah, so we can think of the cloud services sort of the brains of snowflake. It is the kind of the single layer that is the coordinator between the client, metadata, compilation, warehouses and the cloud provider itself. And so it's actually sort of the coordinator, but it's kind of a lot more interesting than just kind of a middle hand. And the cloud services layer is designed to be a general purpose layer. So without parameters to explode as you question your own traditional systems, the equivalent of a cloud services there would have a lot of configurations and configure it specifically for your workloads. And you can change those configurations as your workloads change. For instance, if you're running ETL at night and running online processing during the day, you can set up the system differently. But snowflake doesn't expose any of those things, as Boyle mentioned, as part of our philosophy around simplicity. So it makes the challenge harder on our end in terms of engineering because the cloud services layer has to be kind of everything to everyone. So kind of some responsibilities that fall to the cloud services layer, this isn't the comprehensive list, sort of the interesting things. One is service management. So keeping the entire service of snowflake up, metadata and concurrency control query compilation, including kind of parsing and optimization for planning security. So both security at the cloud service level in terms of authenticating users and ensuring kind of everyone has every account of siloed from one another, but also kind of the traditional database security concepts of roles and access policies and things like that. Background services are run by the cloud services layer and as well as kind of cloud infrastructure to ensure that we have enough VMs available to scale up or to release VMs when we want to scale down. Next slide. So I'll talk about kind of the different tracks. I'll talk less about the things I don't know too much about and more about the stuff I do know about. So service management, you can kind of think of it as how do you keep a cloud system up. And it's a challenging problem for us because snowflake across all of its deployments handles billions of queries a day, billions of statements a day. And the cloud services layer is the only multi-tenant system or portion of the system, meaning that any customer or multiple customers are going through the same cloud services node for their workloads and customers run different types of workloads. And this isn't strictly just OLTP, OLAP, customers have very bursty workloads or kind of very long running queries. And that how that plays out on an entire database system kind of has different properties. And we have to be aware of that and kind of handle that. The service also must be kind of full term tolerant auto scaling, right? So we don't because we don't expose anything to customers, they can't ask us for a new node. They can't ask us to add a kind of prepare, they can't indicate to us that they were about to kind of hit us with a bunch of queries and kind of tell us to scale up. There's none of those facilities. So the service must be able to be resilient in the face of that. So some interesting challenges within the service management area is ensuring that no one customer overloads the system. So this kind of falls into the bucket of auto scaling. So if you have a multi-tenant system and one customer decides that they have a conference running and they're running demos and they have like 5,000 people issuing SQL or Snowflake queries at the same time, we don't know which necessarily cloud service node that will go to and we don't kind of pin a customer to a specific node in most cases. So we have to be able to scale up so that no other customer is affected by that. We also have to ensure that failed nodes and jobs are retried. So this is both an aspect of fault tolerance. So if during a worker in execution of a job, if a worker fails or single node fails, we have to be resilient for that. But moreover, in terms of self-healing as we release features, every customer is on the newest version of Snowflake. So it makes it challenging when we release features. So we have kind of mechanisms in place so that we retry with new features disabled. And so to the customer, it seems like their queries are just succeeding. But in the background, there may be failures and things like that that we can kind of mask over. I think providing predictable performance, especially at the cloud services layer is a challenge because compilation is just one aspect of the cloud services layer. There's a lot more that goes on and having to compete for resources in a multi-tenant system makes providing predictable performance for customers a challenge. And some customers don't care necessarily about how much time their statements spend in cloud services, but others are customers very sensitive. And we can't really separate those two customers from the same node. And then finally, ensuring enough cloud VMs are available. So this is kind of we keep a we have mechanisms in place to kind of keep a free pool of VMs available. And the challenge is not reserving too many VMs, but not reserving not enough VMs so that you can scale up, but you know, Snowflake isn't wasting too much money and kind of holding onto VMs that never get used. Next slide. Some metadata. I think compared to like the traditional database system that you'd read about in textbook, I think Snowflake is its use of metadata is kind of unique and interesting. I mean, it's not necessarily unique across all the new systems, but kind of compared to the textbook stuff, it is pretty interesting and good to know. It's the right way to do it. How about that? The way you guys do it is the right way to do it. Well, yeah, yeah. And I think I think it comes with kind of being cloud native and stuff like that. So it's a nice fallout of the design decisions made early on. And so as Bowie mentioned, we create metadata files for every data file partition. And those metadata files, we have good enough stats on every single column. So customers don't need to tell us, hey, build an index on this column, we pull all this metadata for all columns regardless of how wide the table is or how many tables a customer has. And we generate a ton of metadata files and there's a lot of metadata interaction going on all the time with every kind of statement that gets processed. So we decided to use a different statement of different database entirely to store metadata and use FoundationDB, which is a open source key value distributed store with asset properties, not too much an expert on FDB, but this is kind of like the underpinning of all of our metadata and every single metadata database object is stored in FDB. So beyond just tables and columns, users, roles, masking policies, views, schemas, it's all stored in FDB. And it's partitioned in a way that the key itself is silos every customer's data from one another in terms of the metadata, because we use kind of the Alk account as the top level key. So there's no way for us to inadvertently pull metadata from a different customer while processing another customer's globe. And cloud services layer retrieves the metadata at different times during cooperation. So we don't just reach out to FDB once and kind of have different phases of compilation or different phases of the cloud services layer in general. And we make different calls to FDB. And similarly, or in a different way, the background services that the cloud services run are managed in the FDB layer so that we don't have kind of growth of metadata that gets too large and kind of compact metadata, you clean up old data and stale data. So we have those in the background to keep it reasonably sized. So below is in the diagram of kind of how we interact with FDB. It's a very simple example, but imagine you have a cloud services layer and you're processing the query. As you go through that query, you kind of want it, you want different metadata. So initially, you want table metadata to verify that tables exist and the properties of those tables. Then we ask you get recruiting, which I'll talk about a little bit more later. We want to get column level metadata. So we then reach out to FDB again. And then finally, if you have like a DDL or DML statement, then FDB is the place where we can kind of just issue this, issue the create table statement and kind of persist that data. But then we also use FDB to grab locks on tables. And this is, yeah, this, we can go to the next slide. Cool. So concurrency control. So the fact that we use S3 or like the blob storages and we have immutable files, makes kind of snapshot isolation and multi-version concurrency control in the natural choice. Because every DML operation, whether it's an insert, delete or update, generates new data files. And we can consider the set of data files before and after a DML operation as this entirely different version of the table. So it works very nicely for us where when we process a query, we get the version based on kind of the current state of the table and we use that version moving forward. So if there are other operations going on in the system, the query that's being processed is kind of unaware of those and doesn't, there's no competition for grabbing on kind of table level off or anything. You have your version and you stick with your version. So if you have a DML operation on a version, then you have the lock for that version. But if you're just to select and kind of, you don't see reads that are uncommitted or you don't see reads that are outside of your version. And additionally, kind of this versioning feature, which I think kind of is understated or maybe doesn't get talked about enough in textbooks is that if you have this versioning and you have a cloud system, it makes things like time travel easy, featured implement, right? If you are at version three in this diagram below and you want to go back to versions, we support that because we have the properties of S3 that give us kind of the failsafe of the data, the retention of old data files. So even if the data is deleted from the customer's perspective, it still is an S3 for a certain amount of time. And within that amount of time, we can do time travel. So if you mess up a table and you want to clone your old database, you can clone based on some previous time. And what that does is it doesn't copy any data. It doesn't do kind of any data copies or data loads. It just simply points at the new database and use Pima to an older version and those files all exist. And we just register that now we have an active database that's referring to those files. Similarly, result reuse is also a simple enough featured implement because we use the table's version as part of the hash when we generate a hash for the entire query and columns and all of that to figure out if we can use a previous result that's already been executed. So in those cases, if you haven't changed the table or nothing's changed and we've already run a very long query, like an hour's long query, and you run it again, we can just give you the result because nothing's changed. We can detect that nothing's changed because we have this version system. How aggressive you guys are like the sophisticated is your query result caching? Is it like exact SQL query matching or can you recognize the semantics of the query and exploit that? It's more in the realm of exact query matching. You can make slight differences, but it's pretty much like the top level result. Next slide. Now I'll talk about query compilation for a couple of slides. This is the general flow query compilation. Also, you can call it optimization and query planning. It fits your standard textbook model, it's in slight variance. We start with the query text on the top left. We go through parsing to generate an abstract syntax tree. We go through semantic analysis where we look up tables, columns, information. We verify that the SQL query makes sense. We can also do type checking. One interesting thing about Snowflake is that because we have the metadata and because that metadata gives us many max values, we can indicate to the execution engine the specific kind of width of data types to use. This may seem like a small optimization, but it kind of has an aggregate. It can kind of save execution engine a lot of memory by not wasting unused memory for data execution. Then we go through phase of logical rewrites. These are rewrites primarily on the parse tree itself. Then we go through pruning. Pruning is the process in which we look at predicates in a query and then we prune out data files by pruning out micro partitions. This is a huge cornerstone of Snowflake, I would say, is this pruning and how we leverage it. Then from pruning, we go to natural plan generation. We go through a standard set of heuristic-based rewrite rules. As the dashed lines indicate, when we run these rewrite rules, we also have new opportunities for pruning. For instance, if you push a filter down from a join to one or both sides of a join and it's now sitting in your table scan, now you have a new opportunity to prune data files in that table scan. That's why the lines are indicated in both places. Then we go through cost-based optimization. We have a Cascade style cost-based optimizer. Nothing kind of different from what you may have read when the test was presented in your lectures. Then finally, we cover Cascade's in the advanced class. What's that? We cover Cascade's in the advanced class. Check out the YouTube video then. Then finally, we go through physical plan generation and new generator plan, which is then shipped off to the execution engine. Rather than talk about the details of population, I would first say that the stuff you've learned in this class is very relevant to database systems. It's not old stuff. It's not outdated. It's very relevant. The better you know that, the better you can get up to speed with modern systems. I'm going to talk instead about the things that I think are interesting that it sounds like does that many other traditional systems don't do. One whole realm of that is data-dependent optimizations or meta-data-dependent optimizations. I've talked about pruning. I'll talk more about that later. Constant folding is another optimization where if you can detect that a column will always have the same value, you can substitute that column with a constant. You save execution time. You save IO, not having to read that column. You can also, that constant folding itself can unlock new opportunities for optimization and then constant file detection. If you look at number one there, number one is a value set and not a table scan because at some point, the compiler was able to detect that all the data files that are remaining for this table scan all have the same values for all the columns. I'm just going to substitute this table scan with this constant set of values. That's one interesting optimization we have based on the meta-data. Then, as Bowie mentioned, we have many adaptive features. These are optimizations where the query plan indicates to XP that, hey, you can make a decision and it doesn't give a fixed plan, but the plan is there and XP adapts. One is join filters, which based on, if you look at number two there, that join filter corresponds to the topmost join. What it does is because we implement a hash join, we have a hash build and that build site, once you've built the hash table, you have the information on what columns you've seen, what their min-max values are, what the ranges are. Before you even get to the join and join probe from the other side, you can do an early phase of filtering and you can aggressively push that filtering down in many cases to as close to the source of the data as possible. The benefit of this is that if you've built your hash table and you know the min-max values and those statistics that kind of the compiler just couldn't possibly have or didn't have correctly, by having a join filter, you can eliminate many rows even at the table scan level before they go through many other potentially expensive operations. A filter for that as well or no? Yeah, it's implemented as a blue filter. Aggregation placement is another one where aggregates can be expensive. The two things, the aggregates can be evaluated in different places illegally and some places it's obviously illegal to move an aggregate too but because we exploit the fact that an aggregate can be evaluated in different places and the cost of an aggregate can change depending on where it's evaluated. From the compiler perspective, we just assume that, hey, we don't know enough about the stats of this query to say that, hey, this is the exact place to execute this aggregate so we indicate to XP or the execution engine that we're allowing you to either evaluate this aggregate earlier or later depending on the cardinality you're seeing of the rows. So in this case, with number three, the aggregate can either be evaluated before a joint which makes sense when the joint is explosive but it can also be evaluated after the joint which makes sense in the other case where the joint is reductive and we allow the execution engine to figure that out and then finally adaptive links is another kind of smaller adaptive optimization where in many cases you're either in a distributed system you want to redistribute your data amongst your workers or you may want to keep that data local to the same worker that you're on and adaptive links allows the execution engine to determine what to do based on the data set and the cardinality of the data it's seeing and sometimes it makes sense to broadcast your data set to all of the workers sometimes it makes sense just to hash it and distribute it to a single, distribute the hash data to a single worker. Next slide. So I'll talk about pruning more because yeah, this is I think it's a cool feature. So I'll go through the example and let's say we have a query like select star from foo where C1 is greater than five and on the left hand side you have your data files for foo and I'm showing you the call the min max value is for C1. So as you can see there are two files that don't fall into the range of being greater than five and so as part of the pruning pass what we do is we take these predicate expressions and we break them down into kind of kind of atomic level expressions that we can prune on and then we prune the data files and we do multiple passes of this for every predicate expression. So after we run one pass at this we've already eliminated the two of the five data files if there are other predicates it may reduce this further and this is this feature is kind of it scales nicely to the complexity of the query so maybe it doesn't really seem like we're saving that much time or that much kind of bio by eliminating two files but when you have massive queries there are kind of tens of thousands of lines long and reference hundreds of tables. We can do this granular pruning on every table on every pruning expression that we kind of have access to in terms of being close to the table scan and we can have a huge reduction in data files that we end up actually scanning and execution. Next slide. And then along those lines with enough pruning you can kind of unfold the opportunity for constant folding and constant folding itself can lead to further optimization. So if you look at a query like this where we have we're projecting a single column from foo but before we project that we join it with bar on c1 equals bar dot c2 and then we're looking for all call all rows of bar where c2 is less than four. So as we prune first on bar we've eliminated four of the five data files and we're left with one and we can detect that that one file is constant so the min max values of c2 in that file regardless of how many rows it has is always is it will be two so that lets us substitute two for bar dot c2 and effectively we can rewrite the entire query to selecting a single column from foo and having a constant filter on foo and perhaps we can then now this unlocks a new opportunity to prune on foo based on this predicate and perhaps it will even lead to like a entire constant folding of things. This entire query where we detect that okay even foo dot c1 equals two leads to like a constant set of data files and then we can just return the result without even having to execute it at all. So those are and that happens in some cases but kind of the in the average case we can do kind of interesting optimizations like this. So kind of pruning and constant folding were things that I wasn't aware of how powerful it could be before I joined Snowflake having seen other systems. So yeah I can't highlight it enough in terms of how much we leverage this this kind of these kind of optimizations. Can you say like roughly percentage what queries can can leverage this? Is it like 10%? This sort of optimization like with constant folding yeah probably in the 10% bucket but I think pruning and it having a dramatic impact on queries I would say is probably I don't know like 75 to 100% of queries that actually execute have like more benefit from pruning. Of course yeah yeah okay yeah it's sort of like scale this you know 10% is still a lot. Yeah yeah next slide. So yeah this sort of like advertisement for compilation at Snowflake. So these are kind of interesting problems that I think we that we work on and I and yeah this I'll go through them. So one is compilation time. So obviously like you can run many optimizations and you can kind of search this your search space of potential plans all day but then if you spend too much time there then you never return a result to the customer or you return a result way beyond their expectations. So a compilation is always a balancing act between figuring out how much time we spend in optimization and at what that benefit will be on execution time and at what point do we say hey we have a good enough plan we've done enough pruning let's just execute this execute this plan as is and this is kind of an ongoing challenge that we face. I think core optimizations is another area because Snowflake is relatively young compared to other systems and so there are a lot of optimizations that we don't have in the in the optimizer core optimizer itself and it's not so much implementing these optimizations I think understanding what to implement and how to implement it is fairly straightforward as there's tons of academic papers on it and kind of textbook information and it's very easy to understand. Are we very old the challenge for us is because we pin customers every customer is on the latest version of Snowflake and we don't pin customers to older versions and we don't let them set any knobs or disable optimizations. We have to kind of figure out a way of how to release an optimization to all customers without regression and when you're kind of running billions of queries a day and you don't really and different types of workloads almost as a rule at least one customer will find a regression with every optimization regardless of how beneficial that optimization is for all other customers so this is an interesting challenge kind of from a software engineering perspective. You've said this a couple of times that other than the class may not appreciate it. The idea that you guys are only running one version of this of this database system for all customers is a huge huge deal like with Oracle for example when you download Oracle included inside of it is also like the query optimizer the query planner from like the last 10-15 years and you can go and manually set like I want to run you know Oracle 19 would use the optimizer from Oracle 10 because of these regressions so that means that and the Oracle software they have to maintain like 10-15 year old versions of software for all these old customers that don't want to you know don't want to upgrade so for you guys you don't have this problem you say you force everyone to do it of course you have to be cautious to make sure you don't shoot yourself in the foot. Then this is a big deal and because of the cloud you can do this. Yeah exactly yeah and it's in and so in terms of like engineering within our engineering culture is kind of very very concerned about this and so we do have a lot of mechanisms in place we have a ton of testing tools and I don't think we could do this like it all goes hand in hand with the cloud like because we manage the service and we can observe the query workloads that also makes it kind of realistic for us to be able to then put everyone on the same version with all the same optimizations. Adaptive optimizations is an area that we can are continually growing and I feel like our adaptive optimizations are just scratching the surface so kind of there are a lot of things that kind of you can consider optimizing the query and running the query as kind of a joint phase and do the optimization and things like that I think there's a lot of literature on that. Similarly kind of runtime pruning is another area where because the cloud services layer is all done on one node maybe we can distribute pruning and have pruning passes that get done in execution so there are a lot of kind of areas of exploration here and then finally optimizing non OLAP workloads which Snowflake was designed as kind of an OLAP system to begin with and there are many assumptions you make especially within a compiler when you have that especially how how free you can be with memory and how loose you can be on compilation time but as we get new workloads like OLTP style workloads or external functions or kind of data science Snowflake spark like workloads expectations of customers change kind of the data patterns change how they access their data how frequently they should raise all change and and our objective is to have one one compiler one optimizer for all workloads and figure out how to be like smart and intelligent internally without having to ask the customers to tell us hey what workload are you playing run um yeah and I find a kind of external function that I find very interesting because this is literally just like a java or python sandbox that gets stuck into an operator at execution time and we don't really have it's kind of a black box from a stats perspective so kind of being adaptive or being more intelligent there is kind of an area that we still need to explore I taught UDS on Thursday last week what percentage of the queries are using UDS or external functions you know roughly uh it's probably in the in the low single digits right now um UDS in general um is is very uh like there's a wide range of SQL UDS but the external function UDS is kind of we're slowly adapting to that got awesome thanks I think this is my last line hey uh so uh it's my part right now so uh I'm libo from Snowflake's SQL workload optimization team and I've been there for uh a few years right now uh so today I'm going to talk about workload optimization so Bowie and Covinder they just gave a like uh actually Bowie can you restart a slideshow I made some updates yeah so Bowie and Covinder just gave a talk about overview of compute and cloud services layer and I'm going to do more of the like deep dive in one of our latest features and uh just a showcase of how Snowflake can benefit from this unique architecture so um so basically what does workload optimization do so we own a bunch of uh separate features that benefit the performance and cost uh not customer queries not single queries but customer workloads uh so some of these features include uh clustering um which is uh yeah um I think uh this class knows what clustering is so um basically clustering is uh what makes pruning as Covinder mentioned earlier much more efficient and uh we also have materialized views uh so materialized views are like very simple if there's no dml you just you know cash the result but unfortunately that's not the case in real world um so uh we um we take care of the new uh new data new uh micro partitions and deleted micro partitions by like constantly uh like doing main like incremental maintenance on top of the materials views and we uh keep the materialized view uh fresh uh from a customer's uh perspective by like doing some runtime uh doing some runtime rewrite uh for the customer so for example if the customer selects from a materialized view uh so uh basically their result will come from two different places the first place is the materialized view itself and second place is like uh for the for like the new data or new updates on the base table that's not yet materialized uh those will directly come from the base table and we do a combination of those results uh those two parts together so that uh from a customer's point of view the materialized view is always fresh and uh in terms of in terms of maintenance because of the like constant dml's uh on the materialized views we uh on the base table so uh we cannot just materialize uh like everything uh because that way like incremental maintenance would be very expensive instead what we materialize is like uh like partial uh like partial materialized views can think of as a materialized view on like different micro partitions on base table and uh so uh in like each uh each role in the materialized view itself comes from like one uh at most one data file uh in the in the base table so uh and then uh when like querying the materialized views uh we do like further aggregation on top of it uh so uh yeah the whole challenge comes from constant dml's on the base table and uh we also own features like result reuse uh yeah which is currently just caching uh like the result and using uh exact match to uh to to to get them and the feature uh the latest feature i'm going to talk about today is our query acceleration service so more details to come um so yeah so some of these features like are uh kicking automatically like result reuse uh and query acceleration service if the customer opts in uh but some other features like clustering materialized views uh uh today they have to define to be like uh set up by customers like customer has to uh set up clustering keys on the table they have to manually create materialized views um so one of our long term revisions is to implement uh like automatic recommendation or even like automatic adoption of those features uh so it's like a workload uh that um that learns about itself and uh optimizes itself okay so now more details about query acceleration next slide yeah so uh what problem are we trying to solve with query acceleration service uh i know the name feels like it's uh like a super effective thing just accelerates everything uh but uh um we are all like uh we all know some database stuff we know it's not possible to just accelerate everything uh instead uh the problem we are trying to solve is very uh is like in a limited area but uh very um as a very big impact on query performance so um so there are like two uh separate perspectives one is like we have a heterogeneous workload with uh both large and small queries running on the same virtual warehouse so um yeah so like if you have such a workload um yeah so some problems may arise like uh larger queries they might have a long execution time and like while the larger queries are being executed uh the smaller queries might queue up because the large query takes up all resources uh there are several solutions to that um the customer could spin up a larger warehouse um but uh like when there's no uh real large queries uh that incurs like a necessary extra cost for the customer um and you know customers uh causes real money they are sensitive about it and uh they could also spin up multi-caster warehouse which is you know uh basically uh spin up like uh like uh copies duplicates uh of the virtual warehouse and so uh for example if you have a like a small warehouse which uh contains two servers you could make this multi-cluster so uh there might be multiple clusters of two warehouses two servers and uh each cluster takes care of uh like each query is executed in uh at most one of those clusters uh so uh this doesn't solve everything because as I just said a query cannot be executed on multiple clusters so uh the the execution time of those large queries are still high okay so the next perspective is uh the warehouse size has to be decided by human right now uh or like uh some customer has to deal with it so uh it's not able to scale um based on the actual compute demand of each individual query so uh the traditional model requires the customer have some knowledge of the workload uh like how much compute uh their workload is going to be going to need um yeah so um um the customer estimation is not always uh accurate um and that might like either cause small slow performance or like extra cost to them yeah so uh speaking of those large queries uh in many cases uh like large table scan is like a corporate um like when uh the customer query scans like terabytes of files uh the table scan can take a lot of time uh and we we do see a lot of such queries uh in real life okay uh next slide okay so uh yeah given all problems uh here is like what we are trying to do so basically um the basic idea of correct acceleration is to execute a fragment apart of the plan at a larger scale um so um basically uh we are taking uh we are taking like a fragment of the query plan and execute it on like additional separate servers outside of the main uh virtual warehouse the main server cluster so um and like uh those fragments uh the servers can for those fragments servers can kind of go and um like there are like some uh targets uh like criteria uh when selecting the fragments so first uh it needs a large table scan so uh otherwise doesn't make sense to uh really to to to scale this up and uh that isn't worth all overhead we have so uh and second uh we need uh like fragments with high reduction ratio it means like a lot of data that comes in from the scan and uh only like a small amount of data comes out so high reduction ratio can come from like a highly selective filter or highly selective joint filter uh or blue filter in snowflakes case or like a group by with uh very low cardinality such so so the reason behind that uh uh is explained later uh and the third one the third criteria is that no uh data exchange between servers should happen in those fragments uh so that's because like well in in the normal warehouse execution though we have those other servers talking to each other we establish a mesh and uh that mesh uh actually uh kind of limits like how many servers there can be um you don't want like too many servers talking to to each other broadcasting stuff uh that's going to add a huge overhead so uh because like uh for core acceleration we want to we want to have many many servers uh so uh like one criteria is no data exchange so uh everything has to be local until the very last step of the fragment so next slide please yeah so this uh diagram basically shows how we uh execute uh how we do the plan rewrite and execute everything so um well uh as I mentioned uh we want some uh extra servers so where those extra servers come from they come from the thing we call flexible compute so uh basically they are extremely scalable so um you can grab like a very large amount of servers compared to your size of your warehouse and uh the server assignment is also flexible uh that's where the name comes from so uh the servers can come and go uh at any time so uh like this server can work on this job for uh like this amount of time and then uh like maybe something changes maybe uh there's not enough servers in the system or maybe like there are other jobs that need the servers more well they can just uh stop working on the current job uh like uh without interfering with other servers on that job and then join another like uh job or like uh be like repurposed anyway um so uh that's about flexible compute uh and um we have this design principle called do no harm um so basically uh this means um the query execution time with query acceleration should not be less than it should not be more than like if query acceleration didn't kick in so anyway in other words query acceleration shouldn't make queries slower and okay so here is how we do the rewrite so basically we uh because of the do no harm principle um um we are actually scheduling the fragment work on both the warehouse and flexible compute servers so uh in that case like even if you get no flexible compute servers the warehouse the warehouse is still working on the the job uh and shouldn't be much slower than as if uh query acceleration hadn't kicked in so um yeah so uh we have this um uh the on the warehouse we uh we also work on this uh this stuff and uh we on the on the flexible compute fragments uh we execute all those stuff and we insert the intermediate results into what we call materialized result files uh so they are just like internal results materialized as regular data files and uh then intermediate results are sent uh are like uh sent to the to the warehouse servers through cloud storage like at three and they're scanned by the warehouse and the rest of work continues on the warehouse okay uh so there is like one possibility for those materialized results although the lifetime of those materialized results uh currently is only like the lifetime of the single query but um many of you might have noticed like uh this uh intermediate results can potentially be reused and that would be a huge benefit uh for the customer workflows okay next slide please but I just point out also too to the class that like this is another advantage of using shared disk right you can't do this easily if you would share nothing because to to to borrow a new compute node you'd have to move data to it first right and then you start using it but if it's shared disk you say all right spin up your compute node it's stateless and it starts pulling from s3 to to scale out you know for this you know the the shaded portion of the query plan here okay yeah uh so uh one thing to mention is that uh it's about join filters uh so as you can you might see like the bloom filter is also part of the fragment so uh the way we implement this is like basically uh during during the join build stage we serialize all the bloom vectors and from the hash join build node and we upload them to s3 and then the flexible compute services uh servers can just con and download it from s3 as needed so next slide so next up I'm going to talk about like how the query execution works so well one like uh one key uh characteristic of query acceleration is that uh the files are from that for that table scan are distributed in a different manner so normally uh the files for scanning are like determined and distributed during compilation time uh there could be things like files stealing but uh the the main shape uh of the file distribution is determined that compile time and uh in query acceleration we use this continuous scan set thing so files are distributed in batches and runtime upon request so like a server uh runs out of files to scan so it goes to the the service uh we call it like query coordinator the server would go to query coordinator asks for uh another batch and uh continue like processing the other batch and this keeps going keeps going and going and uh until the the entire uh like file queue is exhausted so um yeah so for query uh execution uh we first create this continuous scan set from the input files uh and this scan set is created uh after like some runtime pruning and some other optimization um next slide please yeah um and then the warehouse server uh begins uh requesting those batches for scanning uh and uh we have something called a warehouse assembly so um basically the warehouse part the warehouse branch of the query plan uh gets executed first and it sends the statistics to the query coordinator and the query coordinator could then determine okay is this fragment uh does this fragment have a high reduction ratio uh is it like a lot of enough work to spin up a query acceleration etc so uh the query coordinator will answer those questions from the the stats collected by the warehouse servers and uh next slide yeah so once the warehouse sampling passes uh the flexible compute uh actually kicks in so uh the flexible compute servers uh are assigned to like to the job and uh begin requesting batches from shared queue uh next slide yeah so once uh once uh a batch is requested uh from the queue is put into a staging area so a staging area contains like all the batches that are kind of uh on the fly and uh so like after flexible compute like processes one batch uh it registers uh the batch as materialized result files uh to the query coordinator and the query coordinator just puts them in another materialized result queue um and um the and upon file registration um like a checkpoint uh is performed uh so basically uh the that batch is marked as done uh in the and removed from the staging area so next slide yeah and uh so after like uh like all the batches are are distributed uh processed ideally normally um like every batch in the staging area should be uh should be checkpointed and removed um so this if everything is right the staging area should be empty but uh in case there is some uh like random server failure or like uh you know like with this number of servers glitches do happen from time to time uh we don't want to fail the entire query uh because you know uh queries accelerated can be uh costly to run so instead uh we take those batches from the staging area that are still there uh and send them to the warehouse part for another chance for a retry so well if the if the retry still fails then it's done nothing we could do uh but uh in some in many cases especially during uh random server failure the retry would succeed and uh we don't waste a lot of compute resources on those glitches so uh yeah next slide yeah and the final step final step is that warehouse fetches the materialized result files uh and uh does rest of work so uh the main overhead just basically comes from the insert and scan of those materialized result files that's the main reason we want the fragments to be highly uh selective okay next slide yeah so uh this is an additional use case of like uh of query acceleration so uh you know for those normal select queries um we compute the intermediate results in the fragments but uh for for certain insert or other dml queries uh the entire plan could uh look like this so uh like from a table scan large one uh bunch of you know uh filter projection or any any operator that doesn't require data exchange and uh ends with an insert so for such a plan uh as you could tell uh they can be uh the entirety of the plan could be accelerated directly so this provides like broader uh eligibility uh since you don't really need reduction ratio to be high so they are inserted into the the base table into the target table anyway uh that's you know a benefit of using uh shared storage and uh yeah it makes some jobs like potentially serverless uh this uh in particular benefits uh some of some of our like internal jobs like materialized view maintenance uh they could be uh serverless and like use whatever compute is available doesn't require like uh dedicated worker pool for it um saves us some some a lot of cost and it also be uh for customers uh it's also uh beneficial since like uh this shape of quarry is actually um you know uh very uh very common uh in your real life workflows okay so uh that's everything uh yeah in my presentation and that's everything okay i will collab with we have a you don't have to come i'll have to have the class uh so we have six minutes for questions uh anybody i mean i've asked a ton everybody in the class want to want to go for it so i guess one thing i you guys are doing the answer to this but i think it's helpful for the the students to hear this uh you know kvanda was talking about like query optimization or query compilation how how you know it was super hard it was super important uh if someone shows up at the store playing implies for a job and they have background in query optimization uh is that a is that a is that a huge win for that for you know for potential applicant uh yeah i think it helps to to have that previous background um i would say yeah um having any sort of familiarity with databases would be helpful i don't think we look specifically for the optimizer people to work on the compiler but you have to see if you understand the concepts it can help go a long way got it and it's also worth mentioning too i'm glad you mentioned foundation db you guys basically have your teams working on two data systems there's the snowflake obviously the snowflake database itself but then there's also a team that works with the foundation db team at apple you guys are building a distributed otp key values store or maintaining it which is a very interesting system as well so um you know you have the opportunities to work in both types of systems send lots of lots of problems