 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. So today, again, we're continuing down the path of discussion, talking about real-world systems that are implementing the concepts and ideas that we've talked about. So before we jump into today's lecture, again, some administrative stuff that I posted some of this on Piazza, but I want to put it all in one place now. So the final presentations for the project is going to be when we have our scheduled final exam. I think it's in this room. It's scheduled to start on Thursday morning at 8.30 a.m. That's not doing that. Let's start at 9 a.m. and we'll do donuts and bagels. We can decide what we want to do. And then the written final exam, that will be given out the April 24th, which I think is next week, next Wednesday. Again, that'll be a prompt to asking you a question about the Tresen, encompass all the ideas that we've talked about throughout the entire semester. So it's not like a multiple choice question. It's like, what does this paper say? What does that paper say? Because that's f*****g. Anybody can go figure that stuff out. It's more about synthesizing the ideas, again, of all the various things we've talked about, and applying it to a new situation. Which is, again, the main thing you want to get out of this course, in addition to all the development stuff you guys are doing for the project, is how to take all these ideas and see how they fit into the bigger picture of some data processing or data system. Again, that'll be due the same day we have the final exam. You just show up and hit it to me. And then we'll do what we did a while last year. You can use JTBT to help you answer the question. But obviously, if you're stupid and copy the prompt, or copy the output and put it right into your response without checking it, you're going to be in problems. At least last year when we put the question in, it said, I invented stuff, which is not true. I'm sure it's not true. So be mindful of that. Okay? And then we'll also do what we did last year. You're not penalized for this. You use JTBT, because why wouldn't you? It's the way things are going. But we'll do a, William will set up a Google form that I can't see, and you just tell us whether you use JTBT or not, and then I'll try to predict whether you did or not. Okay? Well, that's fun. Okay? Any questions? Okay. So do you know what s**t is? You ever heard of s**t programming language? Okay. All right. Just checking. All right. I'll explain it afterwards. All right. So again, last class we were talking about Databricks Photon. Again, that wasn't a full-fledged system. That was an extension library that Databricks people added into Spark. The Java code running Spark SQL would then evoke through JNI, and that was a C++ vectorized engine that would try to offload the computation and expensive tasks in running an query in Spark SQL and push that down to C++, and they showed pretty significant performance gains. So today, again, when we talk about Snowflake, and we'll see this also in Redshift, these are going to be full-fledged systems that are going to look and smell a lot like Dremel and other systems we've talked about throughout the entire semester, and Yellow Brook as well. All right. So just like before we jump into Snowflake, it's important to take a step back and understand what the landscape and the database will look like at the time that Snowflake came on the scene. And a lot of this is repeating the things that we talked about throughout the entire semester. So again, in the 2000s, that's when we saw the special purpose of specialized OLAP systems that were built just to run kind of OLAP queries we talked about in the entire semester, that they sort of came on the scene. And for the most part, a lot of them were pushing this idea of a column store. It's a little before, I mean vectorized came along in the later 2010s, but everyone was sort of operating column stores, compressed data, and then vectorized showed how to do vectorized processing on these things. Of all these, Vertica and Green Plum are probably the two biggest ones. Monadb we've talked about, and we'll talk about this again about DuctDB. DuctDB, the early version of DuctDB was Monadb Lite, which is a fork of this system and we talked a little bit about that. Vectorized we covered in the early semester. Parxcel, we'll see again when we talk about Redshift, because Redshift wasn't written in front of scratch on Amazon. They bought a license to the Parxcel source code. And we'll see the transition that going from a shared nothing system that Parxcel was into what Redshift is today. So of all of these except for vectorized Monadb, these are all forks of Postgres. And they've ripped out the storage layer and rewrote a lot of stuff to make it, you know, operate it on analytical workloads. And at the same time as this is all going on, Hadoop became popular. Everybody was trying to shove a lot of data on HDFS, and it's data lakes before data lakes before that term came to prominence. Like Hive, we talked about Presto and Paula and Stinger. Actually Stinger, we didn't talk about it. It's basically the same thing as Impala or actually the same thing as Hive. It's SQL on top of MapReduce and Hadoop. So all of these systems at the time, in the various forms, they're supporting analytical systems, but their primary distribution model, like the company, the vendor selling the database system, the primary way that you got access to these various database systems was by downloading it and running it on prem. So meaning you would buy a license to the source code, sorry, buy a license to the database system, but then you would provision the hardware and you would be responsible for running it on your local machines, right? And this is pretty much helping you run database systems for decades prior to this. So as we talked about last week, the Dremel paper comes out in 2011 and shows that, okay, you can build something in a cloud sort of native environment to run on a bunch of files that are sitting on these object stores, right? It's no longer native, natively, the storage is no longer natively managed by the database system. Facebook also starts building Presto in 2012. Again, PAR Excel, as we said, we'll discuss this next week when we talk about Redshift, but AWS buys a license to PAR Excel in 2011 and then releases it in 2013 as Redshift. They actually beat Snowflake to the market by a couple months. But Snowflake, Snowflake was written in scratch and Redshift was based in PAR Excel. I'll cover this next week. PAR Excel is basically going bankrupt and they were hoping that Amazon was going to acquire them. Amazon has acquired the source of the license and I forget who, I think, Actian bought in a buying PAR Excel. So it's still around, but it's basically a zombie database. So around this same time, there's a VC firm out of Silicon Valley called Sutter Hill where they decided that, hey, we're going to build a new cloud-native database start-up. So they got these two guys that were very prominent engineers at Oracle and then the vector-wise developer from the paper you guys read, Marzan Sokowski, from, you know, vector-wise was going under at the time. So they got him, they combined these three together, gave him a ton of money and said, go build a cloud-native warehouse, you know, we'll call it Snowflake. Sutter Hill is different than most of the VC firms you've probably been familiar with, like the Andreessen Horowitz, the Kleiner Perkins, Greylock and so forth. Their models are, like, I have an idea and you go to them and you pitch them and say, hey, give me money to go build a start-up. Sutter Hill is basically like putting together a boy band at a record label. You say, let's get some good-looking people together, get them in a room, we'll give them money and they're dictating what we should build. So that's how Snowflake sort of came about. And so again, obviously Snowflake was super successful because here we are talking about it, you know, 12 years later. So again, all these guys are super nice. Marzan is much younger than these two French guys here but this should show you how hardcore they are about databases, probably just as hardcore as I am. So this is actually Marzan's leg. He has a Snowflake tattoo after they won IPO, right? That's dedication to databases. I haven't even gone that far. All right, so what is Snowflake? Snowflake is going to be a managed OLAP database system written in C++ that is only going to run in the cloud. And again, this seems obvious today, but back then, 10 years ago, 12 years ago, this was sort of unheard of for database systems, right? And certainly, the Snowflake guys have told me they got a lot of pressure in the early days like, hey, great system, you know, when we download and run it on-prem, we don't want to run in the cloud and they said no. And then eventually, everyone just moved to the cloud and that's where, you know, if you're a new database start-up today and you don't have a cloud offering with some exceptions like DuckDB, then although we'll see Mother Duck next class as well, it's very hard to gain traction, right? So everything's going to be written from scratch and the paper guys talk about, like, you know, they consider, like, oh, should we go take things from, like, Hadoop or go take things from Postgres like other systems have done, and they decided that in order to have complete control of everything, they wanted to write everything from scratch. They're going to be doing a shared architecture as we talked about before, and then one thing that's going to be different than we saw in Dremel and I think the photon paper doesn't really talk about this but Spark SQL doesn't really do this, is that they're going to do aggressive compute-side caching on the worker nodes themselves, right? So since, again, they're not the cloud vendor, it costs them real money to go get things from S3, which is obviously also very slow as well. They want to do as much caching as they can on their side on the nodes that they're already paying for to avoid those look-ups to S3, right? The other interesting thing is that rather than doing what most people do today, like taking the Postgres SQL dialect, the grammar file, and using that as the basis to start your, you know, what kind of SQL you're going to support, they wrote everything from scratch. And it kind of looks, if you ever look at the documentation of Snowflake, it kind of looks a little orcally, a little enterprise-y, right? And again, I think that's sort of the lineage of coming from, you know, the two French guys coming from Oracle. So I always have to say this, obviously there's not any impropriety, but Snowflake actually sponsored this class in 721 before they went IPO. They went there and still worked there. So like, she stood literally here and presented the Snowflake architecture back in the day. So if you want, you can watch the guest lecture from back then. And last year when I was putting this lecture together, I had a bunch of questions like, you know, it's a closed-source system. You can't always infer exactly what they're doing based on the blogs and the documentation. So I actually had a phone call with the C. This is Sunday night, the day before the lecture, while he was like cooking in his kitchen, he's answering all my Snowflake questions. So the combination of what we'll be talking about today is from documentation, the paper you guys read. There's another paper as well, and some blog articles, and then while they're just asking a sheesh while he's cooking spaghetti, what does Snowflake do? He's a good dude. Okay, so, again, here's that high level bullet points of all everything that we care about in Snowflake that, you know, related to other systems, right? Again, not surprising. ShareDisk, Disaggregated Storage. Again, they were the one of the first systems to commercialize this and pursue this. But, again, that's building off of Dremel and Don, building off of what HDFS and Hadoop were doing at the time. They're new push-based vectorized query processing, relying on pre-compiled primitives similar to vector-wise and X100. Again, not surprising. Marcin was the guy who built vector-wise. They're not going to do any cogen, except for serializing and deserializing the moon of data from one worker to another. They're using LLM for this. It's a very limited thing. Like, I have some data I need to serialize to a binary format, ship that over the wire and send it to another node. Again, this is like 2013, 2014. This is a four-epatchy arrow. So, to make this work fast, they would have this little cogen piece that you could compile it to send data over. It's somewhat similar to what Protobuf would actually do for you as well. They're going to separate the table data from the metadata, and we'll talk a little bit about this, because that opens up some opportunities for other organizations. That's certainly different than... I mean, now it's sort of commonplace. Can you run something like Hive Catalog or whatever? Databricks has their unique catalog, but this is certainly different from a single-node system. There's not going to be an explicit buffer pool in every single node. They basically have ALRU cache aside of when things move things in and out, but nothing fancy. Like everyone else, they're going to use packs, a clumnar storage, because, again, they started before like Parquet and Oracle were like a big thing that they are now. They're going to have their own proprietary storage format that they'll have for managed data, but now, since the last five years or so, they're supporting all the open-source file formats that we'd expect. I think they can use sort much join, but primarily most of the time they could pick cache join as we talked about was always going to be better, and then they have a Cascade-style query optimizer that, again, tries to leverage the adaptive optimizations that we talked about before. So we're going to mostly focus on these things, but I'll sprinkle in discussions as we go along for the other topics. So the first thing is what does the architecture look like? So they are, obviously, again, they're going to run on distributed storage, so this is just using an object store to leverage that. And, again, the paper talks about how they made this decision early on when designing Snowflake. Should they actually spend time building the storage layer, or should they just give up control and use S3 and let Amazon handle that for them? And part of that decision was, it's from an engineering effort, so you can only do so many things when you first start building a new system. They decided they would rather focus on the execution engine and leverage client-side caching, sorry, worker-side caching, or compute-side caching to speed things up and just let Amazon handle all the replication and storage durability guarantees you would need. So I think that turned out to be a smart choice. I think the original version, they only supported S3, and now they support all the other major cloud managers for their storage. So they're going to have this notion of, yes? They're using Amazon and all these other companies, and it has to be sort of a conflict of interest taking to something just to Snowflake, right? His question is, if you're running on Amazon, Amazon has a competing product. Should you be afraid of Amazon trying to do something to screw you over? Snowflake's a big company, too, right? We're a lot of business from Snowflake, right? Yeah, they make up... So, how do I say this? Think about from a publicity standpoint how terrible that would be for Amazon. Like, hey, we just screwed over one of our biggest competitors. That would immediately panic all the other companies that are relying heavily on Amazon and make them go elsewhere. So from a business standpoint, a short-term benefit for long-term problems is just not worth it. Now, there are some companies we've talked to, at least in the startup, where they should have told us, we don't run on Amazon because we can see to consider them a competitor. And so they have to run on GCP or something else. But again, Amazon's not stupid. They're not going to let... I don't think they would... I like to think they would not make decisions like that. So, I will say, though, the way Amazon is going to get around this, so even though they may not make S3 slower because they recognize it's a Snowflake query or something, that would be terrible, right? It's not scalable. They can do other things like not so much for... Well, actually, for Redshift, we'll see this. They can add hardware accelerators and other things above S3 that's running on the same data center that maybe Snowflake can't easily do to accelerate certain things. They do this for Aurora as well. They push down... They have a layer above EBS that does transaction propagations to replicas for PostgreSQL and MySQL. You just can't get it if you're an external to Amazon. So, there's other optimizations you can do. In terms of adding support for Azure Store or Google Cloud Storage later, is that right? Yes. Is that something... You're referring to, like, internally, they were using S3 to store all... for all their managed storage, and then they switched over to a mix of the three? Or are you saying something more like they added support for Apache Iceberg and now you can use any of these? This question is, like, when I say they add support for, like, say, Google Cloud Storage or Azure, does that mean that, like, if your new customer you show up and you start storing data into Snowflake that will spread across the different data centers? No. I think you tell them when you sign up, like, I want to run AWS. I want to run on this, right? But it's still managed through them. What do you mean? Who's them? Snowflake. So, it's Snowflake's managed storage. Yes. So, you're just saying, like, use this other one. But why would you ever choose to use one of the other? Like, why would you use the other CloudEnders? I mean, there's companies that, like, we've talked to, like a, they were a Canadian grocery store that, like, we see Amazon as competitive. We don't run on Amazon. We run on Google, right? People have various reasons. But you can't, like, integrate it with existing data? No, we'll get to that in a second. We're leading up to it. Like, the original version of Snowflake was, like, okay, we're going to store things on S3, but the data we're storing, actually, inside the buckets in S3 is our proprietary data format. Because at the time, that's how everyone built these data warehouses. Now, Dremel was doing its own thing, right? But, you know, it wasn't, it wasn't until, like, Parquet and other things came along that people realized, oh, yeah, like, I can have my disparate system generate a bunch of these files and I want to be able to scan it with my database system. That precipitated them having to support external tables or other things to be able to read data from S3 that are, you know, that wasn't ingested through the front of the data system. Redshift's going to go through the same transformation as well. So, the question is, at this point, why does Snowflake not build their own S3? Above my pay grade, I don't know. Yeah, you'd have to build in data centers. It would be expensive. And, like, I mean, they're not stupid. I guarantee they did the back of the envelope calculations and said, like, does this actually make sense or not, right? I think Netflix did something similar. They realized that, like, I think they were running on prem and then they switched to the data centers. The big company is always trying to figure out, is it cheaper to do this? But they're like, if you're like Snowflake, Snowflake is a huge company, right? But it's not as big as Amazon. And how many data centers is Snowflake building? One a year, maybe? I don't know. Amazon's probably spinning up a new one, like, every few months, right? So, in economies of scale, they can just do way more efficiently than anyone else. So, the internal data between the two virtual data analysis, is that being sent to... I know there's no in-memory shuffle. Is the reason for that, that they don't have control over, like, they can't have the hardware accelerator and so they decided, hey, they're all easy to do in instances, might as well keep in memory and share data between them? Your question is... I mean, we haven't got there yet. Your question is, why are they going to allow worker nodes to talk to each other rather than going through a shuffle phase? Yes. I think it's just from a... it's just a philosophical decision of, like, that's how they want to build it, right? It's... There's pros and cons of both of them, right? Actually, this is a good discussion. We can have this now. Like, you know, the shuffle phase, it's nice because here's this abstraction layer that I can just write things to. You know, it'll have some fault-tolerant guarantees that they're not going to be able to do, right? So, there's that. But now, there's a whole other service I've got to run with additional nodes and it's actually now making another copy of data, right? So, there's pros and cons of all that. There's no free lunch. Right? But there's companies out there that actually also want to maybe have some database on-prem or maybe move workloads between the cloud and on-prem. Is there a big market still for that? Or is it just... everybody just clouds it? This is not exactly... It's sort of like architecture. His question is, like, is there still a big market where people are running on-prem databases? Yes. But, like, the sales cycles for those things are just way different because you've got to, like, you know, go out and have, like, you know, go fly out there, talk to the customer, like, you know, take him out to dinner and that kind of crap, go golfing, you know, like, f*** things in the 80s. Whereas, like, Snowflake and the Davis-Essers model, like, hey, here's our website. Just give your credit card and you're up and running. Right? So, again, like, for small startups, sure, you can do that. But, like, yeah, obviously, no banks can be like, oh, here's the credit card. Just do it. Yeah, no, there is a huge market. I think that... I mean, just there's everybody... I think the market of people going to the cloud is that percentage, that pie is growing at a much larger rate than people spinning up stuff on-prem. Again, it's not just from terms of, like, don't think of this like the cost of, like, oh, if I ran it on-prem, I certainly can run it cheaper than what Amazon would charge me for machines or, right. But then, like, then you got to pay for humans to go actually and manage those things. So, there's, like, pros and cons of all these. All right. Okay. So, right. So, the abstraction they're going to have, they have data storage. They're going to have this notion of a virtual warehouse. Again, this is how they first designed it where you basically say, I want to... You don't say exact number of nodes. You say, I want some of this compute capacity. You say, here's some virtual data warehouse that I can give me an endpoint where I can start sending data into and run queries on, right. They then... And so, when you turn on a virtual data warehouse, whether or not you're running queries, you're always paying for it, right. And we'll see how Snowflake will leverage that when they do the flexible compute because they can steal idle cycles from these warehouses. In 2022, they added support for serverless deployments. So, now basically the virtual data warehouse spins itself down if you're not running any queries. But obviously, they charge a premium for that because now you're using more shared infrastructure at the cloud services layer instead of spinning that up yourself. And then the cloud services layer is just the catch-all phrase for the front end of the system that encompasses all the things that we've been talking about the entire semester. Some coordinator, scheduler, the catalog, the query optimizer, all that is the entry point for queries. And I'll see at the end of the semester, sorry, in the class, the catalog is interesting because they're going to be built on another data system called FoundationDB that provides them transactional semantics for doing updates. So, now within a... at the compute layer, they have a notion between a worker node and a worker process. And again, this is from 2012, 2013. So, a worker node, at least in the original version, is just an EC2 instance, right? This is for Docker. This is for Kubernetes. Like Docker is 2012. Kubernetes is 2014. But in 2012, you know, you had raw EC2 instances. And so on that instance, this is where they're going to maintain this local cache on the attached storage device of that instance. So, they're not reading or writing to EBS. It's always running on a local... well, now be a local SSD, right? And this cache is going to be a combination of intermediate results that you're generating for a query while it's running, as well as some of the persistent files that you may be retrieving from S3. So, the idea is that if another query shows up, reads the same data that you just read from S3, I can read it from my local cache, which is going to be faster and cheaper to go do a roundtrip lookup on S3 or whatever the object store is, right? Again, we'll see this in a second. The way they're going to manage this, keep this everything consistent, or keep spread data out evenly and be able to scale up and scale down without having to reshuffle everything, as you would in a shared nothing architecture, is that they're going to rely on consistent hashing to keep track of what worker node is responsible for what persistent data in their own cache. And then within the worker node, when a query shows up, they'll fire off a whole new worker process, literally like a spawn of a new process in the OS. And that's going to be executing whatever the tasks are for this query. And it can read and write data to intermediate results and other workers or out to S3. And then when the query is done, the process ends and goes away. Yes? This question is, on EC2, you can have a EBS volume mounted as the root device. Yes, you have to anyway, because you have to have an AMI image that's been up. But you still kind of a locally attached SSD that you can then use NVMe or whatever, 2012 before that, that you can then read and write locally. That's just another mountain of file system. And that's going to be way faster than EBS. Okay. I don't know whether they switch over to Kubernetes now. We'll see, when we talk about Yellowbrick, the paper, they're all in on Kubernetes, and they make a big deal about how they're designed to run in the Kubernetes infrastructure. I assume now they're not dumb that they're doing something very similar these days, whether it's exactly Kubernetes or something else. It doesn't matter. All right, so when a query actually starts running, they're going to be doing a push-based vectorized execution, again, using the precompiled primitives with template and C++ based on the different data types that we've talked about. We've already mentioned that they're only doing cogen for when they serialize and deserialize data going from one worker node to another. And as he mentioned, they're not going to do explicit shuffle between stages. And instead, the worker processes are going to be allowed to send data directly to, push the data directly to the next node who's going to process it. Or they keep it locally and keep processing it if they're going up the pipeline as further, as needed. Right? So, that means now when, if the worker node has all the intimate results, if it crashes or there's a failure, there now isn't a, there isn't, it's not replicated or it's not being stored as an external service from the worker node. So that means that the computation is lost. And unlike in Dremel and Spark where if one worker goes down, then the coordinator then just invokes a new one or hands off that task to another worker, what Snowflake will do is just kill the entire query and then restart it from the beginning. And that's actually how the people have built OLAP systems back in the day. I mentioned this before in, with MapReduce, they were, you know, they were storing things on disk as they went along, but they had the ability to, you know, kill tasks, re-exude things and do, you know, basically partial retry. And in the Snowflake world, they're not going to add any of that infrastructure because that's additional engineering complexity. They're just going to make the decision, okay, well, one worker failed, okay, just kill the whole thing and restart. And who pays for the restart? The customer, right? So obviously, you know, they're not killing nodes randomly, right? They have a blog article where they discuss, like, okay, like, if a worker dies, they have to identify, is this something that we did or is this a transit network failure, right? In some cases, they can actually automatically roll you back to a previous version of Snowflake and rerun that query, rerun the query to see whether that solves the problem. The tricky thing also, too, is, like, if now you're ingesting new data, you want to make sure that the query, when it reruns, like, is it seeing the same data that it had before? Question? This question is, how often do queries fail? I mean, not that often, right? I, exactly, they have a blog article, I should link. I don't know the exact number, but it's not, like, you know, one out of ten. It's some way smaller fraction. Again, because it's like, you're not running, again, just going back to the MapReduce world, like, Hadoop was talking about, okay, we're going to run this query or this MapReduce job on thousands of, like, cheap pizza box machines, like, the one-unit rack machines. And in that environment, yeah, one of them is going to go down. If a query's going to run for an hour, certainly one's going to go down. But, like, not only are these queries running so fast, you know, they're running on a small number of machines anyway. So the likelihood of a failure is quite low. Yes? It doesn't come in scale, I mean, in terms of what? Yeah, the question is, like, the statement is, like, the statement is, like, okay, if I'm saying, like, the more machines you have, the more likely one's going to fail. Does that have some upper bound how many machines you would need? Again, it's so fast that I don't think you need thousands of machines to process petabytes of data. It was so slow that, like, because, right, because there wasn't any query optimizer, and it was just doing this dumb map reduce shuffle, map shuffle thing over and over again. No matter what the query actually was doing, it had no notion of, like, pipelines unless someone wrote everything inside of, you know, the single map job. Again, whereas, like, at the time, other parallel database systems, like, the vertigos of the world, you know, they were running on, they need fewer nodes to compute the same results in less time. So one thing, sort of like, does do, even though on the shuffle phase, they can do work stealing, and it's similar to the morsel stuff that we saw before, where instead of a coordinator, like in Dremel, recognizing that this guy's running slow, this task is running slow, let me kill him and let him fire up the task on ourselves, the workers themselves, the worker processes, excuse me, they're looking for work to do, and so they recognize that for all the input files they were told at the beginning, here you need a process for this stage of the query plan. If it runs out of stuff to do, then they can go do quick lookups on other workers running the same, at the same stage as they are, and see whether they're falling behind and go steal files from them. But, to avoid burdening the other worker nodes with sending the data from, you know, from their locally cache data to the guy that's going to steal the task from them, they always go out to S3, because the idea is that if the node is already running behind because it's slow for some reason, if now another node says, okay, yeah, you're running too slow, give me the data that you have, that's just going to make it even slower, make things worse. So they make the conscientious decision that the node is going to go out to S3, even though they pay for it, even though they have a local cache version, they'll go out to S3 because that avoids slowing things down even further. And then when this, the stealing worker goes, gets the data from S3, it can put intermediate results in its local storage just like before, but it's not going to maintain the persistent files in its cache beyond the worker just stole, because, again, there's some higher level organization through this consistent hashing that's deciding what worker node is responsible for what file on S3. So, you know, the next time the query runs, it reads the same data, it wouldn't go to that node anyway, it would go back to the original one. Yes? So you said that the intermediate results are stored in the local disk, right? So how can you get from S3 unless you send it to it? So it's at the end? The what? So when your work's stealing from another node, yes. and that node's working on some intermediate results that were already on that local disk. So how can you get from S3? I'll state it as like, if you're, if it's processing data from, it's not processing like the original persistent files, instead it's reading from the intermediate results, how can you go to S3 and get it? Because you could go to the other, the worker node that it got it from. Right? I'm going to show you the data. No, no, no. So like the, if the worker node got, you know, worker node, and I'll slide to your diagrams, if worker node X is running behind, but it got its data from worker node Y. Now worker node Z is going to steal that data, it can go to worker node Y and get those intermediate results. Or if it's, again, if it's reading persistent files, it would go to S3. I think the worker nodes can also spill to S3 as well. Like if you just run it, it's facing entirely. Then last, you know, the fallback is to destroy things in S3 as well. His question is, how can a worker know the progress of another worker and identify that they're running behind? I don't know whether they talk to the coordinator or they talk to the other worker. Right? But like, I don't, it's not broadcast, because like, there'll be a, there'll be like a heartbeat that's, broadcast every so often is like, hey look, I'm still alive, but he wouldn't say like, here's my progress as I'm going along. So it's either the coordinator or the worker you say, you know, give me something to do. Okay, so, so Snowflake is going to do work stealing and again, what I like about this paper is like, they describe like, here's why we did it this way. And they go a bit more details than Dremel and Databricks do. Then the other interesting thing they can support as well is what they call flexible compute. And the idea here is that, because the, you know, the original model of Snowflake was like, you define this virtual data warehouse, that sort of sets up the number of compute nodes you're going to have at the beginning. And that those, unless you're using serverless, those machines are always running. So maybe the case that for any, some query shows up, you're actually under provisioned. You don't have as much compute capacity you actually need to run the query in a timely manner. So what they'll do is they'll recognize prior to running, they'll look at the query plan and identify, is there any part where I think the query plan, this portion of the query plan is going to take a longer time than I would want. And can I then hand off those, the task for that part of the query plan to other nodes that I actually, the customers are actually not paying for, basically think of other, other idle customers that have compute nodes that aren't using. And that can farm out part of the query plan to those other nodes. It's a win-win situation for everyone because the query runs faster. Snowflake is not spending any more money because the customer's already paying for the EC2 instances anyway. And the, the customer that you're borrowing the machines from, they don't even know that their machines are being used in this way and that when they run queries they can leverage the same spare capacity as well. So let's say again, so you say we have on this side of the query plan here on the probe side of this join it wants to do some large scan. So Snowflake can then split this up into two, two sort of subplans that are going to be combined together with a union all. And so here we have the, the portion of the query that's going to run on the, the customers, the customers initiate the query on their data warehouse, their compute nodes. But this, this piece over here, this is going to run on the spare hardware. Again, idle machines running in, you know, running in the Snowflake in their ecosystem. So, but because these machines aren't controlled by the customer that's invoking the query, you can't write any intermediate results to the local disk because at any moment the customer could, the customer who owns these machines could start running a query and you've got to evict all this right away. It doesn't look good if like, hey, you're assuming the query, give me, give me 20 seconds. I've got to finish up Joe's query. People get pissed, right? So what they'll do is instead of writing the data to the local storage, they'll instead insert it as if it was a table back into S3. And then now, when I, when I, when I'm going to treat it again, you know, the query operator above it is just reading from S3 like it, like it was a regular table. Right? Actually going back here, this is actually, this is actually from Snowflake. So this is another example of the sideways information passion stuff we talked about before. So they have this, you build the hash join, sorry, build the hash table to the join and they have this uppercut join filter. That's passing over the bloom filter from the build side to the probe side. Right? They explicitly call it out as a separate operator called join filter. Right? Again, so, so this is just because it's a managed storage not running on prem, right? There's elasticity to the, the resources that are available. And again, they smartly recognize, okay, these machines are idle right now. Let me use them to make queries run faster. And this is all, again, all transparent data. You can also use this for the, for basically, basically query result caching. Almost like, not exactly like a materialized view, because it's not going to automatically refresh, but since you're, you're writing out the, the output of this query plan fragment into S3, you can then update the catalog and say, okay, we ever see this query plan fragment again on, on these, these files. Here's some materialized result for it that you can then reuse. Yes? I'm not concerned about customers, but the statement is, isn't this, wouldn't this be concerning for customers? Because they don't want their data to be mixed with any other data. Well, one is they have, they have sort of compute isolation because, as I said before, the worker process kill, gets killed after the query. So it's not like whatever you're running, you know, you're whatever task you run for this query wakes up and can now start seeing the next customer's data. Right. It's a managed service. So it's not running arbitrary code. It's all snowflake code. So if you're trusting snowflake with your data anyway, you could trust them to write the compute side of things. Okay. Yeah, it's not, it's not that big a secret. And again, to your original point, I don't think, it's not, I don't think it's a video concern. Right. And if you really, I think you can opt out of this and say, like, I want to run in a, in a... You didn't say it, you know. Right. This is not, this is not, this is not controversial. It's running snowflake code. Yeah, it's running snowflake code. And you shouldn't be because no one, yeah, there's other, there's other things to be worried about. And again, it's like, it's like, it's like the, the give a penny, take a penny thing. Right. So like right now my, my data warehouse is idle. So yeah, if someone else can take advantage of it, sure. But then when, when I need it, then I can leverage somebody else's. And it all works out. Yes. This question, this is not a good example. I mean, so, there was a, there was a system out of the 80s that actually I worked on, what I call my pre-doc, this thing called Condor, and it was called a cycle scavenger. It's basically, you had a bunch of machines in your computer science department, at night you would recognize that no one's touched the keyboard or the mouse, and then it would start running, you know, compute heavy jobs on the machines. And then when you, you came back the next morning, when you started using the mouse, so that, that idea of like, cycle scavenging is not new. Specific for databases, again, I think what's different about this is because it's in the cloud, it's a single giant pool of all this compute capacity, right? That, that snowflake has, has under its control that you can do this. Amazon does, does it differently, right? Amazon has spare EC2 resources, they try to sell them off as, as spot instances at a little price. But of course, because someone else wants to pay a higher price. In terms of databases that, doing something similar, I'll have to cut this, right? Not exactly the same, but like, yeah, taking advantage of idle resources is a new thing. Or Carl, also too, like, thinking of the database system, like, you know, maybe not so much for the, actually we'll see some micro partitions. Like, the idea is not far-fetched. Because again, because they're in a cloud, this opens up opportunities that you would not be able to get easily otherwise. Okay. So as I said before, the, they're going to rely on some object store, typically originally S3. But of course, you know, there's downsides to this because it's going to be slower. You have to go make requests over, you know, you can't do kernel bypass, you've got to go over the network, make an HTTPS recall, call to their API. That's got to get encrypted and decrypted when it comes back. Right. That's expensive. We'll see next week when we talk about Yellowbrick that they're super hardcore about this and they wrote their own S3 drivers that use kernel bypass to go as fast as possible talking to S3. I don't know whether Snowflake does it. Does something similar but you can imagine they have a lot of money that they could. So instead what they're going to do in Snowflake is that instead of having to build their own, again, their own object store or their own storage layer, they're just going to build their own cache layer, the caching layer on the worker nodes and make that as fast as possible because now the benefit is if they do a really good job caching, they end up paying less money to Amazon because they're making fewer requests to S3 but it also makes the queries run faster because now you're not going to S3. So it's like a win-win situation for everyone except maybe Amazon but they have enough money. So I think this was a smart engineering decision for them to do. So it's a separate layer of nodes that just act as a cache? No. The question is is it a separate layer of nodes that acts as a cache? No. The worker nodes themselves each have a local cache. Right? And then if that cache fills they can then spill to S3 if it's like an inmate result. Right? And then they're going to prioritize this paper talks about I don't think the paper you guys read talks about it. They're going to prioritize the persistent files, sorry, they're going to prioritize the inmate results keeping that local versus going out to versus maintaining the persistent files because the inmate results are ephemeral. You want to be able to get them out really quickly and make the queries run faster. And so you want to use as much as your local storage and your local cache for those inmate results. And they're not doing anything fancy. They just talk about how they're using this LRU to do cache replacement policy and that's good enough for their environment. Yes? It's not like a traditional buffer hole manager where it's two layers, either in memory or on disk. It's multi-layered. Yes. It's either in memory, on disk, or S3. And I think they talk about at least in this paper on 2020, this is before there's this sort of persistent memory work or devices that Intel was putting out and that was sort of seen as a sort of a buffer hole manager where it's two layers. It's not a traditional buffer hole manager where it's two layers, either in memory or on disk. It's multi-layered. That was sort of seen as another layer. Like you had DRAM memory, then you would have persistent memory, like Optane, then you had SSD, then you would have then DS3. Intel cut off the Optane so that's not a thing anymore but they talk about how like, you know, having sort of a holistic view of a multi-level cache is something that they're thinking about doing. Okay, so again the original version of Snowflake or I guess by default when you put data in Snowflake they're going to be using their own proprietary storage format. And again, this is before Parquet, before Orc, but at a high level it's going to look I've had students tell me that it looks basically equivalent to Parquet and Orc. It's using packs. It's columnar storage. You know, it's doing dictionary encoding. I think they're doing run length encoding. So there's nothing dramatically different or special about what they're doing there other than it's proprietary to them. They're not going to give you like binary file because they wouldn't have there's no readers externally for these things. But then one thing they're going to do is for any data that shows up they're going to break it up into what they call micro partitions and I think this is roughly like a let me put it into like a row group that we talked about in Parquet. And so the original data for micro partitions range up to 50 to 500 megs but after doing all of the compression stuff including, I think they run like a block based compression like snappy or as the standard micro partition down to 16 megabytes. Then in the background they're going to they're going to periodically check to see whether the clustering of these micro partitions is actually ideal and they can reorganize them and resort them based on how the what join key people are using or what access key people are using for queries. So there's sort of this extra work that they're doing in the background to optimize the storage when it's in their proprietary format. And that's different than we talked about in Databricks and Spark Spark SQL and Dremel where they just assume that people are going to putting random files on S3 and they don't have the ability to go and rewrite them and modify them and reorganize them and they just had to run the query on directly as the files as they existed. Whereas in Snowflake again using their internal format they can use the extra cycles to speed things up. But we'll see how they they had to support external tables and things where they can't do this. I'll talk a little bit about how they how they handle that. So now one thing that is interesting about Snowflake's proprietary format is how they want to handle semi-structured data. And so they have three types that are specific or unique to Snowflake. Variant, Array, and Object. Variant basically means anything like any kind of JSON hierarchy or something like that or XML, something like that. Array is as it sounds it's just an array of values of an arbitrary length. And then Object I think is equivalent to the degenerative case of Variant where like it's a single level single level hierarchy whereas Variant can go at any arbitrary length or depth, right? So in the case of the Dremel paper they talked about how they were trying to process all these proto-buff files internally to Google. Well, if it's a proto-buff file you have the schema. You know the data types of the data of the fields that are inside of them so they know how to convert them into the proper binary format. And doing the shredding or breaking out the separate columns as we talked about. In the case of Databricks and Photon they didn't have the schema for these files so the way they would handle that is that while the query was running that they would do this runtime adaptivity where they would switch of a primitive they would use to say, oh, I know I'm processing Unicode data or ASCII data or date data versus, you know, there's random numbers, right? And so they were trying to learn while they were running the query what the data type actually was for the different fields. What Snowflake's going to do is different is that they're going to try to figure things out upon ingestion. And again this is when you use their proprietary format you're calling insert to a database or you're bulk loading some file so as they're processing and putting it into their internal format they're going to figure out what is the data type for the different fields, right? And so they'll do things like, you know, if you identify you have a string like a year, a month, and the day while they would parse that and recognize, oh, this is actually in the proper date format so then they'll convert it automatically into, you know, whatever the binary date format and then they're going to maintain the original unparsed version of all the strings in your JSON file or whatever it is in case they get it wrong like if someone puts a poop emoji in there and you're processing it and you've got to fall back and say, okay, this is actually not what I thought it was. Right? So again this is interesting this is now you start to see the differences between these different systems you know, Dremel's doing it in one way Photon does it in another way Snowflake's going to do it upon ingestion and at a high level they're all doing vectorized query execution upon, you know, on an object store but the low level details and the nuances of them are going to be slightly different so I'm not saying that this is a good idea or a bad idea I think it's a if it's proprietary storage and you can get the data as it comes in yes, you should definitely do this because now you don't need to redo this over and over again to figure out what the data type is while you're running the query and so you just do this parsing once and you get all the advantages of compression and coding yes So is this the way that Dremel the question I asked in the Dremel lecture which is that they were doing it's expensive because that's the detail but it's performance games maybe because of this that they have this conversion done for the semifunctionist already So I repeat your question that Dremel was what that sorry was that in Dremel they could take up their storage place could be anywhere it didn't have to be in a specific format from S3 and that there's extra costs in Snowflake whereas there isn't over here this is a benefit So his question is when we talk about Dremel Dremel talked about how rather than have people put data into a proper schema form where you know these act data types and set all that up ahead of time when you load the data and then that way the query runs faster from an engineering and time perspective from a human side they were better just people to store one and at run time the query engine will figure out what the data type is yes so this is the opposite this is like you gotta give the data actually it's not exactly the same cause like it's JSON you can throw any JSON you want in but then they're gonna figure out when you load it what the data type is so at a high load they're achieving the same thing they're both saying okay throw whatever data we'll figure it out Snowflake's gonna figure it out when it's inserted Dremel figures it out while it's running but again like if it's if it's Parquet files or whatever random JSON files in S3 and it's not in your proprietary format then you gotta do what Dremel does or Photon does you kinda need ideally both but if you know which your data is not gonna be inserted directly into your file format you need to do what Dremel does alright so the consistent hashing stuff we mentioned before again this is how they're gonna use this is how they're gonna organize the system to figure out what worker nodes are you know quote unquote responsible or the owners of a micro partition file for a table and we covered consistent hashing in the in the intro class the basic idea is that it's a ring of a bunch of nodes and you can insert a new entry into the ring and only move the files from its predecessor and not reshuffle everything as if you were just doing sort of naive hashing so that means that when a query shows up the catalog is gonna look at this hash table figure out what file what workers are responsible for what files and then when it hands out the task it tells them okay here's the files you need to process and it knows which ones you know which ones you know can compute that data and it knows that the likelihood that they'll have locally cached data because the worker nodes that's responsible for some persistent file is the only one that can maintain a long-term cache of that data and you add new compute nodes which they say their customers do all the time then you don't have to you know you don't have to get everything all over again from S3 or pass every worker node all their files around you can just go retrieve things in a more fine-grained manner so this part is unique to Snowflake I think this part is clever and this is the right way to do this if you're gonna build a you know sort of a no-lub system like this alright so the query optimizer is gonna be a unified cascade style doing top-down optimization if you go read I think in the paper you guys read and then if you go read the documentation they're gonna refer to the query optimizer as the compiler as I said before that's a remnant of like the vernacular from the 1970s because when people sort of built the first C compiler it was taking a high-level language like C and converting it to assembly same idea in SQL you're taking a high-level language like SQL and converting it to the machine code or the executable code of a database system so for you know for historical reasons Oracle sorry Snowflake's and call theirs thing a compiler so they're not gonna rely on on just like Dremel and Databricks and Photon they're assumed they're not gonna have good statistics either because I mean in the paper you guys read it was before they had external tables but like if it's external tables you know nothing about potentially about the files but even if you if it's dating if you insert it in your proprietary storage they assume that all the stats are gonna be garbage and can be changing and become stale over time that they're gonna have their optimizer try to operate as much as possible without relying on high quality statistics so they use some of the heuristic based techniques we talked about before like if it's a star schema do certain things versus other other organizations of the schema the optimizer's big the big goal that it's trying to achieve in the beginning is trying to decide which micro partitions or files that it can throw away as soon as possible before it even starts running and again if you have some basic stats like some zone maps or some catalog you can say well I can look at my query plan I can look at my my predicates and decide these are the files that I know could never have the data I could ever need and therefore go ahead and skip it and like the other systems we talked about they're still gonna rely on runtime activity to adjust their query plans as needed and we'll look at one example of what they're doing so if you go through and insert data to snowflake using their the ends up in the proprietary format zone maps min, max and ranges within each column they're not gonna any histograms and they're not gonna maintain any sketches right and the data is you only get this when you're using their proprietary format so they have some again some really basic information and instead at runtime they're gonna have triggers that decide should they adjust things as needed but one of the challenges that they're gonna face is that and this is a bit of a nuanced topic that only comes out if you're actually building a query optimizer is that if you need to figure out what micro partitions you need to skip based on the statistics that you do have then now you gotta start reasoning about what your expressions look like to decide whether they satisfy or not the whether micro partitions potentially satisfy any data that may be right so simple things like you know where column greater than 1, 2, 3, 4 it's a single column by itself yeah you could use the min-max ranges in your zone maps and each micro partition to figure that out but when you start doing more complex expressions like column 1 plus column 2 is greater than 1, 2, 3, 4 but now you gotta kind of need to evaluate this thing and figure out what it actually is right or if you have like a function like this truncate the date extract the year and see whether it equals 2024 if you're just looking blindly at you know without understanding the semantics of what this is actually doing like how could you actually ever reason about this we gotta go execute this this function right so that you what you really want to do is be able to rewrite it into something like this so they talk about how they have rather than have sort of two separate code bases one for like expression evaluation for run time for you know for new run queries and expression evaluation within the the optimizer itself they try to leverage that same code base to be able to reuse them so that you always have guaranteed you have the same semantics except that you need to be be mindful that you're not actually processing real data or even sample data you're trying to reason about what's actually inside of you know what the expression doesn't matter but from engineering perspective it's actually quite difficult because you have to deal with like you know the null semantics of of what data actually could be right like in the case of I know once it's like my sequel what they do is when they see like a nested query in some cases if it's a nested query that should produce a scalar though inside the query optimizer literally stop query optimization go run that query right plus one equals two run that query in the execution engine get back the result and then inject that back in the query plan and then you don't then you have the constant value that's an extreme case but again that's because they don't have a way to evaluate the expressions directly within at least at least a few years ago directly in the query optimizer you can only use the execution engine within within my sequel itself so again to avoid having to go run some queries to figure out how to plan this query to to to repackage the expression evaluation engine to something that can be leveraged on top of statistics again I don't know again I'm being bit hand-leaved here like this is hard trust me and the cost model team surely tell you this okay so the one adapt optimization that they're going to do that I don't think is unique to snowflake but they make a big deal about it that's kind of cool is to be able to do so after they figure out the join order using some basic heuristics basic cost model estimates in in the query optimizer they then want to decide when is it appropriate to push down aggregations below the joints and you want to do this when you recognize things that the amount of data that I may be processing for the join could be reduced significantly if I do a partial aggregation right below the join and then some things up again down below so in this case here they could recognize that this aggregation could actually be partially computed on this side here the probe side of this join and then now when I do the join I'm just joining on the aggregation key rather than all possible keys that are coming out of this table scan so in this case here I push down the table scan aggregation child and then update the top one to aggregation parent and you can do this for some things very easily like it's a minimax you know in that case like there's not you don't worry about duplicates but for some accounts and averages you need to account for that and so the aggregation is at the top the parent one is a bit more tricky to do and so the way they're going to do this is that they're always under the right conditions the query optimizer is going to inject these these special push down aggregation operators into the query plan but they're going to be disabled by default and then just like before when we talked about adaptive query optimization they're going to have trigger if the amount of data coming up through me is larger than I anticipated or then it should be based on some cost model that they've generated then it'll go ahead and just enable that aggregation plan node instead of just being a pass through yes two questions yes why do we not always want to do this this question why do you not always want to do this because the aggregation maybe computing may be expensive depends on the number join key group I keys group I key in a table column one and the number to think about is column one is you know is equal to the number of tuples right your second question was your okay the database systems do this this question I think Dremel might do this the blog article mentions this so when it's like yes this question how do they join if they don't have sketches detailed systems again I think you have a rough estimate of the you have a rough estimate of the the data might be coming out of the the table scans based on the number of micro partitions you can prune in these statistics now how they do call estimations on the output of the join that I don't know like I said they probably do what Dremel does is like oh I recognize I have a star schema let me have the dimension tables be like the build side of a hash join and write up the fact table as the last pipeline like a simple like trick like that would be you know providing you benefit but I don't think they're reordering the joins at runtime I don't think anybody does that I thought they were doing distinct value estimation could that be what it says it says I think they're doing distinct value estimation but like after the join like the stats are all garbage so Snowflake loves talking about this optimization that they do and there's a blog article about it from last year and actually written by Bowie who was my student who took 721 2016-2017 and went off and built this piece in Snowflake that's pretty cool and I think I think this blog article mentions that there's a couple other systems that do this but I don't know if they name names okay so that's the that's the high level overview of what Snowflake does and again the idea here is that you sort of compare and contrast some of the nuances that they're doing that are going to be different than Dremel and Photon and Redshift and Yellowbrick the others that we'll cover and I'm trying to highlight the parts that I think are interesting so I want to go back to this thing I mentioned at the end of last class about how Databricks came out with Photon and made a big announcement had the Sigma paper and then they also announced that they had that they had they had audited TBCDS numbers and they were the fastest implementation ever so in addition with this they put out a blog article announcing the paper announcing that they have the new world record in audited TBCDS and in this in this blog article they include this graph here where they compare Databricks against Snowflake and this is being run by researchers at the Barcelona Super Computing Center and running on a what's called a power set of TBCDS I think it's like a selected subset of the TBCDS you know the 100 TBCDS queries that it meant to be representative of like pushing a database really hard so this blog article came out with November 2021 and then two or three weeks later the Snowflake's guys came out and they started going on about how the Databricks guys didn't run Snowflake correctly that their results are garbage and don't trust what they're saying Snowflake is actually faster and not as expensive as what Databricks is saying like that Databricks ran on the enterprise version of Snowflake but because they weren't using any of the enterprise features they could have run on the regular version of Snowflake and cut down the call quite significantly and these are the two founders the two French guys that I mentioned in the beginning the two guys that came from Oracle so they came out and said good all the Databricks numbers are garbage well two days later Snowflake, oh sorry the Databricks guys came out again and they go now we stand by our numbers and the Snowflake guys are being disingenuous and that what's really going on is that these are results that Snowflake is publishing for their dataset so in the Snowflake blog article they gave you like here's how to go you know sign up for Snowflake account and go access the TPCDS dataset and run this experiment exactly yourself to see why the Databricks numbers are garbage so that's what this result is here but what they're what the Databricks guys are saying though is the the dataset the data that they're actually running on and the Snowflake results are when you use their internal proprietary storage format and when they've already run that micro partition rebalancing optimization that we talked about before so the data has been not cooked but prepared because it's been ingested through the system and they've done the extra steps to get into a form that is ideal for them and that if you just take the raw dataset that you're given from the TPCDS data generator and then run that without any additional preparation on Snowflake this is the result that you're getting and this is what they reported from the Barcelona data center whereas in Databricks if you don't do any preparation this is what you get right? so in the official TPCDS documentation you actually have to include the preparation time of the data in your time measurements so like if you think about this like if my query is going to run for a minute but I spent 24 hours compressing the hell out of them and re-optimizing as much as possible I have to report the 24 hours plus the one minute and so that's what the Databricks guys are arguing that like this time here includes whatever that preparation is and then if you throw all files at it this is actually what you get right? so I like to be the Switzerland it's coughing I like to be the Switzerland of databases I want to get along with everyone so when this came when this came out this off the cut as well you know I would say for Databricks for this was a big win for them because compared to this certainly Spark Sequel was not as sophisticated and as advanced compared to Photon and other systems at the time but this showed them put them in a different light and show that you can use Databricks as a high performance data warehouse and put them on the same equal footing in a sort of competitive market space against Snowflake and other systems around at the time so this was a win for Databricks I think 2021? this is 2021 yes during the pandemic okay these questions have they come out with any comparison since then? no I don't think so there's always been some kind of like for these guys fought then like the time series data people fought over benchmarks right there's always some battle between these various systems why is it hard to run Databricks tests? like why can't we just run the numbers in benchmark data? this question why is it hard to re-run the Databricks ones? yes I mean like just put the TPCDs data there and just run it yeah so like if there's raw files on parquet and if you say zero preparation that would be an app of that opposite comparison right but here's where things get weird like not weird but like it's a well-known fact that over the years the various data's benders have various optimization tricks that are specific to the TPC benchmarks so if you recognize yeah it's a well-known thing recognize that oh in TPCD this is a very common thing oh you're accessing the new orders table or the warehouse table I know a benchmark you're running and they'll do like they'll give you certain query plans or do certain optimizations that you would not normally get right oh it's like it's like the Volkswagen one it's not that bad because like that was polluting the environment that's that's worse this is more like the speed test speed test you're using yeah like yes actually no there you go it is the Volkswagen but then like polluting like yeah like Volkswagen their cars would recognize oh I know I'm running the emissions test so they ran at a more optimized manner but then when it was out in the real world it was polluting a lot more like there's tricks you can do in TPCC I don't know all the tricks are like the analytical workloads and TPCC for example like the you know you call create index on the warehouse table but like there's only a hundred warehouses you don't need a full-fledged B plus tree just do a sorted sort of because the number of warehouses don't increase when you run that benchmark so you make a static array and do really fast lookups right but like if anybody else would not give that optimization so there's well-known tricks like this okay let's finish up this we talked about many times again snowflake started off being proprietary storage the world of data lakes has evolved or has expanded and so over the years they've added support for accessing data that's not directly in their proprietary format it first started off with this thing called Snowpipe which basically was a Kafka endpoint that lied to you in just data in Apache era format that then actually did get written to their proprietary format but in 2021 they added external tables I actually don't know what this actually looks like because the definitions say oh it's this data format or whatever so I don't actually know what they're doing other than I know they can read from but then to read parquet files they've added support for iceberg in 2022 which we talked about last class it's basically parquet files with additional metadata to keep track of the schema information and you can do you know simple updates and inserted updates on those files and then in 2022 they also announced support for what they call hybrid tables and this is a service that I think is still called Unistor and it's basically a full-fledged transaction transactional database system that's a row store that's running inside the Snowflake ecosystem that you can do queries SQL queries on and run TPCC and other transactional workloads and what will happen is the data will get inserted in a log structure format as a row and then in the background they'll then run compaction and convert it to a columnar data stored in the Snowflake proprietary format right and so when you now run an OLAP query it's basically the fracture mirrors approach where query shows up you have a table that's being cleared it's a hybrid table and the execution engine has to recognize oh some of the data I can get from the column store side but I also need to merge with some data that I have on the row store side and it provides a single viewpoint for all of this so again this is in response to like Delta Lake actually I think they support Delta Lake now as well but the idea like again ingesting data from different sources if you know it's one additional thing a way to access put data into Snowflake you now run your transactional application on top of this so Snowflake is a great OLAP system how can they build a transactional system that's just equally as hard that's fault-tolerant, reliable and safe how can they build a transactional data system at the same time well let's just say you have a transactional data system you're already using for other things that you can then doing it for other other parts of your system so famously Snowflake runs their catalog on this thing called FoundationDB who here has heard of FoundationDB before for this class three, four, five small basically in the NoSQL days in the early 2010s there was all these NoSQL systems that were doing key value stores that didn't do any transactions and FoundationDB said well we're going to be a transactional key value store and I think they were backed also by Sutter Hill and so basically they got the two-boy bands to put out work together and Snowflake decided to use FoundationDB early on as their catalog it's one less thing they'd have to build because you need a transactional catalog so the challenge though is that FoundationDB got bought by Apple in 2015 and it was always closed source so what happens is Snowflake had in their contract with FoundationDB that if they get the source code in the escrow service they would get access to the source code because again by this point in 2015 Snowflake was huge not as big as it is now but it was growing quite rapidly so Apple buying the main thing that runs trying to catalog service would be a huge problem so they got access to the source code and they kept maintaining that and over the years and then when Apple then opened the source FoundationDB they then had to spend time to get it merged back together and now right now the number one contributor to FoundationDB I think is Apple the number two is Snowflake but for legal reasons the Snowflake people can't commit code directly into FoundationDB only Apple employees can do that but they literally show me on Slack they say hey commit this to the Apple people and the people they'll do it for me I don't know whether that's changed that's what it was a few years ago FoundationDB we don't have time to cover this it's a very fascinating system specifically to hermetic testing infrastructure where they could introduce faults on like the disk the network and whatever and show that thing could fail over and it was fault tolerant they the guys that built FoundationDB have a new startup called Antithesis where they're now they're trying to sell the infrastructure that they built for FoundationDB new testing for FoundationDB databases how do you make a key value to fund that in the sense that they isn't it like something it doesn't matter your real statement is how do I make a key value store transactional but like I was doing that they probably used NBCC like to make it kind of like a gun transaction I had to go look I don't remember but like the fact that like it's a key value store versus a relational database it doesn't matter it does not matter begin put, put, put, commit it's all the same right that's basically what NODB is doing underneath the covers that I interfaced on a B plus tree my sequel does the higher level stuff but like it's doing transaction and all that wireTiger, RocksDB it's all the same it doesn't matter whether it's a key value store or not okay, all right so this is again a crash course on what Snowflake is again I think it's a very fascinating system and even though again it's 12 years old now I still consider it to be a very much state of the art to be a disaggregated storage lake house system that does vectorize query execution right what Snowflake provides or did back then is common now but it's still state of the art although you can see sort of not cracks in it but you can sort of see how there's aspects of it that are sort of remnants of being designed 12 years ago whereas like you know again whether or not they were determining what EC2 images I don't know the tables after you've already had this proprietary storage that's not how you would build a system today if you want to do lake houses or data lakes but again it's still I still think it's it's still a very fast very good system the other challenge I guess from from Snowflake perspective you look at bellocks look at data fusion the core you know the core engine itself has become commoditized so it's all the stuff above that Snowflake does the snow park the snow pipe stuff because that's the user experience right and the adaptivity of runtime that you know is going to matter rather than just like how fast can you do your vectorized scan alright so next class on Wednesday we're going to read the duck db paper so I think the paper I had you guys sign is like it's the demo paper so it's like it's two pages or four okay yeah so there isn't a canonical duck db paper out there that's the best we can do well we'll cover the mother duck paper that came out this year but I don't think that one discusses the core architecture but I'll go through like hear what duck db like the internals actually look like and that's based on public talks and other documentation that they've given and this is going to be slightly different than everything we talked about before because like we're making a big deal about these OLAP systems that are running on lake houses and now we're talking about an embedded in process database system but again see ya