 Yo, hey, yo, hey, yo, yo. Pack the chrome styles, fly like Mrs. Jones. Lyrical mathematics will have the devil's smoke and stone. I put heads to bed, lick shots, and reference it. Anybody can do it. How about that? OK. So today, we're going to talk about Snowflake. Again, we're following along on the reading the papers about these individual overlap systems that build upon the things we talked about this semester. For a quick overview of what's coming up, finish up the semester. So project two, the final submission will be two weeks from May 1. Again, I will send feedback later this week on everyone's preliminary submissions. Project three, the final presentation will be on Friday May 5 at 5.30 PM. I think there's a room we can either go there. I don't know. I think it's a big room. Or we can just come here. We'll figure it out, and we'll get pizza or whatever else people want to eat. Next class, Mark from CWI, will be giving a guest lecture about DuckDB. I decided, let's just all come here and meet up, and then we'll watch the Zoom thing. And that way, it'll be interactive. We can ask questions, and not us, like everyone muting themselves, sending on Zoom. And then next week, also two, Ipocratis will be giving a guest lecture in the same thing over Zoom about redshift. And I didn't assign the redshift paper to read for that class. So there is one that's better than the Snowflake one, but I'll cover all the main details. Any questions about these many things? OK, so let's jump into our discussion today. Is it giving a guest? It's going to be Snowflake. I've already talked to Charlie. It's not on the website. What's that? It's not on the website. Yeah, Snowflake's getting back to me about the name of the person. Yes. OK, so we've covered this story about what databases look like, or OLAP databases look like, in industry and the open source community. We covered this at the beginning of the semester. We've been talking a little bit about giving, when we talk about the background on Dremel, the background of the system we talked about last time, which I'm drawing a blank on stupidly. Oh, Spark, Databricks, sorry. So we talked a little bit about what was going on in the world again, and this is reiterating this to understand why Snowflake appeared when it did and how it did. That it just wasn't this people magically thought, oh, we're going to build this. There was clearly things going on, and people put the pieces together and decided what came up with architecture Snowflake. So again, in the 2000s was when we started seeing the special purpose database systems, relational database systems that were designed specifically for OLAP workloads. Vertica, is this OK? Yeah, everyone is closer. We'll cut that. So Vertica was probably the most famous one. Again, that was Mike Stoenberger's work with Sam Madden and Mike Stanzano, like with the fork of Postgres. Rewritten from scratch would be Kolomnar. Green Plum is the other one that's probably the most famous of these. That also was a fork of Postgres. Monadibi was at a CWI where DuckDB came from or comes from. Vector-wise was the optimized version of Monadibi, which was the precursor of Snowflake and also sort of the precursor to DuckDB. And then ParkSale was a parallel version of Postgres. So again, there was a bunch of these Kolomnar OLAP systems that were around at the time. There was also this work on building the top of Hadoop to add support for SQL to run on Hadoop jobs as MapReduce or reading data from HDFS. Hive and Presto are probably two of these famous ones. Presto came out of Facebook. It got forked into two different versions. There was PrestoDB and PrestoSQL. PrestoSQL got renamed to Trino. And then PrestoDB still was around. Apollo would cover the last class. Stinger was sort of Hortonworks version of Impala, like SQL on top of Hadoop stuff. But all of these systems have a lot of the sort of early variants, early ideas that we've been talking about in the semester, mostly in the Kolomnar storage. Vectorized execution mostly for the ones up here. Later on, these other ones added it. But the big thing about them all of them is that they weren't being sold or provided as a service. And instead, you had to download them, run them on-prem, and manage everything yourself. I guess that's the way people who had been selling and selling DB system software for decades, right? And obviously, the cloud changes all of this. But again, this is late 2000s, early 2010s. The mindset that immediately, oh, of course there should be a cloud service, wasn't automatic for everyone the way it is now. So the big changes that came out in the early 2010s was the Dremel paper we talked about before came out in 2011. Facebook started building Presto internally in 2012. They're building on top of their experience of creating Hive, which covered its SQL queries into MapReduce jobs. They're building a specialized engine similar to Dremel. Amazon bought a license to Park Cell in 2011, and then they released Redshift as a service in AWS in 2013, which actually beat Snowflake. Park Cell, the company was kind of like, I think it was dead at this point. Or they were trying to sell it off, and they were hoping Amazon was gonna buy them, but Amazon only ended up buying a license of the software. So, while this all going on, there was a bunch of VCs at a VC firm called Sutter Hill, which is probably not really well known, not like Andreessen Horowitz, or Greylock, or Excel, most of the other ones, right? What their game is that they don't, you don't come and pitch them, like, hey, this is what I wanna do, give me money. They're instead, they're like a record label putting together a boy band, where they find people like, okay, these guys seem smart, let's put them together, and they're gonna build what we want them to build. But that's their operating model. So, rather than, again, pitching random people, or people coming and pitching to them, they say, this is what we wanna build, who's gonna help us build it, and they use their own money. Like, you go to their website, there's nothing there. This is literally just a logo, last time I checked. It doesn't say anything about what they're doing. So, they recruited two top engineers working on Oracle's database system, these two French guys here, and then they had the guy that built Vectorwise, at a CWI, where the MoDB X100 stuff we talked about for, Marcin Zacalski, they got them together and said, hey, let's go build a Vectorwise cloud service using a lot of the ideas that they've built with Oracle and Vectorwise, and that became Snowflake. So, that's the background here. It wasn't like someone clearly saw, sorry, it wasn't like the founder saw, like, okay, this is exactly what Snowflake is gonna be. My understanding is that this VC firm recruited them, said, okay, this is what you guys are gonna end up building. They gave them a lot of money, and then set them off, and obviously, it was super successful. It's so successful, one of the co-founders actually has a Snowflake tattoo on his leg. I mean, to me, that's a dedication for your database, right, that you're willing to be like, I mean, yeah, it's, they went IPO, so there's a life changing amount of money, but that's dedication. You love your database so much, you give yourself, you get a tattoo of it, right? I hope to get there one day. Then all the students can get tattoos. Okay, so what is Snowflake? So Snowflake is a managed OLAP database system written in C++. I don't know if it's public that it's C++, but I've been telling you it is. And it's a shared disk architecture, like we've talked about for all these other systems. And the one thing that they're gonna do slightly different is, or they're very explicit about, is that they're gonna do aggressive, sort of compute side caching of files. So when we read the Photon paper, or we read the Dremel paper, they talk about, oh, we're gonna go get things from S3 or your object store or whatever. And it seems like, makes it sound like every query is actually doing this. And it's unclear how much they're caching they're actually doing, because they want the compute nodes to be ephemeral and stateless. So in the Snowflake world, because they're not the cloud vendor itself, although, you know, database is not the cloud either, but, though you're not the cloud vendor, so therefore they're paying the cost of doing all the reads to S3 to get data. So to avoid, you know, increasing their, you know, increasing their costs or operating costs to run queries, they're gonna do as much client side caching on with locally attached disks as much as possible. And we'll see how they're gonna be able to manage the scaling out the compute on the fly, you know, and without blowing away the cash every single time you add a new computer or drop it into computer. Everything in Snowflake is gonna be written from scratch unlike so much of the other systems, they're not gonna rely on existing code base. And they mentioned Hadoop and Postgres. You said Vertigo is the fork of Postgres, Greenpalm is the fork of Postgres, Parcell is the fork of Postgres. A lot of the new systems today are forks of Postgres, they're using bits and pieces of Postgres. Everything's gonna be written from scratch. And that, in the claim, in the paper that a lot of them have sort of precise control of exactly how the systems can behave and perform without bringing in sort of legacy baggage of maybe how Postgres expects things to be done. They're gonna use their own custom SQL dialect. I haven't really looked to see what flavor it is. I think it looks more enterprise-y, so it looks kind of more oracle-ish than Postgres-ish. That can be a thing. And then they'll have their own custom client-server network protocols. Everything's written from scratch. So I would say, again, a sort of disclaimer would be Snowflake sponsored this course in 2018. They were very aggressive in coming and hiring senior students. And so that link there would take you to literally a sheesh was sitting right here or standing right there giving a guest lecture about what Snowflake does five years ago. And actually, last night I had questions about Snowflake and he actually got on the call with me. This is Sunday night. I answer my questions about how they do, you know, vectorize execution for certain things. Literally, he's cooking and he's answering my database questions. So he's a good dude. All right, so a quick overview of what's everything in Snowflake. Again, these are the things that we've talked about throughout the semester and now we can apply them to actually Snowflake. So of course, it's a disaggregated storage. The final resting place of the database is always going to be on the object store. They support AWS, Azure, and Google. But then again, they're going to use locally attached just on their compute nodes as an ephemeral cache for both results spilling out. For query results, spilling out any results of operators and then for table scans because they want to avoid the roundship costs or reading up to S3 for both monetary reasons and for performance. Push-based vectorize execution. So that's different than we saw in Dremel and in Photon or Spark SQL because they were all pool-based. All right, so they're going to do that sort of upper fusion stuff that we talked about before. Everything is going to be done through pre-compile primitives using C++ templates. A big thing that that, you know, is maybe not, it's not sort of groundbreaking but it does help with a lot of things is if you want to separate the table data storage from the metadata storage and we'll have sort of separate services manage these things. And then in paper they talk about how this allows them to do a bunch of optimizations that you wouldn't really sort of think about right away. Like if now I want to drop a table I just change the metadata say this thing no longer exists and then something in the background will go ahead and improve the data eventually from S3 at some later point when it expires. There's not going to be any explicit buffer pool. They will have operators be able to spill the disk either locally or then spill out to S3 as needed. They're going to be doing PAX columnar stores like everyone else. Now in the case of Snowflake, I think the paper you guys read was what, 2015, 2016. So they're very explicit out saying we're using, you know, how they're using their own internal proprietary file format. Later on they add support for parquet and I think Delta Lake and iceberg will cover it later on. But the way they operate versus what Dremel talked about or what Photon talked about, Spark SQL talked about is if you want to scan or query any data you got to first load it into Snowflake and then it'll then convert it into their proprietary format. All right and they do this because then you can collect some statistics. You know, it's a single code base that's operating directly on their expected file format. But then again later on they now support source ones. The way, I don't know how the codes implemented whether they have different scan operators that operate on parquet style format versus their own internal format. In the case of Photon or Spark SQL what they didn't talk about is they'll actually transcode any file that comes in to Databricks' own internal format. So they have a single execution and they'll be able to operate on that. I don't know what Snowflake does. I think they support certain merge joins. I messaged Shije and I ran her back. They do obviously support hash joins because that's the way to go. And then there will be a Cascade style query optimizer with some adaptive optimizations but I think it's less aggressive. As far as I can tell it's less aggressive than Dremel and Databricks. So these are the sort of key topics that I want to discuss as we go along. So the high level architecture is composed of three parts in Snowflake. So the first is there's the data storage. Again, that's just gonna be the cloud-based object store that the vendor, the cloud vendor is providing them. But then they're also gonna have, so they're gonna have this notion of what they call a virtual warehouse, a virtual data warehouse. And again, think of this as, the architecture was designed before data lakes were a thing. So there was no idea of a lake house or data lake back in 2012, 2013. Snowflake was designed to replace things like teradata. People had the traditional model of data warehouses. And so the idea with a virtual warehouse is that you would provision some number of worker nodes or compute nodes, well, explain the next slide, that would be able to run your queries. And you could have multiple data warehouses or sort of multiple virtual warehouses that could be operating on the same shared storage but you would have sort of performance isolation between the different warehouses because they're completely set on compute nodes. So originally the idea was like, you wouldn't say I want five nodes or six nodes or whatever, they have this notion of like, you get a size of a virtual warehouse and it's abstracted away from some number, like two X, three X, four X. And you don't know exactly the number of CPUs or how much RAM you're getting, all that's hidden away from you. And it was sort of like, if you ever turned the thing on, you're paying for it, right? So even if you weren't running queries, if it was running, you were paying for it, similar to like EC2 model. In 2022 or 2021, they did add support for serverless deployments but they upcharge that because now you're not provisioning resources to do all the coordination stuff down here. Like you're piggybacking off of some resources that are always just running. But in this case here, you only pay for like the compute time you're actually using. Then they have what's called cloud services. This is the hodgepodge of things. This is like the front end layer that's gonna be the query coordinator, the scheduler, maintaining the catalog, which we'll talk about in a second. Which is, it can be running on FoundationDB, which is a transactional key value store that Apple bought. But Snowflake was designed from the very beginning yet to use FoundationDB. Again, we'll cover these things as we go along. All right, so the execution architecture is comprised of sort of two parts. There's the worker node. You sort of think of that as like the single instance. Again, this is 2012, 2013, they're designing this. It's before Kubernetes, before Docker, right? Docker was what, 2013, 2014. Kubernetes is 2014, 2015, like a year or two later. So they're running on, at the time, bare metal, bare metal instances of VMs on Amazon. So this worker node is sort of always running in your virtual data warehouse. It's gonna maintain its own cache on local disk of files that previous worker processes have access and retrieved. Also potentially also intermediate results that they've generated, and they talk about how the cache doesn't need to be very, the cache replacement policy does not need to be very sophisticated. LRU is good enough, and you don't really do anything sophisticated. Now contrast that with like, when you think of like a buffer pool policy or replacement policy where you're doing like more of a sequential scans and updates and dirty writes and things like that or dirty pages, all that you don't need for this, LRU is good enough. Like simple is better. We'll cover the consistent hashing up something in a second, but basically the cloud services portion is gonna keep track of the metadata about what, for what files I have on S3 within Snowflake to say here's my tables, and then it'll use consistent hashing to assign some portion of those files to individual worker nodes. And then when a query starts running, the worker node will spawn, do a fork exact, a worker process which is gonna last for the life of the query. And this worker process can execute multiple stages or multiple query plan fragments. And it'll communicate with other worker processes running on other worker nodes. So unlike in Dremel and in Spark SQL or Photon where there was a shuffle stage where you have the worker nodes sort of can write data to different nodes you have the workers write data to different workers or write to a shovel service. The workers are communicating directly to each other and sending data. Now as I said, Snowflake is a vectorized engine. Again, it's no surprise because one of the co-founders, one of the architechs came from or had built vector-wise as part of his PhD thesis. But again, it's different is that they're gonna be even push-based. So there's not a lot of information on how they're actually doing the fusion of this. But basically it looks a lot like vector-wise that we talked about before that there's templated primitives for doing different operators and different expression evaluation. And then at runtime or sort of compile time you then fill in the different types that you have in your system in order to generate the different variance of each different primitive. And then the query plan is essentially when it runs is figuring out what virtual function or function to call for the different primitives to process the data as you're running along or as you're executing over data. And because you're running on batches or vectors that amortizes the jump call to the different functions. So Snowflake, somebody from Snowflake gave a talk with us during the pandemic, 2020. They mentioned that Snowflake does do co-gen in limited ways. The only way I was able to confirm where they're actually doing this, they're doing LLVM style compilation for the serializer and deserializer for when nodes send data to each other. So on the fly they can co-gen the something that takes the internal in memory representation of a vector and then convert it into some byte buffer that you then send over the network. And then the other side, they can reverse that. Again, think of this before arrow. Arrow was his way to do this now but they designed this at a time before arrow. So again, they're not doing shuffle between stages. The worker processes push data directly to each other. They're also not supporting partial query retries. Again, this is different than Spark and different than in Dremel. And in case of Dremel, you would have the workers would write to the shuffle service. If one of the workers died, the query coordinator could say, okay, this worker died. Let me go fire up another task, another worker to complete this task that I'm missing. And then it would pick up as if the worker never died at all. As far as I can tell, they're not doing not, they can't support this. So, and I think it's a byproduct of the way they're making heavy use of a local cache. Like if you have a worker that's always operating one pipeline after another, they're accumulating intermediate results on its local disk and not writing it to a shared storage. If that worker dies, you may have lost multiple stages of data. And therefore it just has to restart. Because again, the alternative is what? You could write up S3, but then Snowflake pays for that and it's plus it's lower. So rather than do that, they'll just kill the whole query. Sorry. All right, so they do two types of sort of rebalancing that are worth mentioning that's different than, again, what we saw in Spark SQL and in Dremel. So they support the ability to do work stealing between the different nodes, or the work across disease while you're running a query, similar to the way we saw work stealing with the morsels approach from Hyper. So what happens is when a query shows up, before you even start running, the optimizer has to figure out, okay, the query's accessing these tables. I know all the files that belong, that of course make up this table. And I'm going to use my consistent hashing scheme to decide, okay, what worker nodes are going to operate on those files. So all that's figured out at the planning stage. Then the query starts running and if a worker recognizes that it's processed all the files that it was required to process, but they haven't moved on to sort of the next pipeline, the next stage of the query plan, then they can go ask their neighbors, the other worker processes that are running for that query and start taking files from them to start processing. The idea here, again, if one worker is a straggler, it's for whatever reason, it's just running slower. Another worker can go ahead and steal their files to process. But the key thing they talk about in the paper is that even though the other worker that is running slow may have a copy of that file that's waiting to be processed on its local cache, and therefore you could go ask that worker and get it and not pay the cost to go to S3 because you pay per look of an S3, and instead, since they don't want to put more burden on that slower running node worker process, the one that's stealing will go down to the shared disk and steal it instead. And then it won't cache it locally because the fact that it's taking over that one worker node is temporary, or sorry, that one file is temporary, whereas in the consistent hashing scheme in the catalog, you need to know that if you ever need to access that file again, this is the one worker node that should have it. So even though another worker node is gonna process it, it's still being assigned globally to the one that's running slow. It's a slightly different approach than we saw in Dremel, in the case of Dremel, if you had a straggler, the coordinator would figure out, okay, this thing's running slow, and then give the work to somebody else. Whereas the workers here are proactively figuring out what they should steal. The other thing they can support is what they call flexible compute. And the idea here is that if the database system recognizes that some portion of the query plan, like a pipeline, is gonna be processing a large amount of data, and the system can then decide to temporarily allow the virtual warehouse to use more worker nodes than the customer's actually paying for to get some additional computing capacity to make the query run faster. So this is sort of like, it's like a bookkeeping trick to allow you to get more resources than you're actually paying for, and then assuming the resource couldn't be used for something else, then this makes you a happy customer, for now for your query, it runs really, really fast, or faster than it otherwise would have, without you having to pay more. Now I can't say exactly how they're actually doing this, but you can imagine, again, the original model's not serverless, you're paying for virtual warehouses, and you turn them on, and even if you're not running queries, you're paying for them, well, that's idle compute capacity. So I'm not saying they're borrowing from customers that aren't using their nodes, but, you know. And again, it's fine, nobody, and you can do this because it's a managed service. All right, so let's say we have this query plan here, we're just doing a hash number to do tables, and this side here is gonna scan a really large table, on the probe side. Also as a point out here, they have explicit nodes to do join filters, this is the sideway information passing stuff we talked about before. When you build the hash table, you also build a bloom filter, and then it can pass it over to this side of the query plan, which obviously breaks the original notion of this sort of a dag model of query plans, but it's such a big win, in many cases, that you wanna do this. So, I think they support bloom filters, and they also support some basic range filters as well, to do on the join filter side. All right, so say that this large scan is, so this side of the query plan is a large scan, that's gonna take a long, long time. So what they can do is, at the optimization stage, they would recognize this, because again, they have statistics, they know how many files you're gonna process, because everything is loaded internally. So what they can do is rewrite the query to look like this, where there's now the, on this side it's sort of split into two parts. There's this path here, that does the same work as we did before, but this will be using the nodes that were, that are in the virtual warehouse that the customer has paid for. But then there's this other portion of the query plan here, that you can scale out and run on additional, the free pool of additional resources. And then the, what happens is that these, this is like optimistic, like you're opportunistic, like if there's resources available, use it, if not, whatever, I'll just process it over here. But what they do is, because actually, I don't, I forget the reason why they do this, but what'll happen is this thing runs on some other machine, other nodes that aren't your virtual warehouse, but then instead of just sort of, they can't store the results locally, because these are ephemeral nodes, right? So let's say this guy here, like it's, you get to this group I, it produces some results and let's go back and, you know, come back around and do the next scanning, it can write those results to its local desk, because that worker node's not going away, because it assigned you a virtual warehouse. In this case here, you're running on some arbitrary nodes temporarily, and so it'll run maybe your query now, but then like a second later, it's gonna run somebody else's query. So there's no guarantee that your local storage is gonna be there. So instead of what I'll do, it adds this little insert operator above here, so it materializes the result out to the shared disk storage, and then the query plan will scan it as if it was a regular table. Yes. What about Shuffle? Do they have to store everything for storage? Okay, so you're talking about broadcast join versus Shuffle join, which is different than this flexible meeting. So your original question is, how do they support Shuffle joins? Right. Again, they are using the local disk as a cache. So you could spill, like if you're doing a Shuffle join, and you have to spill the disk, you would try to spill to your local disk first, right? And then if you need to send those results to somebody else, when you read it back in and send it out. It's only a one that is like a failure where you push your local disk and you push it on. It's not like you have multiple local disks. What do you mean? Yeah, I'm missing. In the sense that it's not a distributed fashion. It's not, there's only like... Like, sorry, when you do a join, like you don't have to distribute the partitioning on a key? Yeah, like there's... You do, why wouldn't you? Yes, it's a distributed database, so yes. It's basically the same thing we talked about before, right? But it was a single node. Same thing, I hash on the join key, and then I know how to shuffle things and move things around as needed. And they'll do the same stuff, like they recognize that if one table's really small and one table's bigger, I'll take the small one and broadcast that around so everyone has a complete copy of it, right? Otherwise, they do complete shuffle on the join key so that all the data from partition A goes one node or the data from partition B goes another node. And I can do recursive partitioning if things are skewed. All the same tricks still work. It's just they're not doing shuffle explicitly after every stage the way Dremel does and the way Spark does. Now this is completely different, right? Again, this is just saying that there's this sort of fragment of the query plan that I'm gonna run this on machines that I'm not directly paying for or provisioned. And because I can't rely on its local disk cache because it'll get blown away once my job is done, they're all gonna write the results out to distort as if it was a file, a data loaded in as a regular table. And then this operative then just scans from it. They also talk about how you could potentially start using these intermediate results for as a query cache. So if you ever see this fragment again because you've written out the storage, you could have the optimizer figure out, okay, well, I've already materialized this pipeline. Let me go just read that so I'm actually doing the whole process again. I don't know how often this kicks in. I don't know what, is it like every query gets this or is it some queries? Again, that's all proprietary, that's all internal. Snowflake doesn't say how often this happens. And I don't even know whether you can see this as the user through the UI. Like when you would know your query got the flexible compute stuff. Okay, so in the paper you guys read, they talk about how in the early days of Snowflake they were trying to decide what should the storage there actually look like. And again, flashback to 2012, Hadoop and HDFS was the hot thing. That's the way you had scalable cheap storage. And what they decided was that, or the trade-off they're trying to decide is should they build something HDFS, should they build their own proprietary storage layer or should they rely on the cloud storage from Amazon, in particular Amazon S3. And they talk about the obvious trade-offs that S3 is gonna be slower than local disk. Not just because of the latency, the round trip, which I think is on average at least 50 milliseconds, if not more, could be up to like 300 milliseconds. It can be variable. But there's also a higher CPU overhead of making these calls because it's not like MVME where you can bypass the kernel and go get directly to the storage device and bring it back in with low CPU overhead. Like you're literally making a REST call over HDPS, which means it gets encrypted, going down through TCP IP, sending that over the network, you know, getting what gets resolved back, and then you gotta parse it when it comes back in. So the latency is higher, plus also the CPU cost of doing IOs is higher as well. You probably don't remember this, but beginning of the semester, I think the second lecture I mentioned that there's another OLA system called Yellowbrick where they saw similar issues and they basically rewrote their own their own S3 client-doom kernel bypass, right? Instead of using the one Amazon gives you. I don't know that the snowflake does the same thing, but you can imagine that like you have the money the snowflake has, you would rewrite a lot of stuff yourself. But obviously if you don't want to write the whole entire storage layer because Amazon does a really good job of making sure it's durable and always available, that you don't have to maintain yourself, let Amazon do that, because they have economies of scale that can do it better and cheaper than anybody else can. So the other thing they say is like, although this part sucks, it actually is okay for what we want to do in OLAP workloads because we don't actually need to read the entire file every single time, right? Essentially what's going to happen is for these queries we know what comms are going to access because it's SQL, it's declarative, we know what they want to do. So we can just go get the offsets of the headers from these different files, the case of parquet, it's the folder, but the same idea. And we go figure out what portions of the file we need and things like S3 support doing lookups on objects with offsets, right? So they decided that instead of trying to make HDFS work, which was the right idea because HDFS is still common, but like it's not the default choice anymore for large scale scourge, that instead of trying to build their own storage layer or use HDFS, that instead they were going to spend their engineering time building a better caching layer on these local disks to amortize and hide the costs of doing lookups on S3. And that's something that's very different than what Dremel and the Spark SQL paper talks about, right? So the query operators are able to spill to local disk. If necessary, they can also spill to S3. But again, all that's transparent to the user as running SQL, they handle all this underneath the covers. So it's this interesting dynamic where the better Snowflake gets a caching and avoiding him to do S3 lookups, the cheaper costs for them to run the service, they make more profit off of people because the customers still take the same amount and they can reduce their lookups to S3 then they're paying less to Amazon. So it makes them better. But then it also makes the query run faster too. So it's a win-win for everybody. So as I said, beginning, Snowflake started off with the, and actually still is the case that the default choice of the most of the data is going to be stored in their internal format that they designed, which at a high level looks, I've been told looks somewhat similar to Parquet. Not exactly the same, a little bit. I think it's slightly, it's in between Parquet or in terms of complexity. This is my understanding, but this is just, this is just antidote from students that work there and other things. So it's their own proprietary format. They're gonna take any table and they're gonna break it up to what they call micro partitions. It's basically 50 to 500 megabytes of data before it gets compressed. That's how they're gonna sort of segment things. All right, and once it gets compressed down on average about 16 megabytes per file. So again, this is just Pax because it's running in S3, they're immutable. So anytime I need to make a change, you know, I have to rewrite the file out and they update my metadata to say here's the new version of a file. One thing they do that's different than Dremel and Infoton or SparkSQL is that in the background, they're gonna be, they have this background service that's gonna go through and automatically rewrite micro partitions multiple times to recluster them, reorganize them based on the query access patterns. So you could have, you could define a clustering key on a table and Snowflake's gonna try to figure out what's the right boundaries for that, if I sort the data, what's the right boundaries for these different, you know, for that on that key for the different micro partitions. The idea here is that you can, if you pre-sort the data, obviously there's a bunch of optimizations you could do by figuring out what things you need to read without having to do a complete sequential scan and then that reduces the amount of IOs you have to do. So it seems like it'd be very wasteful that like I'm gonna read a file in, you know, reorganize it or re-sort it and change the boundaries and then write it back out. The reads aren't gonna be that bad because again, you can be mostly absorbed by the local cache, but then the writes are gonna be better, yes, you pay for the writes to go out, but then that's gonna help you when you, if you assume most of your queries are gonna be read only, which in their case it will be, because it's OLAP system, you know, you'll get fewer reads when you run queries, it's the primary thing you're trying to optimize for. So you're paying a little cost to do the right to do the re-clustering, but then you get a win for making the queries run faster and read less data. Again, all this is, all this is internal, other than defining what the clustering key is, the user just doesn't have to kick any of these things off. Again, we'll see in a second how they had been also expanded upon this to support parquet files or external files. So like in the Dremel paper, they talk about how it's very important to support semi-structured data, right? People have a bunch of random JSONs or XML files that they wanna dump in your database and start running queries on it. So to handle this, they support it or add a support for three data types that are unique to Snowflake. They're not exactly, there's different versions of these things in different database systems, but the way I'm gonna describe here is specific to Snowflake. So variant is basically a JSON blob or some kind of on semi-structured data. And then arrays and objects are just restrictions of that. An array would be semi-structured data, but only as an array of a given data type. An object is a mapping. So unlike in Dremel where you had to give the schema of the JSON document ahead of time, right? Because you were defining these protocol buffers. You would say, here's the schema of the thing we're actually gonna store. Then based on that, they could do what's called, what we call the shredding method where they would break it up and try to store it as columnar data. In the Snowflake world, they're not gonna do that and instead they're going to, because they assume people aren't gonna be able to give them the schema, and they're gonna try to figure out what the schema is or what the data types are for these different fields on the fly automatically as you load the data in. This is different than what Photon did or what Spark does, right? Spark had the same problem. Well, we all have the same problem with these JSON things, but like, here's what the data files, I don't know what's in them. And in the case of Photon, what they would do is they would do that batch level add activity to try to figure out as I'm actually running the query to change what primitive I'm gonna use to do different operations based on what data type I think I'm seeing or the system thinks it's seeing. Right? So like, again, if I have a string and then I'm seeing, oh, I'm doing basically converting it to a date and then doing some kind of data operation, they would have an optimized version that could take a string and convert it to a date type. Instead of running the sort of random function of this. Again, in Snowflake, they don't wanna pay the penalty of doing that conversion at query time, so they're gonna do it at right time. So again, you start loading up some JSON data, they're gonna figure out the fields after you see a little bit, and they'll start inferring what the type is. And if they identify things like, oh, there's some field in my JSON document that looks like this, year-month-day, oh, then I'm gonna assume that field should really be a date type. So then they will automatically parse and convert that into the binary representation of the date data type. So it's going from a string here to a four-byte date. But they always maintain or retain the original string version of this data so that if you ever get it wrong, the system gets, you realize this later on, the prediction was wrong, you can always go back to the original data. And if someone does something weird and stupid, like put a poop emoji in here instead of the date, then you parse that and realize that it's not really, it's a malformed date, but you can still go back and get the original one. All right? So again, they're doing this, when you load the data, again, under the original model of something like where you don't support external files, you only reading, you have to load things in and you parse it and put it into the proprietary format. So that's why they can do this conversion. I don't know how they handle external tables. I think they just do the fallback method. So now again, you can start to see these different design decisions, different differences between the different systems. Like at a high level, you think, oh yeah, they're photon and Dremel and snowflake. They're all vectorized engines, but the nuances are going to be slightly different. I'm not saying this parsing thing is a huge game changer versus one work limits is another. But you can start to see now what are the nitty-gritty details of how these things can vary. And I would say, in the case of the Dremel world, because they were dealing with internal data originally at Google, where everything was mostly written in protocol buffers. So they had the schema, so that's why they can make it a choice that way. Whereas in snowflake, they said, okay, well, I don't want to pay the penalty at read time because I want my queries to run fast. So I'll pay the penalty of parsing this at runtime. And then in the Spark world, since photon and Spark SQL, they don't have their own preparatory format. They only run on the open source formats. They have to then do the batch level activity stuff that, you know, how they support it because they're seeing things for the first time. All right, for the consistent hash yourself we talked about before, again, consistent hashing is just a way to do basically like a hash map where if I add or remove entries in the hash map of doing my mapping from data to nodes, I don't have to rebalance. I can just reshuffle things along the ring. So this hash map is going to be made internally as it would transact on the entities to ensure that the worker nodes always know what node is responsible for a given file. So there's never any question of like, there's never any consistent issues of like, okay, we're two worker nodes that think they're both responsible for the same file. This consistent hash map technique is always going to tell you exactly who's responsible for it. Then now if I add new nodes and I'm in my virtual warehouse and need to rebalance, I don't have to shuffle everything around. I only have to move data from wherever in my ring that I've added the new one, whatever comes after me and I send some portion of my files down to or over to the other worker node. And this ensures now that again, as workers come and go, you don't have to clearly invalidate your local cache because the what node is responsible for what file or micro partition file changes as the work for warehouse changes, right? Because again, they're trying to avoid doing lookups on S3 as much as possible. So this will solve that problem. All right, for the query optimizer, it's a Cascades model doing top-down optimization. Again, there's not a lot of public information of exactly what they're doing, but I will say sometimes in the snowflake literature, I forget what the documentation says, they will refer to as what we've been calling the query planner, query optimizer, we'll call it as the query compiler. That's a remnant again of the vernacular people use for databases in the 1970s, thinking of like, again, in the same way that I would compile C into machine code, I would compile my SQL query into the machine code of the database systems. So people historically have used the term compiler, but nowadays the most common, more common phrase is the query optimizer. So the big thing that they care about again is trying to figure out what data they don't have to read. So they're very aggressive on trying to figure out how to prune and skip micro-precision files before the query even starts running. Right, again, they want to avoid the lookups on S3 or pressured on the local caches. And then this, once they sort of figure this out, they can then determine what will be the complexity amount of work being done for the query fragments, at least the lower portion of the query plan. Obviously as you go up, you have to do flexibility estimations, which they try to do, like everyone else does, because they have some statistics, but not very aggressive. Not very sophisticated statistics, we'll see in the next slide. But they're gonna try to then, that's when they make the decision, okay, well, I think this portion of my query plan is gonna be taken much longer than I normally should or normally would for this virtual warehouse. So that's when they can then decide whether to employ the flexible compute scale out. So if they support query plan hints, they have an online tool where you can muck around with the query plan and change stuff. And then they have some support for runtime and activity. There's always the example they always show of whether you push down the group by before or after the hash join. That's the example they love to talk about. I don't know if there's any other technique that they support. The group by pushed down below the hash join. Dremel supports that. I don't know whether Spark SQL does. I don't know whether they can reorder join orders. I don't know whether they can change what join plan to do or switch from broadcast join to shuffle join. That part, again, I didn't see any documentation about it. So as I was saying before, that Snowflake always wants to try to put things in their own proprietary format, try to get some statistics about them. But the statistics they're going to maintain is not going to be as sophisticated as a sort of monolithic shared everything system like a PostgreSQL, MySQL, Oracle. So the only statistics they're going to compute are basic zone maps, mid-max values for columns within each Michael partition. They're not going to compute any histograms. They're not going to compute any sketches. The idea here is that because those things always are horribly wrong and the data could get out of sync very quickly, they said we're just going to avoid all of that. That means we're joining our new basic heuristic to figure out what should join with what. And they claim it works reasonably well for what they need. The statistics that are maintained is going to be hierarchical. So within a table you have some statistics and then from the Michael partitions you would have some statistics. And then within the multiple partitions for the columns you'd have statistics. So table number rows, size and bytes, compression information, but then the columns is basically where the zone map stuff is. A mid-max on a column, the number of null values you would have and the number of distinct values you would have as well. And again, this is enough for them to figure out, to do some early pruning before you start running. The challenge with pruning is how do you actually evaluate expressions to figure out what you don't need, right? So soon I have this table here and I have these queries on these columns. So for things that like if it's column equals something, obviously a mid-max would be able to print out a bunch of stuff. When we start to have expressions like column one plus column two is greater than some number, you know, in order to figure out, okay, what is the thing that I'm actually going to resolve to, I have to then play with this and actually maybe evaluate it, this expression on per micro partition to figure out what's going to actually get to be in there or not, right? Or even more complicated here. I'm converting some date field of creation timestamp, extracting out the year. So find me all the entries where the year is 2023. The classic trick everyone does with this one is you rewrite this into between call, between beginning of the year and the end of the year. But I mean, that's what you would do at runtime, but like if now you're trying to look at these statistics you've collected with these micro partitions, you actually need to evaluate these as if it was like actually running a query. And that's basically what they do. So they have specialized implementations of the expression evaluators that we normally would use at runtime on the real data as you're actually processing them, but they would have variants of them that can operate on the statistics you collect within the user maps that then see whether this thing's going to evaluate to true or not and decide whether you're actually going to need to be able to access the micro partition later on. This seems sort of like how to say this, this seems like trivial, but it's actually not. It's very hard to do this because it's like you're trying to infer what the boundaries could be, but you then you also have to actually evaluate these expressions. You need to consider things like the null indicators like column plus column, column one plus column two or what if column two could be null? What does that mean? Well, probably should look at the entire market petition. I can look at the zone map or the null count and tell me whether any column is actually even gonna or any value within a column to be null then I can tell me what I need to look at it. But these are all the considerations you have to make. And again, you don't want to use, you can't use the implementation you would have for when you actually processing tuples. You basically need a whole another expression evaluation engine to do this directly on the statistics. We had the MySQL guys gave a talk during the pandemic and they talked about in their query optimizer for some queries were like you had the sub-queries we have like a constant in there. They would actually stop query execution, run that sub-query on the database system itself, get back to the constant value and then substitute that back into your query optimizer which just seems kind of crazy if you think about it like for one query I'm gonna invoke another query to run my query even though it's not like it's, that makes sense. This is essentially what they're doing but rather than doing the MySQL way, we say, all right, well, I'm not gonna have a separate code base that can evaluate these expressions without having to go look at the database. Sorry, instead of having this, MySQL doesn't have a separate execution engine to do these sort of mini evaluations for during optimization time. They just ask the data system to do it for them and give back results as if it was a regular query being sent by the client. What they're trying to do is just evaluate these things without having to fire up the full engine. Right, because you're doing this in what they call the cloud services layer not on the worker nodes themselves. Okay, so that's a quick rush about what Snowflake does. I wanna talk a little bit about the fight, the dust off, whatever you wanna call it between Databricks and Snowflake that happened in 2021 that I talked about a little bit at the end of last class. So, again, remember that in the Databricks paper and they had TPCH results, but then there was this paragraph near the end that said, oh, by the way, we also have TPCDS results that have been validated and audited by the Transaction Processing Council, the nonprofit that runs TPC, and were the fastest. So they had a blog article that came out in November in 2021 that talks about the paper and talks about setting the new world record on doing 100 terabytes of TPCDS run. But then they also had this little piece here. You know, they talk about why it's important to get this audited, you know, the fast, yeah, yeah. But then they also had this little piece here that says where some researchers at the Barcelona Supercomputing Center also ran a sort of sub-center, what they call the power run of TPCDS, sort of a more restrictive version of it that doesn't do everything to get the full auditing approval, but it looks like TPCDS, that they ran a comparison between Databricks and Snowflake and then found that Databricks was two times faster than Snowflake at 12 times the lower cost, right? Sure, okay, that's fine. So again, they put out this blog article and they had this graph here, right? Well, two weeks later, the Snowflake guys saw that. They didn't like that. So these are the two French guys, these are the two Oracle guys that I mentioned. These guys don't have the tattoo, it's Marcin who's got the tattoo. So Snowflake comes back and says that the Databricks numbers are junk, that the cost is higher because they're running on the enterprise version of Snowflake, which they charge more for, but there's no materialized views you would use in TPCDS so they could run the standard edition and it should be a lower cost. And that when they run the same experiment supposedly that the Barcelona guys run, that they're twice as fast as what Databricks is showing in this graph here, right? And in this blog article, they tell people how to log into Snowflake and like in four clicks, you can rerun and validate the experiments from that they did to showing that they're not as slow as Databricks claims, right? Okay, that's fine. Well, Databricks came back three days later and they said that the Snowflake people are all bunk, that they still stand by their results that they showed here. And that if you rerun Snowflake's experiments, or just rerun the experiments that Snowflake ran, but not using what they call the pre-baked or the prepared version of the data files of the TPCDS dataset, if you don't use their prepared version, then you get the results that the Barcelona guys originally were reporting where they're two X slower than Databricks, right? So here's what the Barcelona guys claim over here. This is what Snowflake says in their blog article that they can get on their version of running TPCDS. And then this is what Databricks got if you then load up the official dataset from TPCDS, install it into Snowflake manually. And this is what Databricks claims that they can get with Photon, right? So the reason, in the Databricks article, the reason why they say that there's a discrepancy between what Snowflake claims they can get versus what if you just use the official dataset and load it into TPCDS is that this file, the dataset here, what they call the pre-baked version, that's like the cluster sort of stuff that I talked about before, where you let that automatic thing run to rebalance the micro partitions in such a way that's optimized for the query access patterns. If you do that, then you get the better results. If you just bulk load it, then without doing any of those optimizations, this is what you get. And that with the, to get official TPCDS results, you have to account for that data preparation time or the maintenance time or the setup time in your analysis, right? Whereas in this case here, this kind of ran for 12 hours for them to compress down and sort everything up. So now the queries run fast, but they're not reporting that in their times, right? So who's right? Well, I wanna be the Switzerland of databases. I like everybody. I will say also too, both the co-founder of Databricks and the co-founder of Snowflake are investors in my startup. But I'll have to cut this. Did Snowflake respond to that? No, as far as I know, Snowflake didn't come out with a blog article after the Databricks thing. And the thing sort of blew over because of the holidays or whatever. You can also, I don't think- Well, sorry, this? Like Databricks' numbers are limited, right? Were you Snowflake's, but just like them running at themselves? Correct, yes, yes. Here's the comment is the Snowflake numbers are audited, or sorry, the Databricks numbers are audited by TPC with a third-party old guy that checked it, right? Whereas this is just like, okay, we ran it, here it is, but then they also claimed like you can go run it yourself and see the results. But the point is like, if they wanted to have it even match, they also have to get their result more- Correct, this is what I was saying in the last class, like to your point, yes, if you wanna have a true Apple's-Apple's comparison and say it's audited by TPC, Snowflake would have to do what Databricks did. Is it worth it? I would say no, right? Like, again, this is fun for means, but like, is this, in the end, does it matter? So, Reynolds did send me a text afterwards that like, for them, this was a big win because this sort of changed the narrative of what Databricks was meant for. You now saw that as a competitor to Snowflake as a data warehouse. Whereas maybe prior to this, because Spark SQL was pretty slow and pretty primitive compared to Photon, that for them, this was a game changer. Yes? Did you have the slide with Snowflake? Did I have the slide with Snowflake sponsored? Snowflake sponsored in 2018. So this is 2021. This is during the pandemic. Yeah. Again, there's a long history of different database vendors shooting at each other about benchmark numbers. There's a couple of ones about time series databases more recently too as well. And then Mongo always famously had amazing benchmark results and then you realize, okay, you know, everything, nothing is persistent, nothing is durable, right? Because in the old version of MongoDB, when you do a write, you would get back an acknowledgement, but the acknowledgement was like, oh, we got your write and we saved it on disk. The acknowledgement was, hey, we got your packets. You didn't actually run in the query and do anything yet. So then you had to send another request so you didn't might change actually make it to disk, right? But like all their benchmark numbers looked amazing because it was like, hey, it was like a ping back and forth. It was nice about the new, some of the new distributed SQL systems like Cockroach, Yugabyte, Fauna, PlanetScale, is that they take strong consistency or transactional guarantees. It's a very important design concept and like out of the box, they provide these things. Whereas yes, MongoDB sorts transactions now, but they progress to get there. Now, from sitting in Ivory Tower in academia, I can claim, oh yeah, that they should be doing transactions, of course. But like, if you're a business and you're trying to get your thing out the door, I could see the advantage of not doing the hard stuff right away and doing the things that people maybe immediately notice like the getting up and running quickly, having a nice UI or nice user experience and then get the traction, get the product actually working or people actually using it and then fill in the pay off that technical debt later on. My SQL did that in the late 90s or the 2000s, like my ISM was a shit engine to do new transactions, you could lose data. Eventually they got innerDB and that fixed things. MongoDB, same thing, right? So, you can claim Neo4j is basically doing the same thing as well. History repeats itself. All right, so I want to finish up talking about a few additional things. So, again, like Databricks, like Dremel, if you don't have statistics, if you want to be able to operate on files that aren't in your database that you store, you don't have any statistics and you need a way to figure out what's going on. Figure out what's going on. So, again, Snowflake originally required all users to load things into the priority format before they be queried but since then, since the paper you guys read, they've expanded support additional methods for ingesting data. Snowpipe is basically a way to use Snowflake as a Kafka endpoint that you can do streaming updates and they store everything using or send data as a patchy arrow. But I'll talk about how they support external tables at least for using Iceberg, which I'll talk about in a second, and then hybrid tables is their transactional engine that came out in 2022. So, Parquet files themselves, it's just the format of the file. It tells you what's in the file. But there's additional context that you want to have about how that file relates to other files that you could use for speeding up query execution. So, in particular, this pruning thing. Like, if I know I'm doing a query that's going to look up data within some date range, I'd like to know what files have that date range and ideally, I wouldn't be able to do that without having to go peek inside the file. The catalog is, you know, I could do a look up in the catalog, find out the files and then there's damage. That's sort of the expression evaluation that we just talked about. But again, Parquet by itself doesn't have that. So, the way people sort of have asked it or like did it themselves is that oftentimes the, like literally the directory name would tell you what partition, what date range of file was in, right? Like I'd have my directories be like, no data slash 2023 slash April slash and then, you know, some date and then you could have a, you'd have to have additional code to then look at the directory names to figure out what files I want to read, right? That's a very common sort of, I don't say design pattern, but it's a very common scheme that people would limit. So, with Iceberg, the idea is that, okay, well instead of people just putting willy nilly files on it and then trying to figure out what the information or metadata about it from the directory names, let's treat it as a first class entity and now have a service of itself that can keep track of what files that I have. It's essentially the same thing that Snowflake built internally in our catalog services, but now here's something that can be reused for Spark and Dramio and all these other, these data lake engines that people, query engines that people are building. So, that's what Iceberg is. It's basically a way to keep track of the partitioning information, the versioning, the schema information or schema changes of a bunch of part K files. So, for Snowflake, the as support in 2021, be able to ingest metadata from Iceberg as well as creating and writing it out and then querying Iceberg files in 2021. So, that gets around that problem that they would have where you would get to load everything into the proprietary format. There's enough of a match between what Iceberg maintains and what Snowflake maintains internally so that they can directly query on top of these things. I think they also announced preliminary support for Delta Lake, the thing from Databricks where you can basically see an idea where you can have this storage interface that you increment updates that eventually spits out part K files. You can do the same thing now. There's early support through the same thing with Snowflake as well. The other thing you do now with Snowflake too is as support for doing transactions. And so, again, Snowflake always supported transactions. They talk about how they have, in the paper how they have Snapsite isolation because you're doing transactions on the key value store as you make schema changes or make changes to the database system. But you wouldn't really want to run like your website that's doing, ingesting new orders or running a message board off of Snowflake, the original OLAP engine. It's simply not designed for that. So in 2022, last year, they announced support for building a, what they call a Unistore, which is a, I won't say the database system because essentially it's a service within the Snowflake ecosystem that allows you to run operational workloads or OLAP workloads on directly inside Snowflake. So the way it works is that you declare a table as being a hybrid table and it's gonna have a row portion and a column portion. The row portion, it will be a log structured storage area. And then there's a background process that occasionally is gonna look at them, look at the data and write it to the internal columnar format that Snowflake uses. Because it looks a lot like the Delta Lake stuff we saw in Databricks. Same idea, right? But in the case of Unistore, I think you can do SQL, transactions on SQL. Whereas Delta Lake, I think it's just like, it's a simple Chrome API. So now your OLAP queries can, that you can run on these hybrid tables. The system's gonna know, okay, depending on what data I wanna see, how fresh I wanna have data, I can go retrieve whatever data I need from the row store and merge it with any intermediate results or results I'm generating from the column store portion of the table. And all of this is, again, it's transparent to the end user. Okay, so, the building a transactional database system is hard, extremely hard. And Snowflake got this up pretty quickly. How do they do it? I can't tell you. But they have another database called FoundationDB that they use internally. That is a transactional key value store. That is very robust. That you could use it for things. And so, FoundationDB was, the paper doesn't say, I think actually I don't remember the paper says it explicitly, but they use FoundationDB explicitly for the catalog service. Catalog service has to be transactional. It has to be durable, but it's not a column store information. It's a row store. So they use FoundationDB for the, doing all the transactions internally for when files get added, when new nodes get added, when nodes go down and so forth, right? Any schema changes that occur. All this is being stored in FoundationDB. My understanding is that they would have, they wouldn't have one FoundationDB cluster or sort of instance per virtual warehouse. You would have it sort of, they would have multiple, they would have one FoundationDB instance, which could be multiple nodes, service multiple virtual warehouses across different customers, right? Because the amount of load you're putting on this thing, it's not as high as like scanning data or running all the workloads. So they were using FoundationDB at the very beginning in 2012, 2013. Again, when you're designing a system, in the same way they didn't want to worry about how you're gonna build a storage layer, let Amazon handle S3, now you don't have to worry about how am I gonna handle my transactional catalog? They didn't rely on FoundationDB to use it. But then FoundationDB was closed source. Apple bought them in 2015. And they basically, their company ceased to do any operations for any external customers. Now their only customer was Apple. So, but that's a problem. And now this huge, you know, by 2015 Snowflake is becoming huge. You know, the key, this key component of your system now has been bought by, not only competitor, like some other tech company, what do you do? Well, they told me that they had the source code in escrow service. That in the event that FoundationDB got bought, they would have access to the source code. So when FoundationDB got bought in 2015, they then got the source code they could then maintain internally and expand upon and use, keep using in Snowflake. But then in 2018, Apple announced, oh, we're gonna open source a FoundationDB. Now, I think the part of the cloud, the one of the cloud-native foundation or the Apache, we made a Apache foundation. Snowflake then had to go merge a bunch of the stuff that they've built over the three years since Apple acquired them, had to merge that back into the open source FoundationDB because they get so the development of this. So Snowflake is a database company. They spend a lot of time building the core OLAP engine, but there's actually a pretty sizable team at Snowflake that actually works on FoundationDB. So the two main developers of FoundationDB are gonna be people at Apple and people at Snowflake. But for legal reasons, the Snowflake guys can't commit things to the FoundationDB source code. They have to send a Slack message to somebody at Apple to go do it for them, right? Because of lawyers, there's always a problem. Anyway, so FoundationDB is an interesting system. I won't, obviously we're not focusing on this in this class, but one of the things they did give a talk with us, it's on YouTube a few years ago as well. One of the things that is very fascinating about FoundationDB, which is much different than how other people build database systems, is that they set up building the database system first. They build a simulator of the database system first. And that way they can inject failures and check the fault tolerance and robustness and resiliency of the system itself in a simulated environment, like stress testing to see whether if a node fails, what happens. And then they can potentially have a way to verify that the implantation is actually correct for different corner cases of various scenarios. And that ensures that it's super reliable, which of course, again, you want your transactional system to be very reliable. And that's the way they test it. Okay, all right, so to finish up. So Snowflake is a very interesting system. They laid what I call the roadmap of how to build a successful cloud database as a service company that's trying to be replicated by a bunch of different startups. But they got there first. And then I would say that the main competitors, the space are gonna be Microsoft, Snowflake, Amazon, and Google BigQuery. And then all these other people that have raised a lot of money, these other startups, they're trying to get to that Snowflake level. And I would argue that Snowflake got there first and it's gonna be very hard to beat because they have so much traction already. And they have had sort of on everyone else. Although there's not a lot of information that's available on how their internal system works, I mean, Snowflake has given a bunch of talks with us at CMU. They're giving a talk next week at the intro class as well. So you can kind of get bits and pieces about how they do certain things from these talks or even, like I said, calling a sheesh, but that's not an option for everyone. But I think there's still a lot of things about that publicly talk about, oh, we can do this and it's interesting, but it's not really clear how they actually do it. And I would argue that the Databricks guys do a better job publishing than the other ones. Again, I think there's a byproduct of them coming out of Berkeley. So there's other things, like so one thing I actually don't know, and I asked them, like you can kind of get the, again, it's still set in the art but there's still, you can kind of see how it was designed for a different time. Even though it was 10 years ago, not that long, right? But in tech world, that is actually a long time, right? Cause people sort of think about databases differently 10 years ago than we think about them now. In particular, this idea of the data lake, right? Databricks coming out, building on top of Spark, you know, like I said, they don't have their own proprietary format. They're gonna run on these open source formats. And a lot of the systems that are out today are doing the same thing. They may build proprietary indexes to speed things up, but they're gonna be mostly running on the open source formats. The other thing they talk about too is like, I actually don't know whether they are running, you know, running containers in Kubernetes, right? They talk about, you know, it's a process on a worker node because you assume at the time they were running on, you know, the bare metal BM. I like to think that over time that's been changed, but like, you wouldn't build a system like that today. Everything would be containerized. I'm assuming that, again, they fixed this up. I haven't seen anything public about it. So I think, again, just to show you that, like, I mean, especially in databases, things change really fast and what seems like sort of state-of-the-art will be antiquated, you know, within five, 10 years. And stuff like is still state-of-the-art, but you can just, you start to see remnants of it. And again, I'm not saying they're gonna fail or like because, you know, because it was designed 10 years ago instead of eight years ago or five years ago. But, you know, over time, it's very hard to change these things. And especially at their scale. Although, you know, if you're cloud service, you can hide all that, but it was on-prem of you at Nightmare. But anyway, it's just some random musings. Okay, so next class, Mark is gonna come give a guest lecture about DuckDB. So we'll project it here and we'll, we'll sit here, we'll try to get the owl thing, the speaker, the camera goes around. And Mark's a great guy, it should be a good talk, okay? Okay, all right guys, see ya. Ha ha ha ha, that's my favorite all-time job. Ha ha ha, no. What is it? Yes, it's the S.T. Cricket I.D.E.S. I make a mess unless I can do it like a G.O. Ice cube with the G to the E to the T.O. Now here comes Duke, I play the game where there's no roots Homies on the cusley, I'm a foo cause I drink brook With the buzzer cap on the eyes, bro Bushwick on the go with a blow to the eyes Here I come, Willie D, that's me You rollin' with fifth one, stop putin' By the 12-pack case of a fart Six pack, 40 act, gets the real bounce I drink brook with Joe, I drink it by the 12 ounce They say bill makes you fat But sayin' eyes are straight, so it really don't matter