 All right, so we're super excited to have Ashish Jackie from Snowflake with us today. So Ashish has been with Snowflake for five years, four years, five years, five years. So they're going to talk about their data warehouse they've been building as part of their startup for the last couple of years. I actually interviewed at Snowflake, I had a phone interview with the French dudes. That was 2013 before I agreed to come to CMU and I basically told them, look, this is very interesting. I'd love to come work with you. If CMU makes me an offer, I'm coming here. That's why I'm here. CMU didn't make me that offer. That's why I went to Snowflake in 2013. So I will say that what I would like about any of this semester, having our friends in industry come to their data system, because a lot of the things that I talk about will show up and get in their system and you'll see how they solve the problems. They may not be exactly all the things that we talked about because they're working on different operating environments that will reassume in memory databases where they're trying to work in a shared disk architecture. So it's good to get another perspective about how people are actually building modern database management systems. Okay? Okay, go for it. Thanks Ashish. Awesome. Hi guys, thanks for having us today. My colleague Jackie, shout out to him. And I walk while I talk. I can move it. Maybe that's not a bad idea. So we've presented this paper in Sigma 2016. A lot has changed then since then, but the core database has kind of stayed the same. So what I'm here to present today is a high level, a little more than a high level overview of Snowfake's architecture. What were some of the design principles? Why did we build it the way we built it? What were the decisions that we made? And talk a little bit about what we've done since then, and then lessons learned future. So how many people have read this paper? I was hoping that everybody in the class will raise their hand. I can close the laptop and walk out of here and say, this is great, but it seems like it's the opposite here. So let's start from the beginning. Snowflake is an elastic cloud data warehouse. So a data warehouse or an OLAP database is different from an OLTP database because we are an analytics database. We scan large amounts of data. We deal with petabytes and petabytes of data as opposed to OLTP systems, which seem to deal with a lot less than that. So there are tons of data warehouses out there. Oracle has one, SQL Server has one, Vertica, which he was probably not really, never really associated with. Okay, sounds good. Good for you. Netiza, Teradata, there are tons of databases that existed and have existed for 30 years or so in this space. But what changed around the 2000, the late 2000s for us was that the cloud was around and tons of things were coming up on the cloud. And we thought that building a data warehouse on a cloud was a great idea. So we would have a multi-talent database. It would be transactional. There are a number of OLAP databases that are not transactional. Security on the cloud is a big thing. So that would be a key tenet. It would not be an afterthought, but it would be baked in the very beginning for us. And it would be elastic. We had a couple of French founders who he mentioned as French dudes and we had one Polish founder as well. So when they started talking about this, they said, oh, you know, the first thing that every new person would do wanted to build a database, including Peloton would say, let's take Postgres, let's rip out this thing and let's put in a new storage layer. And we decided not to do that. We decided that it was a lot more effort down the road to deal with the limitations of what Postgres may have in terms of the kinds of plans it may offer, in terms of the catalog it may have. And so we decided to build it from scratch. There's no Hadoop. There's no other SQL database that is running this. What we did leverage a lot of was the cloud. And I'll talk more about that. It currently runs on AWS and Azure. So before I left, I looked at our dashboard and we've run two billion queries this year to date. So on one data center or one deployment of ours. So we're running close to a few hundred million queries a day and we have over a thousand customers. Thanks, Channing. So the Polish founder, Martin, he helped build vector-wise. That's right. So yes, and you may have read papers about that in this class. Yeah, my machine was our sort of rock star hire. And so you'll see this as you go. So we picked the cloud, obviously. So Jackie and I both worked at Oracle before this. I worked on this in-memory database, Oracle in-memory, or Jackie worked on the query optimizer. And I would say that at least 70% of Oracle's code was in the storage layer and everything around that, transactions, et cetera. And maybe 30% was in the query optimizer and other layers about that. So when the cloud came around, we realized that this 70% was actually written for us, or at least most of it was written for us. And we didn't have to do all that much. Well, it sounded like all that much work at that time to build the rest of the database since we'd sort of built databases before. And that turned out to be true that we got an unlimited supply of compute, which was elastic. We could demand compute on the fly and we could relinquish compute on the fly. And it would give customers a model to pay for what they were using. And they could stop and they could relinquish service when they said, oh, I don't need to run any more compute. This was one of the biggest problems that you kind of saw at Oracle or other existing databases that are sort of built and sold for boxes. You install this, you buy a quarter rack of Oracle and you run out of it. And then what do you do? You buy another half rack and it takes you three months to work with it and another three months to plan it and three months of support to move things around and none of that. We didn't want any of that. It needed to be really simple and really elastic. And it had to be a software as a service. So we wanted to be the DBAs. We wanted everybody to sort of, and you could sort of see that mind shift. And I talked about mind shift and I talked about it a little later. You could see the mind shift that was moving in analytics is that people were focusing more on data versus and actually analyzing data than actually having to manage and massage data itself. So we wanted it to be such that there are no norms. There's nothing for you to really do except for manage, visualize and analyze your data. Right? So I talked a bit about traditional databases already. You bought boxes, you bought software licenses and they were paying to deal with. So the other thing, and this is where I talked about the mind shifts, where classically how analytics was done in large businesses was you'd have a batch job that would run every night and would transfer some amount of data from what is the transaction processing system over to the analytic system. And it would run every night and your data would be stay, and the results that your dashboards and graphs you'd look at would be stale the next day. But that stopped being true. People were generating a lot more data and a lot of different kinds of data and wanted all of that in their database, their OLAP database. So in addition, there wasn't just structured data that was coming through. There was data coming from Salesforce and there was data coming from logs that click logs or what have you. And that data was different. It was semi-structured. We'll talk again more about that later. So to deal with these kinds of the new mindset for data, of course, there were a lot more solutions that were built. Companies like Google, Facebook, et cetera went out, built their own systems to deal with the data at the scale that they were dealing with. And we chose to not use any of the Hadoop style systems underneath to build our databases primarily because we thought that they were built for specific use cases that were of much larger scale and degraded the performance of the analytics dramatically. Like this was when our first investor actually started talking and he said, met the French dude and says, you know, guys, you believe you can build a database that's 100 times faster than Hadoop? And these guys said, of course, we can do this, right? And he couldn't believe that. He said, he says, no, how is that possible? So we sort of questioned everything that Hadoop was doing for the general populace at that time. What year is this, 2013, 2012? This is maybe mid-2012. Mid-2012 is when the company started. So it must have taken about six months in the making for this to have happened. But even then in 2012, the guy was like, can you spray? But this is an investor you're talking about. So he's not necessarily that tech savvy, right? But that's not true completely. But I mean, I would say 2012, to build a cloud database in 2012 was still not a given. And if you, I was on the job market on 2013, end of 2012. And you could easily pick a dozen companies that were doing Hadoop-based solutions, and only one company that wanted to talk about doing a native database from scratch. So that kind of gave you maybe two, the spread. So the other thing that these solutions lack in this kind of refers to that is that there was a lot more focused on building processing frameworks, but a lot less focused on building manageability, security, how do you. So they weren't really built for an enterprise that didn't want to have all this IT burden. If you wanted to hire and build a Hadoop team with 15 people like you guys, then go for it. But if you didn't want to do that, there was nothing out there. So our vision was a straight data warehouse as a service with SQL as your frontend, no knobs, no infrastructure that you had to deal with, and elasticity that is multi-dimensional, which you'll see in a little bit. I'll talk a little bit more about this, and support for all kinds of data. So the reason why this slide kind of exists is because we had to sort of compare and contrast ourselves with some of our competitors, and we talked about this. So these two classic modes where they shared nothing databases, and they shared this databases. And one of the problems that we're sort of seeing is with shared nothing architectures was people really liked them, me included back then, but they came with a huge set of problems with local storage. They seemed very logical, they seemed great. You're saying, yes, I had to do a join, I can do this locally, I will do my local join locally, and then I will deal with the join on non-primary keys outside of that. And maybe I won't deal with it very well, but as soon as you separated, or you co-located the computer and the storage, you ran into issues when you needed more capacity. Or when you needed to, you were running some background work, and now you have to manage how to schedule work on this, or when you had to say, I've run out of storage, I need to add two new servers, I need to redistribute the data. Right? So, and this was the dominant architecture. Oracle is the one that sort of didn't exactly do this, but everybody else was doing this. And so the, it'll, for some of the marketing slides, we have this picture of, I don't know how many of you used to read this cartoon called Asterix and Obelisk. And maybe it's a little lost, but it's basically, yeah, or South Asian. But anyways, it's a real fight for resources on these systems. And to deal with those resources, you want to separate the computer and storage. And that was one of the tenets that Snowflake had to begin with. This also talks about problems when you lose copies of, when you lose some sort of servers, as you can imagine, you would lose data. There are ways systems work around these with multiple copies of data, though not that much in SQL really. Anyways, so this is one of our stock slides. And it kind of highlights what the multi-cluster and shared data architecture is. So, so there's the data that sits in the center here, which is on is a service like S3, for example. And it could be various formats can be ingested into this. We turn all of these into our proprietary format. It's called FDN, which means Snowflake in French, Florida Nash. And then what you do is you provision a set of clusters or a set of servers or a cluster, what we call a warehouse, virtual warehouse. And you can run queries through this warehouse on this data. And this data is fetched from S3 and brought here for the purpose of the query. When you're done querying this data, you can return it. So you're not paying for anything other than the compute that you used, plus the storage, which is much cheaper, dramatically cheaper, on all cloud solutions. You can run one user, you can run multiple users. You can also have different clusters for different kinds of workloads, where you say, and this has been a huge problem traditionally, where you say, look, I have got only so much CPU for my ETL jobs. ETL is this process, which takes the data from various other sources and pushes it into this OLAP database. It's called extract transform load, just in case, effectively ingestion of data. And you can separate that. You can have separate data for your QA and test environment. So what we also said was that you can take some of this data and you can make a copy of it. And I'll talk about that in a little bit, but bear with me for a second. So that's a feature called clone, and I'll talk about that. But, and you can separate, have separate things for dashboards, you can different business units can do different things. This worked really well for customers. It was really convincing that, oh, I don't, and you sort of see people smile. So I don't have to go fight with this department or that department to do this. And what we also did on top is then we said, aha, there are other data science solutions that exist, like Spark. We need to be able to connect with more than just SQL. We need to have dashboards like Tableau, Click, and Spark, which does data processing. So we build connectors for other things. And fortunately, we now sort of grow into the stage where other people build connectors for us too. The one thing that I haven't mentioned here is, so the life of the query is sort of that it comes in, and maybe that's actually better explained on the next slide, but there is a large amount of metadata component here. And while the data that's stored on S3 or on the blob storage is stored is the set of immutable files, all the other catalog information, all the other account, accounts, users, sessions, billing, what a SQL statement is, what are the set of changes for a transaction that it's made are all really stored in metadata, which is a separate system and is available to all the services on this warehouse. And that allows us to sort of efficiently do something called a clone, which I will talk about again in a bit. But so you got no data silos, data was available to each one of these, data got moved from S3, got cached on these clusters. So it was while it was shared disk at this point, it also had the advantage of being shared nothing. Right? So this workload was a loading workload and could load a subset of data that was valuable to it, whereas this Tableau user, a dashboarding tool user could load a subset of data that he was analyzing or she was analyzing, whereas the QA department could do something completely different. Right? And so we actually get really good cash locality. So we get really close to the shared nothing like cash locality on these warehouses, whereas all the data is still here. So I'm curious to support Spark. Is there like a generic TBC connector for Spark? So our first version was a generic TBC connector, maybe our 0.1 version. But then we do a lot more. We do pushdown of predicates. And we need to do a lot more. That's all on the connector side. So that's that's a very important piece for us. So what shows up at Snowflake from Spark? Is it SQL or is it some SQL? That's still the interface. Whether we will do anything beyond SQL, that's kind of a question that's yet to be answered. I mean, you could sort of imagine more things, but we haven't really, that's something that's some people believe in, some people don't believe in. So this is a better view of the architecture. And so the cloud services, which are common to everybody, right? So our manage everything other than the execution of the query itself, so when you first create a Snowflake account, you really have to do nothing. So you've created an account, all we've done is we've created an account and an entry for you in the metadata. You log in, you provision a warehouse, like so, and then you execute a query, which goes through, so login of course goes through authentication access control, stores your session information in the metadata, you provision a warehouse, which goes to the infrastructure manager and says, I'm going to go to Amazon, get you four servers. You execute a query, which then sends a request to the optimizer, compiles the query, produces a plan, and this virtual warehouse is polling this service to see whether it can execute this plan or not. I mean, is there work for it to do or not? All the nodes or servers in this particular warehouse will fetch the same plan and will execute it and they will return the result. They actually write the result on S3 as well. And then the client can actually poll the result and get it directly. So that's really the life of a query. So what you get here is a set of workers who are completely stateless. They know that they've been given a plan, they get set of files from here, which they're told that these are set of files you need to read, partitions as we call them, and they cache them locally. So the cache is probably the only piece of state that we have. And when that query is done, they go and ask if there's more work to do. Another query to execute. Would you dump, you write the intermediate results to S3 and then something, you're following that location to see whether it exists, like they did, right, and then pull it in? And by intermediate results, you mean for a single query? Yes. We do write intermediate results, but there's nothing that's pulling. The query has to continue to execute. There's no, I can kill this query and make it restart at the same point later in life. Let's say you exploded your join. Yeah. Right. And you ended up spilling to the local cache here and maybe then to S3. At that point that if that query failed, then all the intermediate results are lost. Is it like an exchange operator that's waiting for the results and all, but it's children? Correct. So we'll get a little bit into the plan and what kind of plan it is, but that's, hold that for a second. So we start from the bottom. I don't know why that doesn't show up pretty well, but fair enough. We start from the bottom. The storage layer, as you know, is S3 for Amazon, whatever the cloud storage is for Azure, but it's built of immutable files. And this was something that was kind of forced on us and turned out to be one of a huge blessing for us. With immutable files, it made it a lot easier to cache them at these various tiered levels of caching that we have on ephemeral disk as well as in memory when you're executing the plan. But it also made transactions, snapshot isolation much easier. S3 is, of course, highly available. The performance is not that good, but we solved that problem by making sure that there's local caching. So this is how a micro partition looks for us. It's a fairly well known format. It's called Pax, which is called a hybrid columnar. I don't know if you read this paper, but it covered it somewhere. But what it means is that each column is stored separately within the same file, one after the other. There are offsets to each column. And within each column, we do various kinds of compression, depending on the type for that column. So for example, for numbers, we do Bayes encoding. For strings, we do tries and dictionary encoding. For JSON, we have a fancy try as well. And so each of the roughly 16 megabytes in size, some are larger. And there is a header which points to each column, so you don't have to fetch the entire file. You can fetch part of a file, which makes, fortunately, this is also something that S3 supported. Otherwise, we'd have to probably split this into multiple files. So you can sort of seek in and read a few blocks. So we've talked about other data. I briefly mentioned that we store our data also, the results of a query, also in S3. And this really helped us. For example, if I did select star from T on a one billion row table, it happens. Customers do this all the time, not expecting anything. Or with even maybe some small predicate, what do you do at that point? Most databases would croak with this. What we do is we simply push those results to S3, and we return the first block to the client. And now the client kind of, and the result kind of knows what the next set of blocks are. It's got a manifest, and it confesses the results. Not only was this useful for results, but if this format of your results is something that you can query yourself, which we can, then you can now enable a feature like querying on your own results. And so features like some of these features were enabled by metadata and storage, which classic databases are just unable to do, at least not do very easily. I mentioned our metadata store. This is a big part of our story. So we used to sort everything that was not data in that system. So all our objects, all our catalogs, zone maps, I assume you guys are familiar with zone maps. And even transactions, they're not done by the workers themselves. The transactions need to be done at a global level if you've got multiple clusters, because the work of one cluster can interfere with the work of another. And as a result, you know, a lock queue was sort of implemented as part within this metadata store. Right. So the metadata store is FoundationDB. And it recently went open source. I'm very, very happy again. So you charge, you charge it like that. Actually, what is the pricing model? You charge your storage, that makes sense, because it's there, if they have paperwork, and then you charge on the compute side of the query. That's it. And you don't charge that data transfer? We assume, we assume that's noise. And it is. So we put... Oh, because you dump everything to S3, so they're going to charge it back and they pour out and it's not... I mean, so we don't charge any premium on the storage itself. We just, it's a pass-through cost for us. And we believe that by charging a premium for the service, you should be able to cover the rest of the cost. Otherwise, it's just a very complicated pricing model. And it just becomes like, you know, you have to check through various menu items to say, it's this, this, this. And in fact, we don't even charge the results or storing the results. We just, it's something we can deal with. So we talked about this. The other thing we sort of did was that we hid what warehouses were really in terms of size and resources. So you basically order a t-shirt and say it's x small or it's 4x large or 2x large, which really is a multiple of two in terms of the computer resources that you get. And this allows us to change the instance types that we may use in the future as well. So let me, let me pause for a second and see if there are any questions. Have I lost anybody here? And I'll get a sip of water in as well. So the one thing I haven't mentioned yet is how these sort of workers get a set of partitions to work with. And so when the query comes in, the compiler determines from the metadata which set of micro partitions belong to a table, it does what's called as pruning and using the zone maps removes a number of micro partitions and then it consistent hashes those micro partitions onto the nodes of a virtual warehouse. And it does this because you expect the same workload to come through and the caches on the virtual warehouse to work effectively. What you'll often find is that resources in the cloud are flaky, servers can sometimes have poor IO performance or are just on a degraded because you're actually sharing resources with others. And so you need to deal with stragglers. And this is, this is a very typical problem in, in, in query processing. You'll find this that, you know, eight out of your nine pipelines or, you know, seven out of your eight pipelines are actually done, but the eighth one is kind of stuck and it slows down the entire query because you had to wait for that one to finish. And so what we did was we added this concept of stealing. It's, it's not super fancy. I mean, I think it's not super fancy, but the French dudes disagree with me. But what we did is we also consistent hash that part of the stealing. So it's like, if you can think of it as another round of consistent hashing or it effectively gives you deterministic stealing, so that you increase your chances. If one specific node is consistently slow, then other node, the node that would steal work would do it deterministically. And this is, this is for the case of scans, but it's also true for ingesting external data. Right. What is, what is usually the main cause of like the slow, like if the data is how you skew interview, steal the work is still going to fit, you know, it's not magic, you know, faster to just scan all the work. You still have to scan all the work. That's right. But you're running on a, a node, which is on a rack where some other company is also running something and has saturated the network. Right. So all of a sudden your network to S3 is terrible. So, you're screwed. So we've moved to, I mean, we moved from the storage in the last slide to the, to the execution engine. And here this kind of gets into more of the weeds and the kind of stuff that you've been talking about in this class. So we're column there, as, as we said, and our founder, the third founder, Marchin Zhukovsky was one of the guys who worked at vectorize, vector-wise, and was one of the pioneers of vectorization and for databases. So, you know, it was, it was very clear to us that we had to vectorize this and we had to, so our use of SIMD instructions of vectorization was kind of done on day one. We kind of knew what we wanted to start it with, our use of what we call as, you know, selection vectors. And, you know, you don't materialize intermediate results when you're, say, filtering. So you sort of maintain what are the sort of, what are the particular rows that you picked was, these are some of the decisions that we kind of made very early on. This, what we also decided that we would do a push-based operator, a push-based model, unlike a lot of databases that were sort of these volcano style, including Oracle, which are iterator-based. And this was, I think, fairly prevalent by the time we had started working in this space. I think it was quite clear that all future databases should be written this way. And so, I mean, this debate for, you know, whether you have a push-based query execution model or a pull-based is a fairly old one. It's, I mean, this paper from Neumann made it quite popular, but it was a good paper. But so what we do is we operate on blocks of data, or on rows, or what are called as batches of data. And the plan sort of moves from down upwards if you visualize a plan. But they're really called downstream operators. So at a time, there's a set of operators at this level, at each level that is sort of executing, and sending data to each other. And once they're done, then they move on forward. That's kind of changed and improved over time. But what this meant was that we could also have plans that were not just trees, and they were DAGs. So your plan could sort of fork your data and merge back up later. And this was very helpful. So for example, our update is really, because we're dealing with immutable data, if you decide to modify a single row of a table, you had to take that partition, micro partition, you had to read the entire partition, write everything other than that row back, and that row again with the modified data. So you can sort of imagine that there are two kinds of, there are two pipelines. So you can say all the data that sort of needs to be updated goes on one pipeline, and all the data that doesn't need to be modified goes on the other pipeline. But you can do this as part of your scan operator, and you can split it at that point. And so these DAG-shaped plans actually ended up being quite powerful for us. And without the push-based approach, I think we couldn't have done the DAG-shaped plans either. So you seem to have a quizzical look to you. I'm missing the update part. I don't understand the splitting. So you have an update statement which modifies one row in a partition. Now we have to create a brand new partition with all the rows except for that one row, which are the same, but this one row needs the update. So what do you do? You scan this twice or you create some intermediate results or you say that I'm going to have two sets of operators run on this scan data. So if I pivot based on saying if this row needs to be modified, then it needs to go to this operator, which will just write it in and produce a new partition, a new file. If it doesn't need to be modified, then it goes on this other side of the partition. And these DAG graphs ended up, we ended up using these for view merging also. Right, Jackie? Right. So there is no buffer pool. We realize that, I mean, again, this was not new information at that time, but that putting sort of reserving large amounts of memory for partitions in or micro partitions in memory was not as valuable as actually giving the higher operators more memory. Right, and go ahead. Correct. So we would send the data on one link and we would send the data, a different data, possibly on another link. So one link says modify the data that's come in here. This link says don't have to modify it. This is as is and it's what we call a scan back in fact. We have two copies and the other one is new copy which only modified copy, for example, and then you merge this back and you produce a new file. So that's right. There is no buffer pool and we'll talk a little bit more about our memory management in one of the slides. There's a design slide. I think it's coming up pretty soon. That's it. This is this one. And this kind of talks a little bit about how Peloton is thinking about its model as well about its self-driving. I can't say we're self-driving in the same sense that Peloton is, but to operate at the scale that we are, we're definitely self-tuning and we're definitely adaptive. So for example, we do automatic memory management which says that how much memory an operator uses is sort of a little adaptive. So the compiler kind of decides how much memory it should get, but the operator tries to use as much memory as possible in the beginning. And then the operator can scale down if it needs to because another query came in and you needed to do workload management. So what you want is if you've got a single query running on a warehouse or a set of nodes, you want to take all the resources that are available to finish as fast as possible. Now, if another query comes in, then you say, oh, shit, what do I do? I've got to split this into two. So the operators can grow and shrink their memory based on the situation. We pick the degree of parallelism based on what we see. So if you say, oh, this is a really simple query, we believe it's got only, it needs to scan eight partitions, that's eight files, then I think it can run on a single process. It doesn't really need eight nodes, then we change that on the fly. So the compiler needs to come up with some decision to the compiler and query optimizer the same thing. They're synonyms for us. It comes up with the estimation, but you don't really use it. You still give the operator whatever memory you have. So memory usage is known for scheduling, calling the queries to run on a warehouse to begin with. It's used by the warehouse query scheduler. And to decide what the degree of parallelism should be. But once you decide it, you give all resources to the query. You give all the resources to the queries that are available. You split them evenly across the queries that exist. I mean, there's a little bit of fudge factor and there's a little bit of nuances there, but that's the higher order bit is that you, and why would you not? Because you have so many resources that are available and you want your query to finish as fast as possible. In fact, if your queries, and this is one thing we kind of tell all our customers as well that if you find that your queries are scaling linearly, then use the largest warehouse you can. It's kind of interesting to me that you use the memory estimation to decide how many queries were wrong. So you don't use that to distribute different sides of memory to different queries or different operators. That's right. And you'll see that that's one of our ethos. Our general belief is that the query optimizer can be wrong. And when it is wrong, it can be, it's horribly wrong. And so you use it to get some sense of how things should work, but then you don't rely on it at runtime. So at runtime, you make adaptive decisions on the fly. And what bring your layer to your, are you like adjusting things? Like it was like per vector as the process to say, all right, how much memory should I go up or down? It's per operator. So each operator has a budget for memory. And that budget is shrunk or, and the operators are actually monitoring the system. So you say, hey, this is how much memory is actually available on the system. And so I need to degrade and back off exponentially. Like I said polling and checking, should I go back off? We change this. I would say it's like off the order of a few milliseconds. Is that what you're kind of going after? Like I would say hundreds of milliseconds, right? Can actually do it. Again, if it's a hash, you're going to hash table or hash domain, you don't want to shrink it back into the, we have to write back, right? Or it's built to disk. So there is no vacuuming. There are no statistics. The only thing we use are zone maps. At this level, there is nothing else that the, that the execution engine is using to decide what to do different. It's picked for you already by the compiler. It's picked it based on what planet came up with. So that join order is not yet adaptive. Got it. Okay. It can be. So other things, if you don't mind, I also saw some automatic distribution. So I will, I will, this next example is a much better way of doing this. So these two slides are actually I pulled up from marketing. So they're a tad bit repetitive, but I like how they kind of covered the picture. So this is an example of a, a, a join. And what we do is we detect the popular values on, say a build side of the joint, which for us is the, in this case is whatever, the left side of the joint. And, and if there are a lot of popular values, then we will off of one type. Say we're in Pennsylvania and your ticketing system has Pennsylvania 99 out of 100 times, then we want to broadcast that value as part of a join. For all the other values, we want to hash distribute them. So at runtime, we can make this decision. The compiler does not make this decision at all. So, so we sample it. We get the number of values. We believe there's, I mean, as such, there is no performance degradation because we can make this decision on the fly. And it wasn't decided that, so I mean, this isn't a new technique per se, but it wasn't sort of predetermined that you will do hash or you will do broadcast or you will do adaptive. It is always adaptive. It's always on. Otherwise, you sort of have to tweak and tune every query, which just isn't feasible. Right? So we've now moved to the top layer. I can pause again, take a few questions. I am, I think I'm doing okay on time. I think I've got about another 15 minutes and then maybe another few minutes for questions. That's right. Yeah. So the cloud service is what we sort of call the brains of the system. That's what you saw was the compiler, optimizer, transactions metadata layer, and the access control. This is the only part that is multi-tenant. Everything else, as you notice, is not multi-tenant. It's also highly available. So we can, we run this across different parts of the Amazon data center. They call availability zone. So if one center is unavailable or has reduced availability, we can switch over to the other. When we started out building Snowflake, we actually thought that this would be the easy part and the database would be a hard part. It turns out it's the opposite. There's a lot more work to do over here than that. I mean, there's obviously a lot of work everywhere, but it turns out that it took us a lot more work to get this right and get this working. It's, you know, sort of, I give this analogy when I talk to people is this service needs to be online 24-7 all the time, needs to work for online upgrades. It's like, you've got a dance floor and, you know, you've got all these people who are dancing on it and every week you need to build a new dance floor and you need to move the dance floor under all the dancers without making them, without them knowing it or them stop dancing. And this is effectively what we do as part of the service. So this talks a bit about our transaction model. As I mentioned, our transactions are completely done as part of metadata and we implement snapshot isolation. It turns out that this immutable S3 files that we have make life much easier. What you need to do is effectively keep track of the set of files that belong to a table at a point in time, which is after every DML operation that took place. And once these, if another DML takes place, then you sort of keep track of the set of files that were added or removed, partitions are added or removed, right? And instead of now maintaining an ARI style, you know, a redo log or an undo log, you, all you need to keep track of is the set of file operations that you have to undo or redo if you did a commit, right? So you sort of keep these on all on the side and say, aha, I want to commit. Then you say, make these files part of the set of the table. And if you said rollback, then you said, let's just throw them away. Now you can sort of also imagine that if we kept track of the set of files or various points in time, then we can do what we call as time travel. It is a feature that some other databases also have, but it's really easy for us to say, these were the set of files one day ago, you can run your query on this one day ago, or 30 days ago, right? All you're incurring is extra storage cost, right? Is there a customer use case that you can pick up as long as time travel queries? Because I know one and it's the only one I ever hear about. There was a systems integrator who came and showed us once a dashboard that was live and they had implemented it with time travel and he would move the slider back and forth and the graph would change completely and it was making time travel queries to see, aha, this is how it was and this is how it was. And that's been one use case. I know it's used quite heavily. A lot of people use it to see what changes have taken place in the table. Like this is what the table was back then, this is what it was, you know I can do. And other than that, people also use it to make copies of their data. So you say, I want to create a clone of the data as of seven days ago, right? So I think these are the large use cases that kind of exist. What vertical is that common in? The financial guys have, I mean I would say they have like a slightly different set of requirements but and their sort of requirements may sort of go beyond like seven days or 30 days or one year. And I don't know if I have a good answer for that. I don't think I've dealt with that customer Jackie. Sure. And have you handled like time travel on like schema changes, like do you record the schema history, remember history so that like if someone drops a column now and I would do a time travel query in the past, I don't want to lose, so we don't version our schema today the same as time travel does. So you're a bit out of luck there and you could run into some situations where you're not able to, let's say you drop the column, then you may not be able to access it at that point in time, past point in time. Does it mean like you cannot go back to like seven days ago without the column or you can go? The compiler won't let you. The data, the partitions and the data is there. The compiler says this is the table, how do I go access this one thing. So I also talk about cloning here. I don't know, I briefly mentioned this, I said I talk about it later but you can sort of imagine that cloning an object is much easier. All you have to do is make a copy of the metadata that points to the underlying data and then these objects can evolve independently. This is one of the other sort of key contributions that Snowflake has for the product is our zone map implementation. Since our zone maps are also part of metadata, the compiler itself can do a lot of work itself. So for example, you can run a number of queries that can be answered completely from metadata. I select star or select max C1 from T, you don't have to go to the execution engine at all. Because the compiler looks at the zone maps, it says aha, I'm just going to scan the zone maps. This one is the largest one, I'm going to return that result. What we also did is we implemented a lot of operators. I think pretty much every operator that exists works with the zone maps and you can derive the values of what the zone maps would be for every partition. So every partition has a zone map which says what the min value is, what the max value is, what the number of distinct value is, but let's look at the min and max value. And if you've got a let's take a substring, what substring is a bad example, but if you had let's say a date trunk function and then we would apply the date truncation even on the zone maps. So we kind of implemented our evaluation of expressions both in the compiler as well as in the execution. And it gets a little trickier here because our compiler is written in Java because our execution engine is written in C++. So we had to make sure that we worked hard that their behavior is kind of the same. By the time the execution has received the plan, it either includes the values of the zone map that are interesting to it or 9 out of 10 times it doesn't need them at all because the set of files that the execution needs to scan have been determined. So for joint ruining, for example, we include the zone map for the set of files that we already picked as part of the execution plan. Basically, if you can, if there's any expression in the query that can be evaluated to a constant, then your problem is solved. Right, from the zone maps. Right, so I mean, you may be able to do substring also. I'm not saying it's not possible, but in general, it's a bad example because... So the zone maps are hanging out in memory, so at this point you're running in memory speed. You're running in memory speed. So our zone maps used to be in foundation DB. We've since moved them out of foundation DB2. They're stored in separate files in S3. We load them from there as well. Where are you also storing zone maps on the disk? Yes, on S3. The cache them in memory. The cache them in memory. So if you run a query which says, you know, select C1 from T, where C2 greater than 5, then we will load the zone map for C2. In the entire cloud service, written in Java, I'm wondering why kids who wrote everything from scratch, why are you using Java and SQL code? Only the brain is written in Java. The muscle is all written in C++. Could you remind me what the brain is? The brain is the cloud services on the top, which is the compiler, the metadata manager, the access control, security, transactions. The muscle is the virtual warehouses, the set of workers which are actually executing the plan. So the thought process was that to build a restful service with a cloud service, there are a lot more libraries that are available to use in Java. And that was kind of true at that point. It's still true now, but I would say that there are enough cloud services that have been built using C++ only. And I don't know if we've, you know, we've had our fair share of issues and concerns with Java. If I would guarantee you that if we just sort of read it this, we would probably not pick Java for the brain either. So you've got to realize that the bulk of the time in for all our queries is spent on the worker nodes in the C++ code. Right? When you have queries that are running for, you know, a minute or even two or three seconds, then it's done down there. And whatever work is done in the Java layer is hopefully of the order of hundreds of milliseconds. Right? Now that's not always true because sort of evaluating these things and dealing with zone maps can take time. Sort of if you get really large complex query plans with, you know, thousands or tens of thousands of operators, then transforming these plans can also take time. How do you determine the number of zones? It's predetermined for us because every partition that was ingested has its own map. So for example, we determine the size, exactly, we determine the size of the partition. It's a few megabytes, tens of megabytes. And that partition, for every partition we have that. Right? So we also maintain bloom filters on top of this. We kind of stopped doing that for some things because our bloom filters and it just ended up being too large. We found they were not that effective. At some point, we had to sort of revisit this. We use bloom filters for other things, including joints. And launchers got a better one. We told you, right? Yeah, we talked about that. So the one thing I didn't mention is that we don't have indexes. No indexes at all. And this is not uncommon. If you have a billion rows, you can have an index. If you have a trillion rows, maintaining indexes is a lot more painful. At that point, you're much better off saying, I will create another table, which is effectively an index table. We build a UI, like, and I was shocked when we started doing this, and I was like, why would we do this? And I think this is one of the best decisions we ever made. It's a full-fledged, it's almost like an IDE in some sense. And a lot of customers, for them, that is pretty much the database. It allows them to easily do a lot of operations that were much harder to do through command line. Right? So you say access control, creation of new users, etc. This is sequel for all of this. But doing this on the UI, managing how many warehouses there are, how do I see this? What is my billing? How well are my warehouses utilized? It's all possible really through the UI. The other thing that we kind of did is that, so performance, while we wanted to make sure that our first cut was super good, we wanted to make sure that we could analyze our performance really well. So stuff like sampling of what the operators were doing, which operators were running, how much time did each operator take, was done on day one. And we, not only that, we would actually get metrics for all of this back to the brain from the state less workers. And we would show all these metrics for every query. Right? So the kind of things that Oracle added in 2008 were sort of there on day one. And whenever a customer has a problem with a query, people from support are sort of able to look at this and say, by the way, it seems like this is where your join is exploding. This is how many rows went up here. This operator took so long. It's really easy, and it's really important to be able to do this at the scale that we are. And we're reaping the benefits of that today. Right? There was focus on ease of use, no knobs to tune, no physical design, no storage to groom, no storage to manage. Right? There are views. Yeah. We had views before we went generally available. And I'll talk a little bit more about the future features that are coming up. I mean, in terms of SQL support, we are probably up there or way up there as compared to most other databases. In fact, we even support dialects of multiple databases just to make sure that it's easy for them to move over. Oracle, Netiza, some of the stuff like bind types, for example. That's the easy one. What did you start with here as your SQL branch? Did you take the process one and build that off of that? We started with our own. We wrote our own. We used Antlar. When I said we started from scratch, we started literally everything from scratch. And so we call our grammar Frankenstein grammar. That's what you call internal. The marketing people don't call it that. It's Frankenstein because it's got a bit of everything in that sense. Yeah. Again, we talked a bit about this. We were continuously available. And our points of failure are not our own, but they're external services. So when you think about it, there's probably today on, say, US West, there's probably 1,000 brains running, and there's probably 5,000 workers running, each one running multiple processes. And none of these, a failure of a single one should not bring the system down, never. So semi-structured data. I am running a little behind. I'm going to try to go a little faster here. We realized we needed first class support for this, and we did a fantastic job at this. This is one of our huge differentiators. I don't think there is any other database that has as good JSON support as we do. And it works. So we added three new types. We didn't call it JSON because we wanted to ingest Avro or Bison or what have you. So we call it a variant, which is, in fact, you can think of it as JSON and then a map and an array, a list or a map, which is an object is really a map. And we supported all SQL operators on this. So you can do joins on it. You can do group buys on it. You can even flatten it, which is you can take the multiple array values or map attributes that were there and turn it into individual rows. And there's this concept in databases, which is called as lateral joins. I understand. And that's effectively what we use. So there are a number of other people who did this, but the required schema definition up front. We have absolutely no schema definition. You can evolve this over time. Each partition can have its schema. We will columnarize what are the most common paths in this. So you get native performance on those paths. And this was a key differentiator for us. So for example, if you've got property like courses and 721 is value, but if CMU.courses shows up in every single row, then we will turn that into a full-fledged column. Another course. This is only for semi-structured. Because everything else is already structured and already columnarized. And not only that, for everything that we will columnarize, we will also produce zone maps. And hence, we get performance that's as good as a native column. I think Verna does this too, right? Possibly. So that's a little fancy picture. We turned things columnar and this was, we usually don't push out a lot of benchmarks that gets us into trouble or could get us into trouble. But this is an example of how the TPCS performs with and without JSON. Or if you turned it completely in JSON data. And you can see it's pretty similar. This is less interesting. We talked about time travel and cloning. I'll talk about security very briefly. So we encrypt all our data at rest. This was obviously a given for us that we're in the cloud. A number of customers are very sensitive to moving into the cloud. And we had to prove to them that we would be way more secure than what they could ever be, right? And we did a fantastic job. Security again was something that wasn't an afterthought for us. It happened on day one. And so we introduced role-based security very early on. In fact, we went GA with this. Which is to say that fine-grained access was controlled by various roles which could be granted to users. And then various accesses could be granted to those roles. So again, there was a little indirection by introducing roles. So privileges and roles are two separate things. We made sure we had stuff like two-factor authentication. Our key management is such that it's hierarchical. So Snowflake can have a customer-managed key, which is in the customer's own KMS or whatever they have, key management system. And to decrypt the data, the customer would have to provide us that key because we ourselves cannot read it otherwise, right? And that key sort of encapsulated with if it's Snowflake's own key, which is different than for each database. It's different for a schema. It's different for tables. And each set of files, we generate new set of keys. So if you sort of manage to crack one key, you won't be able to still crack or read the other files, right? We also change these keys and we rotate them. And so we decrypt and re-encrypt our files, what's called as re-keying. And S3 access and policies, again, we have these concepts of what they're called as stages to load data. And of course, where we store our own customer data, and we have to separate customer data from one customer's data from another. And so each customer has its own set of access policies that are managed at a logical level, right? So we implemented a lot of features after SIGMOD. We did this thing called data sharing, which again, you can think of as a metadata feature where I can say that CMU, CS department or this class is a customer of ours and wants to share data with, say, another class that also Snowflake customer. And all we really have to do is say this data is really available to this as part of some metadata and access control, right? And the rest kind of just works. We added an ingestion service that doesn't require you to provision compute upfront. You just send us the data or you just put the data in S3 and we will see it and we will just ingest it. I talked about the Spark connector, support for Azure Cloud. I'm just scratching the surface. Jackie worked on what's called as reclustering, which he's going to talk about tomorrow, which is approximate sorting of large amounts of data. And there are at least a dozen more features that I'm kind of missing here. Jackie, you want to mention something? Lessons learned. It was controversial to build a database that was not Hadoop based, but this multi-cluster architecture really worked for us. The semi-structured extensions were also much better than we thought in terms of value. Service was much harder than the database itself. Everybody loves the no tuning aspect. So you guys are in the right direction there for sure. And the actual performance of the database, whether oh my gosh, you're actually slower than this other database, didn't matter as much. So while we didn't ever want to lose on performance, it turned out that a lot of these other things mattered more. And in fact, today, if you come to us and say, this little feature, if this little trick that we can do can improve performance by 20%, we probably won't take it. And we'll probably be biased in favor of a new feature. Ongoing challenges. I'll leave this at that. We can talk more about this later. There's still a lot more work to do. There's no doubt about that. And these are some of the things that we're kind of looking at and we're doing right now. We need to go global as a service. We will do replication of our services. We will have various kinds of advisors for materialized view, for clustering of data, for determining how well the workload is running. Kind of some of the stuff that you guys are also doing right now. The advisor just means sort of playing self-driving. That's kind of to be determined. I wouldn't say I'd go as far as self-driving. Because what your definition of self-driving is, is probably a little different from what I would say is self-driving. More data there's more self-healing. So I mean, the approach is maybe a little different here, but I think the outcome is kind of the same. And it's not like we're not against new approaches. We have to try, we have to see them. And so we're interested in hearing more new approaches as well. I think that was my last slide. I think you guys kind of know who we are. We have a second office that's now in Seattle, where 80-plus engineers are working. It's waffle days still on Fridays. Wednesdays. All right, let's thank you, Shish. Cheers. Thank you guys. I have one question. I'll ask it. All right, so you talked about all the things you were having when you guys first started. Like, from the ground up, you had a security, right? Is there anything that you regret doing? Like, this turned out to be a waste of our time. Was there anything that you guys at the beginning thought this was a good idea? Yeah, so this is a picking job for the cloud services was in the long term has been very painful. And I'll tell you why is because when you get, when you had to compile queries and the way and how I compile this kind of build, it does a fair amount of work. And it's fairly complex and it takes a lot of resources and provisioning for those resources and separating those set of resources is much harder and it's a lot less deterministic on how they will end up being, how much memory will you end up using for this compilation of this query? It's very hard to tell, right, by just looking at the text of the query. And this service and this part of the service ended up being multitenant, right? And Java is a programming language does it makes it even harder for you to know where your memory has gone, right? Is it this crazy this crazy that query, right? Unless you sort of have added that kind of, you know, accounting from day one, it's much harder to do now. And that's where we are. So we've had to turn it out and do a whole lot of other things for that. Okay, although most of the kids laughed. So I'll say real quick, Snowflake is hiring. For those of you that are, most of you are going away for the summer for internships, if you're interested in talking with Snowflake, interviewing there, should do it while you're out there. She's just happy to take you out for a good time. So is Jackie. Yeah, as well. Okay, although all those kids probably will, I know they're looking for jobs, and I just laugh so. All right, guys, this is the end of the semester. Again, the co-reviews are due two days from now. I'll send a reminder on Piazza. Good luck with the rest of your finals. And I'll see everyone again on May 14th in this room, not where they tell us to be in this room. And we'll have database t-shirts and bagels and breakfast regular. Okay.