 Yo, hey, yo, hey, yo, yo. Pack the chrome style, fly like Mrs. Jones. Lyrical mathematics will have the devil smoking stones. I put heads to bed, lick shots, and rap was fed. Hey, guys. Sorry for not being Pittsburgh, as I said. I had to come out here to Seattle. I'm actually in front of the courthouse right now because Smooshu's boy, Three Fingers, and this guy, Jesus the Goose, they got popped last year for passing bad checks. And so they asked me to come in as a character witness for them. It's not looking good in there for them. That's their problem, not mine. So I figured let's come out here and do the class lecture. Let's talk about databases. So for today's class, we're going to be talking about sort of an overview of what these sort of modern query engines or these modern OLAP database systems look like at a high level. And that'll then set us up for how we've progressed through the various topics throughout the rest of the semester. So this is a high level outline of where we're going in this course. And I didn't want to bring this up last class because I wanted to spend time talking about the sort of historical overview of the state of the database. So to understand why we're at the point where we're today and we're focusing on the relational database systems. But this is a high level architecture of the roadmap of how to build a modern OLAP database system. So we're going to first talk about storage, essentially how are we actually going to represent data on disk, compress it, maybe build some indexes that we can use for accelerating query execution. And then we'll use all that storage ideas about how we want to store the database. We'll then spend most of our time discussing how we actually want to execute queries on this data. So we'll talk about processing models, how to schedule queries, the tasks within them. We'll spend a lot of time talking about how to vectorize execution, taking advantage of SIMD, query compilation will be another major theme or method we can use to speed things up. We'll spend a lot of time talking about different join algorithms. And something new that we'll discuss this year that I haven't discussed in previous years is materialized views. The full emit materialized views is actually something I know the least about in database systems. So we'll spend some time on this. And then we'll talk about query optimization, how do you take a query plan, convert it into SQL query, generate the query plan, how to design sort of modern networking interfaces. And then the rest of the semester after that will be doing a analysis and deep introspection into real systems like the Snowflake system you guys read in today's paper, but also BigQuery, Spark, and a bunch of other modern systems. So this sort of layer diagram showing here is very similar to what we discussed in the intro class, where it's essentially the system going from the bottom up. You're going to build these different abstraction interfaces and they'll expose an API that we then integrate with to the layer above it and then to the ambient, the client interface will be exposed to the application. So this, going for the next three lectures, four lectures or so, we'll focus on the storage layer and then that'll then define how we actually want to... what API will have exposed to us when we run queries. Okay. So today's execution, as I said, this is a high-level overview of what modern OLAP systems look like. And so we're not going to go too deep on, like, what are the algorithms you do for the queries because that'll also come later. And this class isn't necessarily about distributed databases because I want to spend time discussing, you know, what are the modern techniques of what we do on a single node to process data and then we'll layer on top of that the distributed architecture. But it's good to understand a little bit of what a distributed system would look like in the modern OLAP system. So, you know, to understand the big picture where we're going, but it really is, as I said before in many other classes, we need to get the single node system to be high-performance and super-optimized first before we make it distributed. Like, making a database distributed isn't this magic wand or magic solution that solves all the scalability problems. Like, if you have a crappy single node system, making it distributed just means you have a multi-node crappy system. So, we need to nail that down first. And I'm going to spend a little time talking at the end about this bigger trend about the commoditization of the components of these OLAP systems. So, we're most of me talking about systems like Snowflake and Spark where there are these monolithic systems that are all where the vendor has implemented all the bits and pieces of them. But there is a trend now where people are trying to build sort of standalone pieces of actually the layer diagram I showed before. And in theory, one could then put these things together at a later point to hold that whole new OLAP system. But we're not entirely there yet, but I'll talk about some early work in this space. All right. So, the first thing I understand what does a distributed query look like in a distributed database system? And I will say at a high level, it's essentially the same as you would have in a single-node system, right? SQL query shows up, you run it through the parser, the binder, the optimizer, and so forth, and it'll generate a directed, a cyclic graph of physical operators where you have these physical operators, the scan, a join implementation, and they're taking some input coming from its children operators, doing some amount of computation on them, and then passing that along to its parent operator. And so, the fact that these operators may be exiting across multiple nodes versus all running on a single box, a single node, at a high level, again, there really isn't any difference. There's the orchestration of the tasks and so forth, but that's sort of outside the scope of what we're talking about right here today. Like, at the end of the day, there's a query plan. We break it up into tasks and we distribute it across nodes. It's no different than taking a query plan, breaking up the tasks and distributing across cores on a multi-CPU box. So, all the operators that we had before, there's a single node system, table scans, joins, and aggregation sorting, all of these exist in an distributed system. So, at a high level, it's going to look like this. So, say there's going to be some notion of a worker node, and the worker node is going to have a CPU, it's going to have local memory, it's going to have local disk. And say if we go from sort of left to right in the diagram here, we're trying to execute a query, each of these nodes are going to be responsible for taking some input in, doing some computation for the operator that they've been assigned for a given task, and then producing output as results. That then either goes back to the client or goes off to the next worker nodes. So, say we start at the very beginning, say this is the bottom of a query plan and we want to read what we'll call persistent data. So, the persistent data is going to be the underlying tuples of the tables that are defined in the database. So, these are going to be some files that are going to have data that we know how to interpret their types based on some scheme information in the catalog, and we know how to access that and scan it and do something. And so, I'm not showing this diagram whether the persistent data is local to the worker node or it's some shared disk architecture or object store which we'll cover in a second. For our purposes here, it doesn't matter. So, again, we've taken the persistent data, we're doing some kind of scan, and we're going to produce some amount of intermediate data. And then we now need to ship this intermediate data to the next stage of the query plan which either could be on the same node or a different node. And so, one of the ways we can do this is through what is called a shuffle operation where after we do some operator, we would pass along this intermediate data to another set of nodes that are going to distribute the results based on some hash key or some range partitioning. And then there'll be some other worker nodes that can then pull that data from these shuffle nodes. Now, I'm putting the shuffle node phase as optional here because I don't think Snowflake does this with explicit nodes, but BigQuery and Dremel do this. So the idea is like these in-memory nodes that are... These shuffle nodes have a lot of memory, so they try to keep everything in cache without squaring the disk. It's a way to sort of redistribute things. And I'm also showing here there's a sort of one-to-one correspondence between number of shuffle nodes and worker nodes. It doesn't necessarily need to be the case. You could fan out or fan in at this different stage. But again, it's just a way to get the intermediate data from the previous stage to the next stage and if necessary, redistribute it on some new partition key. And then this next worker node does additional computations. It has more intermediate data. It says it's a shuffle phase or not, it doesn't matter. But now it descends it to this other worker node who then does some final aggregation of coalescing of the data to produce the result of the application. So again, there's nothing really... There's nothing drastically, dramatically different than how we'd execute a query on a distributed system versus a single-node system. We just may have to do... spend more time doing data movement, which nowadays is actually not as bad as it used to be. It used to be that the hierarchy was that the network... Going over the network was always the slowest, then reading from disk was the slowest, and then memory... The networks have gotten really fast now. And especially in the case of BigQuery, they have accelerated... They have hardware acceleration for these shuffle operations. So in some cases it's actually faster to send things to the network than read from disk. So this shuffle phase is not always a big deal. But the main thing I want to sort of expose here is this notion of persistent data and intermediate data, because that's going to depend on... Or that what we actually do with these different types of data is going to depend on what the architecture of... whether it's a shared or not shared disk system. So this taxonomy between persistent data and intermediate data comes from the scientific paper that you guys read. And they would call if persistent data is the data that's going to call the source of record for the database. So if I insert a tuple as part of bulk loading some data or copying or insert operation, I expect that tuple to reside in these persistent files and the persistent data. So in a modern system, we're going to assume that the persistent data... the files of persistent data is immutable. And that's a dramatic change from how people designed or assumed data systems were built from the years past because you assume you're running with a local disk that you can read and write as needed. But in a modern cloud setting where you're running on an object store or even if you're running on an early predecessor like HDFS, those file systems were... some of them were appended only and you couldn't do in-place updates. So in the case of a modern lab system, they're going to assume that these files are immutable and you still have to support updating them and you have to maintain some metadata to keep track of whether something is... whether a version of a tuple has been overwritten by a newer version of something. But we'll come to that. We'll talk about that in the next class. And then the other interesting part of this effect is going to be these... what we call intermediate data. And this is the output of the operators that we execute in a query plan that we then need to send along to the next operator in the query plan. And this actually can be a lot of data. And in the Cineflake paper they talk about how the amount of intermediate data that a query would generate has little to no correlation to the amount of persistent data they're reading, meaning the table could be really, really small but then they still end up producing a lot of intermediate data because of some computation that they're doing. Or it could also be... it's not correlated to the execution time. So if a query takes a long time to run, that doesn't necessarily mean they have to generate a lot of intermediate data. So this intermediate data is interesting because it's... it's... it's... you need to move along as fast as possible but you don't need to necessarily need to worry about durability, fault tolerance, as soon as you finish the query, the next stage is going from operator to the next, you immediately throw it away. So the paper talks about how the sort of trade-off between deciding what the spell and disk, persistent data versus intermediate data. And it was often the case you want to keep intermediate data in memory as you move along. So we need to talk about what the system architecture would look like in a modern OLAP system. And there's a couple of ways to think about this but the end of the day is there's the... how it's going to move data, or move persistent data to... start beginning the execution on the persistent data, like start scanning it, and it will determine also where it's located. And again, in the intro class we talk about this like sort of a shared nothing and shared disk or just push the query to the data, pull data to the query. We talk about these things as being very... these mutually exclusive, but it's actually not the case, especially in a modern... platforms where systems don't clean divide along these lines, they actually do a little bit of everything. And so we can understand sort of the trade-offs and the applications of these approaches. So the first thing I'm going to say, again, if we want to start reading persistent data, how should we begin that process? And then as we move from one stage to the next that we execute operators, how do we transfer or what do we transfer for the intermediate results? So first, just talking about like how do we begin the execution things or the movement of data. The two choices, again, pushing the query to the data or pulling the data to the query. And this, again, these ideas aren't new. They go back to a lot of the work that was done in the late 80s, early 90s on some of the first like parallel distributed data system architectures. And the trade-offs of understanding these things, I think have changed in modern times just because of honestly the how much hardware has gotten better, especially the proliferation of cloud services like object stores. Where maybe things in the past like shared nothing and was the dominant architecture that would assume that was the best approach and it was always logical to push the data to the query to push the computation where the data is located. But now that's not always the case in the cloud. So the first thing is again how we want to initiate the process of executing portions of a query on a database or on data. So the first approach is to pushing the query to the data and this means that we want to move as much computation as we can to the location of where data is located. And the thought process here is that the cost of executing some portion of a query where the data is located was always going to be much cheaper than the cost of transferring the data to some other compute node. So again think about it, I have a disk on one node, I have a computation on another I'd rather send the query to the data on the disk that has the disk where it's located than have to transfer it out because the net cost is so high. The alternative is to pull the data to the query and this is where the query is, you leave it at that computational resource and then the data needs access, the data needs access, you transfer it over to that node. And this can be necessary in a situation where you can't run any queries, you can't do any computation where the data is located. If I have a remote file system and I can't run anything, I can only do gets and sets and deletes you'd have to do pull data to the query. So as I said the hardware has gotten much better in recent years where this isn't always the best design choice. Pull data to the query because now again the network in cloud systems is pretty fat. The other interesting thing also too is these aren't always mutually exclusive anymore because there are some cloud object stores where you can actually push some computation to the data, even though you wouldn't think you'd be able to do it. So on S3 they have this select operation where you can do something that sort of looks like a pseudo-sequel a pseudo-sequel with a where cause and you can send that to and to do some initial filtering directly in S3 so that you're only now trashing the data that would satisfy your where cause. So again the idea is that I do a select state that I can maybe send a portion of the where cause to S3, S3 knows how to natively understand CSV file, JSON, or parquet files and then do that filtering and then only send back the results that within the rows that satisfy it. So that's actually pretty powerful now because that's basically pushing the data to the query, but not all of the query, but at least enough where you could get a big way of doing some filtering. Microsoft Azure Blob Storage has something pretty similar where you can do something again that looks like SQL to do some initial filtering. I don't know what file formats they support, but the basic idea is the same thing. So again just going forward we're going to be using pull data to the query for the lower portions of the query plan, like the access methods of scanning the tables, but then we're probably going to want to be pushed but as we go from the intermediate phases it's also probably pushing data to the query as well. So now we want to talk about the two high level approaches for better attribute assessment. Again these aren't this one is actually pretty this is pretty rigid this is how this is a clean dichotomy for systems today, but the first type of architecture is the more traditional one where people think of distributed database systems is a shared nothing system and this is actually a phrase or term coined by Mike Sternbraker going back to like the 1980s and again the idea here was this was presumed to be the better way to build distributed database systems for decades up until maybe 10 years ago. The idea here is that each is going to have its own CPU its own memory, its own locally attached disk, think of like MVME and the only way that nodes can communicate with each other and see other portions of the database is to send messages over the network. So like if I'm running on one node on the left I need to access data on the right I just can't peek into its disk I got to go send a message to say hey send me that data and so the database is going to have to be partitioned into distribution subsets across these nodes, sometimes called shards where each node will have a sort of unique portion of the database of the tables and ideally you would have to pick a partition key that reduces the amount of data transfer you have to do when you do joins we'll talk a little bit about this next class which is historically a very hard problem to do but it does make a big difference the challenge is going to be now that since the storage is tied to the compute I can't easily add either more capacity at the compute side of the storage side without bringing in a whole new node now if I bring a whole new node I got to do repartitioning to move data around to balance things out so again we'll discuss the implications of this so each of these are also going to be when I say I know this I mean one of these things the other thing too is that because data is considered local like with the locally attached disk traditionally a database system will access the disk using a POSIX API or an NVMe like it'll be a sys call to go read data directly to the disk yes there's the you can do kernel bypass which the storage data playing kit from Intel that's a nightmare we'll talk about that later on but in general we're using POSIX to do reads and writes to the local disk and if I need to send things over the network to a community with another node I make the perfect calls to do that as well the alternative approach for distribution is called shared disk and with this one now we still have on every single database the database system node they'll have their own CPU memory and disk but that disk on each node essentially just a cache and that the persistent data the persistent files of the database system is going to be down on this separate storage layer so at the top here we'll have the compute layer this is where we're going to execute queries they can communicate with each other over network over the network but their storage is essentially used for ephemeral data either as a write through cache in case of snowflake or a holding area or staging area for intermediate data but the node is essentially stateless meaning if it crashes the compute layer, if a node is stateless we don't lose the database because that's all stored down here in the storage layer and the storage layer essentially looks as one giant disk it's not entirely true in an object store because you don't access it through a POSIX API you use a user space API that the cloud vendor or whatever the object store they would provide so and the reason why this matters instead of using the POSIX API because if it's a POSIX API the database system is not aware of maybe the different physical locations of the data you would have different file paths and so forth like that but the database system it would just see here's some file system I can read and write to and maybe not understand or could not push down the selects the filtering stuff that we talked about before if you're using a POSIX because there's no way to do hey read this but also filter some stuff for me down in the storage layer just to recap everything the two choices we shared nothing and the pros and cons of this that it potentially achieved better performance because the each database system node if it's accessing only data that's local to it it'll be it'll perform better because there's less data movement potentially between between the nodes and as said with modern NVME drives with PCI-E 5 or 6 those things pretty fat pipes that you can get but modern networks are pretty good the two days too so traditionally it was always going to be faster to read from local disk than over the network but it's not always the case these days but the big challenge is going to be hard to scale out capacity actually in both directions up and down but I can't add new nodes without shuffling data around to sort of rebalance things, repurchase it and I can't take nodes away without doing the same thing with coalescing now with a shared disk system in a modern setting in the cloud the great big advantage is that we're going to be able to scale out the compute layer and storage layer independently meaning if I find myself being CPU bound on my nodes then I can add new compute nodes because they're stateless I don't have to do I don't have to move a lot of data around to rebalance things now again the snowflake paper talks about how they use consistent hashing it changes the assignment of what files are processed from persistent storage layer to the new nodes but you're warming it up and it's telling you here's the data and you start processing it's not like you have to always copy things from one node to the other if you want to do that safely it requires you to use transactions it's a little bit more complicated so the other nice advantage too and this matters a lot for database systems that want to claim that they're serverless I can shut down the compute layer the portions that I'm not needing the nodes that I'm not needing if I'm not exiting any queries and the data doesn't go away right again in a shared nothing system that's tied to the storage if I shut down a compute node that portion of the database stored in that node is no longer accessible whereas if I shut down a compute node in a shared disk system the remaining nodes can go get data from the shared disk no problem so in a service environment this can make a big difference because you can shut down idle things and then the last one is the disadvantage we talked about the over-tree persistent data that isn't cached locally in the compute node I have to I have to pull it from the shared disk layer, the storage layer and maybe I can do some early filtering that we talked about from Amazon or Azure but that's not always the case because it depends on what the filter is actually you want to do like you can't do complex regex or execute a UDF on S3 they're not going to let you do that so our focus going forward the rest of the semester we're going to focus on shared disk systems now for some of the algorithms we'll talk about it doesn't actually matter the joint algorithms they don't know don't care whether it's reading from shared disk or shared nothing system because at that point of when you're actually computing the join you've already scanned the table and fed it into a scan operator so it doesn't matter where it came from just in the back of our minds we're going to be mindful of that okay we should think about okay if I'm scanning this table or scanning data what does it mean to be coming from S3 or an object store so as I said before traditionally the in a shared disk architecture it was assumed that the storage layer was the storage layer would be some dedicated on-prem appliance or storage device or NAS that expose like a POSIX API that you could read and write to in the more high-end systems like oral exadata they do use a storage appliance these racks sort of look like this you have a computer and a storage layer but the data system is aware that it's talking to a special storage appliance and it goes over a Fiverr channel and it's not making POSIX calls to read data sending proprietary commands and they actually can do some predicate pushdown in these systems but the system, the environment we're going to focus on as I said is the object stores because this is how everyone's building these modern OLAP systems around the cloud and the nice thing is that they're infinitely scalable put that in quotes obviously but like you know as long as your credit card has enough money on it you can keep putting as much data as you want in Amazon S3 and the local Adelaide stored for you with pretty high guarantees and durability same for Azure and GCP's cloud storage so you no longer have to worry about provisioning storage appliance storage space like you would have to do in an on-prem system you just keep throwing more and more data in your storage layer and you know and you don't have to scale a compute layer separately either you just keep putting more data in so these object stores what do they actually look like so they're only going to expose you a pretty simple API we talk about Amazon select but in general they're going to have three commands put, get and delete and so what happens is you have your data that you're ingesting you're going to break them up into these large immutable blocks of data these files and you're going to do a put to store it in the object store and then you'll have some catalog that will keep track of some file ID is found at this location in the object store so we'll talk more about what these these files look like next class but in general they're going to be some binary format for a column store in a columnar format it's going to use a Pax layout where within each block the attributes of the table will be broken up in a columnar fashion but all the attributes for a single logical tuple will be found in that block and this is different than how like maybe the Vertica store files sort of data where you would have a separate file for each column at a table with Pax you basically have all the data within a single attribute in the single file so the data sometimes the footer like in parquet will contain metadata about the offsets of the file for where columns start what kind of compression schemes are using for the different columns because that can change per per file sometimes there's pre-computed indexes to do filtering or zone maps again these are things we'll cover later in the semester but you would store all that information within the file itself in the object store you would make a request on the object store just to retrieve the header of the of the file block and then that'll get all that metadata about where the columns are located and what kind of compression schemes are being used and then you can then do selective get's just to get the byte ranges of the data you actually need so you don't need to bring in the full page or the full block from the object store and then parse it you can just bring a little bit in figure out where you need to go to get the rest of it and retrieve the minimum amount again that's different than what we talked about doing disk I.O. in the intro class where we'd say every page is 8 kilobytes or 4 kilobytes you go fetch the whole thing even though you only need a portion of it you can be a bit more intelligent here because the blocks are larger and you obviously don't want to retrieve the whole block I think in snowflake I think the lowest is like 50 megs maybe goes up to like 100 megs or something like that these aren't small blocks these are pretty big so you don't need to fetch the whole thing if you don't need it so one interesting system I like to bring up is yellow brick so this is actually a slide from a talk that gave with us during the pandemic a few years ago but it shows just sort of how they're using the object store with S3 so again this architecture looks a lot like what everybody shared you you have these worker nodes you can fetch NVME SSD so it's interesting about yellow brick is that they talked about how they tried using Amazon's S3 library the access library to communicate with S3 and they found it was super slow so they wrote their own library and they were to get a 3x performance using retrieving the same data from S3 instead of using Amazon's so I'm bringing this up just to say that you don't have to use whatever the cloud manager provides you for the client side libraries you can write your own thing and potentially get much better performance I forget what the secret they used here they might be doing they might be doing kernel bypass to get through the object store again you can do this on the client side you can't do it on the server side because Amazon controls the server but another cool thing they were doing is that they would communicate between UDP with the Intel DPDK which is the data plane development kit it's a way to do kernel bypass on talking from the NIC which is not easy to use all this Intel kernel bypass stuff is super tricky and super painful again we'll cover that throughout the semester alright there's a bunch of other stuff that we're not going to talk about today but we'll start covering these things over the next throughout the semester there's different file formats again I've mentioned the packs layout but we'll talk about different approaches again all this is going to be very highly depressed because it's columnar storage we'll talk about next class how you actually decide how to do horizontal partitioning on your tables and split them up based on values so you can minimize the data transfer between nodes for inmate results then there's a whole process of how you want to do ingest new data like a big part of databases put data in it and so there are there's fast pass instead of just running a search team it's as fast as do bulk loading and get data into the system as quickly as possible obviously you want to be an updater to delete it how do you handle that even though the files are immutable and then we're not going to talk too much about this but this will come up when we talk about data lakes especially with with photon and databases that are on but in a modern cloud system you also want to support the ability for people to have existing files just sitting in S3 like in say in a parkway format without having to first bulk load them to your database system you want to be able to operate directly on the files as they exist so how do you discover that these files these files exist then we'll talk about how you're scheduling adaptivity we'll spend a little time talking about it means that as the query is running how do we maybe make changes on the fly to better utilize resources maybe under provision or over provisioned the tasks or the workers and we can scale that and make changes as we go along again there's a lot of things we're not covering here that are important we'll get to later this semester okay so I talked about this a little bit at the beginning when I make an observation here that the in the snowflake paper you guys read even though it's about snowflake they talk about snowflake does certain things but they make some important observations about what a monitor or a lab system looks like that's why I had you guys read it but the the thing to point out a system like snowflake like Dremel like BigQuery like Yellow Brick like Databricks these are monolithic systems meaning that they're building all the components that that you need for the database these are monolithic systems meaning they're building all the components you need to build a database and entirely in-house with their engineers and most of the non-academic systems that we'll talk about in the semester are going to have pretty much the same architecture right you have a query engine you'll have some storage layer you'll have an orchestrator to move data around or execute tasks, scheduler, some sort right they're all pretty much implementing the same thing and that means basically that we have people where they have companies and a lot of money writing the same database system software over and over again they're writing the same thing to do the same thing and obviously that's a it's a huge labor effort to do this and it may not be in the best interest for the database community broadly speaking to do this right if everybody's running for the same SQL parser what is that actually that moving us forward maybe for a time it would be better spent doing other more innovative things and so there's this interesting development in the last five years or so where you're starting to see various projects of in the open source community or organizations to build to break out portions of a modern old lab system into standalone components that are open source that other people could build on other people could take advantage of and other people could reuse either for well for I mean for their own new old lab systems and again the idea here is that instead of everybody implementing the same thing this all redundant parts of the system lets everyone work together on you know one thing and have that be really really good and everyone gets to reap the rewards it's sort of like Linux right Linux is what it'd be kind of stupid now are insane outside of you know research academia to try to build a new operating system from scratch and to compete with Linux where the sort of eventualism or the mindset is everyone's focused on okay Linux is what we're using let's make that really really good it's not exactly the same because the OS is a major major undertaking where these other parts I'm talking about are smaller but the idea is the same thing right instead of having two people build competing parts that do the same thing let's work together so you see this in sort of four sort of categories and you can sort of see how this is tied to the different layers I talked about in the beginning so there's people building system catalogs query optimizers which is the hardest part of the data system these file formats act as live in an extension engine so I want to quickly go through these different components and just sort of talk about the major efforts the major major projects and just can be aware of that as we talked about the semester for these different engines different systems a lot of them aren't going to use these things but you know they'll be potentially open source alternatives for them and this idea of breaking the system up into reusable components or modular components is not new and so there's this paper by a famous researcher at MSR Surgeon Chargery where he back in I think it was like 19 year 2000 we talked about what they call risk-style database systems now this argument here was built for how to make these different sort of sub-components modular reusable and to make it easier to automate the tuning of them make it like a autonomous database system but the idea of again breaking the system into these separate layers that can be implemented through some standard API is basically the approach that they were talking about in this paper long ago alright so the first thing is the system catalogs and again we'll talk about this next class but system catalog basically is the registry where the data system keeps track of what the schema looks like what tables they have, what columns they have and if there's persistent files where are they located in the object store on the disk and so the way the basic catalog works is that if you're calling and just sort the data for you like an insert query or a copy command then the data system will maintain its catalogs because it's creating the files as it goes along if there's a discovery process as I said where there's a bunch of existing S3 files and the catalog needs to have an API needs a way to be told hey there's these files and you'll be aware of them so there's now several projects where people build these sort of cloud catalogs that essentially again just keep track of what data exists what tables and columns are in these files sometimes there's some basic statistics about what's in them but as I said before oftentimes the statistics are stored in the file themselves so probably the most famous one or why they use one is the H catalog this came out of the Hive project it's a wrapper of the Hive Metastore I know you can use this for Spark I think there's some other systems that take advantage of it and again think of it just like it's like a key value store that keeps track of again the database schema Google has this data catalog Amazon got its data catalog they're all doing this at the same thing the H catalog is actually the only other source one the next is the query optimizers so we'll spend a several classes talking about query optimization but think of these are just generic frameworks where you can define the heuristics or the cost based search rules to do query optimization how to take a some of them actually they'll parse the SQL query for you but then convert it into a logical plan through transformation rules and then run a cost based search to generate a physical plan and so the this is the hardest part of building a database system because there's all these corner cases in SQL there's all these optimization rules that you can apply to do a correlated sub-queries like it gets very messy very quick and the idea is that rather than everybody building their own first crappy heuristic optimizer which everyone does you know in theory you could have a cost based search which is pretty powerful out of the box so the two notable source implementations are Apache CalCyde and Green Plum Orca CalCyde came out of LucidDB LucidDB was a I think commercial system late 2000s that failed but for whatever reason CalCyde was pulled out and was reused getting as a standalone query optimizer which is pretty cool and then the Green Plum guys built this thing called Orca it was designed to be the query optimizer for both Green Plum which is an OLAP system as we talked about last class that VMware owns and but they also built it for this other thing called Hawk which is like their version of SQL on top of Hadoop and so they tried to build a single optimizer framework that could be used in both systems Hawk is still around it's not that people actually use it but Green Plum uses it as well so the main thing is to think about it Green Plum is less known than CalCyde it's written in C++ CalCyde is written in Java and that's more used there's more systems used in CalCyde I don't know which one is more powerful or not they both look roughly equivalent then there's also these source file formats and actually this is a pretty this is a pretty novel idea that is definitely made possible by the amount modern cloud systems so historically, database systems always use their own proprietary data format SQLite, Postgres, MySQL whatever you name it then when they write files to disk it's in their proprietary binary format so you can't take out Postgres data files and read them in MySQL Oracle DuckTV can read SQLite files but it's designed to do that and so the only way to really transfer data or be able to take data from one database system and reuse it or do computations on it and another database system is if you dump it out as a CSV or JSON file so starting in the last decade there was this movement to say let's actually build these binary file formats so we don't have to dump it's like text-based formats like JSON and then we can build libraries for these different database systems to access this stuff very efficiently so these libraries, especially those you're an iterator to retrieve batches of columns from these files you can do some selective scans and pick which columns you actually want and some of them do all the decompression stuff for you but again, you're just like this, the lowest level of the query plan is the access methods so this is probably the 6 most widely used or famous ones in 2013 there was parquet and orc and carbon data came out parquet is probably the most widely used one this is a compressed column format from cladera and twitter it has more aggressive compression than orc whether or not it's faster or not depends on what the data looks like depends on how you're actually accessing it so I can't say one is better than another, it depends on a lot of things but at a high level they're pretty, they're basically equivalent orc came out of facebook's high project carbon data came out of hawaii and as I said I think in the intro class I'm working with a former student of mine and they actually tried using carbon data in all the open source libraries don't compile or don't work I don't have anybody outside of hawaii that uses the carbon data Iceberg is a newer one at Netflix and we'll see a little bit of this in the next class where it's a way to do support updates and schema evolution over parquet files you can do, they have like a deltoid store you can do updates and then apply them to do the parquet files plus read the struggle data HD5 is a much older data format from the scientific community from the late 90s this is for multi-dimensional data again I don't know of any sort of commercial database system that uses HD5 HDF5 it's primarily used for, as I said scientific calculations, scientific instruments and then Apache Arrow is a memory column format that's designed to be the exchange format for sort of sharing data between processes and over the network very efficiently again it looks something like parquet but it's for memory data so for our class this class we're mostly going to be focusing on things that look like parquet and orc and arrow and then for project project one which we'll discuss next class will be building a foreign data reference to parts that looks like a parquet file because that could be a parquet the last one that's super interesting is the standalone library to query execution and this is kind of what the whole point of this class is as I showed in the very beginning we spend a lot of time talking about how to build the execution engine how to vectorize execution how to do query compilation how to do joins it's not the hardest part that he's and that's always the query optimization but this is certainly the part that can have a huge impact on performance obviously if you generate a crappy query plan you're screwed anyway but having this doing this right can make a big difference so there's now work being done to build standalone query engines that are vectorized and use all the modern techniques that we've talked about before again, it's just a dag of physical operators they've required something else to do the scheduling and do access to the data do orchestration, do data round but at the end of the day the kernels are essentially the query operators so the most famous one of this is going to be Velox which we'll spend time talking about at the end of the semester but this is actually very fascinating now because vectorized query execution is sort of what made snowflakes so good and so unique at least it was the starting point of making it so unique 10 years ago that they were building one of the first vectorized data extensions now it's so complex everybody everybody has vectorized execution for the most part but 10 years ago it was a big deal but now Facebook is building Velox there's data fusion from the Aero and InflexDB guys Intel has this yeah so now this is super interesting because Facebook is going to spend a lot of time building up this this execution engine that anybody could come along and build a wraparound and make the next snowflake and not have to worry about writing an execution engine and essentially this becomes a commodity like a vectorized execution engine using the modern techniques we're talking about today it does not differentiate one system from the next it's now going to be the communication between the different nodes sort of like the yellow bricks we were talking about before and then all the query optimizers yeah we're good my boys in there I got teaching class yeah I'm teaching class though okay this other guy he's almost done oh he's coming down we're good from the FBI it's like my f*** f*** thanks okay so this is super fascinating and we'll spend a lot of time talking about this right so this was supposed to be a high level approach a high level overview of what the query optimizers look like again as I said the way you on a single node itself the ability system is not different than how you build a single node database system it's going to be that data movement stuff that we talked about before and also again assuming that we're coming from retrieving data from a cloud-based object store shared disk the rest of the semester we focus on how to build these different components that we talked about and make them using the modern techniques right so next class we'll talk about storage models we'll talk about how to represent data within the columns themselves the bytes, how to do table positioning and then system catalogs as well so hopefully I gotta deal with these guys, forget what's going on with them hopefully they'll be done soon and then I'll film the class next class and post that on YouTube as well later on okay and then next class we'll talk about project one alright guys have a good night, see ya I make a mess unless I can do it like a Gio ice cube with the G to the E to the T now here comes Duke I play the game where there's no roots homies on the cusse, I'm a foo cause I drink bro put the bus a cap on the ice bro push wick on the goal with a blow to the ice show I drink it by 12 they say bill makes you fat but saying eyes is straight so it really don't matter