 All right, yeah, well, thanks, Lynn. And yeah, so like you said, I'm Justin. Been working on databases for roughly 10 years now. And I decided to give a talk to tie together the last two lectures that you had in this class about two specific systems about OLAP, how OLAP and OLTP are done in the real world in the cloud. So the outline of this talk, briefly I'll motivate the movement to the cloud. Hopefully it doesn't need too much motivation. But I'll talk about two systems in some detail, both that I've worked on or currently worked on and worked on in the past. The first is BigQuery, which is Google Cloud's native cloud data warehouse. And the second one is Amazon Aurora in the OLTP side. And I should say I'm currently employed at Google, but everything that I'm talking about today is in the public domain, has been published. But I'm here to essentially add more color to it. And please, during this talk, feel free to ask questions. I'm here to talk about how things are done in practice. So you have me for an hour. Please ask questions. I can manage time and so forth. So without further ado, let's talk about the cloud. So once upon a time, in the old line, companies like Oracle and Teradata, Pied Piper, would sell you a box and essentially a database appliance. And this hardware was stored in customers or businesses, on-premise data centers, and everything worked fine. But clearly, there's a shift now to the cloud for multiple reasons. Customers are shipping their data and their workloads to systems like BigQuery, Redshift, and AWS, Synapse, on Azure, and of course, Snowflake, and Databricks on all platforms. So why is this? Well, multiple reasons. The first is economies of scale. The hyperscale kind of cloud providers can provide clearly hardware at very good costs. The second is innovation. A lot of these cloud companies are now starting to build their own custom hardware in A6. So examples of these are Gravitron from AWS, TPUs from Google, so on and so forth. So there's a lot of innovation going on at the customization of cloud workloads between hardware and software. The third is operations. It's really hard to operate these large-scale database deployments in either data warehousing or transactional workloads at scale. So by consolidating operations around a single service, that's a huge benefit for businesses. The fourth is security and governance and compliance. This is a really hard thing to get right, and clearly the hyperscalers have very good folks that have worked in the security and governance and compliance for a long time. So being able to consolidate that in a single service is clearly a benefit. The next one is rich cloud ecosystems. So really, customers move database workloads to the cloud, but really what they're doing is they're moving all workloads to the cloud outside of databases and so forth. So really what they're buying into is a rich ecosystem of services in the cloud. If you look across all Azure, AWS, GCP, they have unique and very large AI deployment services, security, databases for OLTP, OLAP, that will eventually just work together. So this is another huge draw for large businesses as well. So why should any of us care, especially us in this room? Well, when you're moving in re-architecting databases to work in the cloud, there are different set of assumptions, and this leads to different architectures. And I argue that really when we're going to be done with this in the next, say, five to 10 years, really a lot of the textbook basics will have been rewritten about how we think about architecting database systems and so forth. And frankly, this is an exciting era for the field. So this is why we should care about it, and hopefully by the end of this talk, you'll have shed some more light on why this is interesting. So again, ask questions as I go along. OK, so the first system, BigQuery, OLAP and the cloud. So in a nutshell, what is BigQuery? In a sentence, it's a serverless, highly scalable, and cost-effective cloud data warehouse provided by GCP. It's fully managed, serverless, and clusterless, and I'll shed some light on what that means. It's available 24-7 with a 4.9 uptime. We really strive for predictable performance, and we have key features like integrated ML within the BigQuery umbrella itself. Easy to use SQL with hints. We have petabyte scale storage and queries can scale massively. Data is encrypted and durable, which is important for enterprise customers. And we also provide real-time analytics on streaming data. This is one of the key features of BigQuery and really a differentiator. So the BigQuery architecture looks as follows. And I'll spend a little bit of time on this one and then go into some of the details here. But BigQuery truly separates compute and storage in a unique way. So on one side, it looks like a data warehouse from the right-hand side here. It's SQL compliant. You can access data through REST APIs and Azure queries. It also has a web UI and CLI, and client libraries in seven different languages, so very easy to work with. And storage is uniquely separated from compute. So we have essentially a highly available cluster for compute, the components doing the query processing. And this is separated from storage by a petabit bisection bandwidth network. And for which data at REST sits on replicated durable storage. And there's also a unique distributed memory shuffle tier in between these components for doing shuffles for massive queries and so forth. And we also support streaming ingest and read directly to storage, which is quite unique for BigQuery, and also bulk load and export. So that's the basic overview about how BigQuery's architected. So some of the architectural design principles that BigQuery follows is storage and compute, clean separation. So again, petabyte scale storage, highly available compute, and it's serverless and completely serverless and multi-tenant. We also aim for co-location and caching where possible for high performance and low cost. And we have a nicely integrated hardware and software stack. So take advantage of key hardware primitives where necessary for high performance and low cost. So this is roughly what a query looks like in BigQuery. So if you were to issue a query in BigQuery, we'd set up a distributed kind of query graph where data at REST in a distributed file system is first scanned by a set of workers. If it requires kind of reshuffling, there's a shuffle tier. Another set of workers in the next stage will process it, again, send it to shuffle, where a final worker, say in stage three here, finalizes the query and writes the results to a distributed file system. And to be able to coordinate this, there's a separate coordinator component and a scheduler to essentially provide the resources to answer the query. So any questions here? Are we good? A shuffle? Yes, I have a couple slides on shuffle. Oh, the question was, can we provide more details on shuffle? Yes, yeah, okay. In the subsequent slides. Yep. Okay, gaining from this slide, is that the query does this, right? And then, like, where does it go first? I don't see like an error on like the query comes in. Right. Right, okay. So the life of a query starts at the coordinator where essentially it'll parse the query, plan it out, right? And at that point, a set of resources is provisioned, right? These workers are provisioned to answer the query. And in kind of typical massive MPP kind of style, the query's answered in a distributed manner, right? And I'll get into more details of how exactly this works and what happens behind the scenes. But yeah, that's the, essentially yes, it's the query comes in, it's planned, the graph is set with the actual work that's to be done and data flows from essentially left to right in this graph. Okay. Before I go on though, right, a key piece of BigQuery is that it's built the Google way, right? And by being built the Google way, what I mean is that it's a set of, what BigQuery really is, is a set of infrastructure and components built around kind of Google's core infrastructure, right? So I'm happy to provide references after the class, but the distributed storage tier is Colossus, which is Google's global scalable storage system. All can, essentially all jobs, all resources are provisioned by Borg, which there's a very nice paper, I think back in 2015, written on what Borg does and how it works. And I think John Wilkes' visit to here and given talks on it as well. It's also based on Dremel. Dremel was a paper published back in 2010, back when Dremel was internal Google infrastructure for essentially querying logs data and so forth. It also relies on pieces of Spanner to store kind of state as well. And the petabit bisection bandwidth network is actually Jupyter and there's a nice, I think SIGCOM paper written on this as well, right? So the key here is that, not only is it the set of components that make up a kind of a cloud data warehouse, but it's heavily based on the use of BigQuery, sorry, Google infrastructure as well. And it allows us to make kind of different architectural assumptions about how the system is built, what assumptions we can make about kind of the underlying primitives that we use for other infrastructure as well, right? But what I want to tell you a bit more is Borg, right? So this really enables BigQuery to have serverless architecture at scale, like truly serverless, right? So the basics of Borg is that any compute at Google, any resource allocation is more or less done through Borg, right? And essentially you tell Borg, what set of kind of primitives you want for resources, how much CPU, how much memory and so forth and some config, you package up a container and you hand it to Borg and it'll place it on a machine and the data set up for you, right? So Borg is able to shape compute and resources in unique ways, right? You're able to use exactly what you want. Borg will figure out where to fit it and bin pack it on physical machines and you go from there. I highly recommend reading this paper if you hadn't. So if you look at kind of a comparison of how you would build a cloud data warehouse across kind of, when you say like the large, the large kind of cloud providers, right? Usually what you do is use kind of a VM cluster as a primitive, right? So in terms of Azure data warehouse, right? They have SQL server running alongside local storage and a VM cluster, they also have clusters for Spark and data can reside on Azure storage or Azure data lake storage and so forth, right? But really the primitive here is going to be the VM cluster boundary, right? In BigQuery we don't need to worry about that due to Borg, right? Essentially we have a set of jobs for metadata, for our storage APIs, they stream ingest, kind of storage management tasks, data residing on Colossus and kind of Dremel. Again, the Dremel workers shipping as their own kind of set of compute and so forth, right? So really we're able to essentially, this is essentially a picture that says BigQuery is truly able to be kind of a serverless architecture relying on Borg for compute. So what's the serverless design principles and advantages? Well again, the disaggregation of compute, storage and memory is key, right? This allows us to essentially scale on demand each set of resources, so both compute and storage separately. It allows on demand sharing of resources, right? And this adapts well to a truly kind of multi-tenant usage which allows us to provide lower costs at the end of the day, right? So it also allows for fault tolerance and restartability of BigQuery jobs. So at scale and especially at Google's and kind of large deployments of data warehouses at scale, you assume everything is eventually kind of unreliable and slow, right? So BigQuery's sub-tasks are designed to be deterministic and repeatable and even multiple copies of the same task can be dispatched to avoid this regular problem, right? So compute is provided in what we call slots which are virtual scheduling units. So really it's an abstract unit of compute and memory that the customer sees. And this maps well to kind of borg to supporting the flexible resource shapes that you're able to provide it, right? And this really allows us to decouple kind of customer-visible resources from the actual kind of machines, VMs, et cetera, that are being used underneath the hood to provide query processing, right? And we also allow, you know, we built centralized scheduling. So this allows our scheduler to have a global view of the entire service state, right? To be able to make, you know, semi-optimal decisions about how to schedule jobs, how to schedule compute slots and so forth, and, you know, really make the best decisions for utilization, isolation, and kind of running the service. It also allows for kind of flexible query execution, right? So again, going back to the original question, kind of the life of a query, you know, happens again in this diagram from top to bottom, right? At the top there's a coordinator, right? It's the first to receive a query. It builds an execution plan, which is, you know, really a directed graph of execution trees, right? And it orchestrates and executes with workers to provide, provided by the scheduler, right? So the scheduler will provide the compute, right, and to the plan, and will deploy essentially the query at that point, right? And workers are really allocated as a pool without any predefined structure, right? So these are essentially bored jobs that we pool. And come query time, the scheduler will, you know, give a query a set of workers, and after planning coordination, the coordinator sends the workers essentially ready to execute query plan, kind of their piece, their sub-plan, their piece of the query, to go off and execute, right? So in this case, before I go on, are there any questions here? Is this pretty clear? Good question. The execution plan is... Oh, does the existence of bored make it hard to kind of plan the query? No, it's typical kind of query planning and optimization, right? Once you're ready to execute the query, I mean, the scheduler will provide the workers and you essentially parallelize that way. So the question is around co-location and SLAs? Yeah, but there's two locations. Again, this is where the network comes into play. We're kind of into play. Oh, does allocating kind of too many workers in different places affect query processing, right? Right, so queries don't necessarily have SLAs, right? We have predictable performance, and what provides predictable performance is the assumption of a petabit bisection bandwidth network and what we're able to read off colossus and so forth. You can play a predictable performance. Oh, by predictable performance, I mean, you issue a set of queries, the same query over and over again, roughly the same performance. Exactly, no very little variance. That's what we strive for, right? Why does that matter? Customers, and ES, why does it matter? Above all, especially large enterprise customers demand predictability over everything else. That's one thing that's, we seem fairly true over and over again. You're willing for it to run slower along the runs at the same time. Right, exactly, right. It's like, why is it important to be critical for you to, like, watch this? Why is it, because if customers, well, if customers see, you know, if they have a workload, say, 10 queries, right? If one gets faster, but nine gets slower, right? They'll take the predictability of everything going in the same speed as one getting faster and nine getting slower, right? Another way to think about it is, you're not thinking of, like, not something right in the terminal, but we're going to have to think of, like, a program, right? So some applications, it's the same, run series, and then today's slow, tomorrow's fast, and you're like, is it my application? Is it the database? Oh, I see. Is it one less thing you have to worry about? Exactly, right, right, all right. So in this case, you know, say we have this query that we send that's executed in three stages. The bottom workers here would be scanning the data from distributed storage, applying a partial group by, then go to shuffle that provides kind of dynamic partitioning for us into the next stage, which is doing the group by, sort and limit, then piped into the final stage, which is doing the final sort and top K limit to enter the query. So that's a basic flow of how things work. Okay, in memory shuffle, right? So, I mean, this is the basic overview of in memory shuffle, right? So BigQuery uses essentially a separate storage tier. This is kind of a design decision that was actually just described in the VLDB 10-year best paper award for Dremel, right? And the basics are as follows, right? So, in memory shuffle isn't necessarily new, but the key finding is that in memory shuffle coupled with compute presents bottlenecks, right? It's hard to mitigate kind of quadratic scaling characteristics of shuffle, and it leads to resource fragmentation, stranding some memory and poor isolation, frankly, right? So BigQuery actually implements, on top of separate compute and storage, a desegregated memory-based shuffle, right? So this is RAM disk managed separately from the compute tier, right? And it was found to reduce shuffle latency by an order of magnitude and enables order of magnitude larger shuffles and reduced resource costs by 20%, right? So this is a key piece of the stack, right? And it also added persistence in the shuffle layer, right? So this allows kind of stages of the query to be check pointed, right? And it allows, it's really the key piece to allow flexible kind of scheduling and execution, right? So BigQuery can preempt workers after checkpointing state and have re-optimized and add a new set of workers or reduce the set of workers in the next stage based on how it sees the query being executed, right? So this is all done through dynamic scheduling in BigQuery, right? So the dynamic, you know, central scheduler allocates both slots and the number of workers to be done. Again, slots are kind of the virtual compute unit used in BigQuery, right? So it's used to handle machine failure. So if a machine fails, it's able to spin up another one after state is check pointed to kind of redo the work. And it also allows fair distribution of resources between queries, right? So if we see this graph here, you know, query one is executing, query one, two, three and four are executing at a certain point. At a certain point, you know, it's found that query four needs less resources, right? So the scheduler can reduce, and the workload manager can reduce the resources used and provide those resources to another query so they can all finish it at the same time. And so why do you need dynamic query execution, right? So if folks, you know, don't know who this person is, is David DeWitt famous for the DeWitt clause and many other things in the database industry and academia and so forth. And he gave an interview a while back, I think late nineties, early 2000s to the Sigmund record, right? Back when he's working on query optimization and dynamic query optimization and reoptimization. And, you know, he provides a kind of a pithy answer to a question here, but it kind of rings true, right? So optimizers are making assumptions about joins, five or six levels up in the tree, just based on wishful thinking, right? So if you have a large enough query and a large number of joins, right? There's no way you can, you know, reliably predict what's going to come out, like six to seven levels up in the join tree and so forth, right? So BigQuery enables, you know, on top of dynamic scheduling, dynamic query execution as well, right? And this is kind of a key piece that we use. So the goal here is dynamic, so the first piece here is dynamic partitioning, right? So the goal here is to essentially load balance and adjust parallelism while adapting to any query data shape or size, right? So the planner and optimizer is actually quite primitive. We rely heavily on these dynamic query execution pieces for predictability and performance, right? So, you know, the example here, right? Say we have workers start writing to partitions one and two. At a certain point, the coordinate detects there's actually too much data going to partition two. It'll actually dynamically repartition on the fly to partition three and four. At that point, workers stop writing to partition two, start writing to partition three and four and data in partition two is repartitioned into three and four as well to provide better kind of smoothness in the partition stream for consumption by the next stage. We also make use of kind of join re-optimization. So just briefly, a broadcast join, if you learned about that in this class, this is where one worker, if the kind of one side of the table is small enough, it can be broadcast to the other workers and the join can be done on the broadcast of the other worker nodes that receive the join or receive the smaller side of the table. And shuffle join, you know, if both sides of the join are large enough, it'll completely use shuffle or repartition to be able to execute the join and in parallel after shuffle, right? So BigQuery takes advantage of dynamic join processing as follows. What percentage of all of the join queries of BigQuery is used by the shuffle join? So okay, how common are they? I don't know that, I don't know that, sorry. So in a simple example, we can start with a hash join by shuffling data on both sides and you can actually cancel the shuffle if one side finishes fast and it's essentially below a broadcast size threshold, right? In which case, you know, it'll start executing a broadcast join instead, right? Which leads to much less data transfer, usually better performance, so on and so forth, right? So this is an example of where BigQuery can take advantage of kind of the re-optimization on the fly to switch to a broadcast join, right? It can also decide the number of partitions workers for a parallel join based on the input data sizes. Occasionally swap join sides in certain cases and there are also star schema join optimizations that it takes advantage of, pretty typical stuff. Okay, at this point I'll go into kind of more, kind of other kind of key features of BigQuery, but it's kind of the overall kind of query processing flow fairly clear, but what's going on there, any questions? Good, okay. Okay, the first feature I'd like to talk about is a capacitor, right? And this is BigQuery's columnar format, like every other large scale data warehouse, data in BigQuery is stored commonly for re-optimized format, right? And a couple of the key features around capacitor, right, is that, and also there's an interesting blog post written by Moshe, who was an engineer on the team, written back in 2016 on details here, which is, again, very interesting. And essentially the key features here that evaluation, vectorized evaluation is embedded in the data access library to push compute and as close to the data as possible, right? And it provides certain things like typical partition and predicate pruning, vectorization as close to the data as possible. Capacitor makes use of skip indexes within the file to be able to do efficient data skipping. And then also reorder predicates using certain heuristics for things like dictionary usage, unique cardinality, unique value cardinality, null density of the data, and expression complexity, right? And it also provides kind of row reordering, right? So underneath the hood, it can reorder data in this case to take advantage of run length encoding, right? So on the left here, originally, data is loaded in the following format in which say the run length encoding for the items column, you can't take advantage of run length encoding, right? Capacitor rocks, you reorder the data, say that in this case, the state quarter and item columns, to be able to lay out the data such that run length encoding can actually be applied to the data to be compressed even further, right? And this is kind of a graph from the recent DLDB paper that was written on Dremel that shows, for real data sets in BigQuery, the percentage of bytes saved by being able to kind of reorder the data to take advantage of compression in run length encoding and so forth. So layout matters. When do you buy exactly the same? Oh, it can be continuously compressed behind the scenes, right? And BigQuery also favors clustering, so it provides kind of the ability to automatically recluster data too. So if you define a table on being able to say, like I wanna cluster data by this column the BigQuery will automatically recluster the data. So in this case, Block 1 has all the customer IDs clustered on ID 1, Block 3 to 4, and so forth, right? So this allows advantageous advantages in query processing such as data skipping, co-location, if you're computing aggregates, it's a huge win there. And also allows for optimizations over star schemas and so forth, right? How does it affect the typical work? Because it's in the background of the processing. Like if I want my query to be able to compress and I can do more of it, that's not predictable. Right, well, so I mean, think about, like if you have new rows added to the table, right? You wanna automatically recluster the data to take advantage of it. And this is where behind the scenes, I won't get into details here, but if new data is added to the system, right? Added to a table. Yeah, for a certain amount of time, that data will be semi-unclustered, right? The kind of the delta that was added to the table. But over time, yes, it'll eventually be reclustered, right? So it'll make things go faster, never slower. Got it. You didn't mention, what is the pricing model for BigQuery? Like how do you make money to charge for queries, to data reps, to data transfer? Right, so there's two models for charging for BigQuery. Customers deserve slots, or they can pay on demand. For the slots used to answer a query. Up to some limit. Yeah, for a slot. Slot is essentially, think of it as an abstract resource. Got it, right? It's a similar model used in other queries, like Cosmos DB actually in Azure uses something similar. Kind of some kind of unit, right? Of resources used to answer a query. If you compress data, I think the query will be faster. Customers pay less. Oh, they do. So your method is like, you said that you present like a server-like subscription. Right. But underneath of it, how do you capture locality? Like, like is this done, and is this like included into like all boards, looks like that? So this is really, I mean, take the server-less aspect out of this, like, oh, sorry. The question was, how do you, given the server-less architecture, can you? Okay, yeah, like, given the view that you present the customer, underneath the hood, how do you actually capture locality? Is it as an automatic re-clustering? Or just, you know, when you're like scheduling the jobs inside of boards, how do you represent your, I think, like, oh, this is part of this. So the question is, you know, given the server-less abstraction, how do you hand out locality, right? But for the clustering question, there's also a question about clustering and locality, right? Clustering is a property of a data layout in this case. In this case, you know, if there's new data added to a table, right? Eventually you want it to be clustered and sorted according to how the customer specified, right? So yeah, you have to take care of essentially the new data that was added to the table to provide essentially, you know, and essentially by nature, that data, the similar columns will be co-located, right, at that point, because it's essentially a data-sorting problem. The question is more like, since I mean, the query is already, right? Mm-hmm. Since the compute nodes have any local questions, because scheduling of work are on and off, right? Does the worker have any cache of the data or everything is, if you're actually looking for the data? And yeah, being through the round, having a bi-section bandwidth network that provides a lot of advance, right? But yeah, I mean, there is certain caching, like in the system, you know, at the storage tier and so forth, right? But, you know, nothing in particular at the worker nodes. This assumption will change when I talk about the BI engine, but. Another key kind of feature that was actually just published in VLDB by Pravon and Moshe, is actually, you know, quite interesting and this is, you know, what we term big metadata, right? And, you know, just to set up the problem, you know, there's a difference, you know, here to delineate between logical and physical metadata, right? Logical metadata tends to be small and biological metadata, I mean, you know, things like metadata that defines a schema of a table or row and column access policies and so forth, right? Physical metadata, which tends to be much larger, refers to things, you know, that a query processor needs to actually go and execute a query, right? Things like file locations, multi-value concurrency control, like info about the data to read snapshots, column stats, so on and so forth, right? So, you know, how big does this actually get, right? Well, the paper recently referenced, you know, big query can see metadata in the tens of terabytes, right? For certain tables, right? Which can add tens of minutes of latency just to load metadata before processing a query, right? So the main idea of big metadata is to treat metadata management, you know, as a data management problem, right? So what big metadata does is it organizes, you know, the physical metadata of each table as a set of system tables, which itself is stored columnally, that are derived from the original table, right? And, you know, this is built for both batch and streaming workloads that big query supports, right? And during query processing, you know, we defer the reading of the physical metadata for the tables until the query is actually dispatched to the parallel workers, right? So the query planner first uses, you know, essentially only the logical metadata to generate a query plan with, you know, typical things like constant folding, filter pushdown, so on and so forth, right? And then the query is rewritten, kind of, effectively as a semi-join of the original query, kind of with the metadata system table now, you know, as part of the query, right? Which effectively produces a list of block locators, file locations that the query will actually need to read, right? So this is a way of essentially treating, kind of, the physical metadata problem as itself, kind of, just another query or part of the query, right? And there are a bunch of details in the paper. I highly suggest you go read it to see how, kind of, the state of the art in big-mitted data management is handled. Okay. So BigQuery also supports read and write APIs directly to manage storage, or what BigQuery terms are storage tier, right? So on top of drennable, which accesses data on Colossus, you know, more or less directly, we also have high throughput read and write APIs, kind of, in both directions. So engines like Spark, TensorFlow, Presto, or Google Dataflow can read and write data to and from data, to and from storage directly, right? And again, you know, this is roughly the, kind of, the control flow, but really this speaks to how BigQuery is architected and that, you know, to be able to serve, kind of, parallel table reads or writes, right? Metadata, you know, is a separate component on top of, kind of, capacitor stream engine, yeah, capacitor on Colossus, which should be Colossus, stream-ingest components, and Dremel is completely, you know, separate as well, right? So we're able to provide this, kind of, high-performance storage tier directly to other engines as well. So in terms of the read API, we support multiple streams to read data in parallel, so you can read essentially a disjoint set of rows across, kind of, multiple tables in parallel, and this really, you know, facilitates consumption of parallel processing by other query processing frameworks, say, such as Spark, right? We allow column projections and filtering, so we read only the data that's necessary, and there's vectorized query processing that's pushed as close to the data under the API as possible, right? And we also allow snapshot consistency, so consistent, by default, a read is consistent as the point of time in the start of the session, the read session, that the engine sets up, or you can read a session snapshot as a prior timestamp, or so forth. The read API, we just sent it to GA, really supports the best in the industry stream ingest at scale, right? Directly to manage storage. Some key features here is that it provides exactly one semantics, which is actually a key feature for stream ingest and in other, kind of, scenarios customers want to build on top of streaming. It provides both stream level and cross-stream transactions, which, again, is key. So streams can commit, kind of, as a single transaction and retry and abort failure. I kind of, we handle the transactional semantics across that, and again, a key feature that, you know, a lot of customers have started using is committing data across parallel streams, right? So you can, you know, write across parallel streams and have that commit as a single transaction, which is fairly key. Right, right, yeah. So, Andy asked, you might want to describe what stream is, yeah. It's not simply an insert statement. This is essentially starting with the pend only, right? Streaming data into the engine and distorts directly, right? And it also provides, kind of, schema update detection for schema evolution, so on and so forth. Yep. Effectively, you know, if you think of how, you know, such a system is set up, naturally, like, you know, data is ingested and ultimately, you know, it lands in some sort of, you know, read up the mice format eventually, right? So there's a log structure and merge process that goes on there. Yeah, question? When you want to do columns, what's the data, is that a property of columns? No, it's a property of, you know, the, so the question is, how do you support column projection and the read API, right? It's a property of, kind of, the read session, right? So we assume that the engine issuing the parallel scan, you know, does about table schema and is able to ask us for what columns need to be read from the data. Yeah, ultimately, when it's time to access the data, we have that projection and are able to essentially just open up the file and read the columns that are necessary. Same with filters and pushdown as well, push as close to the data as possible. And those are provided through expressions. Any other questions? Just to be clear, that was actually, you know, another key feature that we just built into BigQuery is an embedded BI engine, and this is actually, you know, fairly exciting. So, you know, a lot of analytics queries tend to be repeatable and, you know, issued from, kind of, a BI tool, such as Looker or Power BI. Oh, BI stands for Business Intelligence, right? And these are usually things like dashboards, people building, kind of, queries or dashboards into BI tools like Power BI, Looker. These are all popular, kind of, BI tools, right? So, what this means is that these queries are usually canned, usually short-lived, don't change that often, right? And really, kind of, require a different set of assumptions. They tend not to be super large scale, right? So, BigQuery, you know, if you look at, kind of, what it provides in terms of feature sets for answering, kind of, kind of, query caching or, kind of, advanced, kind of, performance and so forth, at the very base, you know, you can look at things like partitioning and clustering of the data is being able to speed up queries to take advantage of partition skipping or clustering, skipping of the data and read only what you need. On top of that, we provide, BigQuery provides materialized views for, kind of, materializing, kind of, either sub-queries or whole queries for use in, kind of, query optimization and so forth. And at the very top, we're able to cache data in what we call the BI engine. And this is really an in-memory. It provides, kind of, in-memory vectorized query engine for low latency and really high concurrency, right? So, here, data is cached in memory. There's a vectorized query engine sitting close to that data in memory and is able to answer queries in, kind of, sub-second, you know, latencies. Yep? For your materializing, is there a meaning for being able to incrementally update those or just periodically update those? Yeah, the question is, are materialized views incrementally updatable? The answer is, yes. All right. And, you know, if you look at the right-hand side here, right, there's different use cases for different types of data, right? So, at the very bottom, you know, cold, petabyte-style, or petabyte-size, you know, scans and queries, you know, go through our, you know, high throughput, you know, scans and Dremel. Warms, you know, the warmth here is roughly hundreds of terabytes hot for materialized views, tends to be, you know, tens of terabytes and very hot, you know, gigabytes close to a terabyte. So, that's what we're talking about in terms of data sizes here. And, if we're to redraw kind of the big query architecture here, right? The BI engine, again, you can think of it as a set of, this term, stateful workers, right? So, here's where data is cached in memory. There's a vectorized query engine sitting close to the data to be able to efficiently answer queries that are memory bound and take advantage of all kind of the neat vectorization tricks that you can do for query processing. All right, and in terms of, this is another way of looking at the BI engine architecture, right? Again, you know, built on existing big query storage, nothing has really changed there. But, you know, really, it's also ability, you know, if you look at the BI engine coupled with our streaming ingest, it really needs, it eliminates the need to manage kind of BI servers, a separate set of BI servers, kind of in a middle tier or so forth, this is completely embedded within kind of the query processing framework of big query. And then you can eliminate things like ETL pipelines or complex extracts, just stream the data directly into big query, can update materialized views, or again, cache it in the BI engine for fast updates, right? And again, no need to manage traditional LAP cubes and so forth. We don't need cubes anymore. Good. So, just disregard what cubes mean. Another neat feature that we've had two years ago is big query ML, right? And big query was actually a pioneer in this space of really kind of democratizing ML features and pushing them into the data warehouse itself, right? So, in terms of good of what we provide in terms of ML model building, we provide the ability to do classification, kind of logistic regression, DNN and auto ML tables for Vertex AI, that's the AI portion of GCP services, linear regression and so forth, right? So, for some of these, you can build the model directly in big query using Dremel. And we also provide import of TensorFlow models that are more kind of advanced, that you wanna import into big query and do bulk prediction on, right? And basically, the key here is the ability to use SQL to do all of this. So, you're able to essentially build complex ML pipelines using SQL and work with it alongside your traditional kind of data already residing in big query. We also recently released big query Omni, which is big query shipping on other clouds, right? So, starting with AWS, right? So, and the basic idea here is that the left-hand side is big query running on Google Cloud, right? We provide essentially the control plane is able to talk to essentially Dremel and our distributed memory shuffle tier that lives on AWS itself, right? So, here, it's powered by key pieces of Anthos, which we're able to ship essentially our containerization and resource consumption on other clouds. And here, the setup would be customers with large data lakes on AWS, and we're able to, you know, we're able to open AWS and query it there and send them results back to either GCP or answer it directly. So, you mean by shake and compute, yeah? So, in this case, I think the computation is done in factory query, or? Right, so, this is really essentially in our data, that's not made to us. It's sort of Dremel in the memory, the shuffle tier. Dremel itself is the customer data on AWS as possible. That's the setup here, right? So, Dremel, it's essentially, we're able to ship Dremel and shuffle on other clouds. And tied together by the control plane, the BigQuery control plane that crosses and essentially both Google Cloud and AWS in this case. Okay, so, how am I doing on time? I think I got almost the problem out of it. Have it out, okay, all right. So, that's it for BigQuery. Hopefully, been able to essentially relate kind of how at least one major cloud data warehouse is architected and kind of what the architectural assumptions are and so forth. Before I move on, are there any other questions? Okay, so, let's talk briefly about Aurora, right? And so, we're switching tracks to talk about how OLTP, how some assumptions change when OLTP is done in the cloud as well. So, if you think about traditional relational databases are built, it tended to be tightly coupled, right? So, SQL transactions, caching, logging was all kind of co-located on attached storage. That was always kind of the fundamental assumption of relational database architectures going back kind of 30 years and so forth, right? But this is rather monolithic and here we're switching towards more, away from OLAP and data warehousing, which is heavily read mostly to where kind of rights and operational workloads are in the picture now, right? So, really a different set of assumptions at this point. But traditional relational databases, if you were to ship them in the cloud, say on VMs and so forth, this monolithic architecture has a rather large kind of blast radius if something were to go so wrong, right? If the database crashed and so forth, right? And it would require a lot of time and energy for kind of high availability, failover, so on and so forth, right? So, when you think about shipping an OLTP database in the cloud, one set of things to think about is, again, computing storage had different lifetimes, right? Storage is kind of the durable piece. Compute is needed on a varied basis, right? Instances fail, you can shut them down or you can scale them up and down depending on what kind of workload that you want. And instances can also be added to a cluster, right? For things like read scale and so forth, right? So, again, much like the architectural principle of BigQuery and Dremel, right? Compute and storage, even for OLTP, are best decoupled for things like scalability, availability and durability, you know, all of which like customers buy when you are providing essentially a OLTP relational database service to them, right? So, and again, IO flow is another major bottleneck, right? So, as you might have learned in this class, databases write data twice, actually, right? First to the right of head log, which is the durability, provides the durability part of transactions. Then again, the data is rewritten when data is checkpointed pages back to storage, right? So, if you look at it from kind of, this is just a simple example of a MySQL instance on RDS, RDS is AWS's relational database service, but it's the same if it were shipping in any other VM on Azure or GCP, right? There's a heavy IO flow if you have a primary and standby instance, right? So, the right of head log and maybe logical log is shipped to the secondary index. Pages are checkpointed back to EBS, in this case, the attached cloud disk. And there's also underneath the hood an EBS mirror, right? So, there's a lot of IO going on in this picture, right? So, what Aurora, the architecture of Aurora, essentially really looked at, essentially having a hosting, typical MySQL or Postgres instance, right? But really cutting out kind of the extra need to kind of write really, essentially optimizing the stream of IO around kind of OLTP workloads, right? So, really the principles here were essentially, at the database tier, or what you see is the compute tier, right? The database tier writes the redo log to the network over to attached or essentially a shared disk. The abstraction of a shared disk in a custom storage tier, and which I'll talk about in a little bit, right? There's no checkpointing, right? As we know, the log is the database, so it's really just writing the log, the right of head log to storage, no need to checkpoint pages back to storage, right? And really log application is pushed to storage where it's applied to pages in parallel, right? So, you can think of kind of the storage tier in Aurora as being essentially a redo log machine. Each piece of storage is receiving the piece of the log for the pages that it's manages, and essentially just redoing the log on top of those to bring it up to date, right? And the master in this case can replicate to the read replicas and where it updates its cache, right? So in this case, the master is receiving the writes. You can stand up, read replicas for read scale. The master will replicate the right of head log and the read replicas will just apply the portions of the log that apply to its buffer pool and drop the rest of the updates since it can always read them from shared storage, right? So really what you're looking at here is, you know, a set of replicas attached to one large shared disk, think of it that way, right? The storage tiers is custom built. It's highly parallel, it provides essentially a highly parallel scale out redo processing on top of database pages, right? And data is replicated six ways across three availability zones. So an availability zone, at least in AWS is, you know, in the same region, but separated for availability purposes and kind of disasters and so forth, right? It can generate and materialize database pages on demand, right? As of a certain version. And essentially provides instant database redo recovery, right? So think of it as essentially a redo machine, right? And it provides four of six write quorum with local tracking, right? So this is able to provide, you know, a whole availability zone plus one extra replica went down, it would still be available for re-quorum, right? So one of the advantages here is read scale, right? Again, in the typical kind of high availability model with the longest shift of the secondary and data has to be rewritten and transactions have to be reprocessed and so forth, you're able to essentially, you know, remove this path altogether, right? Just the page cache is updated on the read replicas, you know, these updates are physical using delta changes on top of the pages. There's no writes on the replica. It can read data directly from shared storage that needs to answer a query as of some point of time, right? So is that, it's kind of the basic architecture, fairly clear, any questions here? Really, I mean, if there's one point to take away from this is that, essentially if you think of it as the database tier there's no durable piece necessary here for transaction processing. It's all essentially pushed to a scale out highly available storage tier. And, okay, in terms of, you know, why six copies are necessary in quorum, right? In a large fleet, you know, of anything like I said before, if failures are really background noise, failures are happening all the time, right? So Aurora, one design principle was to tolerate AZ plus one failure. So you can, again, a whole and higher availability zone goes down plus one extra replica. The service is still available. So above all it's a highly available OLTP system, right? So for three AZs, availability zones, a data is replicated six ways with two copies in each AZ, right? This provides the right quorum of four of six and a read quorum of three of six. All right, so storage essentially segmates data, segments data into certain segment chunks, right? So it partitions a volume, essentially a storage volume into end segments, which are, these are essentially the units of replication, right? And six replicas form what's called the protection group. So really the key here is to, what is the Goldilocks segment size? What's the right size for a segment, right? If segments are too small, right, and then failures can happen more often, you're moving more data around, right? But if segments are too big, then actually repairing a volume, shipping data from another replica might take too long, right? So Aurora chooses essentially the biggest size for fast enough repair, right? So segments are 10 gigabytes in size. And on a 10 gigabit network connection it takes roughly 10 seconds to repair the segment. We also provide, the storage here also provides quorum sets for replica membership changes, right? So at a certain point, if we have at the top here these six replicas, right? And where everything's healthy, right? And then at a certain point, we detect that replica F is unhealthy, right? Or in suspect state. You can spin up a new, essentially, define a new EPIC, EPIC 2 in this case, and bring up a new replica G, in which case both F and G are currently active in the replica set, right? At a certain point in EPIC 3, you realize that F is no longer needed, and then essentially you start an anew with a new set of, with G as the new member of the replica set. And I'll, you know, briefly, you know, cover what the storage node does and then conclude here, right? So, you know, if you look at kind of, again, what a storage node does in Aurora, right? It's a replay machine, right? So it receives the log records written from the primary instance. It writes them to an incoming queue, which is made durable, and at that point, it acknowledges back to the primary instance that everything is durable, right? And besides the, and kind of behind the scenes that's doing things like coalescing kind of the hot piece of the log, redoing them on top of database pages and writing the pages back to storage for durability, and then also backing up the data, right? For disaster recovery back to S3, right? And it's also doing kind of, you know, peer gossip group between kind of peers, right? If one set, if one replica is, say, missing a block or hot log, it can share data with other nodes to be able to bring it up to speed. But the key here is that essentially steps one and two are the synchronous part, right? This is essentially the equivalent of the right-of-head log durability act that you need for transaction processing, right? And another, so in terms of kind of recovery, the key pieces here is that storage defines, there's a protocol between storage and compute. The segment, based on what's called a segment complete LSN, right? So this is effectively the SEO, is the low-watermark, below which all log records have been received and acknowledged by the storage tier, right? And they're maintained for each segment, you know, each storage node using essentially backlinks, right? Used to identify as holes or missing writes, right? Which are filled in by the gossiping peers, right? And essentially the database knows about the SEO, right? This is essentially the consistency point in the log, kind of the global view of the log across the storage tier, right? And then there's also a concept of a protection group complete LSN, right? So, and this can advance after the database tier, right? Seize the SEO advance at all four of six segments, right? And a volume complete LSN, right? So this is where, which is what we call the VCL, which was called the VCL, which can advance after the database sees the PG, SEO advance to all protection groups, right? So this essentially provides the recovery point. So a commit in Aurora happens as follows, right? When the database can prove that all changes essentially have met quorum, right? So this ensures that the volume complete LSN is above, is greater than or equal to the commit LSN of the transaction, as simple as that, right? And this is, all transactions are asynchronously acknowledged, so multiple transactions can piggyback on each other and so forth. So there's no flush consensus or grouping required here, right? Essentially the database tier is waiting for the LSN marker to advance past the commit LSN before it acknowledges the transaction, right? And this allows kind of key features for essentially fast recovery, right? So in traditional databases, as you might have learned, right? You know, you can essentially checkpoint, you checkpoint the pages up to a certain LSN in which case you can truncate the log to be able to bound kind of redo, or recovery time from the right-of-head log, right? In Aurora, essentially the storage tier is the durable piece of the system, right? So in the storage tier is constantly redoing log in parallel, right, on the various segments, right? So really, again, the job of the Aurora storage is to essentially redo log records on demand as part of a discrete or in parallel as in terms of during regular processing. And this is done in parallel distributed in asynchronous fashion, right? So there's no replay necessarily a necessary startup, right? If the database tier crashes and comes back, essentially a new master, you know, there's a, there's essentially a recovery protocol that goes on as follows, right? So say at a certain point, you know, there's a, the database tier, you know, establishes a consistency point, again, which is increasing monotonically and continuously returned to the database tier. And again, transactions can commit once the database can prove that all changes have met the quorum up to some LSN point. And again, this point is the volume complete LSN at the, which is the highest point, you know, where all records have met quorum, right? But there's also a point of a consistency point LSN, right? So these are things like essentially the highest, essentially the highest commit record below the VCO, right? So Aurora storage can handle things like system transactions, right? If you're updating things like a B-tree or splitting a B-tree, whether that has to be atomic across multiple pages. So if you're splitting a B-tree, you're writing a new page, updating an existing page and updating the parent page, that can be all done atomically and taken essentially encoded as a consistency point, right? That only advances across system transactions, right? So everything, so on crash, right? And then with the database were to come up, you would establish the CPL. And everything past the CPL is effectively deleted and no longer exists for the purposes of transaction processing at that point, right? So this removes the essentially need for two-phase commit that might span storage nodes, right? And there's no redo or undulog processing necessary, right? Before the database is open and ready for processing. It simply attaches to storage, establishes the consistency point and the database is open for business, right? And of course, there are similar architectures that have followed this basic architecture. One is SQL Azure Hyperscale and the other is Alibaba Cloud and PolarDB, right? And systems like, certainly like Spanner follow a different architecture but are in the same kind of key workload for OLTP in the cloud. So with that, I'll conclude the talk and then answer any questions, any more questions you might have. So, thanks. Yeah, that's all just the, any questions that you may have. Question right there. So my question is that there are some times although we are trying to figure out if we can actually use high-pressure, like you can get it from here, I apologize for that. What we know is what's correct, it's always, and then if you find the same thing, we're not sure what it is. But if that was the case, they'll all go through it. So this is one of the various ways for all transactions to finish our work from the other part of the code Oh, so the question is, you used Aurora Serverless to, the Serverless version of Aurora to find a scaling point. Yeah, I don't, we can take that one offline. I don't know, it might have been the scaling point wasn't found fast enough. Right, yeah, I mean the serverless, the public documentation says this as well, like the Serverless version of Aurora does look for workload changes, right, to scale up and down appropriately. Based on the resources used. Okay, any other questions? I have a few questions. Sure. Actually the first one is that, is there a possibility or necessity to have some sort of hybrid system like HDAP, possible that can do both efficiently, or whether that's even necessary? Yeah, so the question is, that alternative design cloud, is there some HDAP or hybrid system that can do it all, right? I think, I don't know, the verdict's still out. We'll see if a system can be built like this. My guess is if it were to happen, it would happen on the OLAP side. OLAP will look more like the HDAP system, rather than kind of what you do to evolve an OLTP system to become more OLAP-y. Different design primitives, so the design primitives, if a relational database design, right, is based on kind of fixedized pages, and the workloads there tend to be more sticky than kind of OLAP. But again, that's just my guess. Anything can happen. But if you look at systems like single store as well, which are going after the space, they have interesting kind of built-from-scratch hybrid architectures as well. But then again the question is, is that the right approach, right? Can you get away with two different systems that essentially do efficient CDC to each other, right? Change data capture, in which the database change log is shipped to the OLAP system. The OLAP system is now instantly updated or close to instantly updated for efficient querying. And another thing I have is that you talk about these new like the third-day architecture, especially the pricing model in a cloud, right? You talk about this log, which actually includes resource consumption, right? The one or the other is that, would you have encountered any case that you could, for example, to make the privacy come faster in a predictable way, but you may consume more resource, right? So in that case, you may charge, need to charge customer more money. You can't encounter any of these cases, if so. Right, so, yeah, I mean, in both of those sections, I think basically true now. In most serverless, the charging model is essentially some, for some actual resource unit, right? For used to answer the query or perform the workload or so forth, right? And the question is, what's the interplay between performance and using more resources, right? And the answer is, usually throwing more resources at a problem, like we'll make it run faster. The question is, how much do customers wanna pay for it, right? In a perfectly scalable world, you can pay all the money you want for the query to answer, be answered in kind of one second, as opposed to three seconds, right? The question is, does a customer wanna stop paying for it? And that's usually a lever provided by most of these VCPU or virtual compute unit models, right? You can cap, you know, a certain workload at a certain amount that you wanna pay, and that's the maximum resources you're gonna throw problem. Okay, so you're gonna be like, that's kind of customer truth. Right, the customer, usually the model is the customer can provide a cap, say like, look, I'm only willing to pay this much. Go do what you can. It's the simple, but usually effective model. Another question, yeah? I was going to ask the cost of moving data across the network to get to the effective design services, so we turned out that much in our, just to be serious stuff, but spending data across the network is expensive compared to if you've just got one service and one building. Yeah, I mean, it really depends, I mean, for OLAP, you're really throughput bound. So it's bound by the kind of the throughput that you get off of storage. For things like, you know, more operational workloads that tend to be latency bound, queries tend to be smaller, in which case kind of reading data in kind of a point-wise fashion or smaller page-based fashion across the network, yeah, will affect the performance more, right? So it really depends on kind of what type of system you're building at that point. It's a different set of assumptions at that point. Usually you'd like, you know, an OLTP system, you know, like Aurora, you'd like most data to be cached because it's an IO, it will cost you a network trip to go fetch an AK page, right? All right, good. Question, right? No, I think what's the next biggest challenge that you think? Sorry, and... What's the next biggest challenge that you've been expecting to build in the context of this world? What are the biggest... Find me afterwards. Oh, hang on, there's a long list of... A list of challenges, right? Yeah, I mean, there'll always be a continuous challenge of, you know, making things faster and more predictable, right? It's kind of an age-old optimization question, one that is clearly not solved across any system yet. I'd say that's far and above, you know, number one. All right, thank you. It's also not really... I don't know if you can see that. Many of the concepts that we discussed in class will definitely influence or still be useful in practice, but many of the assumptions could break, right? And you can see more challenges than in different contexts in the cloud driven by application. Many more challenges that you need to deal with. So there are still lots of exciting or problems that you could work on in this space. So that's it, that's for today. And on Wednesday we'll have actually a pretty quick final review. And then good luck for the rest of the homework projects and the exams. All right, that's it for today. Yeah, looking forward to see you guys on Wednesday. See you. I see Jay talking about the St. I's groove. Run through a can of two. Share with my crew is magnificent. Bust his mellow. And for the rest of the commercial, I'll pass the mic on to my fellow. We'll need for a mic check. Bust it. The bees will set to grab a 40. Confirm the yoga snappers next. St. I's. Take a sip and wipe your lips. Q, my 40's getting more. I'm out, he gots the flip. Drink it, drink it, drink it, then I burp. After I slurp, I skew. I put in much work with the BMT and the e-trouble. Get us a St. I's groove on the dump.