 The Carnegie Mellon vaccination database talks are made possible by AutoTune. Learn how to automatically optimize your MySeat core and post-grace configurations at AutoTune.com And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at StevenMoyFoundation.org So today we're studying to have Jagan and Intrid and Hawken from the Google Naples team. So they're here to talk about a new database system they've been doing at Google for several years now. And there's a recent DO2D paper this year that describes the architecture. So we're really happy to have them here to talk about this massive project that they've been working on. So Jagan is part of the data infrastructure team as well as Intrid and Hawken. He has a PhD from Unison Mellon in computer science. And the Intrid is also on the Google data infrastructure team. He has a PhD from UT Austin. I believe Hawken also has a PhD from UT Austin as well. So with that, Jagan, the floor is yours. And then for the audience, if you have a question while the speaker is giving a presentation, please unmute yourself, say who you are, where you're coming from, and ask a question. And feel free to do this any time you want us to be interactive so that Jagan's not speaking into the emptiness of Zoom for an hour. So with that, Jagan, the floor is yours. Thank you so much for being here. Thank you. So today I'm going to talk about a new Data Warehouse product at Google. We've been working on it for many years now. Many of the important Google systems actually use our Data Warehouses for their analytical needs. So as Andy mentioned, I have Indrajit and Hawken in the Zoom call. Let me also tell you a little bit about the work they did with NAPPA. Indrajit is a manager in a group. He's one of the first members of NAPPA. And he leads the effort to build the NAPPA controller and things like that. And Hawken is our senior director and then the whole NAPPA team reports to him as well as many other teams. So if you would like to grab a copy of a paper, we have it uploaded at the Google Research homepage. So the material here is covered in this particular paper. I highly encourage you to grab a copy, read it, and send us your comments. Sorry. So a little bit about myself. I'm Tech Leader at NAPPA. I work primarily on query optimization, view maintenance, and view recommendation problems. For many years I worked at NEC labs before I came to Google. Having said that, let's get started. Now traditionally what you've seen here is Data Warehouses are typically built for these so-called white whales. Now these are the systems with these uncompromising sort of performance requirements. They want everything to be faster, larger, freshness to be great and things like that. So they sort of skew the design in a certain direction. Now, but if you leave them aside and if you look at the many other requirements that are there inside Google, I'm pretty sure outside Google also, is that there are many new clients that have fairly sophisticated performance needs, which are really hard to achieve. I'm going to tell you a little bit about them. And this is the reason why we actually went back to the drawing board. We actually designed this new system called NAPPA that is able to satisfy many workload types. Now, if you zoom out and look at it from say a thousand foot, the Data Warehouse typically consists of a few boxes. I mean, no surprises here. First of all, what you have are three components of a Data Warehouse, which is an ingest, there is some storage, and then there is query seven. An ETL pipeline brings in an enormous volume of data. In our case, it brings in these trillions and trillions of rows every second of the day. The way to think about it is that any minute of the day, every second of the day, we have tens of gigabytes of compressed data actually coming into the system. So as far as the scale goes, the scale is quite extraordinary in some sense. It comes and then it actually lands in thousands of these tables and associated indices. And then we are able to serve queries on billions and billions of queries every single day. Now, what is important here is to really understand where our clients come from. So the clients, what they do is they actually build these dashboards, these analytical tools, these widgets and collabs. And then they actually query the system using SQL. Now, one thing to understand here is that these are fairly interactive application. And the goal here is to actually serve these applications using in less than a second, a sub-second latency. So one very critical aspect of what we do is to ensure that we are able to answer these clients really, really quickly. For the volume of data that's being ingested, the amount of tables and the amount of data we deal with, that's a pretty challenging problem in itself. But really, one of the key realizations that we had here is that latency is one aspect of things. But what is very important for an interactive sort of application is being able to provide robust query processing. What that basically means is that while the queries are fast, they should also have low variance in the response time. So it makes for a terrible user interface. If one query runs in 300 milliseconds, the other query takes three minutes. The variability basically makes it for a really unpleasant user experience. Now that is easily said and done for the mere fact that our workload basically has these huge variations over time. Day versus night, big day versus weekend and month ends, for example, are a very busy time for us. So as you can see, the both the ingestion and the query workload that is actually put on the system is extraordinarily different and changes over time. And that is one challenge of it. The second aspect of it is that unlike these high-performance data warehouse vendors who typically require one to basically buy software and associated high-performance hardware, our data warehouse completely runs inside Google internal cloud, which are basically general purpose heterogeneous machines. Now, the challenge there is that how do you tame variances when you are running inside a Google cloud, which is actually a multi-talented environment. So that is the second challenge. Now, the most important thing. Quick question about the variance, like what is the scope or the window that you look at variance? I understand like one query after another in succession. You want those to be roughly the same response time. But like are you talking about like within a month, the same query has to have the same response time given a week? Or it's lifetime. It has to be simple, right? I mean, because you're talking about clients that are running these critical workloads. So which basically means you're 99 percentile and a 999 percentiles, they better be don't have this huge exponential jump. So being able to tame tails is extraordinarily difficult. And as I will describe later in the talk, we go through enormous effort to make sure that the tails is abated. So in the paper, we basically talk about literally 50 things that we need to do to make tails go away. So I'm abating is a serious business that we do. Oh, hello. I have another question for any person. Oh, by the way, I'm from MIT and currently is a student looking to the basis. So, so for one, you emphasize on the low variance in response times. So I just want to understand the intuition behind why the virus is put before the absolute response time. For example, is there just wondering the exact intuition. So I do have many slides on it. So, okay, can I get back to you after I presented those slides? It's later in the talk, but we can talk. Okay, sure. Is there a security concern, for example, like to avoid like timing attacks? Not really. I mean, this is this is the data warehouse that the people use to consume business data. So this is you make business critical decisions based on what these data varrows tell you, right? So and the user base is pretty much all of Google. So you what you want is basically people to be able to consume insights from data in a timely fashion. I mean, that is basically what is going on here. Maybe we can talk a little more after I presented some of the later portions. Thank you. Thank you. All right. Now, the second requirement, which is actually quite interesting and hopefully a little surprising is that the one thing to note here is that although we've written many people is on the extreme scalability of our systems. One of the things to understand here is that not every application that we write needs millisecond latency, right? And not every application requires minute freshness. And most certainly not every application has unlimited data warehouse. You are talking about many teams and projects and groups within inside Google that basically are trying to optimize their business model. So as the business needs changes, they come back to us and basically say that I want to renegotiate my contract with the data varrows. I want to basically make the query slower or faster. I want to reduce cost. I want to change the freshness requirements and so on. So the challenge here is that now we have built this one data varrow system with hundreds of these clients that actually use, put the data inside NAPPA. The idea here is that how do you change this production system to actually cater to these different requirements and also changing requirements? Because these things are not fixed in stone. A client could come back tomorrow and make a tweak to their expectation from the data varrows. And that's the challenge that NAPPA is able to handle. And to illustrate this point, let me give you a few examples that which would make clear where these needs come from. Now, there is this cost conscious NAPPA client. They run an experimental framework. They come to us and say that, well, I'm cost conscious, which basically means that obviously we need to control cost. I am willing to live with low-query performance and I want the cost to be reduced. But what they want to say is that, can you give me these two things? You already give me moderate query performance and the cost has to be low. But I'm willing to sacrifice on data freshness. So they're basically saying, let me trade off on data freshness. Let me get a couple of things that are beneficial to my user. Another client basically comes and says that, well, I run this fresh data analytics application. I'm willing to, so the name itself says that my analytics data should be very fresh. But what I'm willing to trade off is query performance. I don't need absolute fantastic query performance. I'm willing to trade off on it. Can I do that? I mean, that is the second kind of a use case. And obviously, you would have guessed by now the third client is basically one of those external facing clients. They come to us and say that, well, I'm external facing. I'm critically important. My business case is very important. I want extremely good performance. I want good freshness. But can I trade off on cost? I'm willing to pay a little higher cost. Now, as you can see, our clients basically come from every end of the spectrum or every end of this triangle and everywhere in the middle. And they basically want to say that I want to trade off one of them. Can you give me that? And they do this trade off with varying degrees in some sense. Now, why is this hard? I mean, why is some basically flexibility providing client flexibility is hard? To that, let's look at a typical design. I mean, if you look at designs out there at a very high level, we can go back and agree that data barrows typically consist of three such boxes. Ingest storage and query serving. And one of the things you can see here is that most designs, they either couple ingest with storage or they couple storage with query serving. In the first design, if you couple ingest with storage, your ingestion can only go as fast as how fast you can index. So basically, what that means is that if you want great query performance, that is, if you're indexing load is very high, that basically means that you have to sacrifice freshness. On the other hand, if you say that, well, I'm going to sort of couple indexing with query serving, which basically means that you do them together or in tandem. Then the problem here is that you can get great freshness, but then you have to sacrifice some query performance. Of course, we know that there is no free lunch, but the choice here is already made for you. And in the next slide, I'm basically going to say what is, at least intuitively, how is NAPPA slightly different from this architecture I just showed you. Okay, now let's go back and let me tell you why how NAPPA is different. Going back to the same familiar three boxes ingest storage, storage indexing and query serving. I already told you about the ETL pipeline and the tremendous amount of data it brings in. Now, the one thing to note here is that NAPPA design is a planetary scale. It's highly available and there are extreme amounts of fault, tolerance, redundancy built into the system. So in some sense, we can tolerate multi cell data center failures and so on. And basically it's built to tolerate failures to a great extent. The sort of surprising thing about NAPPA is that we have sort of bet the bank on materialized views. Our performance comes from materialized views. We build hundreds, if not thousands of materialized views on a poor table basis. And the second thing is that these are absolutely consistent. A view is consistent with each other, a view is consistent with this parent table, a view is consistent across data centers. And that is a very key part of NAPPA. The reason we sort of insist on that is that the user should not have any query difference regardless of whether the query binds to a root table or binds to a view. It should not have any difference whether it binds to one data center or the other data center. It should not have a difference if I'm dropping a view on the user queries. Or if I'm creating a view on the user queries, the database. None of the database state can reveal that a view even exists in the database system. The user queries the base table and magically they get the speed up by binding to the right view. Now, it is in literature. It is fair to say that the scale of data, amount of data we ingest and the amount of tables we actually maintain and the amount of views on the tables and the consistency mechanisms we provide basically makes this an extremely challenging problem that no system comes to mind that sort of pushes view to this extreme that we have actually done in terms of the volume of data and the number of views we actually maintain. Now, moving to the query serving, one of the things that I want to point out here is that the key idea here is that we want to provide robust query performance that actually avoids stales and offers subsequent latencies. Later, you will see, at least I'll give you a flavor of the extreme amounts we have to go to make sure that we are actually query serving and sort of doing a tail abation in now. Okay, now let me get to the difference now. I've showed you the three boxes and the thing that makes a difference which hopefully I'll convince you towards the end of the talk is that we've actually decoupled the ingest from the storage and the storage from the query serving. So what that basically means is that the ingestion goes as fast as it can. Okay, up to some limits. So as fast as the user can push it, we can actually take it till obviously it exceeds some limits that we actually put which are pretty generous. The indexing is decoupled and it can scale up as much as a provision resources that is given to the indexing. And finally, the query serving, this is an important point is that one of the reasons how we are able to actually abate a tail is that the query serving does bounded amount of work regardless of the system and it's completely decoupled from indexing. It is not very important to understand what that basically means is that right now because I'm going to talk about it, but the key idea here is that regardless of the state of the system the query serving actually does a finite or a fixed amount of work and that is one of the reasons that NAPA is able to provide very consistent sort of tail-free query performance. Okay, now... Sorry, I have a good question. My name is Slim, I'm software engineer at LinkedIn. So I want to double click on one thing. So when you have the indexing there as called in ETL two questions here. Is that ETL that includes like real-time something like Google pops up or that's like a batch job? And the second question when you are ingesting does it model actual updates on the table or do you just only append all the basically kind of workload? Do you do any updates on fields or it deletes or throws? And yeah, that is the question. Right, so the pipeline is batch-based, yes. And then in terms of data model, at least in this paper we only talk about up and only kind of style workload. But yes, I guess... So in this paper we just deal with up and workload. Although it's obviously possible to also do other kinds of things. Did I answer you? Yes, I think you did answer the question. When you do an ETL, how fresh is the data there? Right, so it depends. Again, that is also as provisioning, as SLAs, there's an end-to-end latency and things like that. So the whole thing is governed by an end-to-end latency of which we are a part of it. What is the best latency that you guys can get? Is that okay? There is an SLA. Yeah, I think I'll point out that maybe it's best to talk about spectrum instead of taking what is the best that can be done, because those are internal details. The paper talks about, and it's intentional, why we are only talking about things that is written in the paper. So the paper talks about things like order of minutes for some folks who want order of minutes, some folks who are okay with larger amounts of minutes. To your previous question of can we do streaming, actually all of our ingestion is streaming, what Jagan was pointing out, Batch, means that ETL pipelines may itself be batch-oriented. But from NAPA's perspective, it is actually receiving trillions of rows per day, and they're all being streamed in. Some of the upstream may choose to put data, which is more PubSub-like. There's some smaller examples like that. But the vast majority are slightly different, which is they have logs data, which have been logged, and they did some processing, and then they are starting to stream into NAPA. Thank you. All right, moving along. So we talked about the decoupling nature of it, and then now I have to basically tell you what binds these things together. So you cannot have a system that's completely decoupled with everything doing whatever it wants to do. So what is the mechanism that sort of rules them all or binds them together? The key principle here is what is called a queryable timestamp. The queryable timestamp is the control mechanism, if you will, that actually keeps these different components actually working together with the common purpose of aligning NAPA's performance with the client expectation. So later I'll sort of talk more about it in the sense that your ingestion and storage are basically pushing the system in very different directions. One is actually making things worse for the user. The user one is actually making things for better. Like ingestion is making worse, storage is doing things making better. Now QT is the control mechanism. That's actually interestingly, that's energy steam that is actually constantly trying to correct the state of the system. So for now, just a keyword queryable timestamp is sufficient. I'm just going to move on with the remainder of the talk. All right. Now this is roughly NAPA's architecture. You can actually spot the three boxes. The ingestion, the query serving, and the storage and indexing. That is one part of it that forms the data plane of NAPA. You can also see our familiar ETL pipeline that's actually bringing in data. Now what we built on top of it is actually a control plane which consists of a controller and the queryable timestamp, which is the signal by which these three components can be controlled. Now what the QT or the control queryable timestamp does is actually we use it to orchestrate work and maintain the database at client requirement. So this is the control system. This is the input of the control system, and this is what the control system uses in order to actually maintain the database at the right optimal level for the client. Okay. Now briefly on the ingestion. Let's go one by one. The ingestion is the first thing about ingestion is that your goal as an ingestion system is to basically ingest data as quickly as you can. You don't want to hold on to the data. You want to quickly get the data in. You want to acknowledge and you want to give some sort of guarantees on the data that is actually ingested. So what we do is we quickly take the data, we actually replicate it, then we commit and then we acknowledge and we actually give a timestamp of things like that. Now the goal of the ingestion is very simple. Go as quickly as you can. Don't block the ingestion pipeline to the extent possible. All right. Now then what do you do? So note that you might have something like a conflicting goal here. The ingestion's goal is to quickly write data out so that you can actually unblock ingestion. So what happens is that the output of the ingestion server is right optimized. So basically if you query it, it will be terrible. Now what you need is a background operation that will make this thing into something that's read-optimized or that is great for query. So the key idea there is something very familiar. It's called a lock-structured-merch tree and that is a data structure or that is a mechanism that we actually use in order to organize the ingested data. Now the caveat here is that of course there is no free lunch here. So at the minute you have something like an LSM, you're talking about you have to pay the cost of reading and writing multiple times. That's popularly called a write application. So one of the things you will actually see here is that the LSM is nicely tied in to many things we do including we use the property of the LSM to achieve robust query performance. All right. Now let me show you how the whole thing works. The previous slide picked up really. The ingestion server brings in, it actually dumps in these tiny sort of files on our file system. And then one of the things I already told you that this part of it is write-optimized. It's not for query. You cannot really query this data. If you query it, your performance will be terrible because the yield from opening every single of these files is quite low. Now what happens is that there is a background process that is actually taking these small things and building out larger files called deltas. So one of the things you can immediately see here is that the process is called compaction where we actually take these small deltas and actually make them into larger ones. Now the things you see in green are read-optimized and it is for query. Now some of you might have actually noticed that basically means that whatever is ingested cannot be immediately query. So there is going to be a lag in terms of what is available for query. And that turns out is the key idea behind queryable timestamp. Now this is an important concept. So basically if you give me a database, there is a queryable timestamp of the database or a table or a view if you will where we basically say that it's a live marker. It's constantly changing. This is the point to which you can actually query the database. It is typically a few minutes in the past. And if you query at that particular point, we actually give you a lot of nice properties. So what we say here is that if you query at that particular timestamp, all the views and tables are consistent. It doesn't matter whether your query touches a table or it touches a view of the table, it's going to be consistent. The second thing that we actually say here is that the query will end up reading a very small number of deltas. We bound the number of deltas that you will end up reading. So this is where the decoupling comes. Regardless of the size of the system, you can actually, the query ends up reading a fixed number of these deltas, which actually is very good for tail mitigation as our external results will show. Third thing is that these deltas are now available in a quorum of data centers, which basically means that you have more option of reading from a local data center as opposed to a really far data center. So now the QT is basically the switch point of where if you query, you're going to get really good performance on the database of the table. And now I'm going to tell you a little bit about how we actually control this QT and maybe use it to give the client the desired performance. So the gist of that, maybe this is the next slide. Keep going, keep going, keep going. You had a question? It's the number of deltas that need to be read for any queries is less than X. That assumes your index, your ingestion component can process them as fast as possible, right? If you fall behind and you can't guarantee that, right? Sure, you can. So I think you're explaining that in the next slide. Yes, I'm going to. Sorry. All right. Now, so let's look at this particular slide. Now, the QT basically like I decided rightly, you said that the QT basically decide what you can query, right? It is fixed at a certain number or it's bounded by some number, right? There are some subtleties to it. Obviously, we don't want to talk too much about it. Now, the... So one of the things that you can actually see here is that the QT also means a precious delay, right? So one of the... Now, you can immediately see there is something going on here. There's a push and pull that is going on here, which is perfect for a control system, right? Let me just show you the push and pull that's going on here. Right? Now, the ingestion, if you ingest more, or if you push a lot of data, you actually make this span lesser, right? You make the data inherently stable. On the other hand, if you increase the number of views, you make pushing in this other direction much harder, right? So the indexing effort reduces the span, but also if you add more views or things like that, it actually makes the pushing QT in the right-hand side much harder. Now, the client query always has fixed work, right? Now, what that basically means is here's a system that has been pushed in left direction, right direction, there's a constraint, and now that makes for a very perfect system, right? Because now there's a control system with enormous amount of control that it can decide which table to work on, what work to schedule, and then it looks at the reference client requirements and basically says that I'm going to maintain the table based on this particular requirement that the client has given me. The one measure that actually tells me if I'm doing well or not is this QT, right? Because QT says the client performance is good. QT talks about the QT, if you can maintain QT, means the cost is low. So you can actually see that, we actually built a very nice control system on top of it that will maintain this thing against a required spec. Okay, now let's look at things slightly more pictorially, right? So you can go back to our old three clients, experimental framework, the fresh analytics, and the external facing client. One of the things you can actually see here is that what is actually encoded is what is actually provisioned or some amount of indexing, some amount of ingestion, some amount of indexing, some amount of cost, right? And what is actually again fixed is what is the maximum span we got to allow that basically says how much is the querying performance on this particular database. Now what the QT controller does is it takes all these specs into mind and it takes a physical running system, the performance of the running system, the current characteristics of the system, and it's actually able to reconcile one against the other. And that is how we are able to maintain databases of the right performance. So one of the things I'm going to show you in the next several slides is that later at the end of the talk is I'm going to show you these clients and I'm going to show you actual graphs from production traces so that you can actually see how we do these things. All right. Are there any questions here before I actually move on to some of the infrastructure later problems? All right, perfect. All right. Now the other sort of the thing that is actually very different about NAPA is that we use a database system called F1 query as an infrastructure both for maintaining tables and views and actually also for client serving needs. Now what that basically means is that this database engine F1 query has to be good at two very different kind of workload. In the first kind of workload, you're talking about an enormous data volumes that terabytes, petabytes, if you will, of data that you would have to transform from some input representation to some output representation. Now F1 query was modified in order to take advantage or be very good at this kind of workload. What F1 query has been traditionally good is serving clients with sub-second latency. So that is something that it has been built for. And one of the things we actually augmented to it is the ability to actually serve these queries with minimal takes. So that is the other changes we did. So using the same infrastructure both for modifying the tables and views as well as the client query serving is actually one of the, I think, thing that we did in F1 query which I'm very personally very proud of. Now, so when it comes to the indexing thing there are basically two kinds of workloads that we do. One is compaction, which basically is it takes an input LSM, it takes a span of the input LSM, it actually merges and updates the LSM. And the view generation, which is basically it takes the root tables LSM, it sorts, group buys, aggregates and actually updates the view LSM. And then note that we are not talking about trivial amounts of data but rather we're talking about enormous volumes of data that is actually flowing through these query plans. Now, and if you look through it, I mean, in some sense, you have this insertion server, you have this root table and then you have these views and what you can actually see here is that it's almost the update flows from one LSM to another LSM creating these forest of LSMs and a root table can update a view and a view can update another view and so on. So what you can see here and all these things have been done in bounded time and obviously the data volumes and the sort of the latencies on these things are pretty stiff. Now, let me briefly talk to you about challenges. One of the challenges here is that when you merge, you know that as a fan in becomes wider and wider, you become more susceptible to tails. So the reason for that is that there could be one slow input that could actually slow down the entire query. So that is problematic. And especially because these are ordered merges, which basically means that you can really skip a particular input. Now, when it comes to view generation, you're talking about petabytes worth of data. So basically you have to be very careful about how you preserve orders. When you say orders, it means sorting, partitioning, if you have cluster information, things like that. Now, what one has to be careful here is that creating these new kind of sort order is extremely expensive. So you want to avoid that. So you never want to destroy sort. And then we also have things where you can actually partially reuse sorts and things like that. So it's a very nice sort of an interesting take on interesting orders actually that we have done here. Now, the issues here, which we didn't talk in enormous details are we are very interested in sort efficiencies and data skews, runtime tails and so on. Now, let me go to query serving then. So the goal here, just to recall, is to serve queries, these queries fast and also to make sure that the query performance is robust. So one of the first thing we actually did is we introduced a new sort of service that sits between our worker and the raw storage. So now this layer does aggressive caching and many kinds of optimization. So the key idea here is that if you actually want a piece of data and then if you have to fetch the data from disk, that incurs an enormous amount of latency. So what you want to do is you want to make sure that the F1 worker when it reads, it is actually reading something that is cached in one of the many layers that we have in this particular system. Now, let me talk at a very high level here. The common fallacy that people have here is that if you throw more parallelism at a problem, you can actually solve it. But the issue there is that the more the parallelism you can actually throw at a problem, the more you sort of magnify the tails. Let me give you an example for that. Think about you use 1000 workers to read a piece of data and what immediately happens is that the 99.99 percentile tail now gets magnified and appears as a 90 percentile. So the more the parallelism you throw, the more you are at the mercy of these tails. Now, that is one common sort of design problem with many systems is that the parallelism basically goes completely uncontrolled. And this is important to understand that parallelism is not necessarily a great thing if you are very worried about tails. How do you sort of mitigate it? Let me just talk about three things. But if you look at the paper, there is a whole laundry list of very different techniques used here. One thing here is that if you actually end up not touching the base table, but rather if you touch the views, immediately you don't need that much amount of parallelism because the data is already done for you. The data is already reduced for you. It is much more consumable than what you would have had had you touch the base table. The other thing to keep in mind here is that all these things are great idea like push-down filters is a great idea, caching is a great idea. So those are all very good at actually reducing the data access and which in turn reduces the tails. Now, you can also sort of remove unnecessary parallelism. For example, you can combine small IOs. You can make one single IO. You can do better allocation of things. You can actually cluster them in the physical storage so that you don't have to spawn completely different threads. And the final thing is you basically accept the fact that you're going to have tails and you basically do some mitigation around it. So one of the things you can detect early, you can restart, you can offer competing ones. So at the big challenge with NAPPA, as you can see here is that the tail abation and robust query profiling forms one key challenge of NAPPA and we had to go through many, many of these techniques. We had to implement an enormous amount of techniques before we could actually obey tails to an extent that we can actually make the system quite robust to query. So which of the architecture that you showed before in the previous slide? Which ones do you guys actually control? Because the F1 was existing infrastructure, right? And then Colossus is also existing infrastructure. And like are you guys, when you list these things, these optimizations you guys apply, could you modify NAPPA to produce the amount of IO you're doing? Right. So which is why the middle layer is important. The middle layer is completely our software. So the reads are not done by the F1 worker, but rather it's delegated to this intermediate service that actually does the read for us. So which basically means that we can control the caching story. We can control the parallelism story. We can prefetch. We can do a lot of those tricks in that particular layer. So if you had the F1 worker directly read the raw stories, then obviously we are at the mercy of the days, right? We couldn't do any of these things. So it's very clear. Like Colossus, as NAPPA uses it, is unmodified. Or maybe you guys told me, hey, fix these things, but it's not sourced by the UI limit here. Colossus as NAPPA uses unmodified, completely unmodified. OK. OK. I'm going to go to the present, the performance result. How much from the results? Is there any other questions before I move on? You're doing OK on time. You have 17 minutes. Yeah, I do. Yes. All right. OK. Now let me talk to you a little bit about performance experiences. I'm going to show you a few rather high level figures. What I'm going to show you obviously are some simple experiments that actually show you a few different things. One is that targeted view results in better query performance. No surprises there. The larger fan basically means that you are at the mercy of tales. And then the interesting one to me at least is that I can show you that different clients can realize their desired results. And finally, how decoupling is a good idea. OK. So that's the basic agenda here. And let me just walk you through it. All right. So here is a real workload. And then the goal is we basically wanted to figure out how many views you can throw in on before we can actually make the completely cover the workload so that every query hits a view. Right. So one of the things that you can actually see here is that it took about eight views for us to get to that sweet spot. And one of the things that's actually noticeable here is that the 99 percentile, which is the yellow bar, actually benefits from views for the mere reason that you mentioned before that if you read from a view, less parallelism is needed, which basically means that you are less at the mercy of the tale. So that there is some tail abatement that is going on if you have more views in your system. OK. All right. Now, the other one is what happened? Yes. Like for the previous slide, like what is the order in which you're adding the views, right? It's like the view, like, you know, one, two, three, four is up to eight. Like are they sorted or is it random? Like is the one view that could have the most impact? You know, did you add that first? Yes. So I think we just went with the number of query templates on. OK. So the first view I got added is the one with the most query template binding to it. All right. So the second one we want to show you here is that if you're increasing the merge base, one of the things that you can actually see here is that the tail increases fairly rapidly. So you can look at the 99 percentile. You can see that it increases a much faster clip than the 90th percentile or the 50th percentile. So the tail is very sensitive to the number of things you need to merge while answering queries. And that is one of the reasons why the bounding, the number of, the QT bounding, the number of deltas you need to merge in query time is such a good idea. So for a very high performance system, we actually keep the number fairly tight so that you don't have to really merge enormous amount of data on the fly. Now I'm going to talk a little bit about the trade-offs. I'm going to show you the, I'm going to go back to our old clients, the three clients and actually show you some production data. And then the, and I'm going to show you that the clients really got what they needed. Now the, what do you see is a graph with four panels to it. The first one is the ingestion rate in some units, the data delay in some other units, query latency, and resource cost, right? And then what you see are three curves, right? There's a blue curve, there's a red curve, and there's a yellow curve, right? Now the, so this is the, the first one is the blue curve, I'm just looking at the blue curve now. That's our experimental framework. It pumps in an enormous amount of data, which you can see from the top panel. The, it gets reasonably good performance, as you can see from the third panel. And you can actually see that the cost is quite low. I mean, considering how much data they push in, they don't pay that much of a cost. Now one of the things they actually gave up was the freshness. So they, they don't get a tremendous amount of freshness, but that is fine. Now if you, if they wanted better freshness, obviously the cost would, would actually balloon up quite a bit, right? Now let's look at the second curve, which is the, the red curve. So this is the analytical tool that wanted, that had a moderate ingestion, as you can see from the first one. It, they got moderate freshness on the data. The, for the cost they paid, that's what they get. And from the last curve that you can see that they got low cost. Now one of the things they actually gave up was the performance. Their performance is actually worse than the previous one. Now this is a state in which they're quite happy with. Obviously they arrived at from after much trial and error. And then they're happy with this particular state. They're happy with the cost they pay. They're happy with the freshness they get. They're happy with the performance. Okay. And this is, this thing works for them. Now tomorrow they could come back and actually change all of this. Okay. They could change every aspect of it. But at that moment when this pain shot was taken, this is their state. Okay. Now the final thing is our all important external client. Here is that they don't ingest that tremendous amount of data. They get super high freshness. They are really high performance. I mean they are super high performance. The cost is also high. For the amount of data they ingest, you can see that they pay a tremendous amount of cost. Again, they are very happy with this particular setup and nothing stops them from changing them tomorrow. So as you can see here that clients can come to Napa. They can specify the constraint they want. And to a large extent they can actually get what they want. They go through a trial and error process and they actually converge at something that actually suits a business state. And the nice thing about Napa is that nothing is set in stone. You can actually come back and change it yet again. And that is perfectly fine with us. And we built a system that is actually able to cater. Hundreds of these clients constantly changing the requirements every day that there is perfectly fine. The system is able to handle it. It's such a powerful idea for what we do. Now, finally, I'm going to show you that decoupling is perfect. It actually is the right idea here. Now, what you see here is the familiar four pains. You see an ingestion and then you see some indexing performance. And then you see some freshness and then you see some query latency. Now, in the second pane, what I've shown you is there is some infrastructure problem and the indexing is not moving. Now, what happens is that while the indexing ramps up, one of the things you can actually see here, there is a brief hit in the data delay. But the query latency is rock solid. I mean, it's not budget at all. It's actually giving perfectly good performance. The dashboards are all working except that the outage is actually manifested as some freshness delay for a brief period of time. Now, when the indexing ramps up, you can actually see that the freshness is abated and things go back to normal. The going back to the NAPA architecture, you can actually see that there are two things in action here. The mere fact that we decouple these things basically means that they can actually go in their own place. And the second fact that we actually put a controller on top of it actually means that the system can constantly correct itself. And you can, it comes to a point. It might have occasional outages like every other system, but it can now automatically correct itself. The controller can correct itself. It can invest more resources than needed and actually bring the system back into a point that is in alignment with the client performance. And the most important thing here is that there was no dashboard outages, all the dashboard work except that obviously you'll see a little bit of still data which is also bounded, right? So there's also an SLA on it that is also bounded and we'll never let the system go beyond that particular level. Okay. All right. So I'm going to summarize and then I'm happy to take more questions. Right. Now Napa is a system that is enormously scalable. It has a tremendous high query workload. We handle billions and billions of queries. We take a massive amount of data volumes. We are a very important data warehouse inside Google. Now, what is very unique about Napa is that we actually use materialized views to achieve subsequent queries. The materialized views are hundreds and thousands. We maintain them consistently despite ingesting trillions and trillions of rows. And the clients have the flexibility to change the data freshener's query latency and cost. Now, the way we built Napa is that we have actually built in ample opportunities for automation, tuning and self-driven query efforts. The controller is one aspect of it, but there are many very nice interesting sort of automatic tuning problems that are present in Napa, many of which we actually saw, but don't discuss in the paper. We will discuss them in a future work. That's pretty much all I had. Thank you so much for your attention. Great. I will applaud on behalf of everyone else. We have actually 10 minutes or so for questions. If you have questions for each other, meet yourself, say who you are, and fire away. Otherwise, I'll be selfish and ask all my questions. All right. Let's fire away. The QTs, the main interface that you're exposing or the knob you're exposing to the users, they allow them to control the love triangle, the freshener's versus cost versus performance. Are people able to wrap their heads around that? Because you just tell them what, you get this query in five minutes of fresheness, or is it a dial? You expose them. No. So there's a difference between sort of stating your expectation from a database versus the current state of the database. QT is the current state of the database. It basically tells you. I am able to fulfill all your needs, all your requirements that you imposed on me, but I'm able to only do it on a time stamp, seven minutes of it. It still says that basically this one unit tells you many things. It tells you that I haven't met your needs. Whatever you told me, I'm able to achieve that, but the only thing is I'm able to give you something that's only seven minutes. Now you can look at it and basically say that yes, seven minutes is fine for my application needs, or you can basically come back to us and say that no, that it's unacceptable. I never want the database to go more than three minutes. So I'm just throwing some random numbers at you. But there are two things to sort of reconcile here. There is one aspect of it where the client actually says what the expectation from the database is, that is actually codified in the system itself. The QT is the manifestation of that expectation, but it actually tells you very succinctly that the performance can only be achieved at a certain time. Obviously, you would not be happy if I came back to you and said that I'm able to meet all your demands, but the QT is to us. You can sort of understand that how QT nicely very succinctly sort of conveys many, many things. And it's a great way of actually building a control system over it. I'm curious because it's like a lot of times in systems we say, well, expose something to the client to let them tell us what they want, but clients, people are stupid, right? They can describe certain things. So I like this approach, but it sounds like also too that maybe you said this during your talk and maybe this is just a slip-up or you made it this exact, but it sounds like someone can change what they want the freshest to be on a day-by-day basis. I'm not only going to query right query basis. Right, obviously, but if you change then everything changes, right? The cost profile changes, everything else changes, right? So almost like we hit upon something very fundamental here, which is that the love triangle as you described it, you can choose two sides of it, right? But you are completely unbounded on third one, right? I mean, third one is what the system basically says it can or cannot do, right? So you can come back and say my freshness should be three seconds, obviously, but there's a cost to pay for it, right? So the client basically has to figure out what the three numbers that make sense for them, right? They can specify two, three, so I'm not sure I'm asking a question, but I was curious, like a client, a Google internal client do this on a day-by-day basis? They don't, obviously, day-by-day would be a little too aggressive, but they have the option of doing it very often, right? Okay, okay. Okay. What is the what is your burst delta file format look like? Is that something like Parquet or is that, like, is it like a sort of column store or is it something? It's a column store. Yeah, column store. And proprietary Google. Unappropriated Google, yes. Okay, how aggressive are you doing compression in that? Pretty aggressive. Okay, all right. Can't say anymore. Okay. How complex are your materialized views? Do you support joins, left out of joins? I guess the follow-up is are you actually maintaining them? Are you doing, like, just re-computing the query from scratch? Probably not, because that would be expensive, but are you doing, like, the incremental updates? Yeah, it's all incrementally updated. I think in this paper we don't talk about joins, but everything is incredibly maintained. The goal is to be, if you are not super efficient, I don't think you can really keep up with the amount of data we are actually pumping into the system. So efficiency in every dimension is quite important. I mean, attention to detail is very important in what we do. Of course. I understand that your third bullet point here says there's a bunch of automated stuff you've done that you can't describe as now. But it, I mean, at some point, I think the question is, like, how complex are the queries that are you handling? Is it sort of, you know, like in the scale, like TPC-DS are very, you know, very complex? No, no, no, not complex, right? Because these are all queries. And you, the other thing also you have to keep in mind is that there are ample opportunities of observing a query way beforehand and sort of creating sort of views that align with those queries, right? I mean, you're not talking about, like, random ad hoc queries or not our predominant use case here. You're talking about dashboards with well-established query patterns that you have ample opportunities to plan for and build views on top, right? That does not mean that there are not opportunities for quickly responding to self-tuning kind of view designs, right? I mean, opportunities like that do exist. But people heuristics are probably good enough because you've seen the same query temples over and over again, right? I think you have a few hands that are raised. Oh, sorry, I missed that. Pathanga, go for it. Can you hear me? Yes, I can. Interesting paper. Thanks for presenting it. I was wondering about the utility of the distributed cache. How much that you see the distributed cache has been used and how much of a variance that you see the queries that are hitting the distributed caches versus that are going directly to material recipes, right? So see that the I think the nice the comparison between I think the local cache and distributed cache I think makes a tremendous difference, right? Local cache is obviously high-performance and smaller and distributed cache is larger but you have to make a network hop, right? I mean, so I would sort of put them together as a more complimentary to each other, like think of distributed cache as a slightly less performant cache, but you can have a really large one as opposed to a local cache. Materialized is I think a slightly orthogonal here. What you cache doesn't really matter, right? You could cache you could actually cache tables, you could cache views it doesn't really matter, right? Did I answer your question? I'm sorry. Oh, yes. Thanks, Jagan. I think I was confused. I was thinking that the distributed cache is using for one purpose, like always caching the the results of a query that has been previously showed. No, the views actually exist beforehand, right? Obviously that is possible. That's a pretty nice way to do things but not the focus here. Thank you. Okay, next is a mole which I think you should know. Oh, yes, of course. Hi, Jagan. My question was partially addressed, but let me ask it again. What is the two questions? You talked about cost a lot and I was curious is this something that they actually is it actual cost that the customer is paying you somehow internally or is it just like a number? Yes, I mean we have fairly sophisticated ways of measuring cost, right? I mean we and charging users and so on, right? Yes, I mean there is a real cost and everything we do is actually charged, right? Although the units of cost may not make a lot of sense outside of Google but inside Google there is a very well-developed kind of semantics of course. So the team that's asking for these reports, they're actually paying in some sort of dollars that they care about. Some units, yes, absolutely right. I guess the second question was just trying to browse through your paper but what is the subset of SQL that you do support in this thing? What are the follow-ups of Andy's question earlier on what is your support on what you're doing? I think pretty much industry started, right? I mean, I don't think our SQL is deficient in any sense. Yeah, so pretty much it's the state or Google SQL dialect that you guys are pushing out, right? That's right. Yeah. Yeah, go ahead. I'm just curious about the incremental maintenance part of it, right? Is that something that you sort of had to innovate in or were you able to use mostly standard techniques there? So I think in terms of I think the literature I think what were the literature providers in terms of the algebraic mechanics that is needed to maintain abuse for most part I think we could find things for literature but what was really challenging is how do you make these things work at scale and the consistency guarantees that are needed, right? Those two things were quite challenging. So we do have some more publication in the works that talks more about the view maintenance and the associated challenges. Cool. Thanks. Any last question? So they're completely different architectures but like they're solving different problems like BigQuery versus NAPPA, right? Because you're trying to do more real-timeish things and BigQuery is more of a traditional style of data warehouse. But I guess if we focused on the predecessor of NAPPA would be Mesa. What's one thing that you guys have found that you're able to do with NAPPA that you weren't able to easily do with Mesa? Oh, okay. One aspect of it is this, right? So think about let's see. So Mesa belonged to a style of data warehouse where you did everything at the ingest. So the ingest and indexing were basically coupled with each other. So one thing about Mesa was that if you threw in a lot of these views you basically have a freshness problem. So basically your database will get slower and slower and slower and you'll have a lot of views. The second aspect of it is that the decoupling it's slightly intuitive but the thing about this is that the mere fact that we decouple and the mere fact that we have these aggressive sort of consistency model basically means that I can actually slip in views and the fact that views are completely hidden from the user basically means that I can actually slip in views I can remove views without actually affecting the inquiries. So we made architectural changes we actually also made these automation changes and we actually removed the binding between the user or we completely hid away the views which basically meant that Napa could do many of these things fairly easily that Mesa could not do and obviously in terms of query performance and things like obviously this one is much much much high performance than Mesa but that is understandable because Mesa Napa is a fairly new system. Okay, awesome. Alright, so again I will applaud everyone else. Thank you so much Jagan Andrew and Hakan for being with us today.