 So today, what we're going to talk about, as I said, is sort of a more forward-looking topic on automated database management systems or what we'd be doing here at Carnegie Mellon. In particular, we focus on what are called self-driving database management systems. Before we get into that, real quickly, I just want to go over the, again, I talked with this last class, but here's what's remaining for us in the semester. So I'm lame and didn't print it out, but I'll post it on Piazza later tonight. But it's basically going to be, pick your favorite paper and write a short synopsis of why or why not you think it could fit into the system we're building here. And then on, what's wrong with that? Well, yeah, all right. In a good way or a bad way? Good. Interesting is good. So then the Aneel is coming on Wednesday in class to give a guest lecture. So again, he'll be talking about the HANA system. And his background was he worked on Sybase up in Canada, did his PhD at Waterloo. And then Sybase got bought by SAP and they transitioned over to work on HANA. But if you have questions about Sybase, the old school stuff or the newer stuff, you can answer both of them, which I find is super fascinating. And then this is the stuff that you have to do for the final project. Code review coming up. The final presentation will be on May 6th. And we'll meet at 9 AM. I'll send an announcement for that. Our friends in database companies are sending you guys t-shirts. Everyone gets a free, everyone gets a database group t-shirt. Or I'm sorry, everyone gets a database shirt, not necessarily from the CMU database group, but pick your favorite database company and we'll have shirts available for everyone. And I know some of you guys are interested in TIDV because they're the Chinese database company. They're sending us shirts. They contacted them this weekend. So we'll have Chinese database company shirts as well. And then we'll do the code review drop. So the code review will do the 11th. And then the final code drop will be on May 14th because I had to get grades in, I think, by the 15th or the 16th. Because some of you are graduating. All right, any questions about any of those? Boom. Final code drop doesn't have to be the same as the final code review. They're different. So the final code drop is like you giving me the final design document. You've addressed all of the concerns and comments and the reviews. But there can still be new functionality of features after the review. Yes. That's a bold move. But if you want to do that, sure. OK? OK. Any other questions? All right. So today's agenda, again, I want to start off talking about what people have been doing in the space of autonomous database systems. Self-driving is the buzzword that people are using now because of self-driving cars. But people have been doing this for 45, 50 years, or at least trying to. So we want to understand what people have tried in the past. And then we can then see what's different about what we're trying to do in the modern era. And then we'll talk about the paper you guys read, which is basically a survey or overview of engineering guidelines for how to build a data standard system so that in the future it could be self-driving. And these are things that we've come across and we've learned from building our own system here. And part of the reason why we threw Peloton away and started over was because some of those things we read in that paper were like, when you start trying to do the machine learning stuff on the code base we had before, it was just not possible because of the way the system was originally written. And then I briefly want to talk about the other major trend in applying machine learning to database systems is one of these called these learn components where instead of using heuristics to make decisions, you learn some model that can tell you what to do. OK? All right, so why are we spending time talking about this? Why do we actually want to automate the management of the database system? So I've talked about throughout the entire semester that if you are able to work on the internals of a database management system that you can get paid lots of money and go anywhere you want to go build these things because this is a skill that's in demand. And so not only can you get a lot of money building database systems, but actually you can get a lot of money, you can make a lot of money maintaining them as a DBA or database administrator. So when there was a survey done about 10 years ago about where the people are spending money when they install a new database system, so now again, we're not talking about actually building the internals. We're talking about if I'm a company and I'm selling widgets and I need to set up a database system, how much it's going to cost me to set that database up? So the survey they found about the personnel was accounted for about 50% of the total cost of ownership of a database management system. So the TOC is like, what is the total cost of me actually deploying this database? So it's not just if I pay Oracle or IBM or Microsoft for the license of the software. It's like the cost of the machines I had to buy, the time I spent setting up those machines, setting up the software, the energy it uses, I have to use to run those machines, and then also the labor costs for having humans basically maintain the software and administer it, the DVAs. And these people get paid a lot of money. So the average salary in 2017 according to the Bureau of Labor Statistics from the federal government was around $90,000 a year. I would say that this again, this is the average across the entire United States. I've heard some crazy numbers on the East Coast and the West Coast, especially in New York City for the financial guys. We're talking like six figures. And I remember when I was in grad school, my advisor got a call from a hedge fund in Connecticut and they were looking for a DBA that did high performance transaction processing, their stream processing systems, and they wanted to pay that person half a million dollars a year to manage software. So the humans are expensive and they don't scale, meaning like if I have, not just my one database instance, if I have thousands or tens of thousands, there's no way any human can actually manage all those individual pieces of software. Typically what they do is they try to figure out what's the lowest common denominator or how to configure the system, and then they just replicate that over and over again. So Oracle, Microsoft, and Amazon, they're not tuning each individual instance of RDS or Aurora or Redshift. They'll come up with a basic configuration but they sort of leave the nitty-gritty details of how to actually set up the schema and things like that. They leave that to the humans. Of course again, it's expensive. So people recognize in the early days of the database management systems, especially relational database management systems, when you had this abstraction layer between the logical configuration of the database, the schema and the queries, versus the actual physical organization of the data and the execution of those queries. Since we're decoupling those things, it was assumed that the data system would be smart enough to be able to figure out the optimal way to configure the system. So people recognized in the early 1970s that oh, we have this issue where if we need to be able to figure these things out automatically, how are we actually gonna do this to get the best performance? So what's different though is, from what we're gonna talk about today, is at a high level it's essentially the same. It's just they come up with a different label or different terms over the various decades. So going back to the 1970s again with the first relational database management systems, the buzzword of the term at the time is called self-adaptive systems, so self-adaptive databases. So these early systems were focused on, or tools, were focused on solving what are called physical design problems, or physical database design problems. So this classic one is picking indexes. If I have a bunch of queries, what indexes should I build to speed up those queries? They also were concerned about doing partitioning and sharding key selection, so if I wanna split up my tables across different physical disks or different machines, how do I pick some attribute to split them up on, whether I'm using range partitioning or hash partitioning, it doesn't matter. And that way, again, I get the best performance in my system. And then data placement is sort of, you do after partitioning, so after I decide how to split up my table into partitions, where to actually physically store them. So at a high level, these early tools work the same way that the tools work now. It's just the, now we're doing machine learning stuff. Machine learning magic to make this all work. So the way it basically works is that you have some human DVA, and they're gonna prepare a sample workload trace comprised of the queries that the application executed. And in our case, it's gonna just be the SQL statements that the application submitted to the database system. And let's say this example here, we wanna have a tool, a self-adaptive tool that can pick indexes for us. So what we need to do is we need to then feed this training data, this workload sample into our tuning algorithm. And then this is gonna do some kind of computation on the workload trace, and try to derive some information about how the queries are accessing the data. So in the most simple way to do this would be, in this case here, if I'm trying to pick an index, I just count the number of queries that touch each column. I can build a histogram for that. And then this turn internal tool is gonna have its, this tuning algorithm is gonna have its own internal cost model that's gonna try to figure out what indexes could actually matter to be, right? And come up with a bunch of candidate choices. And then it uses that internal cost model to be able to say, you know, this one will provide me some greater benefit than another one, right? Of all the possible choices that could have for what index to build, I know that this one's gonna help me the most. So then these tools then spit out that information, that selection as a recommendation to the human. But it's up for a human to then take that recommendation and then actually apply it to the database system. Of course, this is where the, you know, the sort of like this disconnect between what the tool says and what the human actually needs to know about how to deploy it, right? So it's not enough just to say like, I looked at my query trace from the past and I think this index is gonna help me. The human also needs a reason about, well, what's the right time to deploy this action? What time is the right time to build the index? I know my demand is low at Sunday morning at 3 a.m., so that's when I'll go build and re-optimize all my indexes. They also need to be aware of what the workload trend is because maybe the simple workload trace I gave it was for, you know, some low point in my, you know, during the year and the types of queries I'm exiting at that time, you know, are gonna look a lot different as I get closer to the holidays, all right, when maybe doing more updates and therefore I want another different set of indexes. So even though this algorithm can figure out based on the workload trace you gave it, what was the best index, there's a bunch of extra information that the human has to reason about to make decisions about whether this recommendation was correct or not. Then the human also needs to reason over time, has my workload changed enough where the recommendation in the past still makes sense now because, you know, the best index today might not be the best index for tomorrow and these tools can't handle that. So just to give you an idea, just to emphasize that people have been thinking about this problem for a long, long time, one of the earliest papers on doing self-adapted databases with index selection for self-adapted databases was in Sigma I'd written in 1976 and actually this was written by my advisor and advisor who's dead, right? So again, people have been doing this for a long time and actually the guy's name was Michael Hammer, his daughter is Jessica Hammer in HCI, she's a professor over there, right? So small world I suppose. All right, so, this was for index selection but partitioning, sharding keys, the data placement, at a high level they all work the same way, right? There's some internal cost model that you use, that you derive from the sample workload trace and you use some kind of algorithm, some kind of search to make a decision. So then we entered the era of, in the 1990s, set a column of self-adapted databases, they were now called self-tuning databases but at a high level they still work the same way. You prepare workload trace, you feed it to an algorithm, the algorithm then crunches on it and spits out a recommendation and it's up for a human to decide whether that recommendation was a good idea or not. The one thing though that was different that sort of was a big breakthrough in this era came out of the Microsoft Auto Admin project. So Microsoft in the early 2000s for like a 10, 15 year period they were sort of at the vanguard of doing automated database optimizations and they all came out of this Auto Admin project. So they were doing like index selection, sharding keys, materialized views, a whole bunch of other things, it's really fascinating. So one of the key contributions that they came out with was instead of having the database, the sort of the tuning algorithm have its own internal cost model to try to figure out what was a good choice of an index, they could just actually leverage the data system built in optimizer's cost model, the things we talked about in the last lecture and use that to then figure out what indexes actually are gonna make sense. And the reason why they did this is because otherwise there would be this disconnect between what the tuning algorithm thought was a good index and what the data system thought was a good index. I may end up saying, oh, I think this index on AB is a really good idea and therefore recommend that but then when you actually run the real queries, the optimizer's like, that's garbage, I don't want that at all. So by leveraging what's inside of this thing, it's essentially doing the same thing, trying to determine what index is what I picked for my different queries. So rather than reinventing the wheel over here, use that. So there's a great paper from Serge Chaudhry from 2007, it's called the Subtuning Data Systems, a Decade of Progress. So this is a retrospective of the work they did over a 10-year period all in the space and this sort of covers the main problems of trying to do automated database tuning and database design. The, so again, so they were, in the Subtune Data Systems, they were doing the same things that the self-adaptive systems were doing. So index selection, partitioning keys, and data placement. But the other thing that these guys also worried about was doing knob configuration. So a knob would be like a configuration parameter you can set in the system to control its runtime behavior, like a cache size or a buffer replacement policy algorithm, like things that you can then tune as a human. And so part of the reason why this knob configuration tuning problem came to the forefront in the self-tuning database era is that the complexity of these systems actually increased quite a bit. So this is a survey that my PhD student, Dana Van Aken did a year or two ago where she just looked at a 15-year history of the releases from Postgres and MySQL. And for every single release, she just went and counted the number of configuration knobs that they had. So at the very beginning, in like 2001, they had, you know, Postgres had 53, MySQL had 75. And then after 15 years, MySQL's up to 540 and Postgres is at 291. So Postgres had a 5x increase, MySQL had a 7x increase. And so not all these knobs affect the system with the same, have the same impact on the performance of the system. But certainly at this scale, this is well beyond what humans can reason about, right? Because these knobs are also not independent. Like if you set one, then that can change the effect of another knob, right? And then the, and so trying to figure out what's the right combination of these different knobs can be really hard. And you have to do this on a per application basis. So this is what I was saying, oftentimes what happens is in really large fleets, they sort of come up with a best case configuration or settings for these knobs, and they just use that for all their installations. Even though, you know, they could be tuning it on a per application basis and get, you know, get better hardware utilization and get better performance. There's nobody does that because it's just way too complex for humans. All right, so then we hit the era, that was sort of at today in the 2010s, where now we have sort of, we'll give these cloud managed databases. So at this point with the rare exception, actually of Microsoft, we just announced, they have a paper out in this year where they can do automatic index, index tuning for databases on SQL Azure in the cloud. Other than that, and Oracle's self-dynamic data, so we'll talk about in a second, but none of the major database vendors on the cloud are doing any per application or per database installation tuning. They're all doing automation at the service provider level. So like tenant placement and migration. So they're trying to figure out as a cloud provider where actually store or run your database installation across their giant fleet. And it's essentially a bin packing problem, right? I have a bunch of customers, I have a bunch of machines, and I want to figure out what's the minimum number of machines I need to service all my tenants. Again, you as the customer don't see this because you don't know what machine your database is running in anyway. It's all virtualized from you. But they're doing this underneath the covers using machine learning and other techniques to sort of automate this, to make their lives easier. But again, other than the SQL Server example that I said, like nobody's actually tuning each individual installation, right? They're using it sort of a best case scenario. So, given I just spent the last 15 minutes talking about sort of automated database management from the last 45 years, why is this all insufficient? Why don't we have a fully autonomous data system today? Right? And why is this something that I'm spending my time worrying about? And I would say that there's three major issues or three major problems with this previous work. So the first is that, with the exception of the cloud stuff that I talked about at the end, in all of these scenarios, they were essentially the tools were recommendation, providing recommendations. And a human had to still make a final value judgment about whether the recommendations that the tool generated was a good idea. Then they also had to know when was the right time to deploy it, to build that index or move data around. And then observe the changes over time to determine whether the application has changed enough or the data has changed enough where choices in the past that were good are now bad. So there's still a human in the loop that has to figure all these things out for you to make it all work. The second is that they're all reactionary measures, meaning they're only solving problems that occurred in the past and they're not sort of looking in the future and saying what's coming down the road let me go ahead and prepare myself accordingly. So I showed you that in the sort of, the simple example of index selection, I gave the tool a workload trace of queries I've executed in the past. So my queries are running slow in the past, that's the only thing it knows about, it can reason about that workload trace and try to pick indexes to help you. But it doesn't know that tomorrow my workload is gonna change or the workload pattern is gonna shift and therefore I need to build maybe another set of indexes or choose other different options to optimize itself. It only looks at things that occurred behind you. So the way to sort of think about this in the context of a self-driving car is like the self-driving car can only look behind it, it can't look ahead in the road. So it can see all the children that ran over but it can't see the ones in front of it. And again, humans handle this for singles day or Black Friday, they know it's coming because it's on the calendar, it's circled, so they go ahead, prepare the database system accordingly weeks ahead. These tools don't know anything about this. The last one is a bit more nuanced but the way to think about transfer learning is that every single database installation is with some rare examples in the newer stuff, at least in all the examples I showed here, they were always treating the database as sort of an island by itself and only reasoning about how to optimize just that single database instance and there's nothing that they learn about optimizing that instance that they then apply to optimize other instances. So if I come along today and I have my one database instance and I use these tools to optimize it, I come along tomorrow with a different database that looks slightly different, has different set of queries, the tools don't know anything, they can't apply anything they learn from the first job and apply it to the second one. They're sort of starting from scratch each time. This is slightly changing. Again, some of my own research here tries to handle this. Things also get super hard too when you start dealing with variations in hardware and some of these algorithms we can talk about later but the bottom line is again, all of these tools assume that it was running a single job or single search to try to optimize your one database instance and then tomorrow start all over again. So, all right, so now we say, all right, so if this is insufficient about an automated database system, a system that's completely autonomous, how can we understand what does it mean to actually have an autonomous system? And so the way I like to think about this in the same way for self-driving cars, there's now this like industry standard to define the different levels of what it means for a car to be self-driving, a car to be autonomous, right? There's like six levels and it has to do with like as you increase the level, the human has to do less. Like the lowest level, there's no automation that the user has to drive everything, control everything. At the highest level, there's like no steering wheel, you just tell the car where you wanna go and it just takes you there. But then there's all these intermediate levels where you have to say, does the human have to have their hands on the wheel, they need to pay attention to the road, can they watch movies in the backseat and not die, right? And the part of the reason why they're trying to define these rules, there's that guy, the Tesla guy down in the south somewhere, he was like, he wasn't paying attention to the road because the car was in autopilot mode and it wasn't truly self-driving and then he got killed. So we need to understand this in the context of a database system to say what's the level of autonomy that the system could be providing so that humans can reason about how much they still need to be involved in the management of the system, right? So, at the lowest level, what are called level zero, like this is the manual level, this is where we have no autonomy and it's essentially what we think about many databases today where I have to tell the data center that exactly everything I want to do, I have to tune the configuration knobs, I have to build all any indexes, if I wanna do partitioning or scale up, scale out, all of that has to be done manually, right? The database system is essentially a tool that does whatever I tell it to do. But then if you now start trying to include more levels of automation, the next level is that there's tools to assist the human with some of these tasks. So, in an example of the index recommendation tool or the picking charting keys and things like that, they would fall into this category here, right? Yeah, I feed it some data, it crunches on it, it makes back recommendations and then a human has to have been reason about whether those recommendations were a good idea, when to apply them and then to observe whether they're helping us over time. And I would say again, this exists today, like Microsoft Auto-Admin would be an example of this. Then you start getting into more interesting things. So, the next level would be what are called Mixed Man and Vent. And the idea here is that the database system can manage some parts of the system automatically and the human can still manage all the other parts just as it could before. And then the two need to make sure that they don't trip over each other and put the database in a weird state. So, in this case here, the system always defers judgments calls or changes to the human. So, if the human says, I wanna do something, and the database system says that its policy is saying, I wanna do something different, it always defers to the human. So, this exists now in, for some systems have some components support this kind of operation. So, Oracle has, it's called the self-managing memory. And basically what happens is I tell it to figure out how to allocate the total amount of memory that's available to the system to different parts of the database. So, how much do I wanna use for a query cache for the buffer pool manager versus like a Java cache? They basically have some heuristics that look at how your application's accessing the database and it can have rules to assign in different allocation slices for that memory. Then we get into what are called local optimization. So, this gets a bit more advanced now where you have the subsystems in the database can now adapt themselves without any human guidance. So, I'm trying to get a good example of what this would be. The, so like the in SQL Azure that example where you can, it'll automatically pick indexes for you if you let it do it. That's a good example of something like this, right? You just say, go figure this out for me and it'll do it for you. The difference though is that there's no high level reasoning in the components about understanding how all the different parts of the system are working together. So, if I have something, you'll have something that can automatically pick indexes, I can have something that automatically manage memory, but they're not coordinated with each other to keep track of like what decisions that each of the two of them are making, right? Still human has to then coordinate across all of them. The second or last level is called direct optimizations and this is where the system is basically can almost manage itself, but it's still gonna rely on the human to provide it with high level of direction and hints about what to optimize. So, you have to be able to tell the system, hey, singles day or Black Friday is coming or Cyber Monday is coming up, go ahead and start maybe thinking about how to pre-allocate itself or to scale out or scale up the resources. But also maybe some cases where the data system will end up in a weird state and it says I don't know how to optimize this, I can't make things better and then it can call out to the human and ask for help. Again, not pushing the self-driving car metaphor too much but this would be equivalent to the, if you, you know, some self-driving cars have been envisioned where you don't have to have your hands on the steering wheel but if it's gonna crash or something's bad about to happen it'll flash an alarm to wake the human up to say, you know, figure out what's going on. And of course that could be bad because you might not, you might be in a backseat watching movie, you might have time to get to the front and help the car, right? So it's debatable whether in terms of automobiles whether this is a good idea. First, for, you know, for a database system, how far, how close, how far along you can get down a bad path or configuration before we have to ask the human for help? You may, you know, screw the whole system up and no human can actually be able to recover it like you put the car in a ditch. So figuring out how, at what point, you need to go ask the human for help is challenging. And so the last one is what, again, what I'm focused on is, so we'll call a self-driving database system and this is where you have no human, no interaction with the human at all. So the long-term vision of how I think this could work is like you just basically give the database system your credit card and then you just walk away and it just does everything for you. Now you may say, you know, Amazon can do that now, right? You just give Amazon your credit card and they'll have fun with it and do everything for you. But it's underneath the covers, you know, that there's still humans in today's system with humans with stuff that manage it. So I'm defining a self-driving database system to be a database system that can deploy, configure, and tune itself automatically without any human intervention. So that means in both of terms of like figuring out what actions I want to apply, so an action would be like build an index, drop an index, change this knob, scale up, scale out, add more machines, things like that. So it can do all, figure out what actions to apply automatically without a human telling us anything. And that includes both how to solve issues in the past and prepare for issues in the future. And then we, the way it basically works is that you tell it, I have some objective function or service level of requirement or agreement that says, I need to have all my transactions complete in 50 milliseconds or I need to be able to process a million transactions a second. You provide some higher level objective function and then it applies these actions to try to figure out how to go and do it. So it'll select actions automatically. It'll know when's the right time to apply them based on what my workload demand is, what the action is actually gonna do, how long it's gonna take to deploy the action, what changes, how's that gonna affect certain queries. And then after it applies the action, it's then able to observe the change that it made, learn something from it, hopefully, and then feed that information back into its models and refine its future decision-making processes. So the idea here is that we're sort of completing the loop that the earlier guys couldn't do. So they could do, select the actions, tell you which ones you wanna apply to improve your objective function, but they couldn't tell you when to apply it and they couldn't learn anything from it if you applied it, right? That's the sort of second half of the loop that's missing. So our goal in building a self-driving data system is to complete the whole thing. So again, I'm gonna be very vague here about the machine learning side of things just because this is the part that we're still working on and we don't have an answer for yet, but at a high level we think the roadmap of the system we're building here looks like this. So you have the database management system and it's executing SQL queries for the application because this is what it normally does and it's observing what happens when it actually runs these queries like these internal metrics about like how much CPU did I spend in this operator? How much memory did I use to do my sorting? So as it executes the queries it then feeds that into this modeling component which will build these internal models that try to forecast what the workloads could look like in the future, tries to say here's how much memory my hash join is gonna take for queries that look like this or my database looks like that. We then feed these models we generate into a search and planning component which then is gonna do some kind of search to figure out what actions are available to me to help me improve my objective function. What if my workload gonna look like in the future based on my forecast? How will the system react to different changes in my configuration based on the actions I apply? It's then going to choose some action or sequence actions that I think is gonna have the most benefit to the system's performance based on its objective function. It then takes the action that it selects actually then goes ahead and deploys it, observes the change while the action is being deployed then observes the change afterwards and then repeats this whole process all over again. So I apply my actions, I continue to execute queries because that's what the application is asking us to do and I observe whether it's helping me or not and I feed that back into my modeling phase. Right? So, again, what is different between what was done before? We need to know where to actually deploy our actions, right? What parts of the system should we target based on what our objective function looks like? Like if we look like we're CPU bound then maybe we want to choose optimizations that will be less expensive in CPU but maybe use more memory. We want to know how to actually want to deploy these. So in all the other previous examples, I said the recommendation tool would say build this index but I wouldn't tell you how many cores or threads you want to use to build the index, how aggressive you should be in it where it should actually store the index, right? All that needs to happen here. And we already talked about when to deploy. The why one is a bit more tricky and a bit more nuanced. And the way to understand this is sort of like this idea of metacognition and machine learning where instead of having the tool or this surgeon planning phase, sort of be hard-coded to say I want to build an index because indexes are good. You want to be able to have the models sort of reason about their own learning processes so that it learns that indexes are good or learns why indexes are good, right? Oh, I have these queries. They're accessing these attributes and indexes can help me, therefore I should build an index. Because then if you can reason, if the models can reason about these high-level constructs then now you can do that transfer learning thing where you say on this other system they had these queries look completely different than the queries I have here but I saw or I learned that at building an index may things go better. So now for my new database installation I have completely different queries but I know about indexes and they can help improve performance. So this one, I'm being very hand-wavy on but this is eventually where we want to go. We're just not there yet. So again, is it high level, this is clear, right? And what's gonna be tricky about this and we'll see some examples in a second is like we want to do this while the system's running. We don't want to have to have like our production server over here and then a complete separate installation over here that just running much experiments all the time and trying to figure out the training's models we want to have everything being done online with obviously without slowing down the regular database workload. But that comes later. All right, so again the paper that I had you guys read was this publication that we've been working on here at CMU for about a half a year or a year now and it's design principles that we've came up with as building Peloton and as building the newer system to then where we know how we want to organize the architecture of the system to prepare itself, to pair it to be controlled by some kind of self-driving brain or pilot or component. So the design principles are broken up into three categories, environment observations or how the data system is going to collect either the workload trace from the application or the internal metrics of the system to then train models that allow them to reason about what the performance of the system will be as it applies to actions. Then there's some metadata we want to maintain and expose to whatever's controlling the system in such a way that we can reduce the number of stupid things or stupid configurations or bad configurations we have to consider in our models. And the last one is that how do we actually then take the actions that the planning opponents select and apply them in an efficient manner and then observe their changes and then be able to feed that back into our model. So the key idea throughout all of this that permeates throughout the entire paper that hopefully was conveyed is that we're not just doing this for the sake of doing it so that it can be controlled by some kind of external system or by some planning component. The goal really is to design the system in such a way that we reduce the amount of training data we potentially have to collect and we cut down on the size of the solution space we have to consider in our optimization algorithms. So the idea, the way this is that for my tables, for our database, I could build an index on every possible combination of columns that I have, but obviously I know that not all columns are gonna be useful to me or so not all index combinations are gonna be useful to me. So if there's a way to reduce the number of potential solutions I have to consider, then that will potentially allow my algorithms to converge to a optimal configuration more quickly. So that's the high level idea about all this engineering stuff. So we'll go through each of these one by one and I'll show some examples for some of these. So again, the first one is the environment observation. So the idea here is that this is just how we're going to record what's going on inside the database system as we execute queries in addition to what queries we're executing. So database systems are actually a, with no surprise, they're very good at collecting data because that's what they're designed for and so they're very good at collecting data about themselves. So every major database system has these things called metrics, sometimes they're called statistics, sometimes they're called events, but at a high level they're just recording what the system is actually doing on the inside. And the design is such a way, excuse me, to expose this information to humans so that humans can then reason about the behavior of the system. But a lot of it is unnecessary, a lot of it is redundant, and if we then feed that into our algorithms or train some models on top of that, not all that data is actually useful. So that's sort of the runtime metrics. So for the workload history, this is also very common in existing systems where you can record workload traces of that the application submits to you. So I think in my SQL it's called the general log, sometimes it's called the query log, and in SQL Server it's called the event store. Basically every single SQL query that somebody sends, you record in a table or a log file what that was. But it's more than just recording what the SQL query was, you also need to record what was the execution context that where that query was invoked. So that means what was the isolation level that the client was executing the transaction under? What are some other configuration parameters the client may have set up? You can have per session variables. So all those things you need to record in addition to the query as well as the result of the query. Was it part of a transaction that got aborted? Why was it aborted? All these you need to record because you're basically gonna replay that later on. And so the reason why you need this and not just this is because this is abstracted away from the physical configuration of the database system and the hardware. So for these runtime metrics as the system changes, like if I start adding indexes, then these runtime metrics will change. So say I have a runtime metric that counts the number of tuples I access per query. So if my table has a billion tuples and I don't have an index on it, then the runtime metrics will say per query you access a billion tuples. Then I come along and add an index and now I only need to access log n or log n reads. Then all the runtime metrics I've collected in the past could potentially get invalidated and I had to throw away all my training data and start over again if I'm basing all my models just on this. So you need a combination of two of these because this one will be independent of how any action you upload optimates the system. The last one is the database contents. This one I don't have an answer for you because we don't know how to do it yet but basically I need to know what do the database look like at a particular snapshot in time for then for me to be able to reason about why queries perform a certain way. So let's say I have my workload history, I see that query that touched a billion tuples I see the query that is a sequential scan on the entire table. I see a runtime metric that says this query touched a thousand tuples. Then I see that same query again and now I see it touches a billion tuples. Has the query changed or has my configuration changed or is it because the database contents have changed? So you need a combination of all of these three things. So for runtime metrics, a really important thing you have to worry about is to make sure that you expose all of the information about any sub-components that you have in the system that you expose all the right metrics you need to be able to figure out what's actually going on for this. So I want to give an example of what I mean by this using ROXDB. So this would be an example of not doing this correctly. I've told the ROXDB guys about this, they said they're fixing it but we haven't checked to see whether that actually happened. So ROXDB has the ability to set column families or set knob configurations per column family in a table. So to think of a column family it's like a partition, right? And for each partition in the table, I can tune them or configure them to be different in different ways. So each table can have multiple column families and each column family in that table can be configured differently. So that's fine. And they actually record statistics for each individual column family information about what it, you know, how many tables, how much data, sorry, how many compactions they did, the size of the column family and so forth, right? So here I'm defining a new column family called default and then over here, so I'm setting configuration for the column family called default and then in here I can see that here's all the different metrics and their values. So this is ROXDB running in MySQL which is called MyROX. The problem with this though, so they are exposing some metrics about the subcomponents but the key thing we need is number reads and number writes because I need to know how many times am I going to disk or writing from disk for this particular column family because I can tune different policies that could affect that, you know, the disk usage. So the only way to get that information for a column family is if you go get the global stats and now you see they've aggregated the bytes read, bytes written for all possible column families. So in order to say now I need to be able to see a lot of sample data or training data to be able to extract out exactly what each individual column family actually did, right? So in this case here, this is a bad example of what you don't want to do. You don't want to aggregate your metrics. It's okay to still have aggregated but for the subcomponents you still want to have the low-level metrics we need because then we need, because this is what our algorithms need to be able to figure out what's actually going on, all right? Postgres does this pretty well. Secret server doesn't have this problem as well. This is the most blatant example. The next thing is that we want to expose the right information about the actions we want to deploy. Again, the basic idea is that we can pre-compute a bunch of actions ahead of time. Like here's all the indexes I think I want to build and then prune them down to be ones that actually matter to us. And then we want to make sure we expose this right information to the tuning algorithms so that they don't consider, again, bad configurations and they've reduced the number of choices they have to make. So I'm going to look at two examples, how to deal with configuration ops and then dependencies between actions. So the most important thing was obvious glaring issue we found when we were trying to do automated tuning and with auto-tune, which is another system we were promoting, is that, again, these systems are designed for humans to manage. So they'll have a bunch of knobs and they'll let anybody tune them any way they want. But the problem is some of these knobs, again, require for a human to make a value judgment about whether tuning them a certain way is a good idea or not. Because for some things, obviously like file pass, network addresses, and isolation levels, for these things, like, if I don't set these correctly then the system just doesn't boot or I can't talk to it. For other things, though, like durability and isolation levels, these are a bit more nuanced and the problem would be that if you let some kind of algorithm tune these things, it's always going to choose, may end up choosing something that may not be the right thing for your company or organization. So say your vector function is throughput or I want my system to run as fast as possible and you let it tune the durability, well, the algorithm is going to learn that turning S sync off makes things run faster so it's always going to turn that off. But now you might end up losing data, right? And again, the algorithms don't know, they don't care because it's a higher level thing, a human construct about losing data that we can't necessarily reason about. So essentially what happens is the way this works is you need to have a little flag that says, you know, this is something that a machine cannot tune. This is only tuned by humans and that way we don't consider these in our models. The next thing is that we need hints about how to actually, for the knobs we can tune, how to actually tune them. So most obvious things are like min-max ranges, right? And they're bound by a hardware resource. You may want to be like the number CPUs. You may want to limit it to just, you know, to that number rather than, you know, two to the 32. You also want to have separate knobs to disable enable features, right? So a lot of times we see systems where they use zero or negative one to mean turn a feature off, right? So one of them is like to turn off how much data you, if you want to limit how much data you can write to disk, there's a flag to go to, you know, you can set to modify this. But if you set it to zero, then in terms of all disk rights, which is again turning off f-sync, and then again everything's in memory, everything runs really fast, which may not be what you want. So you want a separate flag to enable disable the feature, had that been banned from being tuned by the algorithms and that humans reason about that and then we have a regional space of the things we can deal with here. Another issue that's related to reducing the search base is non-uniform deltas. And I would say this is what I think is the right idea, my student disagrees. We don't know the answer yet, right? So say you have a configuration knob that can set the amount of memory to use for some kind of cache, right? And it's an integer to 32 bit integer. So I can set it from zero to two to 32. But we know that every single possible choice along that number line isn't always gonna be, they're not always gonna be dramatically different from each other, right? So if I have one terabyte of memory, setting my cache size from 900 megabytes to 901 megabytes isn't a big difference, right? So what you really wanna have is you wanna have deltas to say how the knob can be tuned incremented and decremented and then the size of that decrement can change with different values for the knob. So when I only have from one kilobyte to one megabyte of memory, maybe I wanna do increments of 10 kilobytes. But if I'm over in between one gig and one terabyte, maybe I wanna jump around by 10 gigs. And what this does, this reduces the search space of the algorithms and allows them to find, to converge more quickly if they're up in this range to an optimal solution, rather than sort of looking at one byte increments each time. So in my opinion, I think this is the right way to do this, my student says, always have the same increment size to just have a multiplier that says, you just have to keep re-implying over and over again. We don't know the answer yet. All right, the last category of things I wanna talk about is doing action engineering. And again, the idea here is that we wanna, it's just how we're actually going to deploy the action. So the most important thing, if you don't remember anything from this lecture or anything from this class, I hope not from the class, but from this lecture, is that there should be nothing in the database system should require you to have to have downtime in order for that action to take effect. And that downtime could either be having to restart or blocking all transactions while you make the change. Now, in some cases, like adding indexes, if you try to do concurrently, at some point there'll be a small window where you may have to block things, but we're talking like milliseconds. When I say blocking transactions where the system's unavailable, I mean for seconds, minutes, or hours. As far as we can think about, I can't think of anything that requires a system that have to go down in order to take effect. The problem is most systems don't do this. The commercial systems are slightly better than the open source guys, but a lot of systems will tell you, if I change some buffer pool size or some cache size, it'll tell you I'm not gonna take effect until I restart the system. The reason why this sucks, the reason why this is bad, because now in our algorithms, we have to now consider the restart time in our cost functions, right? So if I say, I'm gonna make this change, this change is gonna make me run two times faster, but now I'm gonna have to be down for 20 minutes while I apply that change, that requires a human to come in and say, yes, it's okay for you to be down for 20 minutes, all right, and that's bad because the human's likely gonna say no, because you wanna wait until you have some window like Sunday morning at 3 a.m., or when you wanna apply all these changes. So if there's a bunch of changes you can't apply because you can only do it when you have a downtime window, now you can only do so many things during the day, right? Again, you're not gonna collect a lot of training data. So what makes this even harder, not more than just asking the human permission to go down, the time it's gonna take for you to restart or to complete the action and come back online can vary based on the state of the database and what action you're applying. So the example I always like to give is the configuration knob for the not, you might see the log file size. So if I set my original log file size is five gigabytes and I set it to 10 gigabytes, I have to restart in order for that to take effect, but when I restart, it comes back up right away because it says, oh, my max log file size is 10 gigs, I'm currently five gigs, we're good to go, and it comes back online. But if I'm five gigs and I set it to one gig, then I come back online and it starts compactioning the log and depending on the speed of my disk, that can take minutes or even hours. So now I gotta build models that can try to figure out, oh, my state looks like this, I'm trying to apply this action, here's how long it's gonna take me to restart. And that's hard, there's more complication that we don't need. The next one is it's useful when you both have an external controller managing the system as well as for an internal controller, but you basically want a pub sub mechanism inside the database system to be able to keep track of when an action starts and when it completes. And we need this for when we record all that training data about the runtime metrics and what the system is actually doing to know that if we see a degradation in performance, it's because we're deploying an action like building an index and not because we're doing something bad, right? So what makes this hard is that for some actions, although not many systems actually support this, but there's no reason you couldn't, some actions can actually be used to buy queries, to speed things up before it's actually fully completed. So now how do you account for this in your models is a bit tricky. For example, if I say create an index, some queries could actually start using that index before it's actually completed. You have to make sure you don't have any false positive, false negatives, but there's no reason you couldn't check the index, see what the thing is there. If it's there, then you're done. If not, then you fall back to the special scan on the table. But now again, now I'm getting a benefit of an action. I'm seeing an improvement in my training data for an action before it's actually completed yet. And that's an example we don't know how to deal with that. All right, the last one that I'm really excited about and what we're gonna work over the summer is the ability to leverage the fact that we're using replicas in high available configurations to collect more training data. So any database system that people care about or people, you know, there's money on the line, data's don't have a single database machine and it's assuming everything goes well, right? There's always gonna be replicas, there's always gonna be things you could fail over in case that one machine goes down, right? I'm sure there's people who care about they have money on the line whether they are deployed on one database, but you shouldn't do that. So the idea here is that we wanna be able to use those replicas that we normally, that we have anyway for high availability, we wanna use them to explore different configurations by our machine learning components to now collect new training data to help us find new optimizations and new configurations that we can then deploy on the production machines or the master machines, right? Of course, now this has been very problematic because we don't wanna slow down the replicas because they're still there for backups and then some other issues with discrepancies between hardware. So this is hard, we don't know how to do this yet, but let me go through an example and you'll see a bunch of different problems that come up. Again, really simple setup, master slave or master with two replicas, right? And then we have some self-driving managers, I think we're calling it the pilot in our system, right? And this thing is trying out different actions on these different replicas, right? That to see whether there's some index they should be building that would help things that then we wanna push to the master, right? Again, so if we find a better configuration we can then push it to the front end. So the first problem is we wanna make sure that whatever we're trying out here doesn't cause the replicas to fall too far behind because again, if I try a configuration on a replica that makes me run five X slower and now I can't keep up with the master, the master all of a sudden dies and now I have to spend five hours to go replay the log and get me back up where I should be. And then that defeats the purpose of having a replica as a standby. Then we also need to keep track of like if I apply the change, how long should I wait before applying another change, right? That's more on the learning side. But there's a whole bunch of decisions about how we actually apply these actions on the replica and learn from them that we don't know how to deal with yet. Now the next issue is that how do we deal with getting changes from the master to the replicas? So the application server will send all the reason rights to the master, right? So this is a single master setup and then traditionally what happens is the master then just sends the rights to the replicas and this can either be the sequel statements or the actual physical log, the right ahead log and these guys are essentially always in replay or recovery mode where there's replaying whatever comes over the wire and then that's how they're trailing along with the master. So now the problem is, right? First question is what should the log be? Should it be sequel statements or physical log? So if it's sequel statements, then it's actually gonna run through the whole query execution pipeline. So as I'm collecting all the runtime metrics for the different parts of the system, all those parts will get tickled or get exercised so we'll have training data for those different parts. But if I'm just replaying the physical log, then I'm bypassing the execution engine, I'm bypassing the query optimizer and the parser and all that high level parts and I'm just applying the changes directly on the sequel table or the data tables. So now how am I gonna collect any training data for those parts of the system? The other issue is that I'm also only sending over the, so component models are, that's what we're collecting here, but I'm also only sending over the writes, right? All the reads are going here. These guys are only seeing writes. So now the issue is gonna be if I'm only seeing the writes and these guys can start making decisions about how to improve the write workload, which is because that's the only thing they're seeing. So there may be some important queries that we're doing over here, a lot of reads that need a certain indexes that the write queries never touch. These guys will say, well, you don't need those read indexes, we're not sending any read queries and go ahead and drop them. So you could try to fix this by now sending some reads over from the master to the replica. That is a common setup, people do this. But now the problem is like, again, if I'm trying out different configurations before I was just doing the writes, and so maybe I could try different configurations and I wouldn't fall too far behind. But now I'm also doing reads, right? And now if I have, again, if I have a bad setup, a bad configuration, I may get too far behind and I have to crash this thing and restart and then this thing dies and I'm screwed. All right, so then the last problem we have is, you know, we're gonna collect all this training data on these replicas. We're gonna train these component models, say how the system's gonna behave according to the actions we apply. And then I want to select actions that help me on this machine and have them applied to help me on this machine. But the problem here is that the hardware configuration could be different between the master and the replicas, right? You know, I could be running on, if I'm running on premise, this could be running on machines I've procured this year, but these are running on a slightly older machines. So now maybe the disk speed, the amount of memory that I have available to me is different. But even if I'm running in the cloud, right, and I have the same instance type on Amazon, EC2, we've seen in our own studies, like the same instance type can have quite a bit of variation in the performance you get because it depends on who else is running on the same machine as you are, right? So, you know, on this machine here, he's the only, this is the only tenant that's actually doing work, and so it runs pretty well. But then this guy here is running on a machine where somebody else is doing Bitcoin mining, right? And that's chewing up the CPU and you're contending for resources. So this one's gonna end up performing slower. So I may end up making decisions on this machine assuming my hardware is gonna run slow that helps me on slower instances that may actually be a bad choice over here on the faster one. We don't know how to deal with that yet either. So again, so to put this all in perspective, some of the things you guys are working on in the class is the team working on logging and recovery and checkpoints. That's this piece here that we can then feed it to the replicas, right? To replay and instantiate the database changes over here but then while we're also doing exploration in our models to collect training data. Okay, so hopefully again, I wanna convey that for a self-driving database, you just can't take Postgres or MySQL or pick your favorite database system and just say, all right, we're gonna put some machine learning crap inside of it and it's gonna work. It's not, we tried it. There's a bunch of stuff you have to do that's different than how people normally build these systems in order to expose the right information, the right controls into some kind of machine learning components to try to figure these things out. So you may have seen also that Oracle announced that they have a self-driving database management system. So this came out in September 2017 where they said they had the world's first self-driving database completely autonomous, right? No human labor, half the cost, blah, blah, blah. So I was slightly miffed about this because this came out in September 2017 but our paper on self-driving databases came out in 2017 even though they're claiming they're the first although our thing died so it's dead. It's not here anymore but they didn't cite us and Larry Ellison got on stage and said, Oracle self-driving database is the most important thing that Oracle Corporation has been working on in the last 20 years. A little shout out would have been nice but he didn't do that but that's okay. So let's actually look what they're doing and see whether it's actually truly self-driving. So for their self-driving database, at least in the current incarnation, it might have changed as of late 2018 but I don't think it has. They're claiming it's of course the following five things, automatic patching, indexing, recovery, scaling, query tuning. So automatic patching is actually a big deal. The idea here is that you can apply security updates to the binary without having to restart the entire database system. That's a good one, it's definitely wanna do this. I wouldn't say that's autonomous, right? It's sort of like something you just wanna do. I mean it's not easy, there's a lot of engineering work they had to do to make it work but I wouldn't call that being autonomous. The ones that actually look most relevant to us are these three here, indexing, recovery, and scaling. Well, these are just all the same things I talked about in the very beginning from the sort of self-tuning world. And my understanding from looking at the marketing literature and talking to some people that work at Oracle, these are just running Oracle's versions of the tools they developed 10, 15 years ago that you run on premise as recommendation tools for human DBAs. They're just not running that for you automatically in a managed environment. So it's like equivalent to the, it's not this simple but it'd be obviously like if the recommendation had a GUI and then it would say hey, do you wanna build this index and you click yes as a human with a mouse. They're just clicking yes for you to do these things. So the problems that I talked about before where these are only reactionary measures, they're only solving problems in the past because they're only looking at workload traces and they can't transfer anything from what they learn about one database to apply to another database, these all still apply. And so as such, I don't think this is actually truly autonomous or self-driving, right? It's a solid piece of engineering that solves a real problem. But again, as I'm defying self-driving, whether or not I'm allowed to do that, who cares, right? I would say this is not self-driving. All right, the last one is that they're doing automatic query tuning as well. So this is another important aspect of database management that humans spend a lot of time on which we really didn't discuss. But again, it's what we talked about before. We said that the query shows up, I've run it through my optimizer. The optimizer says here's the best plan I can find. I then run it. I then see whether if the estimations that my cost model made about what the data looked like versus what I'm seeing on the real data, if they start to vary, that I can go back and ask for a new query plan. So as I said, this is not new. It's not unique to Oracle. SQL Server 2017 added this adaptive query optimization feature. We've already talked about how IBM had something in the early 2000s as well. Can anybody think of another system that does something very similar to this that we talked about on the first time when we talked about query optimizers? What was that earliest data system that I talked about we're doing query optimizing on a per query basis or per tuple basis? It's Ingress. Ingress is essentially it's at a high level doing the same thing. It's taking your giant query, decomposing it into these single tuple or single table queries, and then it runs the optimizer on that. So it's essentially doing the sort of the same thing where it's adapting the query plan on the fly as it goes rather than sort of generating it all once and then seeing whether that matches up. So at a high level, this looks very similar to this. So again, the main takeaway is that automatic query tuning is not unique to Oracle. A bunch of other systems have tried this in the past. But the main thing is that these guys here, the middle ones, and I would say that again, because of these deficiencies in the approaches, it's not self-driving. Okay? Oh, it's a lot. All right, well, the next slide, it's missing this, but whatever, I'll talk through it. All right, so my research has been about the sort of self-driving data system taking a holistic view about how to manage the entire system. And the high level goal is that we can remove the need for humans to sort of babysit and maintain the software. And obviously machine learning is the way we're going to make this work. The current trend in research now was more common or prevalent is to replace the existing components of systems with what I call learned components or machine learning models that can provide some kind of functionality that human-engineered data structures or algorithms are doing now. So the most obvious one would be the optimizer cost model. I talked about all last class, how error-prone it is, all these assumptions we make in the algorithms to estimate the cardinality and selectivity of these predicates. We make them because it simplifies the problem for us. So instead of having maybe histograms or those sketches to approximate the cardinality of a scan, what if we build a deep net that could figure this out? You also see this in compression algorithms, data structures. We have a paper in a workshop this year doing scheduling policies for transactions. And the idea here is that rather than having a human designed by hand the algorithm to use for making decisions inside the data system for its runtime components, I can train a model that can make better decisions on a per application basis. So the term that people use, these are called learn components. So I would have an example of one of them, but that's fine. But the histogram is the easiest one to understand. Histogram is basically an approximation function to say, for a given predicate, here's the number of tuples that'll match. So rather than having some formula to try to derive that information from the histogram, I can train a deep net that'll match some predicate to some cardinality value. In my opinion, I think this is very interesting. The early work is that basically, the early work essentially shows that you can do slightly better than what the human engineer data structures can do or the human engineered implementations can do. Where I think it's really gonna go, which can be really exciting, is the ability to come up with weird and obscure correlations and dependencies between different input features that humans just haven't even thought about. You can throw so much training data at these things and to start spitting out answers that no human would ever thought would allow you to do way better than any of these human engineer data structures. That's where I think things are gonna go in the future. The tricky thing of those, I don't foresee these learn components being added anytime soon to any major database system in any sort of significant way, just because the explainability is not there yet. So you don't have an easy way to understand why these things are making certain decisions, which for humans, that matters a lot. And I don't think people have really reason about how to deal with a situation where if shit goes bad, how bad is actually yet and what do you actually do? Like if you train some models from garbage data and starts giving you garbage results, then you can feed back into your own garbage model, garbage in, garbage out, that's gonna be a big problem as well. So I think what'll happen is, I don't wanna estimate how long it's gonna be, but I think what'll happen is that the, we'll still have the human engineered components, but then they'll bring in piece by piece the learn components and they sort of work in tandem with each other. And eventually, once we get to the point where people are confident that the learn components can supplant the engineered ones, then they'll overtake them. But I don't see that happening this year. The papers were just coming out in the SIGMOD and BLDB and other conferences about one by one how to build machine learning models for these different parts, right? For these different things here. But no one's really sort of figured out this sort of long-term maintenance of sustainability problem. Okay, so to finish up, I'm excited about this. Like this is what I'm sort of staking my career on. Like this is why we ended up having to build a database system from scratch, other than being fun, but I think like having tried to automate as much as possible is a daunting challenge and we'll see whether you can get there. I remember actually one of the reviews I had for a grant proposal for the National Science Foundation, this was from two or three years ago. The grant got rejected. It got later accepted, but that's fine. But one of the reviewers said, you can't build a data system in academia and you can't automate everything, that we want to automate. And so the fact they told us we couldn't do that makes me really want to do both of those things and that's why we're doing it. So I think in the next 10 years, we'll have what I'll say is a level five self-driving data management system. That I think that the fear is progressing fast enough and that we'll be able to just give the system your credit card and go off and do its own thing. The main takeaway that I want you guys to get from today's lecture and the paper, whether you go off and build databases in the future or some other kind of system, you should really think about any time you're adding a feature, not how is this gonna be controlled by a human or managed by a human. Really think about how it's gonna be managed by a machine. Configuration knobs right now, they're basically a stop-gap solution for engineers where they say, well, I don't really know what this thing should be, so I'll just make it a knob. I'll choose a decent default and then someone who knows what they're doing later on will come and set it for me. You can't assume that's gonna happen and you wanna make sure that you make whatever algorithm's gonna tune it for you automatically, you wanna make their job easier. And so that means doing a bunch of things that we talked about today, exposing the right controls, exposing the metadata and low-level metrics in a meaningful way, okay? All right, so that's it for the semester. Again, I will post the final exam, I'll send the PDF out on Piazza tonight and then please do come on Wednesday for the guest lecture from the SAP HANA developer, our engineer. I'll also send out a notice on Piazza that if you wanna meet with him on one-on-one about internships or jobs on Tuesday morning, or sorry, Thursday morning, we'll have a sign-up sheet for that as well, okay? All right, guys, good luck with your finals and all the classes and I'll see you Wednesday. Yeah, you know what I mean? Got a bounce to get the 40-ounce bottle. Get a grip, take a sip, and you'll be picking up models. Ain't it no puzzle, I'll guzzle, because I'm more a man. I'm down in the 40, and my shorty's got sore cans. Stacks and sick attacks on a table. And I'm able to see St. I's on the label. No shorts with the cross, you know I got them. I take off the cap, my first on tap on the bottom. Throw my three in the freezer so I can kill it. Careful with the bottom, baby, whoops. Sweet guys, be a man and get a can of...