 All right, guys. So today we're going to pick up on scan sharing. So this is, again, another technique where after we switch from disk to an in-memory system to another technique, we can apply to try to speed up our queries. So last class, we talked about query compilation. And that was sort of making one single query go faster. Now we're talking about scan sharing to make multiple queries running at the same time go faster. So before we jump into all of this, I want to do some quick course announcements. So first is that all of you, someone you posted a message to me on Piazza that you did not get a midterm semester of grade. And so I apologize for that. We were going to submit grades, but I missed the deadline. And apparently you cannot submit grades after the deadline for the CMU. So for whatever reason, they wouldn't let me do that. So what we'll do instead is we'll send an email out at the end of this month or at the end of this week with just your assessment of project one, project two, and then in addition to the reading reviews that you guys have all done. The second announcement is that the MemSQL project machines are installed in the data center and that we should be able to provide everyone with information on how to log in and reserve them and be able to use them to scale up and play around with for your projects. And then the last one to be for the extra credit, the website should be hopefully online later this week and then we'll provide you at the very least with more information and the taxonomy you would need to be able to write the articles for each of your database system. I was actually very surprised to see everyone chose to do the extra credit assignments, so that was kind of cool. But I've noticed that everyone picked the easy ones. You picked like MySQLs and the Oracles and the Mongos. So I think that's fine. I guess this is the first time we're doing this. So we're going to be pretty lenient on how everything works out. So another metadata announcement is that we have the CEO of Splice Machine coming to give a tech talk here at CMU on Friday, April 15th on the sixth floor in Gates. So the CEO is Montes Wieben and he's actually a CMU alum. He's listed as a former billionaire with a B on Forbes although he no longer is because of the dot com crash but he's a really awesome guy and has this new database system startup called Splice Machine and they've been recruiting heavily here at CMU. They are a H-Tap system, the same thing that we've been talking about in the entire semester, but they're doing this on top of this Hadoop infrastructure. So you run all your transactions on H-Base and then you run all your analytics on Spark and they'll show how they handle that. So I'll send a reminder out when it gets closer to come to this talk. So I encourage everyone to check this out. And then for project three, what I've decided is that I know some of you've been meeting with Joy, some of you've been meeting with the other students to get anything's going. So instead of having class next Wednesday, what we'll do is everyone should sign up to come meet and talk with me about what the status is of your project. You know, if you have any questions or want clarification or want additional guidance, Joy be with me there as well so we can talk about sort of the low-level implementation details and then I'm pushing the in-class project presentations or the updates that you're all gonna make to each other. They used to be on April 11th, which is a Monday and now we're gonna push it to April 13th, which is Wednesday. So the idea is that you'll come meet with me in private on April 6th. You know, and if you have any questions or things that you're unsure about, you don't wanna discuss in front of the group, we can discuss that during this period here and then you should have a solid explanation of what your status of your project is on the following week, seven days later. So what I'll do is I'll post a link on Piazza. It's that same Google Doc spreadsheet that you've been using to sign up for everything else. There's now a new sheet in there that says, here's all the time slots, you can come meet with me. So it'll start at 11 a.m. and go until my end of my office hours at 2.30. As it's basic, instead of having class, we'll come all meet during the time in my office. So any questions about this? Any questions about the extra credit? Any questions about splice machine? Everyone's doing okay on project three? No major disasters, no one, no one has sent me any nasty emails, no one's cried, so it's, all right. I think from what I've heard from Joy and looking at, and discussing with him, it sounds like everyone's making good progress, so that's pretty exciting. Okay, so for today, like I said before, we're gonna talk about scan sharing. The idea is basically we want to be able to have multiple queries running at the same time, share the work or the data and the operations and everything else that they're doing while they're running. So instead of running every single query in isolation, if we can notice that there's parts that overlap that two queries are trying to do the exact same thing, rather than running them individually, maybe one after another, we can combine them and in a smart way and have them cut down the total amount of work we have to do. So we'll talk a little bit about the background of why this matters, and then we'll talk about how to do scan sharing in an in-memory system, and we'll specifically look at the IBM Blink project. And then we'll talk about some alternative methods that I think are kind of cool, but are not as widely deployed, that sort of, but still follow in the same flavor of doing shared scanning. We'll talk about how to do continuous scanning, and then workload sharing or work operator sharing. All right, so for the scan sharing stuff, we're really gonna be focusing on, or obviously gonna be focusing on OLAP queries, because in the LITP world, you're just doing some point queries and traversal index and doing a small set of operations. So we're really talking about these long-running OLAP queries that wanna touch a lot of data and compute some kind of complex join or aggregate. And so if we have a lot of queries running at the same time, and if we're not careful with how we schedule them, then we can easily have them start to saturate whatever it is our IO bandwidth it is, or whether we're on a disk system or a main memory system. They all try to scan the same table at the exact same time, but maybe probably it's possibly different parts of the table, then we'll be thrashing, and everybody's gonna end up running slower. So with scan sharing, again, the basic idea is that we wanna be able to reuse data that we've either retrieved from the access method, from storage, for the table or through an index, or possible intermediate results that we've computed, like for a join or an aggregation, and we can share them across multiple queries that are running at the same time. And the way to sort of think about this is that you can have these queries come in and they attach themselves to a single cursor that's scanning over a table or the index, and the cursor will publish updates about, here's the next tuple that I see, and then you can have multiple worker threads consuming them and use them for their own query calculations. And so we'll see this is that it doesn't, the queries don't have to be exactly the same, and this is a lot of this has been more flexible with what kind of scheduling we can do, meaning we don't have to have the exact same queries that are doing the exact same thing. We can recognize that, oh, they're both scanning the table and maybe they do something different later on up in the query plan, but the bottom part is the same, and therefore we can combine them. So this idea first originated from the disk or any of the systems, and it's worth a little bit for ourselves. I think it's worth to talk about what they did back then or what they're doing now and then see how we can transition this to an in-memory system, but doing scan sharing in a disk based system is pretty obvious that you wanna do this, because IO, disk IO is the most expensive thing, and if you have tables that are doing complete scans in the entire, across all blocks or all pages, then having every query do that separately would be really bad and really slow. So the way we're gonna be able to do this is that we have to keep track of where each cursor is as it scans the table, so it can jump on an existing cursor, and then when that first cursor finishes with everything that first query needs, if the second query still needs more data, we need to keep track of how do we come back around to the beginning and see the rest of the things that we missed. And I'll show an example of this in a second, but because we're doing these relational databases are based on a bag-based algebra, that means we don't care about the order of these operations, that means we are able to take advantage of this without having to possibly do an extra sort because we don't care that the order of, that we read in tuples, we just care that we get the correct answer when we see all the tuples we're supposed to see. So scan sharing, I'm not actually sure when this first came about, it's probably within the last 15 or 20 years or so. As far as I know, none of the open-source database systems support this. It's only in the commercial guys, and you only see it in DB2 and in SQL Server. And from DB2, I think they probably have the most advanced implementation because they can support shared scans on indexes, they can support shared scans on disjoint sets of the same table or just scanning the entire table. SQL Server, I think, all can only do shared scans across full-table scans. For Oracle, they support a limited version of what these other guys support, and they have what's called cursor sharing, and this is where you execute the exact same query, I mean, the exact same predicates, the exact same work laws, the exact same everything at the exact same time, then you can just piggyback off of it and reuse the data. But if there's one minor variation in the query, then you have to do two complete separate scans. So let's look at an example of how we do disk-based shared scanning. So we have our first query is running when we start, and it's doing a complete table scan on A, and assume all these disk pages, these pages on disk correspond to table A, and it's gonna compute a simple aggregate of the sum of some value. So when it starts executing our cursor, we'd start putting at the first page, and then we would go treat that page and add it to our buffer pool, and then we scan down one by one, keep getting the more pages we need and copy them in. And then now at this point, when we get to page three, we obviously don't have any more space in our buffer pool, so therefore we have to go pick which one of the one we used last, in this case is page zero, that gets swapped out and now page three is added into it. So now query number two comes along, and this also wants to do a full table scan on A, but it wants to compute an average aggregate instead of a sum, so we can't use exactly the same result that the first query is gonna generate. And so without shared scans, what would happen is, if we just let this query start running, it would say, all right, I need to start at the beginning of the table and I'll scan to the bottom, but it would start at page zero, and recognize that it's not in the buffer pool, and therefore it needs to be swapped in. But query one just swapped that thing out. So now we're basically thrashing, we're gonna go, we just had a day and we threw it out, now we're gonna try to put it back in. So this would be bad, and now you would have a lot of contention, and if this was a spinning just hard drive, the arm is going back and forth as you're doing all these seeks. So with shared scanning is a little piggyback query two, on this cursor, they keep going down and getting the rest of the pages, and now query one to recognize, well I've seen everything that I wanna see, I can go away and do whatever it is I wanna do up above in the query plan, but query two would say, well I know I missed the first three, so I'll restart my cursor at the top and just go scan the rest in. So the database system's ending up doing less work because it ended up having fetched three less pages than it would have if they were executed separately, and we don't have the attention of everybody trying to do this at exactly the same time that the disk has to try to keep up. So the one thing to point out is that this is doing the scan sharing technique is actually easy to do, ignoring transactions and concurrency as well for now, but it's easy to do in the disk based system because the database system has complete control over almost all aspects of how it reads data and copies things into a cache, right? The first is because our buffer pool is completely managed by our database system, so we're not using M-Map or not letting the OS control things, we know what's going in and we know what's coming out. And then we also have a dedicated thread to do IO, right? When we ask for a page, there's another thread that's going off and getting it and for us and that way it's easy for us to sort of share this across multiple threads using like a PubSub interface, whereas in like the Morsel case or the Morsel scheduling thing from Hyper, we saw that each thread was responsible for going grabbing the data that was directly in memory and processing it and then going on to the next piece. So the problem we're gonna have when we try to do scan sharing in a database management system is that we don't have full control over the CPU caches and the way that the disk based guys do because this is what the hardware is managing for us. And the second problem is that our caches are gonna be significantly smaller than the amount of data that we maybe end up processing, right? Whereas in the disk based system, if you have a lot of DRAM, then maybe 25% of your database could fit in DRAM and then the rest is on disk, but since our caches are way smaller than that, it's gonna be a much smaller percentage. Another observation I'd like to point out too is that this is different than, this scan sharing stuff is different than query cache results, right? In a, with a query caching or query result caching, if you execute the same query first time and then a minute later you come back and execute the exact same query a second time, you can reuse the cache result. But in scan sharing, we're not, we're talking about queries that are trying to execute exactly at the same time, right? So result caching won't help us because we don't have the result for the first query yet. And it may also not be exactly the same, so therefore we may not be able to use all of what the first query was doing. Another thing to point out, so query result caching is sort of also like materialized views and therefore if you have results cached, the database system has to track what the underlying tables are for those queries, those are results, so that way if they change, you have to invalidate the cache. We're in scan sharing, you don't do that, right? You're only sort of generating things on the fly and you're sharing them amongst queries that are running at the same time. All right, so we can talk now how to do this in an in-memory system. So in a disk-based system, obviously the goal was to reduce the amount of disk I owe. In our in-memory database system, we deduce scan sharing, we want to reduce the number of cache misses, right? Because the low-level L1, L2, L3 caches are way faster than having to go out to get to DRAM. But we have this additional problem of when we schedule these guys is that now we have to worry about what the intermediate result sizes are for all the queries are running at the same time. In the disk-based case, that's disappointing, I don't know what number that is. That's my bookie, sorry. So in a disk-based system, we sort of can ignore that because we're gonna assume we're gonna have enough DRAM to store the intermediate results. But because our cache size is so much more limited, we have to worry about the size of these intermediate data structures. So we'll see in how we do this in Blink, they actually need to worry about how many queries are running at the same time, and they estimate the intermediate size of their results to make decisions about how many queries can run at the same time. So one of the things though in, before we actually talk about how to do explicit scan sharing in a in-memory database system, there's actually a, there's something called the convoy phenomenon that can occur. Actually this also occurs in a disk-based system, but it's probably more pronounced in a in-memory system, where without having to do anything explicit for doing scan sharing, we actually can get some benefit of reusing cache results. So for this, assume that we have some table that's out in memory, and then this is our shared caches like L1, L2, L3. And so we'll start having two, we have two cores running at the same time, but query one starts, and it wants to start retrieving all the tuples from this table and bringing them into the shared cache. So the cursor starts here and it goes through and it's getting all these tuples one by one. And now query two starts, and it's doing the same scan over the same table, but it can rely on the fact that all these tuples have already been added to the CPU caches. So it doesn't have to have any faults that go fetch the thing you need from the memory controller. You just use the data that's just right there. So this is the convoy phenomenon where as of running at the same time, query two would eventually catch up to query one because query one is the one that's painting the penalty to stall while you wait to get all the memory. It's called the convoy phenomenon because you think of trucks driving on the highway. If you have a long convoy of trucks together, the one guy has to spend more fuel or more energy to go against the wind, but everyone else can be behind them to take advantage of their draft and run faster. So this is that sort of same effect that we can see in the end memory system because the second query is gonna rely on things being in the same cache. Now this requires that the query two start within a reasonable time after query one does, because we would have that same thrashing problem if by the time query one is down here, q2 starts and we start throwing away things that q2 needs. It's gonna have to pay for memory stalls. So if our queries are close enough to each other, then we can take advantage of this. So the convoy phenomenon is not new. It was discovered in operating systems in the late 70s. All right, so now we're gonna talk about the BLINK project or the paper that you guys are assigned to read. So the first thing you understand about BLINK is, I don't think it actually exists anymore, but it was a research project out of IBM Research in Almond and the big database group there, which then eventually made it into the real DB2 product. And it was an end memory query accelerator for doing OLAP queries in DB2. So you can think of this as being the same thing as the Oracle in-memory column storage, the fractured mirrored stuff that we talked about before. You still have all the data in a row store and then you just make additional copies to it in an in-memory data column store structure. So BLINK uses dictionary compression. I don't think it uses SIMD because this work was done before all that became in the Vogue. So the reason why I say it doesn't exist anymore because it's now been transitioned into a new project called DB2 Blue. The blue doesn't stand for anything apparently. They just thought they were clever. But this is now, this new version is akin to the Hecaton Apollo stuff we talked about in SQL Server where now you can have the primary search location of the database and be in an in-memory column store. And it works alongside of the regular row store in DB2. And so this is essentially what they're pushing now. Part of the reason why I think they told me they scuttled the query accelerator thing is because IBM already bought Natisa. Natisa was sort of the query accelerator you're supposed to use for DB2 and this is sort of competing products within IBM. So again, this is just sort of for your own edification for the scan sharing stuff we're gonna talk about. We're talking about Blink. And I think the scan sharing stuff we'll talk about is actually now still in blue. All right, so I'm gonna talk about two different types of shared scans that they're gonna support. They talk about this naive sharing and that's basically the convoy phenomenon that I talked about before where the database system's not gonna do anything explicit in how it schedules queries running at the same time. It just lets the harbor do whatever it wants to do. So for this, for full shared scanning, full sharing scans, we're gonna have the different worker threads execute the different operations they need for the queries using that same sort of task model that we saw with the morsels in the hyper-architecture. And then there's gonna be a dedicated reader thread that's gonna iterate through whatever it is the data structure or table you wanna iterate over and it's gonna feed the tuples and blocks to the other worker threads who then do whatever processing that they wanna do. And then once each thread, worker thread acknowledges that they're finished processing that block, they send a notification to the reader thread. Once it gets all the notifications from all the worker threads, it can then go to the next one. And the idea is sort of similar to the LLVM stuff we saw in Hyper from last class where we're gonna try to do as much processing we can for every block of tuples that we get from memory across all the cores or all the queries that are running at the same time before we move on to the next one. The idea is that we avoid having to pay this big penalty to go fetch things from the member controller over and over again. So the problem with the full sharing scans is that there's no governor involved to say, to limit the number of queries you can run at the same time. So they recognize or they talk about how if you have these queries that are generating these large intermediate data structures for your hash tables or whatever else you're doing in your query plan, then if you have too many queries run at the same time, then you end up thrashing because you can't bring in these blocks of data, you're spending all your CPU caches on these intermediate data structures. So then they came up with an extension to this or the optimized version of the full sharing scans to do, sorry, yes. How do you know that all workers are done with that? So the reader thread is read one block. See three workers are reading that block and a new transaction comes in. Transaction query. There's no transactions there, it's already done. Okay, a new query comes in. Sure. And that wants to be the same block. Yes. And that can keep going on forever, right? So how do you know that, is there a limit? Okay, so the question is, say you read a block and you hand it off to three worker threads that are processing three separate queries. During this time, a new query arise and it wants to read that same block, right? Could you have this sort of starvation thing where new queries keep showing up and over and over and all getting the same block end of it? This block never moves. Right, and it never moves, yes. I think the way this works is that you know how many, you have to be, you have to tell the reader thread you want this block before it hands it out, right? So you can't come back later on and say like, I know you're in the middle of this block and these other guys are already working on it. Let me get on it too, you just have to wait. Yes. I think the strategy is the queries and I scheduled those. I didn't consider how long it's coming up. Yes, so his statement is in the paper, and he's correct, in the paper, they assume the workload is static and they just do all the scheduling for that one workload, right? And that's all the experiments are based on. But imagine he's, I mean, this is a real, in a real system, you could have this, right, you could have a guy typing in a random query in a terminal or using some kind of dashboard and play around with things. I think the window, I mean, it's quite small, right? You're talking about, you know, milliseconds, probably less than that, microseconds, the process of block, the likelihood that you're just gonna keep getting more and more queries than we do exactly that block, I think it's small. All right, so again, so we had this problem where we could overflow our caches because we have so many queries running at the same time and we have to maintain all their animated data structures and we ended up thrashing back and forth between the CPU caches and DRAM. So in the batched approach, what they'll do is they will figure out all the queries that are gonna run in a batch, sort of at a per epoch and they're gonna group them together based on what they think the resize is gonna be for the intermediate results. So this is pretty much, this is pretty similar to the query result estimation that we talked about last week in cost models where we don't know exactly what the query's gonna do but we look at our statistics, we look at what the predicates are and we try to approximate how much space they're gonna need. And then they will then try to categorize each query based on the result set size to determine whether they can be shared, share scans or not. And if there's a query that you think is gonna be very large that you're not gonna be able to take advantage of any kind of scan sharing, then they'll execute that individually. Of course we talked about that, doing this estimation at runtime is hard to do and it's not that they have any magic trick to it, they're relying on the same cost model stuff that we talked about before. Right, so in order to pick the right number of queries we wanna have in our batch, we estimate the size of the operators and what they'll do is they'll classify each query based on the operator selectivity. So if the selectivity is really, really small, like you're gonna access less than 1% of the tuples, then you know the intermediate size is gonna be small so therefore it's always okay to have those guys share scans. If there's a high selectivity, like you're doing a complete table scan and you're doing some very expensive or very large join, then you probably don't wanna share any scans for that. And then there's the middle guys here where you say, well, I think the selectivity is gonna be okay but then they take a sample on the data the same way that the hyper guys do to get a more accurate representation or accurate estimate of what the size of the result's gonna be. And then I'm not gonna talk about the specifics of it but in the paper they have algorithms to show how you can pack these things together at runtime using basic heuristics rather than running instead of a very expensive bin packing solver. So we can show some results about whether their technique works and if they're gonna compare themselves against what they call the naive sharing or the convoy phenomenon approach where you basically let the harbor do whatever it is they wanna do. And they have some sample queries from an IBM customer that they're gonna use to determine whether this approach works. So the one thing I'll point out though is if you look at the paper, they don't really have any absolute performance numbers. Like everything's always relative, right? They're not gonna say they can run 10,000 queries a second. They could say, here's the throughput speedup we get versus like the default case. And the reason why they do this is because it is a research paper, it is peer reviewed but it's submitted to the industrial track in one of these research conferences and the bar's usually a bit lower on how rigorous the experiments have to be. Partly because the fact that this is running for real customers in the real world is significant so it's good to get these ideas out. But also part of the reason is too is that if IBM put numbers in the paper to say we can run 10,000 queries a second then all their competitors could use that paper in marketing literature to say, look how much slower IBM system it is to ours. So that's why they're only report like one X, two X, three X over like the baseline. There's also reason why there's this thing called the DeWitt Clause named after David DeWitt, the famous database professor at Wisconsin, where you can't name any particular database system, a commercial system by name in a paper. So that's why you always see these things as DBMSX or DBX because there's this clause in the software license to say you can't name them by name. Okay, so for the first experiment they were gonna show where all the cycles are spent in the CPU on a per query basis. And they're gonna measure, they're gonna break down the numbers based on doing actual computations, penalties for branch mispredictions, L2 hits, TLB misses, and then stalls from going to the memory controller. So they basically had two sets of numbers here. They have the system running with one core where there's no sharing at all and then eight cores where you could have sharing for the concurrent queries. And so as expected what you would see is that when you have one core there's no sharing opportunities. So the naive scheduling or the no explicit scheduling at all approach versus the batch sharing approach they perform roughly the same. But when you go onto eight cores now you see in the case of the having no shared scans there's a larger penalty for memory stalls because the queries are all competing for the CPU caches and you're thrashing and things that can't be reused in the same way that it could if you have the shared scans. But in the batch case since they do have support for this it gets roughly the same performance you get when you're running on a single core. So this is basically showing for this particular workload in IBM's scenario the penalty you pay for doing the estimations on the intermediate result size doing the figuring out how to schedule the queries at the same time and make sure that they can reuse the same data. All that extra stuff we have to do at runtime is worth it because you're not paying this huge penalty here. So then they have an additional experiment where they can scale up the number of cores that are gonna be executing queries concurrently and the comparing again versus the batch sharing approach versus the no explicit sharing at all. And then what you see is that as you increase the number of cores from up to eight X you get about a seven X improvement in the throughput. And this is sort of the goal standard this is what you want. It shows that the technique can almost scale linearly. So any questions about like basic scan shares or this idea? Right, it seems that I don't think any in memory system actually does this other than DB2 blue like VoltDB doesn't do this, MemSQL doesn't do this. Because it is kind of hard to get right and relies on having good cost estimations as always. Okay, so now when it's time talking about alternative techniques and like I said in the beginning these are things that as far as I know with except for the one system we'll talk about from the Swiss people, I don't think any of these things are actually implemented in sort of really widely deployed systems. So with continuous scanning the basic idea is that instead of having our query scheduler and our executors try to be smart about figuring out when we can share data across queries what we'll do instead is just have a single reader thread to scan through the entire table continuously like nonstop, even if you're not even running queries it'll just keep scanning it. And the idea is that when you reach the bottom you come loop back over and start over again. And then as queries arrive in the system they sort of jump on the train of the reader thread go along with it for as long as they need to go and then hop off when they get enough data when they got enough they need to do the rest of the query planning, right? So we'll see in a second when you could do this without using any indexes at all, right? Cause you just do a sequential scan for everything. But we'll see in the Crisano case they actually build indexes on the fly. So you can kind of think of this as like a PubSub interface where again there's a reader thread for per table and it says here's the data I'm gonna find or I'm gonna read and if you want to learn if you wanna actually get it tell me about your thread and I'll hand it off to you when it comes in. Yes. Wait, say it again? They still have to do some amount of scheduling, right? Because you can't be scanning all the tables all at once. They're scanning all the tables all at once. They're very small, different stuff. No, so one is when I'll talk about the distributed side of things but they're doing this in a partition manner if the tables are partitioned within a single machine and also across multiple nodes. So that means each partition is gonna be, that each core has to scan is gonna be relatively smaller than the entire size of the database. And they're doing this in an in-memory system, right? So you're not paying huge disk agro penalty to do this. There is a, there was a research system out of, from Chris Germain out of Florida that was doing continuous scans for disk based system. I don't know what it's called. I don't think, there's not very many papers about it, right? So the particular system I'm talking about here is this Crescendo keeps saying crescendo and it's not. But this is, this is the, this is the system they actually implemented to do, to do this. And the use case they're kind of trying to solve is a, was for I think a French telecom company. And they had all these call records and they would maintain all these different materialized views for every single type of query that could come along. And what they found was that maintaining all these materialized views over and over again as you get new updates would be really expensive to do. So instead you just sort of build everything on the fly at the different partitions of the table using a single thread or single cord to scan everything. So the part of the reason they want to do this is that they want the queries to always take roughly around the same amount of time to execute regardless of the ingestion rate into the database system, right? So that means that if you're getting all these updates from the cell phone towers, all these new call records, you need to feed them into your database and be able to process on them, but you don't want your reader threads to stall why you insert these new things and you don't want them to have an inconsistent view. So they're solving two sort of two problems at once using a sort of a clock scan approach that I'll talk about in a second. So they're gonna have two, actually two cursors iterating over the table. They'll have a right cursor that is applying all the new updates that come in to the table and then there'll be a read cursor that's a little bit behind it, reading all those updates, building indexes and handing off queries or tuples to any operator or any reader thread that's subscribed to its cursor to process whatever the query is that they want to process. And so I'll sort of show an example here. So assume that this circle represents the sort of partition space for a single table. So again, the table's in horizontally partition to different disjoint subsets and now we're just dealing with one slice of that on a single core. So here at the top we'll say this is at record zero. So this is sort of be the starting point for any scan that we're gonna do. And the idea that they want to support is they want to be able to do a one complete scan of the entire table in less than one second, right? Because their partition, they can do this because they're not scanning the entire table, they're only scanning a portion of it. So then what you'll have is the right cursor over here that's going around in a clockwise fashion and it's applying updates that are coming in from the outside, right? So you can kind of think of the table space as being split up into snapshots where an earlier snapshot is anywhere that has not been passed over by the right cursor and then the latest snapshot is anything that's behind it. So now our read cursor will always be back here and it's always reading changes that are made by the right cursor because it's always behind it. So therefore it has a consistent view of the latest snapshot. And so what's gonna happen is as queries come along we can build indexes using the read cursor and then hand them off to any operator above us that subscribe to these notifications. They can do whatever processing they want on it and then they would finish when we reach the beginning of the table space and because we can notify them that hey we finished complete scan you should not expect to get any new tuples from us and then they can pass whatever other queries they want pass whatever data they want up to the operators and complete the queries and then we blow away all the indexes, come back around and do it again. So I think there's a really interesting architecture this is very different than everything we've talked about before and like I said this has been proposed in a small number of systems but this Crescendo thing is the only one that actually knows, that actually implements it and it's actually deployed in the real world. So it's sort of like the batch shared scan stuff that we talked about in Blink where at the beginning of every sort of clock tick every second we're gonna look at all the queries that we wanna execute within this epoch and we're gonna figure out exactly what indexes they're gonna need to execute their queries and what's cool about this is like the you as like the database administrator don't have to define any of the queries ahead of time because it just looks at what the queries you're trying to run and picks the indexes that would be best for them at that particular moment and then we don't have to maintain them after that clock tick, right? After we sweep back around we just blow it away and decide what are the new indexes we want and it may be the case that we maybe build an index on two particular columns the first time we go around using a B plus tree and then we come back a second time and say, well, now our queries want those same two columns but we don't want them as a B plus tree we want them in a hash table, right? Cause we're trying to do it in a quality project we're not doing a range scan. So it blows away the B plus tree and then comes back around and builds it a hash table index, right? So this is like continuously trying to figure out what the best way to organize the data for the queries you're just trying to execute at that moment and doesn't worry about what you've done in the past doesn't worry about what you think you're gonna do in the future and because we're in memory and we can do this very quickly and because we're partitioned across multiple cores, yes? Where do you come in in between of a scan? Like, not when... So this question is, all right my recursors over here my right recursors over here I have now query that shows up at this point in time here can I pick up on, no? Yeah, because then it will be it will read half of this snapshot and half of the data. Correct, right? So you have to, it's only when the read cursor reaches record zero then you blow everything away and start over yeah, the epoch starts over and you do all the processing again, all right? Okay, the other nice thing about this is that because everything's partitioned and because all the query processing within every single core for one loop through the circle is all localized at that one thread it's gonna work really well with sort of the new architectures that we talked about because we don't have to worry about sharing data across cores as we do the scan we may have to combine stuff up later as we go up in the query plan tree but from the scanning perspective it's not worried about that. And because our tables are partitioned so that we have really small indexes chances are most of our indexes that we're gonna have within each partition can actually fit in L2 whereas in like a system like I think MemSQL for example they would have on a single node if you have even if you have multiple cores multiple threads, multiple sockets you're gonna have one giant index that's being shared across all of them, right? It's not partitioned per numerous region where in this case explicitly is because one right cursor is gonna generate a read cursor is gonna generate just the indexes you need for just the data that it's looked at and it ignores everything else. So obviously you can't run transactions like that we talked about before on this. This is only for sort of streaming batch updates into the system. You couldn't run something that would hold a lock for a long time and do a bunch of changes. So, yes, sorry, yes. Yes. Yes. But it also has the negative effect that as if you double the size of the database you don't just get it twice as much data. So his statement is this approach appears to scale really well because you can just double the number of cores and then you get the double processing. But if you double the size of your database then you end up each core has to end up reading more data, right? So for that case, yes, if you wanna maintain that sort of one second SLA that they're trying to support then you would have to scale out your hardware proportionally, right? So yes, but I think the argument is that they're making is that this sort of streamlined architecture that they're proposing here would have less resource consumption as opposed to running something like MySQL Postgres Oracle. So in the end you're paying, in the end you'd use less hardware to do the process of larger database than you would if you had to use MySQL. If you were playing with the channel later. Correct, yeah. That's another big aspect of this. The kind of queries that are running here are touching large portions of the data. They're not sort of going out and getting the one tuple you need, right? It's doing aggregations across, all core records across different time slices and whatnot. All right, so in a way we can extend this this sort of continuous scanning model is to use work sharing on the layers above it. I was hand wavy about yeah, there's this thing that's going around on the clock and scanning stuff and it's feeding stuff up to worker threads that are going to process it, but I didn't say what those worker threads are actually gonna do with that shared data. So with work sharing the idea is that we wanna be able to use multi-query optimization in the query planner to figure out here's all the queries I wanna run exactly the same time, right? You can assume within a one second tick of the data I'm going around it. And then I'll figure out what the best plan is that can use that's gonna be best for the entire system that can share results in between those different queries. And so for this, the same people that made Cresondo out of ETH Zurich came up with a system called shared DB that sort of sits above the Cresondo storage manager. And the idea is that they're gonna use the batches that are coming in, of queries that come in within each time tick and then figure out the way to combine the results to reuse you know, reuse much of the data as it can reduce the amount of processing we have to do. So to show an example what shared DB can do is say we have three queries here. They're all doing a join in A and B and they're doing a join on the primary keys of these two tables. So each join query we have AID equals BID. But then they have a conjunctual clause that comes after it that are doing different kind of look ups or predicates on different parts of the tables and using them in different ways. Some of them are quality predicates. Some of them are less than or greater than. So without multi query optimization you would generate a separate relational query plan for all of these three queries. And although you could reuse this part here because this is the continuous scanning part that's feeding queries up into these operators all this other stuff, especially the join part which is usually the most expensive part of a query has to be executed separately. So what shared DB can do it'll combine together the filters and the predicates into a query plan that looks like this so that for one scan of tuples we're getting we'll apply all three predicates at the same time and then union them together. And then we'll join the union of this side with this side using a special join predicate that knows how to match the tuples based on what query actually needs them. So for each of these, for each of these paths through these filters here we'll tag the results with what query ID we want or what query ID corresponds to tuples that are being generated by this. And then when we do our join we would check to see that there is query one on from A and query one for B here those two guys will be joined together and all the other ones are just sort of thrown away. And so if you think about this from an individual query's perspective this is terrible because you're getting all this for query one, you're getting query two and query three's data in your join operator here and you're essentially increasing the size or decreasing the number of tuples you have to evaluate to compute your join. But from an aggregate across the entire system this is much better because we built a single hash table we can apply all our predicates at the same time and we don't have to worry about any contention or thrashing across intermediate data structures across different threads. And then we do the aggregation to do a group by based on the query ID which then can provide the results exactly to each query. So this is essentially what sharedDB is doing and again the thing that makes this work is you have to be able to do the multi query optimization at runtime to figure out exactly how to combine these guys together. And so what they do is they rely on having prepared statements where you have predefined queries and they can do whatever the optimization they want offline. Then when they turn the system on they end up generating this really large single query plan that looks like this where you see all the access methods doing the scans at the bottom and then there's all the different kind of joints to take the outputs of these different access methods to combine them together and then you feed them up to individual prepared statements up here. So when I call one of these queries I know how to go down and pull together all the data that I need. And you kind of think of this as like they describe it as like a circular system circulatory system in a human body because for every single sort of pass through the circle in Crescendo it's emitting tuples up into this and then they get the different the connections between these operators get turned on based on if there's a query that then needs the data produced from it. So maybe the case for one of these queries like if I don't need to execute these guys then this operator gets turned off and we just ignore any tuples but the other data can be used for other queries. And this is what shared DB can sort of manage for you automatically. Like I said, I think it's a neat idea. It only works if you have prepared statements if you have an ad hoc query it has to be executed separately but I don't think anybody actually does something like this in a relational system. It sort of looks like some of the stuff you see in the stream processing systems but for the most part as far as I know all those are very manual and it's not that the database figures out this plan for you. Okay, so that's all I really have to say about scan sharing and this technique. Like I said, I think this is one of the again one of these cool ideas that can help a system once you go beyond all the other optimizations that we talked about at the beginning of the class. And obviously the amount of benefit, the amount of reuse you'll be able to get for your scans and your operators depends on what the queries look like in your workload. If they're completely different and completely disparate then you're not going to get any and there's not going to be many sharing possibilities. For how to do this in an HTAP system I think that's an open research problem because all the systems we've talked about are either doing sort of the read only operations like in Blink or doing batch updates like in Crescendo and no one really has looked to see how you can do all this with transactions. The answer is probably nothing because transactions footprint are so small and the amount of sharing operation, the amount of sharing possibilities there are is limited. Okay, all right, so for next class we'll spend time talking about vectorized execution models and using SIMD and remember that we know class on Wednesday because we'll have the one-on-one project meetings and then I'll spend a little time talking about how different recipes of making your favorite foods while in prison, right? Using the limited resources that are available to you and whatever you can steal from the kitchen we can still make corn flakes, we can still make alcohol and things like that. So, any questions? All right guys, thanks.