 Everyone, hello, welcome to John Jengset's thesis defense. And thank you for attending. I want to say just a couple of words before handing over to John. In his research life, John's been pursuing a really neat vision for how busy websites ought to manage their data. It's been exciting to watch him develop this vision and impressive to see a skill in realizing it. But in parallel with research, John's been a hugely energetic contributor to all the community he touches from his infinite patience as a TA to organizing graduate student events to live streaming systems programming tutorials. John's sure to be valuable and valued whatever he sets his hand to. And assuming his committee lets him go, many people at MIT are gonna miss him. I certainly will. So with that, I'll hand over to John. Thank you, Robert, I appreciate that. So thank you everyone for coming and welcome to my doctoral dissertation presentation where I'll be presenting my thesis on partial state in data flow based materialized views. Now, my goal with this presentation is to try to summarize the past six years of work or so in about 45 minutes, and hopefully I will succeed. And at least hopefully by the end, you will understand what the words and the title mean and why I think it was worthwhile to spend six years doing this work. Now, don't worry, this work was supervised. My committee, Robert, who you just saw, but also France, Sam and Malta have been sort of working with me in basically trying to figure out that this work is worthwhile and to make sure that what comes out of it is something that is valuable. So I wanna start this presentation with why are we here? And not in the existential sense of like why are we all on this planet, but more in the sense of why am I here? Why am I talking to you at all? I'm here because I want to make databases better. And in order to understand the way in which I want to do that, we first need to talk a little bit about what databases are in order to understand the problems that some applications have with them. So let's do a little bit of database 101. So here in databases, you just sort of take some tables. In this case, we're gonna work with a stories table and a votes table. And these two tables hold stories and votes. And then the idea is that as an application, you can query these tables by issuing queries. And these queries might do all sorts of things like they might aggregate and combine values or they might join different tables together in order to produce some application data that it might care about. And the query operations are here shown in orange. If you want to modify the data, then you perform an insert or an update or a delete directly into the tables at the bottom. So here, if you wanted to insert and you vote, then you just make that insert operation directly to the votes table. And even though inserts in SQL and updates and deletes are queries, I will not be referring to them as such in this talk. Rather, I will be talking about queries as things that read the data, things selects, whereas inserts, updates, and deletes are writes or updates. And now that we've looked at this picture, you might see that there's a lot more orange here than blue. And what this indicates is that if you do a read, you have to do more work than if you're doing a write. But this is unfortunate because in practice, many applications and particular web applications are read heavy. They do a lot more reads relative to the number of writes. And then it seems unfortunate that the reads are also the most expensive operation. And so ideally we want some way to mitigate this problem because it becomes pretty severe once you look at applications that issue many, many read queries that are often either identical or similar, where there's a lot of work that's the same in executing any given query. And it'd be great if we didn't have to do this sort of repeated unnecessary work. And you might say, well, John, this problem has been solved. You just use a cache. And it's true that caches are great. Caches are essentially a storage system that you place in front of your database. And the idea is that the queries looked in the cache and if the result is there, then they just return immediately and the reads are now fast. And then only if you miss in the cache if the result you're after is not there, you issue a query to the backend database. And it's true that this does make queries fast. And there's some other similar schemes like denormalization where you store and maintain derived values like the number of votes for a story in the story's table that has a similar effect. And it's true the caches are great, but unfortunately they're also really hard to get right. This diagram is an attempt to summarize some of the things that you have to get right in the application in order to use this kind of a caching strategy correctly. So the cache is great, but as the data in the tables changes the cache becomes out of date and it needs to somehow be refreshed. So imagine that we insert a new vote into the votes table, the cached vote count is now wrong. And so the path of the application that inserts this new vote also needs to invalidate or somehow marked the cache is being outdated and maybe refresh that value. Furthermore, even though it's easy to say that if you miss in the cache you have to query it's hard to figure out how this should work in practice. Imagine that a hundred thousand people want to read a particular story but that story is not present in cache. Do all those application queries all swarm the database and all execute the same query? That seems wasteful too. And even if they do that we still need some way to fill the cache. If all of them read when they miss someone has to write that value back into the cache and orchestrating who should do that is not entirely trivial. Beyond this over time the cache is gonna start to accumulate more and more values. And we only have so many resources on the server that's serving these requests. And so we need some way to evict from this cache entries that are old or unpopular. But then the question arises again who does this eviction and how? And hopefully this image gives you an idea of just all the stuff that needs to be figured out by the application developers in order to make databases serve these kinds of read heavy workloads. And so in my research I've been looking specifically at this problem of automatic database caching. Specifically is there a way that we can get all this cache logic to not reside in the application and instead be provided by the database. And that is what my thesis implements and what we're going to be looking at for the rest of the talk. Now the way we're going to do that is we're gonna be walking through the title of the thesis which to remind you is partial state in data flow based materialized views. And what we're gonna do is we're gonna parse the title from right to left and then move through the words and figure out what they mean and how they fit together. So let's start with materialized views. And in particular why they're useful in this context. Materialized views have been around for a long time. They were invented by the database community in the early 1980s. And essentially they are running the query and then remembering the result which sounds like a fairly straightforward thing. It sounds very much like a cache. And the key question with materialized views just like with caches is how to maintain that materialization. What happens if the data in the tables change? And so this is what's called materialized view maintenance. And ideally we want this maintenance to be incremental. We want it to be so that we don't have to re-execute the query that is materialized every time that underlying data changes. For example, if some story has 100,000 votes and then one more vote comes in we'd like to not have to count to 100,001 and instead just take the current count of 100,000 and have the system just realize they can just increment that count by one rather than recomputing the query from scratch. We also have this question when it comes to materialized view maintenance of whether we maintain on right. So proactively when a new change happens to the underlying data or whether we do it sort of lazily and on demand on a later reads. And this brings us to the next part of the thesis title which is how we maintain these kinds of materializations. And the title gives us a clue, data flow based. Now data flow has a lot of meanings in different parts of academia and in particular within computer science but essentially it's a system architecture. And in this talk we're gonna be talking about data flow as having data move to compute. The idea here is instead of code fetching data from tables you're gonna have the tables send data towards the compute. You can think of this a little bit like push-based computation. And these data changes are gonna propagate through a graph of operators. These are relational operators like joins, aggregations and filters and you can sort of think of them as whatever operators you might use in SQL. This might sound a little hard to get your head around and so I'm gonna give an example shortly of how this works out in practice. And then each edge of the data flow is gonna indicate a data dependency. So for example, a join is gonna depend on all of the inputs of that join. An aggregation is gonna depend on the table or the data that it's aggregating over. And then the messages of flow over the edges of the data flow are going to be deltas. A delta here is a full row. It has columns just like the rows that are coming out of the operator above but they also have a sign which is either positive or negative. If a delta has a positive sign you can think of it as an additional query row result to what came before. If it's a negative, it's a revocation of some result that was previously issued. You can think of it as sort of like add this or remove this. And in particular, we're gonna be looking at this whole data flow model in the context of Noria which is an eventually consistent materialized view system that is built using data flow. So let's take a look at a concrete example of how data flow can be used to maintain materialized views. So here on the left, I show you a query and on the right, I show you the data flow that Noria constructs for that query. So the query here creates a materialized view by the name of story with VC, VC for vote count here and then defines a query that does a join between stories and votes on the story ID and then counts the number of votes for each story. You can see it groups by stories ID. And if you squint at this, you can see how some of the parts from the relational query appear in the data flow graph. For example, there's an aggregation here account indicated by the sum operator in the data flow and it groups by the stories ID which we know from the join class is the same as the story ID column of votes. Similarly, there's a join between stories and votes and that's represented by the join operator in the data flow graph. You might notice here that these seem a little out of order. The join is between stories and votes, not stories and the count operator. And what we see here is the effect of query optimization. Noria has decided that it's gonna do the aggregation first for optimization purposes. And now the question becomes, what if the application wants to read from this view at the bottom story with VC? Well, what happens is the application issues a query over that view which might look something like this. So in this case, select all from story with VC where the ID is question mark where question mark here means some parameter that's input by the application at runtime. And that parameter is essentially going to be a lookup key into the materialized view to get out just the subset that the application cares about. Now imagine that some data changes in the underlying table. So imagine for example, that a new vote comes in for some story. What Noria is going to do is take that input change and inject it into the data flow at the votes table node in the data flow graph. And then it's gonna stream that as an update through the data flow graph along each of the edges down the data flow all the way down to any materialized views we might have. And the idea here is that the update is modified by the operators that it passes through. So for example, when the vote, let's say it's for story seven passes through the count operator, the count is gonna replace or convert that plus one vote into a negative of the old count, a negative Delta and a positive Delta of the new count. So in this case, it'll be a minus seven the story ID 42, the old count and a positive seven and 43 for the new count. You can think of this message as communicating it used to be that the count for story seven was 42 forget that it is now 43. These two deltas then flow along to the join and the join needs to convert that again into some kind of Delta that can be applied to the downstream materialized view. From the query, we know that the materialized view has a bunch of additional columns that come from the stories table. And so the join performs a join by doing lookups into the stories table and then stitching together the output deltas so that they have the additional columns. Those deltas then flow through the data flow again down to the story with VC materialized view where they update that view in place. And now subsequent reads from that materialized view are gonna see the updated result, the result that has the count of 43 rather than 42. Now that we've walked through dataflow based materialized views at a relatively high level let's start to look at the last part or first part of the title, partial state. And partial state is the core contribution of this thesis and what we'll be looking at for the rest of this talk. So partial state can be summarized as learning to forget. The observation here is that the chances are most of the entries in any given view are not accessed. Old and unpopular stories are just sitting there wasting memory if no one ever reads story seven why are we storing its result? Furthermore, why are we even computing and keeping the results up to date if no one reads them? Similarly, if we don't have the memory to keep all the results for every story in there we might need to choose to only keep some of them say only the popular ones that speed up reads the most. We want to trade off in favor of the most popular and most frequently accessed things. But traditional materialized views don't really give us this opportunity as we saw before you just have a query that gets all of the results and then you have to query over that view. And what we need is some way to evict all the entries for materialized views and only add new ones on demand. Which brings me to three of the main contributions of this thesis. The first is the notion of missing state in materialized view, the ability to mark some state as not being there and not expending memory on it. The second is the mechanism of up queries which allow populating missing state using data flow. And finally an implementation and evaluation of partial state in the Noria data flow materialized view system. So in order to get a partial state one thing we first need to figure out is the separation between the definition of the view and the query over that view. Because unfortunately this doesn't give us quite enough information to make the view partial. We don't know what key to use as the basis for deciding whether a subset of the view should be made missing or not. We don't know the query key that the application is going to use. And therefore in partial state and in partially materialized views what we're going to do is introduce the parameter the query parameter into the view definition itself. So instead of writing what you currently see on screen and having them be separate we can have a single view definition that also includes this clause of where the story ID equals question mark and then have the question mark be a parameter for the view. We're gonna use this key to determine if a given result is present or missing. And then when the application wants to query over this view it's gonna execute that view without any additional query and just supply the parameters that are identified by the question mark placeholders. And what's neat about this is that when an application wants to execute a given view let's say for story seven that gets sent as a request to Noria which then receives it and looks up into its materialized view looks for the index for story seven and it might find that that entry is missing indicated here by a hollow green box. Now when this happens some mechanism needs to be in place in order to fill that result that subset of the view so that we can respond to the application. The way we do this is using an up query and an up query is as the name implies a query that goes up the data flow graph in this case immediately to the join operator. And you can think of an up query as a request for a summary of all the relevant past deltas to be retransmitted through the data flow. You can think of it sort of as the downstream saying tell me about seven again because I forgot all about it. Now the join in this case is stateless it doesn't have a way to send such a summary. And so in order to support this up queries can also recurse in this case the join might choose to forward this up query to one of its ancestors which might be stateful. In this case, let's say it decides to forward that up query to the count operator on the right. And you can think of this as sort of asking can you tell me about seven so that I can tell the downstream about seven? Now a quick aside here we've only talked so far about state in tables and in views but there is other state too. Remember how I mentioned that the count takes a plus one vote and turns it into a negative for the old count and a positive for the new count? Well, in order to do that you need to remember that the current count was 42. And the way it does this is it also keeps its own little materialized view internally it's a materialized view of its past output. And so this might actually be state that is present in the data flow and that can be used to satisfy these up queries. In this case it might be the story seven is known in that materialized state and so we don't need another upgrade to recurse all the way back to the individual votes in the votes table. When the aggregation responds to the up query it sends that response is directly through the existing data flow. And this is a key point. There's no need for special operators or different forward and backward query processing at all in this design. Instead the response that gets sent through the data flow it's just a single message that holds the current state of the source for the requested key and it's processed just like any other data flow update. When that response arrives at the join the join does the same thing it would have done in any for any other update looks up into stories, patches together the output and then forwards that back to the materialized view and that materialized view when it receives this up query response takes that and uses it to fill in the hollow box because this represents the answer for seven. At this point we now have enough state to respond to the application query for seven so we can just send the response back. What's nice about this is that the state has now been populated. So if we now get later queries for the same value seven we can just serve them directly from the same materialized view because that state is no longer missing. What all of this enables and the sort of key feature of partial state is that at some point later if story seven falls out of favor becomes old we can evict that entry. In fact, we can evict any state we want internally in the system related to that key. So for example, we might choose a year later to evict story seven from both the aggregation state and also from the materialized view. And this is a save memory that can be then used to materialize other results that are popular. I mentioned also that we might waste compute to keep materialized view results up to date that are no longer being accessed and eviction and missing state allow us to work around this. Imagine that after evicting seven some change happens to the story table where say the title of the story seven changes. This is gonna introduce a delta in the data flow graph that changes the story. And when that arrives at the join and the join does a look up into the state of the aggregation to figure out what the count is it's gonna encounter missing state. And when it does, we can safely evict that sorry, not evict we can safely discard that update to the story. The reasons for this are a little subtle and the thesis goes into more detail but essentially if you observe missing state for a given key in a sibling in the graph it generally also means that that state is missing downstream. And therefore there's nothing to update there is no state for seven downstream and so we don't need to update it and can discard this update entirely. At this point I've talked enough about materialized view and data flow and partial state that I think it's time for an intermission where we talk a little bit about related work. And there's been a lot of related work in this general area. And the first of course is materialized view maintenance. So materialized views in general have traditionally been used for a slightly different workload than what I'm going for with this thesis work. In general, materialized views are used for something like an analyst that comes in and checks results occasionally. And it would be too slow to execute that complex query the analyst wants to run those few times they come in. And therefore over time we're just gonna keep this materialized view. And then when the analyst comes in they can just open the view and it opens immediately. And so the focus has more been on the maintenance than the reads because the reads are infrequent. Think of it as there's a high frequency and volume of rights. And so we want to make sure that the maintenance is as efficient as possible but it's okay if the read has to sort of do a bunch of work when the analyst sits down as long as it's much less than it would be to execute the query. These systems also generally have little or no support for on-demand queries. The queries are often compiled into a program or something similar. And it's not really built for the kind of dynamic query setting that we often see in something like a web application. And for a similar reasons these systems rarely have support for eviction because it's not really needed. The analyst query is what it is. That is the view. And there's no subset of the view that's more important than the others. Another area of related work is automated caching systems. We've seen a number of these come out of both industry and academia. Unfortunately, especially the ones out of industry these tend to be very tailored for a particular purpose. They're not really general purpose things that you can plug and play into your own application. What usually happens is some large company has a particular caching problem they build a solution that works for them but you can't just spin that up in your Ruby on Rails application. These systems also often only support invalidation and not incremental updates. They just focus on evicting things from cache that are related and not updating them in place which has the downside that now you might miss a lot and have to go to the backend more often than is necessary. Furthermore, these systems are often limited to specific database interactions. They require that you go through a framework or an ORM as opposed to what we're targeting here which is sort of general purpose SQL where you just write SQL queries, you write views and they just work similar to parameterized prepared statements in SQL. And finally, there's been a lot of work on data flow and stream processing systems. And these systems are also usually focused on write performance similar to materialized views where there's a data pipeline and you want to perform all these ongoing computations over that data pipeline. These tend to focus on strong consistency which usually comes at the cost of read latency. The reads generally have to coordinate with the data pipeline somehow. Often the reads even have to go through the right processing path in order to give consistent results. Whereas with Noria we can leverage the eventual consistency to give much faster results by leveraging the fact that the results are allowed to be stale. These data flow systems and stream processing systems also tend to have limited support for on-demand compute and eviction. Here too, you sort of set up some queries and they run for a while and you can't evict partial results from a given materialized view. If you wanted to, you would sort of have to terminate the process or something along those lines. So now that I've talked to you for a bit about what the system does, you might wonder, well, are we done? Is everything you told me everything I need to know and I can just go run this in my application? And unfortunately that's not the case, although if it were, this wouldn't be very valuable and so I'm kind of glad that there are more challenges because it turns out that in practice things are hard. In particular, we need to ensure that data changes to the tables take effect exactly once. For example, if you do an insert to the base table, well, that insert has to happen. It can't just the row you insert it can't just vanish. And if the database applied to the insert multiple times so the table now contains that row multiple times that would also not be great. Now, this might strike you as weird if you come from a traditional database world because the only way that could really happen is through a bug. The database sort of goes through and looks at indices and scans tables and such and would only encounter any given row once. But in this model, it's a little bit harder and the reason is because, well, it's twofold. First, up queries are summaries of past state, things like all the state in the past or the deltas in the past for story seven. But up queries can happen concurrently to updates that flow through the data flow graph and those updates might be for the same state. Imagine for example, there's an up query for story seven at the same time as story seven is being updated. We need to make sure that these don't conflict with each other and end up violating this sort of exactly once rule. Similarly, we want the ability to discard updates early to not maintain state unnecessarily. However, if we discard things erroneously, if we encounter a missing state but there is downstream state that actually depended on the update that's flowing through the graph, then that would be really unfortunate. That result will be permanently stale. There are a lot of hazards that can cause these kind of problems and the thesis goes through a fair number of them but in this talk, I'm gonna focus on one just because of limited time. The one we're gonna focus on is incongruent join evictions and if you don't know what these are, it's not that weird because it's a term that I made up but hopefully you're gonna see in a little bit why I think the name makes sense. So let's start then with what is an incongruent join? What I'm showing you here is another query and data flow side by side. It's a different query and data flow from the ones we looked at before even though they look kind of similar. Here, we have a stories table like before and then we have a users table and the idea is that the stories table has an author ID column and in order to display a given story, we probably want the author's name rather than just their user ID and so we store a separate table that has the user IDs and various information about that user such as their name. And so in this stories with author view, we're doing a join between stories and users which ends up combining the results such that the output rows actually contain the author's name pulled from the users table. What makes this join incongruent is the fact that the query key is different from the join key. The query key here is the story ID whereas the join key is the author ID of the story. That's what we end up looking into users with. Now in general, this is not a problem. Let's first work through the sort of correct case of what happens when an upquery occurs. Well, let's say that someone queries this view as well for story seven, an upquery goes to the join, the join is stateless, it forwards to the stories table. The stories table looks up the state for seven and sends back a message along the data flow saying here's story seven, the author is 42. Might include some other columns too but let's just consider these for now. The join then dutifully does the lookup into the users table and wants to look for the user information for users 42. And let's say I find that the name for user 42 is Lena. When that case, it takes that information, it stitches it together with the update that flowed in as a response to the upquery and then it forwards the response downstream to the materialized view to fill the hollow box, the missing state. So this includes seven, the various columns from the stories table and the author name and fetch from users. So far, so good. But now recall this figure where if we encounter a missing state in a sibling, we end up just discarding the update assuming that there's no state downstream we might have to update. This ends up causing us a problem in this particular case. Let's consider what happens if the author changes. So this would be represented in the data flow as two deltas, one that removes the story with the old author and ones that adds the story with the new author. These two deltas flow down through the data flow graph and when they arrive at the join, the join then needs to do the lookups as before. So it first does a lookup for 42, it again finds the state for Lena and populates the author name column for that update in preparation for sending it to the materialized view. And then it needs to do a lookup for author 43 to do the same for the positive delta. However, what if that lookup misses? So the lookup misses in the users table and now we don't know what the author's name is. Now you might wonder how could this happen if users is a table and there are two answers I can give you to that. One is you could imagine that the user's table is say only partially stored in memory and 43 would have to be fetched from disk. Now, Noria doesn't actually support this of keeping materialized state in memory versus disk, but in practice there are other ways that this can occur. For example, here I've given you a simplified view where users is just a table, you can imagine that users is actually a view in and of itself and has some large data flow upstream and then there could totally be results in users that are missing. Regardless of how this happens, the join still now has a problem. It needs to complete the processing of that update but it doesn't have the states that's required to do so. So what do we do? We can't produce the needed update. We cannot forward just the negative and just sort of discard the positive because if we did the downstream materialized view would now have no rows for story seven and any subsequent query for story seven would get no results, which is obviously not correct. We also can't just drop the update altogether because if we did any subsequent read for seven would now see the old author name rather than the new one and this would also be a problem. So what do we do? Well, your first instinct might be that we can just fill the missing state, right? We just have the user's table do an upquery or something fill in the state for 43 and then everything is great. And while this seems like an attractive solution in practice, it doesn't work so well because we would have to hold up the data flow at the join until that upquery finishes and that might potentially take a long time. Remember the user's table might be a view that has huge data flow above it and satisfying that upquery might take forever. Well, at least a very long time. And during that time the join can't process any more updates and we're sort of stuck just holding up the data flow, not processing more writes. If I could turn this out, this is even worse. There are cases I won't go through them here where this might end up in deadlock where the join can't process more updates until the upquery finishes but the upquery cannot finish until the join processes more updates and now we're stuck. You might think, well, can the join just process later updates to avoid this problem but we can't process them out of order because there might be later updates that depend on this one. And so we need to finish processing this one first. Okay, so that's not a viable solution. What do we do instead? Well, partial state actually gives us the mechanism to solve this problem and that is evictions. It's true that we don't know what the author for story seven now is but we can just communicate that downstream with an eviction. We can evict story seven from the downstream materialized view. And now all the application has, or all Noria has to do is when a later query comes in for seven that's gonna fill the required state through the normal upquery flow which we already showed worked just fine. And it turns out the system can actually detect when you have in Congress joins and only send evictions in those cases where it's necessary and it doesn't have to do it for every join. So you might wonder now, does all of this work? Like this seems like there's a bunch of mechanism internally. Does it end up just killing performance? Is this even worthwhile? In order to evaluate that we need a realistic test subject. And for this thesis, I chose lobsters which is a hacker news like news aggregator. Users submit stories that the users can vote for and comment on those stories, look at top lists of most popular stories that kind of stuff. And I chose lobsters for two reasons. One is that it's open source which allows us to see the queries that are issued for different page requests. And the second is that the data statistics for lobsters are available. This is stuff like how many requests come in per second in general, how are those requests distributed across different pages, what users, what stories are most popular and most active. And ultimately all of this data allowed me to build a workload generator for lobsters that can synthesize lobsters like requests. And the reason why we want a generator here is so that we can change the load of the system artificially and see whether the system keeps up. The generator also has a pluggable backend so that we can choose to run it against MySQL, against Noria and just see how they perform. It also lets us bypass Ruby on Rails so that we can benchmark the true backend performance rather than the language that's serving the front end. So let's first look at how MySQL does. This is MySQL running on Amazon EC2. It's a 16 core instance. And this is showing the throughput that the workload generator can get to before the system falls over because before all the cores on the machine are saturated and the latency starts to spike. So this gets to about 400 pages per second and this is across all the different page types. So it generates sort of a mix of page requests. Now I wanted to point out that this is already with a denormalized schema. The lobsters developers have done a decent amount of work on their queries to make sure that things like the vote count or their equivalent of it that they call story hotness is actually a column of the stories table. So they don't generally have to do join and aggregations in their queries, although their update paths in the application have to do a lot of work to make sure those values are kept up to date. If you don't include this kind of manual denormalization the throughput for MySQL, it can't even keep up with a single page request per second. Now let's contrast that with running Noria without partial state. Noria without partial state already does significantly better. This is about an order of magnitude improvement. And this shows you the power of materialized views, especially in a case like lobsters where there's a majority of reads. All those reads don't have to do joins. They're essentially all just key value lookups. Here, however, we run into another bottleneck. We're not CPU bound. We are memory bound. When Noria without partial gets to about four and a half thousand pages per second it runs out of the 128 gigabytes of memory on this machine. And in fact, it gets a little bit worse than that because the memory use tends to increase over time. And so if you ran the benchmark for longer it might not even keep up with this load for longer periods of time. And this is already with some amount of optimizations within Noria to make sure that we only materialize requested values. So if no one ever asks for a particular query we don't materialize the results for it. And now let's look at what happens if we run Noria with partial state. It does significantly better. This is about 67% over Noria without partial and about 18 times my SQL. And here there's actually a little bit of room to grow. This benchmark falls over not because of saturating all the course and not because of memory it falls over because of processing up queries. It turns out that there's a particularly update heavy path through the data flow that currently bottlenecks Noria. And this might be somewhere where additional optimizations could help. And we can see how Noria with partial state gets to this higher performance number by looking at the memory use. So here I'm showing you the memory use in gigabytes for Noria without partial and Noria with partial at the capacity just below where Noria without partial falls over. Using the Noria without partial uses over a hundred gigabytes of memory and Noria with partial state uses only a third of that. And you might say, well, John forget about my SQL no one does that in practice. If I implemented caching myself is Noria really gonna keep up with that? Is Noria really a serious sort of competitor for that? And this is a really hard question to answer because it depends on the implementation. It then depends on the application in question. It depends on the workload. And so instead what I did is construct a benchmark that is sort of an idealized caching setup. This is one where there are no evictions, there are no misses, almost all requests are reads almost all requests turn into a single key lookup. And there's only a single query in these benchmarks. So this is not all of lobsters but instead just the stories and vote count query I showed you in the very beginning. And I ran this against Redis which is a popular key value store that's often used to back things like caches. And in this idealized workload, Redis gets to just under a million requests per second which might sound pretty good. However, Redis is single threaded it can only run on one core. And so if we really wanted to compare these we have to assume that someone implemented like perfect scalability across cores for Redis. And so here what I'm gonna show you is 16 times that number. This is not the number on the right here is not a real benchmark number. It's just 16 times the number on the left. Think of it as a theoretical maximum for what you could get against Redis here. And then I ran the same benchmark against Noria. Now before I show you the result remember that Noria provides SQL automatic maintenance of this cache and it doesn't have these requirements of everything has to be implemented in the application logic. Instead, Noria does it automatically for you. And Noria here gets pretty close to this theoretical maximum for Redis. Noria uses as many courses it needs to satisfy reads it handles all the eviction and cache updating for you. And yet it gets within a factor of two of what you could get with the theoretical Redis deployment on this machine. Now I've been talking for a while and so I'm going to start to wrap things up now. And I wanna start first by talking a little bit about future work on this because Noria is neither perfect nor complete. There are a number of things that are missing both from Noria without partial state and partial state itself. For example, there's no support currently for things like range queries, cursors and timed window operators that might be useful for applications and it's somewhere where there's room for innovation both within Noria's data flow but also within partial state. It would also be really nice if there was a way to integrate Noria with an upstream database. So imagine that the tables in Noria were not stored in Noria itself but was stored in Postgres or MySQL or some other source of truth database that the user or application developer is already running. Similarly, there's an attractive prospect here of because Noria uses data flow to manage all the updates to materialized views. There's no real reason it has to stop with the materialized views. Imagine that the user or the application developer also has client applications running on devices that users are holding. Well, it might be that those are showing subsets of views that are on the server and you could imagine that the deltas are allowed to flow all the way to the end user device and update the views there in place. This is sort of reactive programming or reactive applications. They're becoming a bit of a trend in web application development. And finally, fall tolerance is something that Noria has a somewhat weak story for but partial state might be able to help here. The idea would be that if a given node goes away, what we can do is just mark the state that that node held as missing and then have the partial state and up queries mechanisms take the role of trying to repopulate that state. Now, before I end, I want to acknowledge the influence of a couple of people on this work. And I want to start off with the committee. Robert and Franz have over the years has been continuously asking and trying to make me build a better story for Noria and for partial. And I think this is one of the reasons why the work is where it is today. I think initially I had these sort of hazy ideas for something that might be cool. And I think the two of them have really helped hone the story, hone the argument for why this is useful and what the work should focus on. And this project wouldn't be anything like what it is today without their invaluable input. Malta as well was a postdoc in my group during many of the years of Noria's development. And without Malta, Noria wouldn't have SQL. It wouldn't have query optimization. It would essentially just be a data flow system. And his contributions have been invaluable and without him, this work would not be in its current state. I'm also very glad that I have Sam on my committee so that I can draw from some of his database experience, which is something that he has a lot more of than us mere systems people. I also want to thank the people of the parallel and distributed operating systems group at MIT. The people there have just been giving me, they just have this endless curiosity and insight and support that I think has made me the researcher I am today. And I don't think I would be here without having them surround me throughout the years. I also wanna thank my family, my parents and step parents who have always been encouraging me to pursue my interests no matter how geeky they might be. And I think they've sort of been nudging me along all the way until I got where I am today where I'm now giving a very geeky presentation. And I think they would be happy with that. I also want to thank my girlfriend, Talia. I'm amazed that she's tolerated me just locking myself in my room for hours on end working on this work, on the evaluation, on the thesis, on this presentation. I am so grateful and happy that I have her and I love you. To conclude, my thesis enables materialized views to be used as caches. It does so by allowing state to be missing from materializations and using up queries to populate missing state on demand. The resulting system provides automated caching for SQL queries and reduces the need for complex ad hoc caching logic. Thank you all for listening and please ask any questions you might have.