 Folks, welcome back. This is gonna be a little bit of a different stream than the ones that I usually do because normally I do these like Rust teaching streams whether they're like live coding, usually live coding, but they're focused more on like Rust code. In this stream, I want to focus more on my thesis work which happens to be implemented in Rust but it's not like this is not gonna be a Rust stream. In some sense, this is more gonna be like an academia stream. In particular, I'm currently writing my PhD thesis and as a part of that, it's just been a really interesting process to go through of like, how do you write a PhD thesis and what kind of things do you need to think about? And my guess is there might be multiple of these streams because there are many facets to a thesis that might be interesting to work on. In particular, what I wanna focus on this time around is, let me pull this up to the start, is the evaluation section. This is the first part of my thesis that I'm writing. I have some texts for the other sections but the evaluation is the one that I've been like iterating on over and over for the past few weeks. And I wanted to take some time to basically work through my evile section and talk a little bit about why it is the way it is, why it has the graphs that it does, what it's trying to say, how it's trying to say it and how I arrived at its current form, which I still think is not its final form. I'm gonna do this relatively informally. This is more, I'm not gonna be like reading my evile section but I'm more gonna try to give you an argument about my work and try to present the argument that this is making. And then I really would just wanna hear what you think about it. I also want to approach this a little bit like a sort of Q and A type thing where if you have questions about, I mean the evile section, of course, or about my thesis or just about like writing a thesis in general or about PhDs in general, any of these questions I'm happy to try to take over the course of the stream. I'm gonna zoom in once we actually get closer so that you can read the text and such. The text won't be too important in these. I'll focus more on the figures as we go later. So someone just asked a question of whether this is a practice for my presentation and this won't really be that. Like this won't be a thesis presentation. I'm gonna talk relatively little about what my thesis work is. I'm more gonna talk about the argument for why it's a good thing. But of course I will need to do some setup to like just explain what my work is at all. I'm gonna do like the brief intro I sometimes do just because it's useful for people who may not know who I am outside of this particular stream. So I do a lot of different streams usually focused on the Rust programming language. I upload all the videos and we'll also upload this one to my YouTube channel. You can also follow me on Twitter if you want to sort of hear about some of the work that I'm working on or if you just wanna stay up to date about new streams that come out. Sometimes I do polls and stuff here as well about what stream topics I'm gonna cover next. And in fact, this stream came about because I tweeted out like would anyone be interested looking at sort of a walkthrough of a thesis evaluation. And there was a lot of interest so that's why we're doing this. So my thesis work is, revolves around a project called Noria that I've worked on now for about five years, a little over five years. And at a high level, Noria is a database. It's a relational database sort of like MySQL or Postgres or Oracle DB. Very similar in terms of how you should think as an application about using it. But Noria is trying to solve the problem that traditional databases are really slow for workloads that are pretty common. In particular, you end up in the situation where you have your database and you do lots of reads and you do some inserts and updates. But like reads or like select queries are the things you issue most frequently. But the database is built in such a way that the reads are relatively slow. Every read, every time you do a select, the database has to like plan the query, it has to execute the query by looking up all these base tables and doing joins and aggregations and whatnot. And this means your reads, even if the results haven't changed since the last time you should do the same query, for example, the performance of your application ends up not being as fast as you want it to be. And the solution that many people have to this is like you stick a cache in front of the database. You use memcached or redis or something like that and then you have your application like check the cache and hopefully hit in the cache in which case the result arrives immediately or it misses in cache, it goes to the database and then fills the cache. This is a fairly standard set of the many people use to boost their database performance. And it turns out, and this is part of the argument we're gonna be making in the eval section, it turns out it's just really hard to get caching right. If you've tried to implement caching yourself you might already have realized this. And the observation behind Noria is why doesn't the database just like, why doesn't the database to solve this problem for you, right? Like the database already knows all your queries and how they're arranged, so why can't it maintain the cache? And that is what Noria does. I'm gonna pause here before we dive into the sort of evaluation just to get questions about what I've mentioned so far, of like the setup of the project, what I'm roughly working on, and also just questions about what this particular stream is gonna be. Let's see here. Does GraphQL help with that? So GraphQL, I've gotten this question a couple of times actually of like why SQL? Like why is Noria a SQL database? Does that make any sense? Because we see all these like no SQL databases or GraphQL or data log based databases. And these definitely have their uses, right? Like you want a query language that is well suited for the task at hand. And that's not always SQL. But that said, SQL is by far the most common query language in use today. And when I initially set out to build Noria it was sort of with the idea that I wanted to make things better for application developers. And realistically, they're not gonna rewrite their whole application. Like if I told GitHub that you're using SQL today, stop using SQL use Noria instead, they're just not gonna do it if they have to change their query language. The only reason to do this is to use something like GraphQL or make up your own query language would be if you think it gives you a significant benefit in terms of the system that you implement. And at least it wasn't obvious to me at the time that a different query language would have been, would have given additional advantages or given additional insights into the way the application functions or the things that the application is interested in beyond what SQL gives you. And so it's sort of like SQL is the default. So let's make the default better. The two hard problems of computer science, caching and naming things. Yeah, caching and concurrency, caching validation and naming things. It is definitely very common for people to run into problems with caching. And part of it is because it's just really hard to get it right because you now have two systems where one is sort of a mirror of the other, but you need to keep these mirrors in sync. And we'll see a little bit about that argument in the eval section. Is there a team working on this or just me? There are a bunch of different students at MIT and at other universities actually over the years who have worked on Noria. The primary developers for much of Noria's lifetime has been me and Malta Schwarzkopf who's now an assistant professor at Brown. And we've been like the main people behind the project. And then of course my various advisors or the professors that I work with. Over the years, we've had many other both master's students and undergrads and PhD students who have sort of gone on and off the project. And so in that sense, it's very much sort of a team effort. But the core people behind it are Malta and myself. And we've all worked on somewhat different aspects of Noria. Noria's a pretty big project and it would be hard to do it all by yourself. And so my thesis is specifically on one aspect of Noria which I'll get into in a second. Doesn't Postgres and MySQL already do query caching? So they don't really, they cache things like query plans. They don't really cache results. And if they have support for actual query result caching, usually it just gets invalidated. Like if a write happens to the table, the whole query cache gets invalidated. Which of course is really bad if popular keys keep being updated, for example. Is it similar to Google Spanner? No, it's very different internally. How does Noria differ from row or column-based databases? How are you getting your efficiencies? So Noria is primarily gains things from caching results. It caches the full results sets for queries. And then that means your results can just be fetched directly from cache. And it keeps the cache up to date as you go. So if some data in one of your base tables changes, Noria will actually compute how that changes every cache result rather than just invalidate them. And it can do this because it knows all the different application queries that are currently standing in the system. Is a cache miss costlier than a database read? I mean a cache miss requires that you do a database read. So yes, it must be costlier. Is Noria acid? So acid is a term used to describe certain databases. It's like atomic consistent, atomicity, consistency, integrity, and durability. Noria provides some of those, but it certainly has a much weaker consistency guarantees than traditional databases. If your project specializes in read heavy workloads, how does it hold up and fair in right situations? This is another thing that we'll look at in the eval section a little bit. Basically, the project is entirely targeted at read heavy applications, but we obviously, there has to be some amount of rights, otherwise the system is sort of uninteresting. And we'll look at how that works out in a while. Let's see here. Is Noria named for the water wheel? Yes, the idea, this came from Malta actually. The name Noria comes from, like it's using the flow of data to bring something to a higher level. It's a very indirect name, but as many academic names are. At what layer of the database is the caching mechanism implemented? So Noria is implemented as a very different type of database. Internally, it looks nothing like MySQL or Postgres. Internally, Noria is a data flow system. Essentially what it does is it constructs a data flow program that keeps every single cached query result up to date. So as data sort of flows into the base tables, they flow through this graph of data flow operators that ultimately compute the changes to the materialized views, as they're called, the cache query results at the bottom. What is your hope with Noria? Are you looking for wide scale adoption? It would be really cool. Noria is not currently production ready. It is very much a research prototype, but my hope would be that if not Noria, the Noria implementation specifically, that something that provides this would actually see adoption because I think it's a really valuable thing, as you'll see from the Eval later on as well. How does differential data flow from MSR fit in there? Differential data flow is a little bit different in a couple of ways. One big difference with differential data flow and timely data flow, if you're familiar with those, is that they target much more sort of batch computation and much higher consistency than Noria does. Like they're built much more for write pipelines than for read pipelines. And they target computations where the consistency of your results is vital to the application's correctness. And so this is things like, if you have loops in the data flow graph, you wanna make sure that you always compute the right result. If you do a read, you wanna make sure that it represents strictly all the writes that happened before and none of the ones that happened after. Noria does not quite give the same guarantees, but what Noria does give that differential data flow does not do is for Noria, reads and writes are separate things. The reads do not go through the data flow at all, instead they just read from the cache. And then they have a way to sort of poke the system to fill in things if they miss in cache. Do you believe your approach is the future? I think it's a good idea for some applications. I don't believe in this like, there's one solution that fits everyone, that seems unlikely. Is the cache always in memory? Yeah, Noria does always cache everything in memory. And this is gonna be an important point is basically the memory use of Noria because Noria caches the results of every query you give it. And that obviously means that your memory use is gonna balloon and we'll talk a bit about that later. What happens if somebody were to put this into production? They'd probably be making a mistake. Just because like, I mean, it depends on what they expect from the system but Noria has like only been tested in the context of academic research. And so there's a bunch of properties that you sort of want from a production ready system that Noria can't really guarantee. For example, like Noria does not guarantee that it won't like crash if it gets a bad query which is never what you want in a production system. And fixing this like shouldn't be too hard but it just requires a bunch of engineering effort that isn't really warranted in a research prototype. Is it scalable? We'll get to that when we get to the Eval. How will it cache something like a random value? So Noria is based only on relational algebra. So there are all of the operators are like deterministic and commutative. So there's no random value in that sense. Like if you're thinking of like MySQL RAND, Noria doesn't have that operator. In theory there might be way to implement it but that's not something we have. Is Noria multithreaded? Yes. Is there a built in way to do subscriptions? Sort of, like because Noria is a data flow system any changes to the base tables are basically gonna propagate as sort of incremental deltas down towards the views at the bottom. And ultimately you could feed those to the application as well. I have no opinions on closure or atomic. Is the long-term goal to productify Noria or is it meant to be a proof of concept for others to reimplement for production? I don't know yet. It sort of depends where the future goes. I want this to exist as a real thing. The path there I don't quite know. I'm also in a position where I've worked on this for six years now and I certainly need a break from it. Not because I don't think it's a good idea because I really do but more just like my brain has been thinking of nothing but this for six years and I need a break. Is it distributed? Yes, it is distributed. Did you try out any other distributed systems ideas for your thesis? I did work on some other projects initially that didn't pan out but this is what I've worked on for a very long time. Now that, so someone asked like is this your thesis and Noria itself is not my thesis. I'm focusing on a particular part of it because Noria is a fairly big beast and there are a lot of sort of different corners of it that you could write a thesis about. And so actually that's a good segue into that. I'm gonna do one or two more questions and then we'll dive into the actual thesis part of this. What's happening on the chart at 5,000 views per second? Was that a crash? It was not a crash. Well, I'll explain this kind of graph a little bit later when we get to eval. Can be implemented for backend systems written in Go, Node, and Java? Sure, yeah, you can use this as normal database. Can you also explain in brief what is Dataflow? Yes, so Dataflow is a pretty poorly defined term but it basically means that instead of having code sort of fetch data and then operate on it, the data flows through, think of it as a sort of like a tree or a graph where data comes in at the top. Let's imagine the top of the graph and then it flows through the graph along the edges and every node in the graph does some kind of processing on that data and then ultimately what you end up with as the leaves of the graph are the ones that see the result of the computation. That's a very high level description of Dataflow. Norio was written in Rust from the beginning, yep. Is this something like Kafka Streams? You can think of Dataflow a little bit like Kafka Streams or alternatively you can think of Kafka as implementing one type of Dataflow but Norio has nothing to do with Kafka. If writes would only return only once the cache is updated would Norio provide asset? No. The details for why the answer is no to that are a little bit outside the scope of this particular stream but we might get into it if we do another thesis stream on the design of Norio for example. There is no master master or master slave in Norio's distribution. It's not a primary backup type setup. It is completely distributed in the sense that every operator of the Dataflow graph can be on a different machine. And yes, I think Rust was the right choice for this and no it can sadly not make toast. Okay, so let's try to actually dig into what part of this my thesis is about. So one thing we talked about was how we're in this world where we want to cache results so that we can get the results faster but caching results cost you memory. Every time you cache the result of a query you need to store it somewhere and if you're not storing it on disk which would be fairly slow you're storing it in memory and there's a limited amount of memory. And so Norio one of the key contributions of Norio is that it supports something we've called partially stateful Dataflow or just partial state. And the idea behind partial state is that you teach the Dataflow how to compute results after the fact. So normally in a Dataflow system every time new data arrives it flows through the whole graph and then the whole Dataflow program and then it updates whatever leaf nodes have state and those are the ones you read from. And this works pretty well but it does mean that you can never throw anything away. So with Norio we teach it something called an up query. The idea behind an up query is basically that you have a way in the Dataflow graph to send a query up the graph. So normally in Dataflow the data flows down an up query flows up the graph and it's basically a way of asking your ancestor to resend data it's sent in the past. And it turns out that this is actually sufficient to give you the ability to evict. It gives you the ability to only materialize only keep parts of the cache that are important to you. So you don't need to cache the results for like a query that was issued like 10 days ago and you also don't need to maintain the number of likes for some tweet that no one has looked at for 10 years. And so this really lets you bring down the memory use a lot because suddenly now you only need to cache the things that are actively being accessed by the application right now or in this short period of time rather than any query that's ever happened. And so that is what my thesis is about. It's specifically about this support for up queries. And it turns out that that little mechanism the up query is sufficient to give you the ability to give full eviction even internally in the graph. So imagine that you have this Dataflow program that computes lots of SQL queries. Every node in that graph is basically one operator in your SQL query. So if you do something like join A with B that join will be an operator, a node in that graph. If you do something like count votes then the count operator is gonna be one node in that graph. And some nodes have to be stateful, right? So the count operator, if it's told that there's now one more vote for article seven then it needs to know what the previous vote count was in order to be able to say what the current vote count is to whoever is below it. And this means that that node needs to have state. It needs to have memory. And partial state actually works not just at the leaves which hold the full query results but all the way through the graph. So if you have a count over votes that's like grouping by article ID, for example it doesn't need to store the count for every single article that is in the database. It only needs to store the counts for the articles that have been asked for recently. That was a lot of sort of technical detail in a short amount of time. So let's do questions about partial state and stateful data flow. I realized that I'm like running through a lot of fairly complicated database set up here and it's because I'm trying to get to the eval section a lot more of the sort of deeper design details I'll go through in another stream. And you basically only need to understand the high levels of what I just explained in order to understand the eval section is at least the hope. Let's see. It feels like eventual read model composed through message passing of data changes of data changes between some objects. That's sort of true. So, well, kind of. Okay, so let me try to rephrase. Actually, let me try to draw this. That might help. All right, so we're in a position where imagine that you have, this is like, I'm gonna go back to a classic example here which is you have an article table, article, and you have a vote table, right? And let's imagine that what we wanna do is we want to fetch an article, fetch a given article, along with its vote count. So we're gonna issue a query like select article.star and count of vote from article. Sorry, my writing is terrible. Left join vote using article ID. And then we're gonna group by the article, oh, I didn't set up my pen correctly before the stream, my bad. We're gonna group by the article ID where the article ID, I'm just gonna shorten it, is some parameter through the query. So this is a question mark, this is a prepared statement. And what Nori is gonna do when you give it the SQL query is it's gonna look at the base tables and go, how can I set up a data flow program that computes the result for this query? And the way it's gonna do that is it's gonna create an account node here, account operator over vote that is sort of told to group by the article ID. Then it's gonna create a join node here, which is gonna be a left join between article and the vote count. And then below that it's gonna create the actual cache that the application ends up reading from if it issues a read to this query. And notice that all of this is sort of keyed on the article ID. If I execute this query with article ID seven, then I'm gonna look into this cache for article ID seven. And then the system will give me only the results for this query with that index or with that article ID. And you can see from this setup what happens if, for example, a new vote comes in. A new vote comes in, it flows down the data flow along this edge, gets to the count operator. The count operator increments the count for article seven. So maybe it remembers that seven used to be four, like the count for article seven used to be four. And then it's gonna now update that to be five. And then it's gonna send along this line like a minus four and plus five for article ID seven. And then this is gonna join against articles, it's gonna do a look up into article. And then ultimately it's gonna forward an update down to the view that says how the view should change. And then of course the problem here is if we do this for every different article ID, then this view becomes very large, right? It becomes sort of, if we wanted to use that notation O of N, it stores every article, it stores one row for every article. Whereas in reality, like many articles, like basically any article that's more than like a week old is probably never gonna be looked at again. And so why are we continuously maintaining that state and keeping it in memory and spending memory on it? And what partial state lets you do is that instead of storing all of these, we're initially gonna start with storing nothing in here. And then if a query comes in for seven, then this view, when it realizes it doesn't have seven, it's gonna send an up query for seven to this join. It is gonna send an up query to article for seven. That's gonna flow back. It's then gonna try to do a look up here for seven. This doesn't have seven, so it's gonna do an up query there for seven. And the result is gonna come back and ultimately the full state for seven ends up coming back into the view and the cache gets populated with number seven, at which point we can reply to the user. Okay, so let's see whether this explanation of partial roughly makes sense. How do you find recent in the setting of databases? Well, recent in what sense? So in terms of whether we decide to keep it in cache, you can use basically any caching strategy here. It can be like least recently used. You can do something like least recently written. In Noria currently, the strategy is random. Like it just evicts a random entry whenever it has to evict, which is not perfect, but Noria's not trying to innovate on caching strategy. It's trying to innovate on caching mechanism. And you could always implement a different caching strategy if you wanted to. How does Noria or a system like it perform when it has to query a huge range of data? Let's say you have a node or query that uses a couple of dozen million rows. So Noria is definitely not written for applications that are sort of analytics queries heavy. So if what you're doing is like scanning your whole data set every time you show a query, that's not what Noria is for. Then you want more sort of a batch processing, analytics processing system, which Noria is not targeted for. Noria is built for applications that are more likely to do like point queries or narrow range queries into the data because that is when you can use this partial mechanism of you don't need to store the whole result set. If think of it this way, if your query depends on every single piece of data in a given table or a very large portion, that means that if any of that data changes, the whole cache now needs to be updated, which is not great. Let's see here. Does the cache have a TTL? There's no TTL on the cache. There could be. Like again, the caching strategy here is sort of up to you. Noria is basically for the web. Yeah, that's a pretty accurate way to say it. It doesn't have to be. There's some use cases on mobile for example, but it is certainly like geared towards basically read heavy web applications. Does that mean that the more cached queries you have, the more writes will be slow? Yes, that's a very good observation. So the more queries you have in this data flow, like imagine that there were tons of different queries coming off of here, then whenever a vote enters the system, it would be more expensive because the vote would have to be processed by each of those child operators. Yeah, so instead of having the data flow graph keep the read cache up to date for every change to the base data, the up query sort of is an on demand pull of the data. You can think of this as it serves a similar way of a similar role as in a normal application that does caching where if you miss in the cache, you have to query the database except that it happens internally in the system. Are there situations where Noria performs worse than an uncached database? Oh, almost certainly. In like Noria, first of all, Noria is like relatively new and it's a research system. So it hasn't seen as much like query optimization work as databases does. But also because Noria has this like push, this like data flow model, it means that there are a bunch of optimizations that traditional databases can do to the query that we can't do. So for example, if you were given a query and you could choose exactly how to like scan the base tables, for example, you can implement some joins a lot more efficiently. And Noria doesn't get to do that. It is sort of a this like update at a time processing kind of Noria does not really maintain a change history. No. If this is a high performance system, does it make sense to use this on top of any other database like a cash or layered approach? Noria currently assumes that it owns the data and part of this is because it needs to ensure some amount of consistency between what's computed in the data flow and the data that's stored in the base tables. And part of the reason for this is when an upquery comes in, there are still, so imagine that this like blue upquery for seven goes up to article. There can still be writes in transit here while that upquery is ongoing. And Noria needs to ensure that it doesn't like duplicate a given piece of input data or end up like not including a particular input data for any results. And so to get that, to keep that consistency, Noria does assume that it owns the data. But in theory, each of these could be implemented as like database tables, they aren't currently. So upqueries the ancestors, which further resolves to the dependencies. Yeah, so they recurse and then the nodes hold partial state instead of full state. That's exactly right. There's no formal verification or model checking in Noria. There are a lot of places where I could see it being used, but that's just not what I focused on. This reminds me of MeteorJS. It's interesting you say that because the model of MeteorJS, the application model feels very similar, but the implementation is very different. How would that compare with materialize, which is a streaming database also written in Rust? Yeah, so materialize is built on differential data flow and timely data flow. So the answer there is pretty similar. Like I interact a lot with the materialize people and they're doing some really cool work. They're targeting slightly different application needs than Noria is. Noria is like, Noria will not work well if you need strong consistency in your query results, for example, but it is real fast if you have a read heavy application because the reads don't have to go through the data flow. And also because Noria lets you evict, which as far as I'm aware, materialize does not really let you do at the moment. All right, do you ever evict a query from cache? Yeah, you can remove queries. Okay, so let's now try to get to eval. I think we have enough context now. And even if you haven't followed everything we've done so far, I think the eval questions are very related to the kind of things you've already asked. Okay, so the eval section, yeah, as someone pointed out, I have to read up everything's going over my head right now. And I think there's a good comment here, which is basically, you have to realize that this is the result of like many years of work. So the fact that you don't immediately get it is not weird. Like there's a lot of complexity here and I'm explaining it very rapidly and at a very high level when there are clearly a lot of like underlying technical nuances. But the reason, but it's okay that you don't understand it, right? Part of that is like, this is why I'm writing a thesis on it, right? It's because I need to take all of the knowledge I have of this problem and the solution and like put it on paper so other people can understand it. And I'm not going to be able to do that in like half an hour on stream. But what I want to do for this stream is to specifically look at how do we even evaluate this as a solution? The evaluation has very little discussion of the internals of Noria. It has very little discussion of like how Noria is built. It's just looking at, let's assume that you solved all the problems or let's assume that you present Noria as a complete solution. How do we figure out whether Noria actually is a correct solution? Whether it is an appropriate solution, whether it's a good solution, irrespective of what happens under the hood. The eval section, like sometimes it goes into some low level questions about how does this mechanism work? But very often it's just like, now that we have this as sort of a candidate solution, how can I try to convince people that it's actually a good solution? And that is very much what an eval section centers on. Eval stands for evaluation. It's a short way to say evaluation. Okay, so I don't know if you can read this text. It's not terribly important that you can read it exactly. But the evaluation section pretty much tries to make one argument. And the one argument it's trying to make is that Noria is a good solution to a useful problem. Now, of course, there are many facets to that statement. But that is sort of the high level point that the eval section is trying to give. And the way it goes about it, it is first to say that Noria is to sort of try to state the problem that Noria is trying to solve. And specifically, it starts out by saying the thesis is built on the belief that view materialization is useful. View materialization is sort of the database way of saying query caching. And then it goes on to say but that it's prohibitively costly to use it with current solutions. So this is like the high level point, right? This thing is useful and we're gonna demonstrate that but you can't do it at the moment and that is sad. And then it says, well, the thesis prevents a partial state as a solution to this problem. And then the first section of eval, well, the second section, we'll get to that in a second, is gonna evaluate the validity of the assumption that view materialization is useful. And also the efficacy of partial state as a solution to that problem. So this first paragraph, even though it's currently a little poorly written, is basically trying to get at like, the first question in someone's head is why should I care about this thesis? Why does it matter? Why is it important to me or to anyone else? And the first, we need to convince the reader that that is true before we can really do anything else. Any other evaluation will just be ignored by the reader until they feel like there's something there that's worth reading. And the reason this is that the big question is answered in section 7.2 is because the first section of eval, as we'll see in a second, is the experimental setup. And this is extremely important when you write a solid evaluation section is that you need to explain what experiments you ran and how you ran it because without understanding exactly what the experiments are and the context in which to the run, it's impossible to evaluate your results. Like if you show a graph that goes up and to the right, even if there are numbers like throughput or latency or whatever, you can't evaluate whether that's good or bad without knowing what experiment was this? Like what were the inputs? What machines were they run at? What was the problem? Like how complex was the solution? That there are all these other things about your experiments you have to explain and those go in experimental setup, which usually comes first in the eval. And then the second argument that this makes is that with partial state, only a subset of each view is materialized. So this is the whole partial state that we've talked about and that missing results are computed on demand. So this is the up query business that I was on about. And the thesis then makes the claim that this reduces memory use and we talked about why that might be, but it also means that some queries take a while to be satisfied. The intuition here is if you try to read from one of the sort of query caches at the bottom, but your result isn't there, the database now has to go in, like Noria has to go in and do all these up queries, which means that your result is gonna take a long time to actually be computed. And the thesis, like this is just like an inherent problem in how Noria's designed. In some sense it's not even a problem, it's just like this is the trade-off that Noria presents you with. It's saying that rather than computing everything and sort of fully materializing all of the results, which takes too much memory, we're gonna not store some of them. And then the sort of obvious follow-on truth from that is, well, if some of them aren't stored, you need to compute them on demand and that will be slower. And so what we're saying is that this is like a trade-off between the cache size and your miss rate. Or in other words, your memory use and your tail latency. Tail latency is just a way to say your latency for the slowest requests. The tail is sort of the, imagine that you have like a million requests, then if like 999,000 of them are fast, but the last 100 are slow, then those would be your tail. And you can talk about various amounts of tail. You can talk about like the highest 10th or the highest 100th or we'll see some examples of this later. But we basically wanna talk about like the worst case scenario. And in Noria, the worst case scenario is that you miss and that you have to up query. And we want to have the thesis be sort of honest and evaluate what is the cost of these misses? Because if every read was extremely fast, but if you do miss, it takes like a hundred seconds to compute the result to your query, that would obviously not be okay. And so the thesis needs to look at, well, how expensive is it actually? And so that's what we're gonna look at in 7.3. And then one thing that's nice about partial state that I touched on briefly is that you can start out every piece of state once you have partial state as empty. This means that if someone gives you a new query, you don't have to do any work because the result set is just empty and we're gonna fill it in on demand. You can think of this in terms of normal caching as just we're gonna start with an empty cache and then we're gonna fill it as the application does stuff. It seems pretty reasonable. But remember that if you don't have partial state, you don't have this option, you can't have things be empty because they're either non-existing or full. And so what happens in a traditional sort of view maintenance system is when you add a query, you need to compute all of its results. And that obviously takes a while. Even if you don't care about most of those results, even if the application only cares about a subset of those results, you have to compute them. And so section 7.4 is gonna be looking at how much does partial help for this context? Like if you add a new view, a new query, how much does partial help speed that up compared to having to evaluate everything or compute everything ahead of time? And then finally, and this is sort of a big one that was touched on in chat too is like we, well, actually this one is a little bit different, but one of the things that's interesting about partial is that it only helps if you only need a small number or a small amount of your computed state. If your queries are just like touching all of your data most of the time, like every other second, someone is like reading every article, then it doesn't really make sense to only cache some articles because every article is gonna be requested again shortly thereafter. And so partial being useful basically relies on there being skew in the data. That is, if you have a million articles, some articles are much more likely to be read than others. It turns out that this is true in many data sets, but we sort of need to convince the reader that this is true for applications that they care about because if the skew wasn't there, if the application just like randomly reads data with no like uniformly at random samples the data, then it doesn't make sense to only cache some things because the application will be slow a lot of the time. And so 7.5 is gonna go into how confident do we feel that a small number of things, a small amount of your data set is usually what you care about and therefore caching makes sense or eviction makes sense rather. If we can't show that, then no one is gonna believe that partial state is useful. If almost every application needed all of their data all of the time, there's no reason to use partial state because you either need to compute everything or you're basically always computing on demand. So this brings us, there's actually a paragraph missing here. I just realized we'll get into that later. Okay, so I've walked through just like the high level argument that the evaluation is making. Let's pause here and see whether this made sense and also whether there are other questions you have about why this might make sense or why it's trying to make these arguments or why these are real problems before we move on. Is there any way to have a look at the current version of your thesis? Is it going to be published? My current thesis, the source code for my thesis is online. The benchmark results are not just because I've been lazy but they will be and my final thesis like the PDF of the final thing will be published. How is Noria handling caching in a distributed scenario? So we're not actually gonna talk much about distribution in the eval section and that is because we're focusing on partial state, right? Like Noria does a lot of different things and I'm basically trying to say my thesis is about only partial state. And the reason for this is it would be too difficult to write a thesis about all of Noria and it would also have too many obvious holes. Like there are a number of things in Noria that we haven't quite figured out yet and it's a little unsatisfactory to write a thesis where a bunch of the answers are, this doesn't quite work but we don't have a better solution. And distribution is one example where it works pretty well for many things but once you start to do sharding you run into some really weird corner cases that we just like haven't looked at as much as we've looked at partial. What is the timeline on the thesis being finished? I am hoping to finish by the end of October. It's very hard to say like working on a thesis is weird because you're in a position where there aren't any set rules for when it's done. Instead the thesis is basically when your PhD committee thinks that the work is sufficient and well enough explained. And that is very subjective criteria to work towards. You're dealing with people who like it's not that they're bad, right? It's not, they're not trying to get you or anything. It's just that it's very hard to know in advance what is going to be needed from a thesis until you see the results. And you'll see this for some of the graphs later too where we went through so many iterations even just to get to this particular graph and you can't ahead of time know how long that will take. So it's pretty hard to estimate how long it writing the thesis will take. I will say that at least for me the evaluation section is the one that has the most uncertainty about how long it will take because the other sections like I know how the design of Noria works and that is not going to change for the thesis. I'll have to actually write the text but that's just writing which just relies on me and I can evaluate it pretty well how far along I am. It's much harder to figure out how long it will take to write the evaluation section because you run an experiment, it doesn't quite show what you want to show or it doesn't show it in the way that you wanna show it and so you need to design a different experiment, run that different experiment and then maybe that doesn't quite give the results you wanted either and you keep iterating and you don't know how many iterations it will take until you're happy and so this is also why I started writing my thesis with the evaluation section because that is the one that I have the least knowledge of how it's gonna pan out. Does evaluation propose a specific data model and how those connections affect the data flow process? I'm not sure I follow. The evaluation section is not looking at, it's not exploring alternatives, like that's not really the role of the evaluation section. Sometimes it is and we'll see an example of that later but generally the evaluation section is like I have now explained the full system to you, let me demonstrate that this is a good solution so it's not like sampling lots of things and being like oh maybe this could be interesting, it's really saying here's a good solution and let me tell you why. What will happen to your streaming schedule after this is over? I wanna keep streaming. Once my thesis is done, I think I'll be able to stream more because I won't be in thesis crunch mode. The explanation in the Eval made total sense and he guesses what giants like Reddit, Twitter and Facebook are running to me to get the problem your thesis is going to describe. Yes, actually and this is something the Eval will touch on a little bit later and something I want to move earlier in the thesis but I haven't had a chance to yet. So I'll get back to that question when we get a little bit later. I would like to hear all your advice in PhD and thesis work in general, maybe at the end. Yeah, let's do some, we'll probably do some of that at the end. I'm sorry but reading again, I'm unsure what this enables fast adoption of queries. So this part. So that sentence is about how if you add a new query to the system, like you issue a query to the system it's never seen before. Then with in the old, in the sort of traditional data flow systems or materialized view systems you would need to compute all the results for that query and store all of them immediately. Whereas with partial you can start out with an empty cache and that only when particular parts of that query are being executed do you need to fetch them on demand. A good example of this is if we go back here to this query, imagine that the database or the application rather tries to prepare this query and this is a new query the database has never seen before. Then with full materialization it would have to compute the results for every article ID because it doesn't know the article ID yet or it can do no work. Like it has to choose one or the other. With partial we can set up the data flow and then when a query comes in for say seven then we do this like up query business for seven and from that point forward any query for seven will be fast. With full materialization when seven comes in either you've computed the results for this query for any question mark value or you haven't computed any of them and you decide to compute it just for seven. Now imagine another query comes in for eight, right? In the full materialization world either you already have the result because you computed it for every question mark or you have to go ahead and like compute all of this again for eight without you can't really reuse any of this because it's all specific to seven. With partial these are just like an up query for eight if that distinction made sense. It's created data flow static. If you run into situations where suddenly another join order algorithm makes more sense. The data flow isn't static you can add to it but you can only append to it. So if you, Noria can't currently like change the join order for example. That's not a thing it can do without rearranging the cache basically and sort of clearing the cache and setting it up again. You could imagine being able to do some transformations internally but it's not something Noria currently does. Are you planning a postdoc when this is done or do you have other plans? I'm not planning a postdoc. I'm planning to probably go into industry but I haven't figured out my exact plans there yet. Is materialization similar to building indices? It's not quite the same but it's similar. So indices are usually added when you already have the data that you're creating an index over. Materialization is computing a derived value of some data you already have. So you need some kind of stored procedure in advance. Well, Noria does rely on prepared statements which are pretty common in how applications use databases today. So it does assume that it gets to know the queries in advance and it also sort of assumes that the application is gonna keep issuing similar queries. This is another case where we get into like Noria sort of assumes that applications are basically web applications because they traditionally do this, right? If when you deploy a web application it's gonna issue the same queries over and over and over again. So it makes a lot of sense for the database to take advantage of that fact. Aren't the databases not doing the data flow already? That actually makes so much sense. Yet databases today usually, they might describe what they do internally as data flow but they're executing it sort of bottom up and not in the sort of data flow down way. Like when a query gets executed usually the database ends up sort of finding the data that it needs and then building the results piece by piece which is a little bit different to what we're doing here. Are you going to stay in the U.S. after completing your PhD? At least for a few years. My girlfriend is getting into voice acting and so she's moving to L.A. and so I move into L.A. And so I'll be there for at least like a few years. All right, so let's start to actually walk through the eval and some of the graphs and such that are there. Let me zoom this out a little. Okay, so first is the experimental setup and this is a little bit important to understand what comes in the eval later. Basically what we've done or one of the primary parts of the evaluation for Noria is the Lobster's website. So if you're not familiar with this, Lobster's is a page that's like a little bit like Hacker News but it has a few key differences that made it more viable for use and evaluation in Noria. The first of these is that Lobster's is open source. It's a Ruby on Rails application and the source code we can actually inspect and crucially we can look at which queries are being executed, which matters a lot for Noria. And then second, I interacted a little bit with some of the Lobster's administrators a few years ago and I managed to get sort of anonymized data statistics about the Lobster's database basically. Things like how many votes do users have on sort of like how many users have 10 votes or fewer? How many votes have between 10 and 50 votes? How many users have between this and this many votes? How many articles have between this and this many votes? Which pages are most frequently accessed? How often are they accessed? How popular, what's like the distribution of popularity for articles? What is the distribution of number of comments for different articles? So I got all these statistics about Lobster's and this let me build a workload generator. So a workload generator is basically a way to construct a benchmark that is artificial, right? It's not running the actual Lobster's code base because Ruby on Rails is pretty slow for something like that. But it means that I can run an experiment that shares many of the properties of a real dataset. In particular what I did was I built a workload generator that produces page view requests that resemble the ones that you would see in the real Lobster's. So it will generate like most of the page views that are requested will be like requests for a particular story and it will request like a particular set of stories more often than others. So this is the skew that we talked about. But all of the different parameters for how these are skewed, like which, if the workload generator decides that this user is gonna vote for this story, it chooses to do so in a way that matches the statistics that we have about the actual Lobster's dataset. And there are a couple of reasons to use a workload generator. One is that we can avoid Ruby on Rails. And the second one is that we can scale upload. So if we just worked from the sort of base dataset and just ran that, we wouldn't be able to experiment with what happens if there were twice as many users. What happens if there are a hundred times as many users? When load increases, there are a bunch of other properties that change in the system and a workload generator lets us sort of artificially turn up and down the heat, so to speak, to evaluate how Noria performs in different settings and at different load factors. All of the experiments that run on EC2 on like multiple different machines, generally the server is given a single machine and then there are multiple clients that are issuing a request. And interestingly, all of the benchmarks are open loop. So what this means is, and this is a pretty important thing that I think systems researchers need to do in their evaluation sections, which is a common way to benchmark is that you have just like a while loop that does the thing and then measures how long it took and then does it again and it just runs in a loop. And this is kind of a bad way to benchmark things. In fact, it is a bad way to benchmark things. And one reason why is that is not how a real application would work. In a real application setting, requests are coming in at some rate. And if your system takes a really long time to answer one request, more requests are gonna continue to come in while that request is being serviced. And those requests are gonna be delayed by however long it took to run the slow request. If you just benchmark timing an operation in a loop, you will never see that. You'll just see that one request was slow. When in reality, if you had real users using the system, they would see many requests being slow because any request that comes in while a slow request is ongoing, those requests would also have to wait and be experienced as slow. And so what an open loop benchmark does is that it generates requests at some relatively fixed rate or some random rate. It measures the time when the request was sent and then it separately measures the time when the request was completed. And what this means is that now, if a request is sort of just sitting in a queue waiting for some slow request to finish, it's still gonna be accumulating slowness. That queuing time is also gonna be measured. And so the benchmarks that I do, I guess, in this eval section are all open loop and will measure the queuing time as well. Questions about experimental setup before we get into the actual results. Something like Jmeter. I'm not familiar with Jmeter, but may very well be. Like open loop benchmarking is a decently well-known concept. And it's just like if you're running a good benchmark, it probably should be open loop. Oh, let me zoom in if any of you are trying to read the text. The text here is like not finished. It's just like my current draft. And another important point here is that the server is gonna have 128 gigabytes of memory. So remember how I talked about earlier that one of the big wins with partial is that you are able to evict, which means you're able to use less memory. And that 128 gigabytes limit is gonna come up later. Doing the same queries over and over again, we'll also train the cache and branch predictor. That's at a much lower level. The branch predictor will end up mattering for Noria. We won't really see it in the eval section, but the reads in Noria are basically just hash map lookups. And so the branch predictor will end up mattering there. But because it's network, like so much of the actual, the bottlenecks of the system in a large sense is like sending and receiving bytes over the network and serializing and deserializing them. The bottleneck is not generally inside of Noria, although sometimes it is. Oh yeah, open loop is definitely what you should be doing if you want. And the reason is because, if you do what's known as closed loop benchmarking, so the one that's just like a tight loop that measures things, you're overstating the performance of your system for real applications, which is obviously not great. Would you be interested in doing a live coding session about benchmarking? Maybe one day. It's unclear that there's that much to say about it, but maybe. Okay, so let's dive into the actual experiments. It took us a while, but now we're here. So as I mentioned, the first thing we need to do in the evaluation section, because some people will read the whole thesis until the eval section. In practice, what usually happens is people read the introduction section and then they read the evaluation section. And then they decide whether they care enough to read something else. And so usually my guess is someone reading this is gonna read like chapter one, maybe just the first part of chapter one, which is gonna be like introduction and motivation. They may just read the abstract. Then they'll jump to the evaluation section. They might read the first page. They'll skip experimental setup and then they'll read the first part like basically this page. And here we need to convince people that like this is useful. And this is also what the text in the beginning of the evaluation section was trying to set up, right? That sort of stating what the core like tenant of the thesis is. And in what sense we believe that this is a solution and what the eval has to do is convince the reader that it's a good solution. And so the way that this sort of first and arguably most important part of the evaluation starts out is saying that the core argument of this thesis is that partial state makes view materialization feasible. The idea being that before it was not feasible, but with partial state it is. And it then makes the point, the bundle up in this argument, there are a number of like different questions that really come into this that we need to answer before we really look at partial state and like sort of micro benchmarks of it. Which is like, why do you want view materialization in the first place? Why is this a useful problem to even be working on? Because if the reader doesn't believe in point one here, if they don't believe that view materialization is useful, then the whole thesis is not worth reading for them. And like it doesn't, then they're not gonna care about partial because they don't believe that it's solving an important problem. And then we're also sort of saying by saying that partial state makes it feasible, we're sort of saying that it's not currently feasible and we need to demonstrate that that is the case. And then finally, we need to demonstrate the partial state is indeed a solution to this problem. And so we sort of want to convince the reader of all these three points in order to convince them that the rest of the thesis is worth reading. Does that make sense? Can you explain the structure of a thesis? I mean, it's like intro, motivation, related work, design, evaluation, discussion, conclusion. There isn't that much that's standardized really, but those are sort of the main points you need to touch on over the course of the writing. Okay, so we've set up the reader with like, basically an acknowledgement that these are the three questions that they have in mind. And then it goes on to say figure 7.1, which you can't see on this page, which is a little sad, I'm considering moving it up, attempts to give insight into all of these questions by comparing the highest sustainable request load for three different systems. MySQL, Noria without partial state and Noria with partial state. Notice here that MySQL is run entirely in RAM by running it on a RAM disk and on its lowest isolation level. Because MySQL has all these like built in transaction things that might slow it down as we disable all of that to try to give as even a comparison as we can. And the figure shows the highest lobsters throughput that each system achieves before its median latency exceeds 50 milliseconds. There's some text down here, I'm gonna show you the figure first. So this is the figure and this is arguably the key figure of this paper. I'm gonna give you a little bit of time to like look at the figure and the caption before I start talking about it. So the core point of this figure is that Noria is good, right? So on the far right, you see that the top part of the graph here is then the throughput, the number of pages per second that the back end can support or that the workload generator can like push through the system. And then the bottom part of the plot is how much memory does it use at that throughput level. So with MySQL, you can issue, okay, so this is the number of pages per second. That MySQL can support of the Lobster's workload generator. And notice that these two plots are with view materialization and this one is without. And the top part here is really just saying, caching is good, right? If you cache, you can do way better than if you don't cache. But then the second point that's interesting here is that with partial, you can have a much higher, you can support many more pages per second that you can without partial. And the reason for this becomes obvious when you look through the memory use. So with MySQL, because you're not caching anything, the memory use is obviously very low. But with full materialization, so this is without partial, you need to materialize every result for every query and you can never throw anything away. There's no eviction. And therefore it ends up using a lot of memory. So at about 4.6,000 pages per second, it just runs out of memory. It can't go any further because you're introducing new data to the system and the system just like doesn't keep up. Whereas with partial, you can evict that state. You can evict any state that's not constantly being accessed and therefore it uses much less memory, which means that it can go further. Okay, I'm gonna stop there to see whether the explanation makes sense and then talk about some of the things that this figure does not answer or some additional questions that might come up when you look at this figure. You consider to add Postgres adapter other than the MySQL adapter. So the thesis will not actually talk about the adapters because they're not in use here. This is interacting directly with Noria through the Rust API and with MySQL directly to MySQL. There's no adapter involved. The adapter was something we built for the previous paper because it meant that we could compare the two a little bit more directly but it's not really relevant to the argument that this thesis is trying to make. So there's no adapter here. That's why I left academia. There's always this bias to or need towards positive results. I would be perfectly happy to read a paper or thesis that clearly explains why some concept does not work. There are theses and papers that do that that give negative results or that are more survey papers. It is true that they're pretty rare in computer science and that's a little sad. It is generally harder to publish a paper on we tried this and it didn't work which are also important results because it means that other people won't then waste time trying to do the same thing. But it is certainly true that it's much easier to write a paper or a thesis on we tried this and it worked. Why was MySQL chosen? It doesn't really matter. It just that the right bar here just has to be a database that doesn't cache. We could certainly try some other one. MySQL is just like by far the most common one. We couldn't use something like IBM DB2 or Oracle because they all have no benchmarking clauses. So if we use them, we couldn't name the system which is just useless. Postgres would probably be about the same here. It's not clear there would be any faster because ultimately the win here comes from the caching which Postgres would also not have. What eviction strategies are used? This is randomized eviction. So I mentioned this a little bit earlier too that this thesis does not try to innovate on eviction strategies. Like you could implement any eviction strategy you want in Noria currently just as randomized because that was easy and it still demonstrates the value of the system but obviously you could do better if you have a smarter eviction strategy. Why isn't caching everything faster? So remember that, okay, so the way this benchmark works is that you, we're just measuring throughput here. We're not measuring latency, right? We're measuring just throughput. And as the workload generator generates more and more requests per second, the one that doesn't have partial runs out of memory. It can't go any higher than this because it ends up with just too much stuff. So it's not that, the takeaway from this is not that with caching your requests are slower than they are with partial. It's that you can't scale as far in terms of throughput. Is Noria's data flow agnostic to the different types of database? There is no external database in Noria. Noria does not contain MySQL or any other database. Are there trade-offs on request latencies? Yes, and we'll get into those in a second. So this is the second paragraph that the introduction to the eval was trying to set up. Why is partial state better than caching pages in Redis, Memcache, et cetera? I'll get to that towards the end of the eval section. It's an important point. The high level answer is doing manual caching and Redis or Memcache or something is extremely complicated to get right. Especially for anything but like the most trivial applications. And with Noria, you don't have to do any of that. With this scale well with much bigger data sets. It's hard to say. Like one challenge here is that I don't have like the really big applications aren't open source and don't have their data sets open. So I just can't test them. And so the best I can do is scale up a data set that I do have. What's the rate of change in the data? How many changes occurred during the test? So this is just doing the normal, this is doing the normal lobsters evaluation, the lobsters workload. And what that does is it generates different page views in lobsters and to perform to execute any given page view there are a number of queries that are issued behind the scenes, which are basically the queries that the real lobsters like Ruby on Rails application would issue. So for example, if you issue a vote it's gonna do an insert into vote. If you do a, if you read a story it's mostly just select queries. Although there is one insert in there that like logs the fact that you've now looked at this story for notification purposes. So I don't have a number for exactly how many modifications were made because there are, these are page views. They're not individual atomic operations. We'll look about that a little bit later though. What is lobsters and how do we know it doesn't favor Noria over MySQL by design? So lobsters is an application that I did not write and it's like a production website that mirrors the design of many other websites although they're not open source so it's hard to say. It was built by other people and basically what I did was I built a workload generator that runs the same queries. And those application authors did not know about Noria at all. And so that's the best answer I can give that like there's nothing, there should be nothing inherent there that makes it favor Noria. Have you tried to calculate the potential cost savings of using Noria over MySQL? Number of MySQL instances needed for the MySQL for MySQL to perform similarly. I don't think MySQL can perform similarly to Noria here. Like if you tried to run a cluster you end up with a lot of overhead in the clustering in MySQL as well. Is this basically partially practicing defending your thesis? Not really, this is just me trying to walk through an eval section. Where's the code? This is not a live coding stream. Like many of my streams are, but this one is not one. Have you compared it to Facebook's database RocksDB? RocksDB is just a key value store Noria is a full SQL database. So the comparison isn't really meaningful. There's like RocksDB with MySQL support. I have not compared against that. All right, so this is like one of the key figures for why that the thesis tries to use for arguing why Noria is like useful. There are a couple of questions you should be asking yourself about this figure. One of them is where do these memory numbers come from? Like why is it that it uses 115 gigabytes at 4.6 pages per second? Why does that mean that it uses more memory if this was pushed higher? Basically, why can't we push the number of pages per second higher without increasing the amount of memory we use? And the answer to that is two-fold. First, as you issue more requests per second you're also generating more data, right? You're adding comments to stories, you're adding stories, you're adding votes to stories and comments. And so you're like adding data to the system that then has to be represented in every query result when you don't have partial. And so that is one reason why the rate of change increasing and increasing the memory usage. The other thing is that even at 4.6 pages per second if you kept running it for longer, you increase the memory use as well for the same reason, right? If you kept running at that rate, you're still generating more data. And so the memory use actually goes up over time. And that makes you think, well, why are these numbers even useful, right? Because for any given input rate the memory use is not constant. It's gonna be going up because you're increasing the size of the dataset. The answer there is basically all of these experiments run for two minutes. And so if you run this for two minutes you get this memory use. So if you wanted to run at a higher rate for two minutes you would get a higher memory use and because we don't have more than 128 gigs of RAM you can't go higher. And that does mean that this is a little artificial, right? If we ran it only for one minute then you could go higher because you haven't generated as much data in one minute. And so there isn't really a good answer for why exactly two minutes. I could make it longer and this would go up and this would go down. What is important though is that for Noria with partial the increase in memory use is much, much slower. So if I went to like three minutes Noria with partial would increase memory use a lot slower than the other experiments would. I don't have a good answer for this. Like I don't know quite how to how to illustrate this without being dependent on how long you run the experiment for because the input rate is just like, it generates data. The best I can do is compare them for the same amount of runtime. Why does the caption say the full materialization blows up at 2.3 K pages per second but the picture says 4.6 because the caption is outdated. Does Noria put constraints on what MySQL features can be enabled? There is no MySQL in Noria. So Noria does not put any constraints on MySQL features because you can't use any. The more serious answer is that Noria does not support MySQL queries. It supports just SQL. So if you're using like fancy MySQL dialect features of SQL, you couldn't do them in Noria. This is partially just because Noria is a research system. If you wanted to add support for those features you probably could support many of them. It just not been a focus of the work. So the argument then that we're making here is Noria uses less memory and it's significantly faster than MySQL. If we look back at the questions, right? Why is view materialization desirable? Well, that's like the throughput difference. It's significantly faster. It's like 9x faster. Why is view materialization not feasible currently? The answer is basically it uses so much memory that either you can't run for as long or you can't run at high enough a rate. And so you would just be really limited in what kind of application scale you could run at. And then does partial state improve on this situation? Well, yes, it does. It uses significantly less memory, which also means that you can push throughput higher. Or alternatively, the sort of inverse of this argument is you can use a smaller instance, for example, a smaller machine with less memory. Because here with Noria, imagine that you wanted to run at like 4.6 pages per second. That was just like the amount of users you had generated that load. Then Noria would use significantly less memory. And so you could get away with having less memory in the machine, which is cheaper. How about if you ran it till it reached a steady state of queries? So the challenge is that it's not a steady state of queries. Like there's no steady state here because most of the queries do some kind of insertion into the dataset. So there isn't really a steady state there. The dataset will keep increasing. Shouldn't you run the test on different EC2 instances then with different parameters to show how much Noria improves the throughput in each scenario? I'm not sure I follow. What EC2 machine you run won't really make much of a difference here. Did you look at the rate of change of resource usage and throughput over time, rather than just instantaneous results? Yeah, you'll see those a little bit later. Remember that this is just the first graph and the idea behind this graph is just like immediately show why you should read the rest of the thesis. This graph cannot answer all of the reader's questions. There's just no way to do all of that in one graph. All it can hope to do is convince the reader to keep reading. If the first graph is not compelling enough that the reader is like, imagine, right? Like these throughput graphs were at like one K. The reader might be like, why do I care? They use like 60 times as much memory for like a two X increase in throughput. I don't care about this paper. I don't think it matters. Whereas this graph clearly demonstrates that there's something here. And the hope is that the reader keeps reading and then gets more of a sort of refined image of what the system provides you with. You could add memory over time, tapping at two minutes instead of the lower bar chart. Not sure I followed that. What if you use SSD as memory? SSDs are much slower than memory. So that would not really work. I wonder how Postgres would do if rustified. I don't think it would make a difference. I wonder how Noria would fare compared to MySQL if you turned off view materialization completely. If you turned off view materialization in Noria, it would be much slower than MySQL. The reason for this is that MySQL has seen like decades of optimization work to its query execution. Whereas Noria's up queries, I've built over five years. And so they're just much less efficient. And sort of by design too, they are requests for particular subsets of the data, which means you can only execute queries in certain ways. So Noria doesn't even have the same flexibility MySQL does to optimize queries. So Noria without view materialization doesn't make much sense. Okay, so the hope of this graph is to convince the reader that this thesis is worth reading. And there are some challenges with it. And some of them I'm still working on. But it's interesting, because I've gone through a lot of iterations of this graph. And I think this is the best one. There might be a better one, but I'm not quite sure what it is. And the reason I think this is the best one, even though it has the challenges we discussed is because it very quickly gives many of the takeaways that I want the reader to have in mind when reading the rest, right? With his higher throughput, lower memory use and view materialization is good. Ravi, the comparison to materialized and differential data flow I went over earlier. What is the rate at which memory usage of Noria keeps increasing as more page views are performed? So the way the, actually, this is something I can draw out. Give me a second here. Also, I wanna set this to not be silly. Oh no, is she my one? Zero? One? That's awkward. Fine. Okay, so let me demonstrate this. If you looked at memory use over time, and this might be something that's worth plotting, I'm not entirely sure, is graph like this where you had like, if you have time here and you have like memory up here, then what you'll see is that with full materialization, the graph will just go there for some arbitrary scale. With partial, what you'll see is that it'll more go like this. And so you'll see that there's sort of a, down here, there's sort of a log component and this is basically filling the cache. And then there's a linear component here and that linear component is the data increase, right? With full materialization, this delta is the data plus all of the cache results because we have to cache every result for every query for each part of the data. Whereas with partial, we're only computing the, well, actually I guess the difference between these lines at any given point is gonna be cached but not accessed, right? So the difference between these is basically a bunch of stuff that didn't need to be cached but with full materialization, everything must be cached. And so this is where the memory saving comes from. I feel like that graph mostly shows better than full materialization but it's hard to see better than MySQL. It would be interesting to see what memory usage it has the same throughput as MySQL. So, I mean, I could show that. It's just like, the takeaway here is that MySQL can't do better than this. This is just like, this is the highest I could push MySQL on the server because before like all 16 cores are busy. And so this is clearly can scale much further, right? And it does that by utilizing memory is the takeaway. Like this is sort of the core trade-off of materialization, right? Is that, or of caching in general is you use memory to make things faster. The lower chart is how much memory is used after two minutes which is arbitrary but you could have a line chart with three lines that shows how much memory each config uses in two minutes. That's true. It's a little harder to sample and it's also harder to read but it is true that this could be a line graph and that might help. And I do like that this is very visually simple. Like you sort of want your graphs to make a single point if you can. This one is making slightly more than a single point but once you introduce like a timeline as well then time seems like it should be relevant but in some sense the argument this graph is making is unrelated to time. Is the memory saving due to random eviction? Well, it's due to eviction. It doesn't matter that it's random. One thought on the first graph. If there's a strategy currently in common use with MySQL plus an additional caching layer I would find it odd to not that it's not included. Okay, so this is another important point. Very commonly, as I mentioned people will use MySQL plus some cache like Redis or MySQL or something. Redis or memcache or something. The challenge here is that implementing caching for lobsters would be a huge undertaking, right? It would be really complicated to try to add caching to an existing application because not only do you need to add all the necessary caching layers but you would also make sure to you would have to augment every part of the application to keep that cache up to date whenever the data changes. You would need to add a lot of mitigations for things like thundering herds which I'll get to in a little bit later. So adding the fourth column that's like MySQL plus memcache I agree would be great but in practice it's really hard to do that. And it would also mean that we're basically evaluating my way of doing caching like some arbitrary way I made up that you should be caching lobsters. So the way that this actually goes about this and this is the paragraph I mentioned was missing from the first part of eval is that the last part of eval is gonna be compared to basically compared to just rolling your own cache. And so that will get a little bit of that difference. I wonder if this more sophisticated cache eviction policy would make it feasible to ignore you fast on cheaper machines. 64 gigs of RAM is no joke. Yeah, smarter eviction will get you down a little bit. There's also an additional point here which is that this 64 gigs of memory includes all the base tables. Remember here MySQL does not store the data in memory. It stores it on disk. Noria stores all the data in memory including the base tables. And so that means that there's like a big chunk of this probably about like 50 gigabytes that are only due to the base table data being in memory. If we could remove those, if Noria stored those on disk instead, then the actual sort of memory overhead of Noria would be very small indeed. MySQL does not support materialized views, no. And Postgres also does not support materialized views. Let me rephrase. They support materialized views, but they're useless. Materialized views in existing databases are basically like compute the results whenever this happens, which basically means re-execute the query when something happens. This is usually like trigger-based materialized views. And if the trigger is someone writes to this table, you're basically gonna be re-executing that query all the time, and it's just not useful. It means that every write causes a bunch of work as well. And with Noria, you don't need any of that. We did some experiments of this in the previous Noria paper. I meant instance types. My question goes towards this. Are you trying to answer the question of is materialization useful at all, or are you trying to answer when it is useful? I'm trying to say the materialization is always useful. And I believe this is true even here. If your dataset is much smaller, then Noria will also use less memory. What happens when full materialization runs out of memory? Does it slow down or does it stop accepting further queries? Well, Noria currently crashes. Like it gets killed by the out-of-memory killer from the kernel. And yet materialized views in Postgres are also not in memory, I think. You see a future where enabling partial materialization for some queries on a production database would magically improve your throughput. I think it's very hard to add good materialized views to an existing database. So like MySQL or Postgres or something. Oh yeah, the other reason why materialized views in databases are a problem is because they re-execute the entire query every time, even if only a small part of the data has changed. Whereas Noria will incrementally keep the result up to date. So if one new vote comes in, it doesn't have to count all the votes. It will just add one to the current count. MySQL is measured with an in-memory file basically with a RAM disk, but this is measuring the resident memory of the process. And so the RAM disk part of MySQL is not measured. I could include that, but it seems a little unfair because MySQL doesn't necessarily optimize for the size of its on-disk storage because the trade-offs for disks are a little different than memory. I could certainly add the storage cost of the base tables here by measuring the size of the RAM disk. Okay, so let's move on from the first graph. It basically makes this argument about instance types, how much memory you need, that sort of stuff. Then the next section is on the memory latency trade-off. So I mentioned this a little bit in the setup that one of the advantages of partial state is that you can throw stuff away. But the downside of that is sometimes a query will be executed and the application is waiting for the response, but that result isn't cached and we need to go to compute it. We basically have to do an upquery and wait for it to come back. When you have to do that kind of on-demand computation, there's a latency trade-off inherent in that. Now the result doesn't come back immediately, you have to wait for it. And the easiest way to evaluate this is basically to start something like Lobster's with all of the views empty, like the whole cache is empty, and then you just start the application, you start the workload, and then you just see how the latency changes over time. And that's this plot, which some of you may have seen if you follow me on Twitter because I've worked a lot on this particular plot. And this one is, let me tell you, this is a doozy, but this is plotting the page latency over time. Time zero here is like Noria starts and the workload generator starts, all the caches are empty. And then what you see is that over time, as the workload keeps going, initially the page load latency is very high. Notice that the y-axis here is log scale, so going from here to here is an increase of 10 times. Remember from here to here is another increase of 10 times, and 10 times, 10 times, 10 times. So this means that in the beginning, like your queries requesting any given page is gonna take like on the order of seconds or like around a second, which is a very long time to wait for a page to render. But then over time, as Noria starts to cache the most popular query results and the most pop, that the pages, that the different pages request, the latency starts dropping, right? Because the cache starts filling. And then the cache keeps filling and the latency keeps dropping. And then what this plot shows, what the colors are for, is that the brighter the color, they're like closer to white, the higher part of the tail that is measuring. So very often what you'll see in graphs and such, they will measure like the mean, or maybe the median, maybe show standard deviation, that kind of stuff. And that's useful, but it doesn't really give the full image of what's going on. And so what this will do is actually tell you like, so the 50th percentile here is the fastest 50% of requests. 50 to 90 is saying everything between the half fastest request and the 90th percentile of requests. So imagine you sort all the requests by how long they took, then the 50th percentile is the middle, the 90th percentile is nine tenths. So that means if you look at the darkest purple with the next purple, those are nine out of 10 requests took less than this time, right? So if we look at the beginning, nine of 10 requests took two seconds. Some of them took less, but if you like two seconds was sort of the top for how long nine out of 10 requests took. And then after about a second, that is one second. About two seconds later, that's 10 milliseconds. You can see the black bar here is the mean. The white goes all the way up to the max. So this is the slowest request measured in that timeframe at all. And you see all of these go down over time as the cash starts to get populated. Do you think Noria could become competitive with MySQL or Postgres without materialization? No, the computation, the query model is very different and it's not built for on-demand computation. The outlier in the 95th percent all at two seconds is interesting. What outlier? You're talking about this? There isn't really an outlier here. It's just that some requests do not hit in cash, right? What's the typical query like in Lobster? So Lobster's queries are all over the place. They're like joins and aggregations and stuff. In general, they try to avoid aggregations because they're too slow in MySQL. So there's already a bunch of manual materialization work they did. I'll get to some of these in a second, but if you look at things like keeping vote counts, usually what you would do is instead of actually doing a count in SQL, you would have a column on the article table that's the vote count and over a vote comes in, you insert it into the votes table, but you also update the vote count column of article. And that way your reads can just read directly from that view. But there isn't really a typical query. The most common queries are things like give me the comments for the story, give me the front page, like the most popular stories, that kind of stuff. The 99th percent to max color is a little difficult to distinguish. Might be on purpose, but I think it would be a bit darker distinguish from the white background. I think that's just the stream encoding. In reality, the background is gray and the color is like yellow. I don't know if you can see it better if I zoom in. So I think that's just the stream video encoding. At least it's better in the PDF. Why do you use log scale for the plot? So there are two questions here. Why is the Y scale log and why is the X scale log, which you may or may not have noticed. So the Y scale is log because otherwise one second is so much longer than 10 milliseconds. So if I didn't plot this log scale on the Y axis, then you would just see that it drops basically to zero. Like once you get to two, it would look like it's just all zero. But in reality, there are things happening in the tail, not in the tail, but there are things happening at the lower latencies here too. And that would just be invisible if this wasn't log scale. As for why is the X axis log scale, it's a little bit of an artifact of the measuring technique, which is I'm measuring the latency in buckets of increasing size. So I'm measuring the latency in the first second, then the next second, then the next two seconds, then the next four seconds, and so on. And this is because I wanted to not sample too frequently because it adds a bunch of overhead to sample. There might not be a good reason to do this really, like I could probably plot them on a linear scale, but it comes back to if this was plotted on a linear scale all the way up to like 128 seconds, this thing happening in the first few seconds would be invisible. It would appear just like as a diagonal line in the far left of the plot. And if I cut it earlier, then you wouldn't see that it keeps dropping. So that's the log X axis. Do you have a graph that shows how the right performant changes depending on the amount of items in cache? Data still gets updated even in read heavy systems. So I don't have a graph directly on right performant. Actually, no, I do. You'll see one a little bit later. Yeah, so this is basically trying to make the argument that it is true that Noria increases the tail latency, but in practice, the tail latency is pretty low because most things hit in cache. So any given page in lobster's issues, I think it's like on the order of 10 queries. And so even if one of those 10 miss, it doesn't add that much to your page load time. If all of them miss, which is what happened in the beginning, certainly it takes a long time, but what this is showing is that at least in this workload generator, that just like doesn't really happen. Even your max is just not that high once the cache has been warmed. So this is trying to demonstrate sort of that trade off. And then of course, the next part of that trade off is what happens if you try to evict more aggressively, right? So basically how aggressively you evict affects how much of an impact that's gonna be on your latency. If you evict more things, more things will miss so your latency will go up. And that's what this next plot gets at. So this is showing a CDF. If you're not familiar with the CDF, it's basically the same thing as what I encoded in the colors of the graph above. So this is saying that like six out of 10 requests for, so this line is without partial. Six out of 10 requests took less than two, less than three milliseconds. Or if we look like here, eight out of 10 requests took less than eight milliseconds is the way to read this. So the place on the y-axis is what fraction of requests and the point on the x-axis is how long, how much time did that part of that fraction of requests take? And what this is showing here is that I'm not actually quite sure why turning on partial slows things down this much because this is steady state. So this is after the cache has been populated. I don't know where this gaps come from and it's something that has to be fixed. But certainly there are a couple of things to take away from this. The first is that with partial, your tail is longer, right? You see that in the very top parts of this graph, right? The like last few requests with partial take a decent amount longer than the last few requests, the slowest requests without partial. And that is because sometimes you just miss, like you do a cache miss and now you need to up query. And so those will take longer and this shows you how much slower the up queries are. And you can see that the up queries are slower by like a factor of 10. So there's a decent amount slower when you do missing cache but that's sort of as expected. And then the other take away from this graph is that as you evict more aggressively, so darker colors here is more aggressive eviction, as you evict more aggressively, your tail suffers, right? And this is because if you evict more aggressively, you miss more often, therefore more requests will take longer. And you see here, this is also a log X scale. And the reason for this is the same as why the other one is log scale, which is if it wasn't log scale, this would just look like an S with all the lines on top of each other and you wouldn't really see these differences because the range of the tail is so long. Let me see if I can get there. So this is, as I mentioned, steady state. Basically the benchmark runs for like two minutes and then I measure the CDF. And you see here also the sort of names of these lines and I don't know whether this is the way I wanna go with it, but they're basically showing what is the memory use, what is the actual memory use you end up with this amount of eviction and you see that going from not using partial to using partial saves you like 10 gigabytes. These are things that are just like never accessed. Turning on eviction saves you another like seven gigabytes and then you can keep making eviction more aggressive to save more, but there's sort of diminishing returns once you start pushing it down. And so for those of you who were asking like how much slower our cache misses, this figure also tries to answer that question, right? It's looking at this sort of tail end here is about 10 times slower than over here. And that happens when you miss. So it goes from like 10 milliseconds to 100 milliseconds. It's anecdotally, some of you may have followed me on Twitter and seen some of my work on trying to like amortize data structures. When I was initially plotting this graph, the tail here was like 10 seconds instead of 100 milliseconds. And that was because like one request every like so and so often would end up having to resize a hash map of like millions of elements and that just takes a really long time. And so the tail would just be really long. And so a lot of the work that I've done on the amortization stuff, it's just like getting this to just be the up queries. I would change the bounds on the x-axis slightly zoom in a bit more. Yeah, I could do that. I mean, it's a little tricky because it would mean that it doesn't start on a round number, but it's totally doable. All right, so that's the argument about tail latency. And the next one now is well, so another interesting point here is that you can only push this so far and the text mixes argument too that if you try to evict even more aggressively then the system won't be able to keep up. You end up missing so often like you end up missing on basically every request if you evict aggressively enough. And at that point the system is just not gonna keep up with load because it's always filling the cache. It's just like constantly doing cache misses, serving cache misses. And so there's a limit to how much, how aggressively you can evict. Basically, this line is just gonna keep being pushed to the right. And there's sort of a cliff where something, imagine that there's some story or some article or something that is very popular. If that article ends up being evicted frequently then now like your whole performance goes over a cliff because you're constantly gonna be recomputing that article. Why are the memory gains with more aggressive evictions so low? The reason for that is because remember this also includes the base tables. Noria has the base tables in memory and what that means is there's sort of a, we can't evict from the base tables because they're like the durable storage of our data. Like up queries ultimately end up at the base table so they need to have all the data. And this value includes the base table storage which is unfortunate because it means that the gains are all in the caching size above the base tables. But the base tables are pretty large and end up taking up a huge chunk of this. I think in reality the base tables here are like 40 gigabytes which then of course these savings are much higher than what they seem like from this graph. I'm still trying to work out the best way to articulate this point which is a little tricky. Okay, so this is the query I showed you earlier. And where the, oh sorry, what's the difference between no eviction and no partial? Does no partial mean we compute vote counts for all articles in the beginning? Yeah, no partial means we're doing full materialization so every query result is always kept and computed. And then no eviction is we have partial enabled but we never evict anything. So one thing that you might wonder is like the numbers here are sort of low. If you look at this graph, this is like 3000 pages per second. And even if every page issues maybe 10 queries or something then this is still only 30,000 queries per second which may seem like a lot but you might wonder like what happens if we go further? And we can't really do that with lobsters because if you look at the plot up here if we go above about 7,000 pages per second Noria can't keep up anymore even though it has memory to spare. The reason for that is actually because it ends up being bottlenecked by a particular query that needs to be computed like we just can't service the rights that have to go through that path any faster. We basically get limited by the throughput of a particular thread that services that data flow path. And this means that we can't see what happens at like a million requests per second but we kind of want to know that too. And so what we're gonna do is we're gonna extract this particular query from lobsters. This isn't exactly a query that appears in lobsters but it's pretty close. And this is the query I talked about earlier of articles and votes where you just like want to fetch like a given article along with its vote count. And then we're gonna do is we're gonna build a workload generator for this query alone. So it's not gonna do any of the other stuff that lobsters has. We're gonna do sort of a micro benchmark of just this query. And then we can push things a lot further. All right, so this is a weird graph. And this is probably not how it's gonna look in its final form. In fact, let me, I don't like this graph. So this graph, which I actually plotted just before the stream started because I got some new data. What this is showing is a throughput latency graph. And these are a little bit weird. They deserve some explanation because if you read this graph, you might see that there's some really weird things going on. A throughput latency graph is a graph where you run the benchmark at a particular load factor. Like you target like a million operations per second and you do this with open loop. And then you measure, I tried to run it with a million operations per second. How many operations per second did it achieve? And what was the latency of those requests? And in this case, we're plotting the 90th percentile latency, but this could be mean or median or whatever. In fact, let me make this plot the mean just to see whether that's a more useful plot. Yeah, it's a little bit more useful. And so what that means is that it's not as though, if you look at a given point and then look at the X and Y axis, it's not like the Y axis is a function of the X axis. Rather the X and Y axes are both a function of the point to their ad. So there's sort of a missing data point you don't have here, which is what input rate was provided for any given point. And for many of these points, the input rate matches the value, the achieved throughput, right? So here, this is like 250 operations per second and it achieved 250 operations per second and the latency was very low, close to zero. This is 500 operations per second. It achieved 500 operations per second with a low latency and so on. And the reason why throughput latency plots are useful is because they tell you when a system falls over. Basically what you do is you keep increasing the input rate to the system and then at some point, the achieved throughput is gonna stop going up because the system can't keep up anymore. And at that point, the system starts getting this backlog of requests that aren't being serviced. And so it ends up building a queue so the latency goes up. And so you get this hockey stick effect where when the system no longer keeps up, it sort of ends up going like this, which is what you see here, right? At some point, the system is no longer keeping up so the achieved throughput is no longer going up even though the input rate is going up. So for example, this point is an input rate of like four million operations per second, but this input point is like four and a half million operations per second. But you see that it doesn't actually manage to get much beyond four. And in fact, sometimes what you see with these throughput latency plots is that the curve starts going backwards. Like you increase the input rate but the achieved throughput goes down. And this is often because the system starts to get overwhelmed. Generally, systems become a little bit less efficient once they run like at capacity. And you can see that effect in the throughput latency plot. But in particular, what this plot is trying to show is that the darker lines here again are more aggressive eviction. And what you see is that as you run with more aggressive eviction, you can't achieve as high of a throughput, right? The darkest purple line here has very aggressive eviction and it falls over at about four million operations per second. This line has slightly less aggressive eviction. It falls over at about 5.5 million requests per second. This one has less and falls over later. And then you keep seeing this effect. I'll explain why in a second, but does the plot make sense? If you have the base tables on disk, that would mean the tail latency would be a lot bigger. But at the same time, you would have much more budget for caching. Exactly, that's basically the trade-off. And I don't know what the right thing to run with for the thesis is, because they're sort of irrelevant for the point the thesis is trying to make. Ideally, what I'd want to do is in all these plots, show like the memory use, like mark the part of the memory use that's for base tables in a different color or something. The problem is it's really hard to know what that is because you can measure the overall memory use for a process, but it's hard to measure how much of the memory use is due to this set of hash tables, for example. Is evicting, is no eviction much the same as thrashing for paging? Not sure I follow. Are you required to have a larger left margin than right? No, this is because it's book format. So this is a left page and this is a right page. And this is just standard typography layout for books because the thesis will be book form. Okay, does the throughput latency, does it make sense why we're doing this with a throughput latency plot? And does the basic idea behind a throughput latency plot make sense? It's weird, I agree. I also, the y-axis here is really strange and I'll talk about that in a second. Okay, so the intuition here behind why more aggressive eviction falls over earlier is that when you evict more aggressively, or rather think about it this way, if your throughput is higher, then in a given period of time, you're accessing more keys, right? So at two million operations per second, let's say you're accessing 10 keys in a given second, like, or 100,000 keys in a given second. At four million operations per second, you're accessing, say, 200,000 keys in that second, which means that here you can't evict any of those 100,000 keys or you'll end up missing a lot. Here you can't evict those 200,000 keys because you'll end up evicting a lot. And so as your throughput increases, you need to keep more of your status hot, so to speak, in your cache and therefore your cache size, the amount of stuff you can evict, your cache size has to increase as well. And that's what this plot is showing, that as your throughput increases, you need to have a less aggressive eviction, which is a subtle point, but an important one. What causes the bend in the bright pink line? When a system runs at a capacity, it's real hard to estimate its performance in any reasonable way. The way I actually want this plot to go is I don't want this to go this high. I basically want, here, let me run this at a slightly more sane scale. There's no reason for this to quite go this high. Like this, for example, demonstrates the same point. Really what I need here is just more data points to make this look reasonable. Because all that matters is like, at some point you get a vertical line and it doesn't really matter what happens on that vertical line because the system is not keeping up. And that is the only point this is trying to keep, trying to make is where the system falls over. We could do that just like with a table, but a plot is usually a nicer way for the audience to experience it. Evicting too much is similar to thrashing for page misses, yeah, that's right. You use a separate process which only connects the base tables and shared with copy and write pages, not easily. Remember, Nory is a pretty complex system. It's like 80,000 lines of Rust code and it'd be hard to just be like, we're gonna have another process toward the base tables, be tricky. Warn some explanation, but totally makes sense. Yeah, so obviously there's a lot of text in the section as well that I'm just saying and that's not in the plot, but the text basically makes the same arguments that I'm trying to make. Is this experiment run with writes turned off? No, sorry, I forgot to mention that. This is run with one in 10 requests are writes, if that makes sense. So there's like 99% reads, one in 10 requests are writes, which means that like at this point there are 400,000 writes per second that are writes to vote and they're skewed. So it's a Zipfian distribution, but that might not mean much to you, but basically like about 90% of votes go to 1% of articles. So there's a pretty significant skew in the data. Measuring the base table memory versus the cache memory usage is an interesting problem. It could use a customized allocator that adds the bare minimum you'd need to instrument that. Maybe, but part of the challenge is it would mean I couldn't use any standard Rust data structures because they're not generic over the allocator yet. So it's a little challenging. It'd be nice if you talked about your automation in that make file. I can do that a little bit towards the end, which we're getting towards the end anyway. All right, so then we get to this point about bringing up new views, which is your application has been running for a while, you add a new query, now what happens? And this is where the big difference between full and partial is that with full you have to compute everything at once. With partial, you can just compute them on demand and fill your cache incrementally. And so what we're gonna do is we're gonna introduce a new query into this article and vote setup that computes a slightly different query using the same data. And the exact query isn't that important, but we end up with this plot. So this plot is showing at the red line at time zero, we introduce this new query and then what we're gonna measure is in the top plot the fraction of reads that hit in cache for the new view, the new query. And at the bottom is the right throughput in the system. So there's some interesting aspects of this. First, as you see with full materialization, like without partial, for the first 20 seconds or so after you add this query, you can't read from it at all. Notice that there are no reads prior to this line. And that's because with full materialization you have to compute the entire result of the query before you can answer any questions about it. Whereas with partial, it starts out empty and then we just on demand filled only the keys that the application is asking for. And so our hit rate initially is pretty low but it increases slowly over time as particular articles are requested. But crucially, the view immediately, the query immediately starts to become useful. The other thing we'll see is that the bottom one, you'll see that the right throughput is actually higher with partial than without partial even after the migration is finished, even after the query is fully up and running. And this is because with partial, you only need to maintain the values for the things that are cached. If there's a particular key, if there a particular query that isn't cached, its results aren't cached, then any right that affects that result, you don't need to compute on because there's nothing to update. And so that's this delta over here. And we see an interesting point here where while the full materialization version is building this new queries results while it's computing all of them right through but drops by a lot because it has to spend a bunch of resources on computing what goes into that new view. Whereas with partial, you don't have that problem because it doesn't have to do everything upfront. And so the only drop in throughput is due to the few upgrades that do happen. So this plot also a little subtle, like maybe you've realized many of these plots are kind of subtle and they really are. But it is trying to make two points. But the key point being that if with partial materialization, when you add a query, you can immediately start serving it because you immediately have an empty cache whereas with full materialization, you have to do a lot of work before that view is even available. Let me make sure you get the caption. I realize we're running a little bit long. We only really have two more graphs. I'm gonna skip one because it's less interesting. But does this graph roughly make sense? All right, so I mentioned how the, actually we're just gonna skip this graph because it's important but not important enough. So the last part of the eval is about rolling your own. And this is a point that came up when I showed the very early graph, which is for lobsters, all we've really evaluated is Noria versus like MySQL. And in reality, what many people need once they start to try to run their application at really high scale and at really high throughput is that you stick your own cache in front of the database. You set up like Memcache or Redis or something like that and you stick it in front of your cache and now lo and behold, your performance is better. And that's great. And that is something that Noria has to compete with, right? Like we have to be able to articulate why Noria is better than that. And Noria is better than that primarily because it's so much easier to use. And of course, this is a very hard point to make because it's sort of a soft point. Many of the other ones have been sort of quantitative points where like we can show that like look, we're higher on the line than this other thing is. But this is more of a qualitative one where if you have to implement caching yourself, it's really complicated, right? Like caching is hard. If you have a database, imagine even just the article with vote count view above, right? Imagine that you in Memcache D, you kept the current vote count. Then now you need to make sure that anywhere in your application that wants to read the vote count, it checks the cache first. And if it hits, that's great. If it misses, what does it do? Well, it needs to go to the database and get the result. But imagine that some popular key, some popular article has its vote count invalidated because a new write happens or it gets evicted. Then now imagine that lots of clients all try to read the same vote count because it's a popular article. They all miss, they all go to the database at the same time and now the database gets overwhelmed by all these clients asking for the same result. This is called the thundering herd problem and it's really hard to solve correctly. Similarly, imagine that a client, an application like tries to read from cache and then it misses. So it goes to the database and it gets the answer from the database. Now obviously it needs to put the answer it got from the database into the cache so that in the future it'll hit in cache. But imagine what happens if a write comes into the database after the application reads from the database, but now the value that the application read from the database, if it sticks it into the cache, the cache is now holding a stale result, an incorrect old result because the data in the database changed. And so there's sort of a race condition here that the application needs to have logic to handle. And it turns out it's really complicated to solve this problem. I think many application developers just sort of ignore this. They just do like when you write, you invalidate the cache. And when you read, if you miss, you go to the database, you put it in the cache. And normally that works fine, but it's going to sometimes yield incorrect results and application developers might not be aware of this. And there's a paper from a few years ago from Facebook where they basically built a system to try to solve this on Memcache D and it's extremely complicated. It's like a big systems paper on integrating MySQL and Memcache D. It requires like code level changes to MySQL and to Memcache D to get it to work. And it requires a lot of application level logic. And that's the application developers should just like not have to deal with that. Furthermore, you need to remember everywhere in your application that touches the vote count for a given article, like anywhere that adds a vote, anywhere that removes a vote, anywhere that changes a vote if the user wants to like change enough vote to a down vote or something. All of those code paths need to make sure that they update the cache. If you add this, if you add another cache later, you need to then go back and check every code path in your application to see whether it should also be invalidating the cache in certain places. So rolling your own is really complicated. And that is one argument the paper is trying to make. And it's hard to make this point. It currently tries to make it just by explaining what I went through of just like just how complicated that is. Right, this is like implementing caching correctly is really complicated. See all these papers. But despite how error prone it is to try to implement this yourself, it's still extremely common. And Noria still has to have a story for, well, how do we measure up compared to this? And we can't just say like, it's super hard, so just use Noria. Even though that is compelling, right? With Noria, you just don't have to do any of this. Noria will do everything for you. But we need to still be able to say, even if you did it yourself, it's worthwhile to switch to Noria. And the way we're gonna do that is to compare against Redis. So what we're gonna do is we're gonna construct this a little bit artificial benchmark where we're really just like testing key value lookup performance in Redis. And we're gonna compare it to key value lookup performance in Noria. And what you see is this. And this plot is a little weird. So this pink line over here is Redis on a single core because Redis is not multi-threaded. And this is running just like a single key value lookup for Redis. And you see it gets to about like three, falls over, this is a throughput latency graph, right? So it falls over around 2 million requests per second. Which is like pretty good for a single core. It's like quite good actually. But remember with Redis here, you're just like, this is not implementing any kind of a cache invalidation, no misses. This is assuming that everything hits in cache. Everything goes into a single lookup like always. It's just like the best case scenario. It's assuming that you have perfect caching. Then the best you can do with a single core Redis is this much. And then this line over here is 16 times that. And the reason this is 16 times is because the server we're running on is a 16 core machine. And so if you had a workload that could perfectly shard Redis across the 16 cores with no overhead, then you would end up at about 25 million requests per second on this server. With Noria, which is like running, remember the full SQL query of vote and is doing all the caching validation and stuff. You get to about 12 million requests per second, about 12 and a half. So you get to within 2x of the perfect ideal you can get with Redis. Now, this is like a really weird benchmark because we're benchmarking against a hypothetically perfect system, which means that all the numbers look kind of weird. But it does try to make the point that even if you did yourself, you couldn't do better than this pink line. And Noria gets this blue line with no effort from you. And so that is the way we try to tackle this problem of like, why don't I just use my SQL plus memcache the way I currently have been. Okay, complex argument, but in theory the last argument. So let's see whether this made sense. People think caching is easy and do it themselves, but every single time I tried it, it caused another 10,000 problems three months into production. That's exactly right. And that's part of what the text above tries to get at as well. It's doing a bunch of reference of just like there was a survey paper from a few years ago that basically looked at like, I think the results are up here somewhere. Let me pull it up here. A survey from 2016 found that like anywhere from 0.3 to 3% of application code spread across two to 10% of the source files is caching related. So these first two numbers are basically caching not only is a bunch of your code caching, but it also ends up spreading throughout your code base because everywhere needs to know about the cache. And cache related issues make up one to 5% of all issues for the application. This really just shows that caching is really complicated to get right. And so this is sort of a soft argument for why use Noria. And then the graph is trying to make the harder argument of even if you've already solved all the problems, it's still not clear that you shouldn't just be using Noria instead. It works right in the first few weeks and then it explodes. Just put it into Redis. Why would it be so complicated? It's a pretty common sentiment. It's true. That last graph is really impressive. I'm glad I've worked a lot to make this graph look the way it does. It's weird. This benchmark is super weird and unrealistic because this has like a read-write ratio of like 10,000 to one. Because we wanted to emulate the case when there are no invalidations. It's like a lot of skew. It's really weird. It's weird because Noria has given 16 cores and Redis runs on one core, but hopefully this tells the right story. Redis might hit a different bottleneck even if perfectly multithreaded. How much data is 26 million requests per second? Maybe this is already memory bandwidth limited earlier. Could totally be. This is why this is a weird benchmark because this line is like 16 times that. And realistically, you just will never get to this line. You just won't. It could be that you could build a better caching system that has a higher baseline that works better, but realistically you're gonna end up somewhere in between because there are all these other bottlenecks to crop up like the network, memory bandwidth, serialization performance. If you actually build this multithreaded, then either you need perfectly shardable cache keys, which are rarely the case. Usually some keys are hotter than others. Or you need to have some serious concurrency primitive, which Noria does have, but Redis does not. If Noria abstracts, even for the general case, the caching logic at this performance point, it makes a very good case. I mean, yeah, that's the hope, right? With Noria, there's no caching logic in the application and it gives you this performance. Now, realistically, as we saw from the earlier graphs, you're probably not gonna get to 12 million per second, but this is just trying to show the best case for Redis and that Noria measures up to Redis's best case. You've got a great length to explain this to the reader or the committee, but I feel like every web developer just sees immediately why Noria looks so promising. I'm glad to hear that, but unfortunately, when you write a thesis, you need to convince people that what you're saying is true. It's not about convincing people that something might be useful. For a thesis, you need to convince someone that it is useful, not that it might be useful. And that's part of why I'm working so diligently, in a sense, to try to make all these points carefully. That plot, those numbers, even if artificial and that exact argument is the story for Noria, I agree. I mean, the argument Noria makes is we do the caching for you and we do it well. Migrating a workload to cache is messy and error prone and requires upkeep by monitoring the cache hit rates, detect possible bugs, especially when dealing with Lucy-type languages. Yeah, like caching is really, really hard, but unfortunately, unless you've worked with caching, you probably don't know how hard it is and that's why the thesis needs to demonstrate that it's hard. Noria is really interesting, however, I don't understand what the base data that you refer to in your benchmarks and why it is on memory in Noria right now. So Noria does support keeping the base data on disk using RocksDB, but it's a use case that we've used much less. Maybe it's worthwhile running these benchmarks with durable data just so the memory use goes down. I'm not sure. It would also increase the tail latency pretty significantly because now your tail has to hit disk, which like, maybe that's a more realistic use case, I'm not sure. Certainly currently it's a little awkward to work around the fact that the base table is measured in memory use, I agree. Could you please slowly tell us about this topic? I can, I have talked about it for about two hours, so it's hard for me to explain everything again slowly. I would recommend you read back the, just watch the video again after. Cash problems are like 75% of my debugging time too. We almost always search all other possibilities before we say it was a caching race condition. We expect it to be correct, but it rarely is. Right, but again, it comes back to, the thesis needs to demonstrate that this is true without relying on just saying, I have experienced it to be true. It's very easy to have writing where you say, obviously it's true, the caching is hard, but in a thesis setting that just isn't gonna fly. We use homegrown caching at work, it's the source of almost all our bugs. Sounds about right. And another argument that thesis tries to make is that, like, when you, there's a reason why there isn't just like automated caching built into ORM layers and frameworks. It's because it's really hard to generalize in the application. And it's in part because caching is really hard, but part of the challenge here is that all of this is sort of anecdotal evidence. And anecdotal evidence is pretty hard to use in a sort of relatively formal setting, which a thesis is. But I hope I've managed to get pretty close. My email address is on my website. Okay, so we've now worked through, it's funny, so the last line that's on its own page, and Noria achieves this performance while providing rich SQL queries without application-specific caching logic, which I think is just like the point of Noria, but it just, it takes like 25 pages of evaluation to be able to make that point. I agree with your point about measuring also with on-disk data, but since it uses ROXDB, I don't know what's ready for that. Also might not even be the point of the thesis. That's sort of what I'm getting at, that like it's sort of not the point that it supports durability, but at the same time, the current measurements obscure the point because of this implementation detail, which is awkward. Why would you not recommend someone use Noria, just for the sake of argument? Don't use it if your application patterns don't fit what Noria's built for. Noria's built for read heavy web applications with skewed workloads, and if that's not you, you shouldn't be using Noria. Noria is not built for cases where you need strong consistency in your query results. Basically it assumes that you're fine with caching results, you're fine with getting stale results. Noria does not support transactions. So if you need those, you can't use Noria. In theory, some of these things could be added, we just don't currently support them. So those I think would be the key points. Noria for me looks like CPU from some point of view, data go brrr, I don't know what to say about that, sure. All right, so I think we're then through the eval section, which is all I wanted to cover today. And hopefully as you'll realize, it takes a lot just to craft the story. And hopefully you feel like the story that I've, basically the story I've been telling you through this like two and a half hours is decently compelling. There are definitely some things that are not satisfactory here. Like one of the ones we talked about was this graph, which is currently pretty hard to read. It's not clear, it makes the right point. It might be that it shouldn't even be a graph. The other, of course, that we discussed at length was this one and about whether these different values make sense. There's also the point about memory use. So some of these things I will have to fix up for the thesis to actually be reasonable in the end, but I think it's gotten to like narratively a pretty good progression of points and pretty strong support for the main arguments of the paper or of the thesis. I think at this point, it ran a little bit longer than what I wanted, but I'm happy to take it like a few questions if people have questions about either about like the thesis or thesis writing or like why visualize the data this way or even just about like PhDs in general, I'll do some quick shot Q and A. Norio does not save time series data, that is correct. Although in theory, you can operate on any data you want as long as you can write SQL queries. You have a plan to use GPU for acceleration. No, but you could. For production use, it's certainly necessary to have good performance with on this data. Totally true. We went with RocksDB just to have something because it didn't really matter. It could be the RocksDB is good enough for this. Like maybe I should just run the experiment and see what happened. But I agree that that's more for the sort of post research phase of Norio if that's ever gonna happen. How many pages do you think it will be when it's done? It's a good question. It's currently, well, let's ignore references. It's currently 45 pages and that is with only eval and headlines. So my guess is it'll be like a hundred, sort of depends, hard to say. One section I wanna write that might be pretty long is I wanna write this like, it's gonna be the appendix but it's basically me trying to give a non-tech technical explanation of Norio which I think is gonna be pretty cool. So the idea is that if you have no idea how databases work, you read this section first and it tells you what Norio is, why it's useful in ways that make sense to someone who doesn't know computer science. This is basically the section I'm gonna tell my mom to read and I think it's gonna be great. The automation. Yeah, so I actually have, so my thesis, if you look at the make file, the make file is pretty straightforward but I use a lot of like Python scripts to generate the graphs and stuff and I've set it up so that whenever my benchmark data changes the relevant graphs are rerun and the paper is rebuilt and I also have orchestration scripts that run all my evaluation on EC2. So for example, what's a good example of this? So for example, if I wanted to run the vote benchmark again, I could run it on EC2 just with this command, arguably like the command is much simpler, it's really just this. So if I run that, it's gonna like spin up EC2 instances, run the benchmarks at all the different data points, download all the results and stick them in this directory and it's in the format that the make file and the Python scripts accept. So this makes it really easy for me to rerun experiments, try different things and just have them all run in the same setup. I highly recommend automating your evaluation this way especially if it's as like execution heavy as mine is, it's super helpful. How'd you zero in on your research topic? You mentioned doing other things at the beginning of your PhD. It's funny because right before I started this project, one of the last things I said to my advisor was I don't think I really wanna work on a database and look where it got me. Realistically though, what happened was sort of more driven by a desire to make something better. Like I'd been doing a lot of web development in the past and I was just like, databases seem like they're not correctly built for what people use them for. I wonder if I could do better and then I just sort of started building something and dataflow seemed cool and I like concurrency primitives and then there's just sort of built up from there and so it was very problem driven on my part. Do you like writing the thesis? Like writing the like latex document is the process enjoyable for you? Yes, I quite like writing, especially sort of educational writing. I think I will like writing the other sections more because the evaluation section, you need to work really hard to make points that I feel are obvious and I think it's for good reason. Like I think I wish I could just say like obviously caching is hard but I understand why I can't just say that but it's also, it's a little frustrating to write convincingly about something that you feel is obvious. And so that makes it a little bit more friction. It also is like sometimes the graphs just like don't show what you want them to show. Like the number of times I've run this experiment and tweak parameters and stuff and changed the way it's plotted. It's just mind boggling and I still can't quite figure it out and that process gets a little tiring after a while but I think the writing itself and especially writing about the system, I like a lot. Look for references to work related to your caches your paper refers to. Don't know what that means. When is your thesis due date? There isn't one. Currently aiming to graduate by October 31st but like it's weird because your PhD doesn't have an end date. It's just like when you're done which is when your committee says you're done. So there is no due date per se. But kind of work you're considering in industry ideal job. I really want to keep teaching but I don't think I want to do it in academic setting. So I'll probably just keep doing like live coding streams and stuff is probably my plan. And then as for my actual work, ideally I want to work in like the Rust ecosystem. I like that a lot. Maybe internally at a company I basically be sort of the Rust point person internally in the company for things like infrastructure but also very interacting with like the Rust ecosystem the open source space like contribute to like core libraries and the standard library in the compiler as needed by people who use Rust at that company or just work on Rust open source in general but it's a lot harder to be hired to just do that. Do you enjoy working on challenging problems themselves or do you prefer the practical applications? I need there to be both. I think for me it's hard for me to work on something that I don't think is important or that I don't care about but it's also hard for me to work on something I care about if I don't think it's interesting or hard. Like I don't just want to write some code although sometimes I do but usually I want my brain to have to be engaged in the process and I also don't want to build things just because they're hard. I want them, I want it to be useful to build them too. All the things you wish you knew when you started your PhD and your thesis it's a lot of things to ask but I'm gonna go with when you start a PhD one of the things is gonna be weirdest to you is that it is only you. When you're doing a PhD it's very rare that people tell you what to do and it's all gonna be driven by yourself. You need to have a decent amount of self-discipline to make sure that you like continue to do work and go in interesting directions and like meet with your advisors and make progress and that's it's an interesting way of working. And I think the second one is that writing a thesis is weird because there are no or very few guidelines. It's a very long-term and like winding process and so I would recommend try fairly early on to start a dialogue with your advisor, your professors, the people around you about what thesis you're working towards. You're not gonna figure it out for a while and that's fine but like when you're three or four years in if you have a project you worked a decent amount on like start the conversation about what your thesis should be because it sucks to spend like a year on something that turns out to not be related to your thesis or not be useful for your thesis. And trying to build up a better like mental map of your path towards graduating is useful. For the graph that has read as compared to Noria what would it look like if they were given the same number of threads? You can't. So read as is single threaded. You can't run it with multiple cores. That's why there's a 16 core line. In theory I could run Noria on one core. I think its performance would be pretty bad because it's just like not optimized for that use case. Like Noria has a lot of internal concurrency mechanisms because it is multi-threaded and those, if you run it on a single core translate just to overhead. And so it would be a hard comparison. MIT pays for the Amazon instances. Yes, basically like we have a grant for the project I'm working on and the grant pays for the Amazon instances. Are you working alone on this or is there a group or pair project? Well, I'm working alone on my thesis but as I mentioned earlier there's like a bigger team of people that has sort of rotated over the years of other people who have also worked on this project. Usually on different parts of the project than like partial state has always been my thing and if you will in Noria. What is the most time consuming part right now and in general in your thesis, the implementation? The most time consuming part right now is just to get the evaluation results where I want them to be. And to some extent figuring out how to even plot them. But often it's just like designing the right experiments making sure they are the right experiments and iterating on it until you get the results you want. Very often what will happen is you think you have an idea of what the right experiment is. You design it, you build it, you run it and the results are just weird and then you need to figure out why they're weird and then you need to run the experiment again and that process takes a really long time because really you're doing performance debugging. In general debugging is most of what you do. Either performance debugging or just like debugging problems. Sometimes it's more architectural debugging like currently we can't support this kind of operator. How might we theoretically support such an operator and then you like work on a whitebird for a while? But it's like, it's all in the flavor of debugging but that's very systems oriented. I think that's not a general thing for PhDs. Will you open source the automation for the data? I mean, all of what you've seen today is open source including all the scripts for automating on EC2 and all the benchmarks, everything is open source. The section for non-computer science is the number one thing I'm waiting for now. I want to see these real world metaphors. It's gonna be interesting. It's basically an explanation for how you could make a library more efficient. It'll be weird. I like it. Do you struggle finding good resources or acceptable for a scientific document like a thesis? Lots of good explanations are in blog posts which are hard to cite properly and get accepted. In terms of like references, it's not that hard. There's just a lot of academic research out there. Sometimes searching for it takes a little bit of time but it's not too hard. Studying blog posts is fine too. It really depends on what argument you're trying to make. Like if you're trying to point out that like Twitter has a certain number of tweets per second, if they wrote in a blog post, you can cite the blog post and it's fine. So that hasn't really been a problem. I really loved your course. The missing semester of your CS education would love to see more of that. It was really fun class to do. We're probably gonna do it again next year. I don't know whether I'll be around to do it but I do like that kind of education and my hope is there'll be more. I got my PhD title eighth of this month. Congratulations. Do you have some background in education or just flows naturally? I don't know. I've done it for a long time, right? I've TA'd, I've been in academia for a long time. I've been a student for a long time and I also just really enjoy it and so it's probably a little bit of a mix. I think it has, trying to explain things has always been a pretty natural part of things to me. So that might be part of it. As a Rust programmer, what do you think about using Rust and the Linux kernel? It's cool. I'm not gonna go too much in depth in it because this is not really a Rust stream. You mentioned that you are on your own. How do you cope with that and how do you drive yourself? Sometimes it's hard. Like sometimes you're just not feeling productive. Usually I haven't found it to be that much of a problem but that's because I've usually found the work pretty interesting but there are certainly times where I'm just not feeling productive either because I'm not excited by the current problem I'm working on or I've been stuck on a bug for a week and I'm just not making progress and some of it I think is just like, some of the way you get through it is by recognizing that it's okay to get stuck. It's okay to not work for a week when you're doing a PhD and be like, I need to take a break and that's just a natural part of being your own boss is you need to manage your own sanity as well. Like it's okay to not be productive for a bit and like do something else. In terms of how I sort of drive myself, I think partially is driven by interest, partially is driven at this point at least by a desire to graduate and partially it's driven by, I do a bunch of other things that I think are interesting and that gives me enough distractions that I don't get getting burned out on the main project. Like if I didn't have my live streams and some of the open source work I've done, I would probably have gotten a pretty burnt out by this and I'm still a little tired of having worked on the same thing for so long but having all these side projects have helped a lot. Are you working full-time on your thesis? Yes, I mean, I'm a full-time PhD student, so yeah. Can you share papers this paper refers to? I mean, they're all in the bibliography which is also in the repository for the thesis, which I'll include in the video description and I think someone posted it. It's just github.com slash johnhoo slash thesis. If you look at the bibliography file, it has all the references. I think that's all the currently outstanding questions. So I think we're gonna end it there. My hope is that this was still useful. It's a little bit of a weird stream. It's very different from the ones I normally do and I didn't have a good sense going into it of what it would turn into. I think maybe it was useful. Hopefully it was a little bit interesting. I might do more of these depending a little bit on how the extent to which people found this useful. If you did find it useful, please let me know. And then if there was enough interest in this, then what I'll do is probably do a stream on the introduction and motivation section at some point or maybe on the design section we'll have to see. But yeah, thanks all for joining in and I'll see you next time. Stay safe everyone. Bye.