 Carnegie Mellon vaccination database talks are made possible by Ototune. Learn how to automatically optimize your MySeq call and post-grace configurations at ototune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Welcome to another vaccination database seminar talk. We're excited today to have Malthus Chorzkoff. He is a assistant professor of computer science at Brown University. Prior to that, he was a postdoc at MIT for several years with the Pedals Group, and that's where he worked on Noreo, which I hope you talk about today. And then prior to that, he has a PhD from Cambridge University in computer science. And then presumably he went to undergrad in sort of on the continent in Europe. But it doesn't matter. Once you have a PhD, it doesn't matter where you went. OK, awesome. So with that, again, we thank Malthus for being here. Do you have any questions for Malthus? As he gives the talk, please unmute yourself, say who you are, and say where you're coming from and answer questions. Let this be a conversation and not have Malthus talk to himself for an hour. OK, awesome. Thank you so much for being here. Go for it. All right, awesome. Thank you so much, Andy. Let's get going. So as Andy mentioned, I run a research group at Brown-Review Systems. And my group is sort of broadly working on four different topics at the moment, two of which I will talk about today. The first one is data flow systems. And Noreo is the concrete system that I will talk about today, a database that enables fast websites via materialized views maintained through incremental streaming data flow. And then a more recent direction is systems that enable privacy compliance by construction, so systems that allow us to comply with privacy legislation like the European Union's general data purpose regulation with GDPR by design out of the box. We are also building Tuplex, a new Spark-like data analytics system that takes Python code and just in time compiles it to native code that has competitive performance with C++. And finally, we're interested in data center memory efficiency. So with all these in-memory systems, in-memory databases, in-memory analytics systems, et cetera, memory is often the most contended resource in the data center, and they develop new abstractions for accessing remote memory efficiently. If you're curious about these projects, check out our website. But in this talk today, I will primarily talk about Noreo and then briefly talk about our next system in the privacy compliance by construction direction. So without further ado, let's get going. What happens when you access a website? So when I go to the CMU database group website, I get a rather nice website showing up in my browser. But really what's going on behind the scenes, of course, is that my computer talks to a web server which runs something like Apache or Nginx, and then some application code, probably some PHP or Python. And that application server talks to a database which actually stores the content, which actually stores the data that makes up the contents of the website. So this is a rather nice design, and it's been very, very successful because it separates the concerns of computation and storage. But one consequence of this separation into a stateless front-end web server and a stateful database is that writing an application that tolerates the increasing load that we get when many users are accessing a website is actually grimly difficult. And to understand that, let's look at how scaling a web app usually works today. So as the load increases, a single web server serving requests is likely soon going to be insufficient. So if, for example, the CMU database group website got really popular because you guys published your latest paper, then the site might go down and Andy might deploy a couple of additional servers to handle the higher load. And this is very easy because the front-end servers are stateless. You can just spin up more virtual machines, and everything will scale nicely. However, things get complicated when the database back-end can no longer keep up. And that point, probably what you would do if you were faced with this problem in industry today, is you would deploy a key value cache. So the sort of standard solution to this is to deploy something like memcache.t or redis to take some of the load off the database and serve some of the common query results directly from an in-memory cache in order to avoid having to go to the database. Now, these in-memory caches don't offer strong consistency, but they can take a lot of the read load that then the database doesn't have to serve. And that makes the whole system tolerate overall a higher load. But of course, soon enough, if there's even more traffic, a single key value cache is not going to be sufficient. And you're going to start shodding your key value cache and your database, in fact, replicating it, making multiple instances, and building a complicated distributed storage back-end. Now, this is a hugely complex software stack to solve a relatively straightforward and common days problem, namely, serving web applications fast. So let's dig into the specifics of what engineers who deploy this kind of software stack actually have to deal with. And to do that, I want to look at what Facebook does. Facebook actually wrote a paper about their version of this back-end stack in 2013 and published a lot of the details of what they're doing. So at the bottom of this slide, in yellow is the application code. That's the stuff that actually displays the website. And then in gray are the various boxes that make up the distributed storage back-end. So in particular, you will see that there's a lot of boxes here. And this is, in fact, even a simplified view. I actually simplified this down so that it fits on a slide. You have primary and backup databases. You have multiple different memcache deep pools. You have custom libraries, like McRouter and McSqueal. And you have lots and lots of arrows going in between all these system components. In fact, Facebook had to build this complex interaction pattern in order to tolerate the high load that their service experiences. But also, and this is a little bit more hidden in the paper, but really crucially important, they had to do this in order to achieve even eventual consistency. Eventual consistency, of course, is the sort of lowest even acceptable consistency level. It's often good enough for web applications. But without some of this complexity, Facebook systems would not even have achieved eventual consistency. They would have ended up with permanently stale information being served to clients. Now, you can imagine, if you look at this picture and you see the many hours coming in and out of the application code, that this actually gets rather annoying for application developers who have to think about all these hours that interact with application code. And indeed this huge complexity leads to bugs. It leads to wasted cycles because there are lots of RPCs going back and forth. And it goes and it wastes a ton of programmer time because somebody has to implement all of this stuff. Moreover, if you're not Facebook and you don't have McSqueal and McRouter and libraries that can hide away some of this complexity, then you're really out of us. If you're a small or medium sized enterprise, not a tech company, but you have a website that needs to handle a lot of load, then you're pretty much on your own. So what I'm gonna show to you in this talk is how we can take this complexity and use a new database, a system called Noria, based around the abstraction of data flow based, data flow structured, incremental, the updated materialized views to take away the complexity from the application developer who simply uses SQL queries like they're used to but get the performance of a caching key value store backend and a database that can tolerate the load of a popular website. So the overall goal of the Noria system is to take complexity and reduce it, but also at the same time deliver equal or better performance to this complex forest of interacting boxes that people developed over time. And there are a couple of assumptions that Noria makes in achieving this. One is that Noria targets web application workloads. So this is a system that is really designed to serve the kind of database workloads that web applications generate. So workloads that are extremely read heavy, the queries are mostly point queries that are satisfied from indexes, no expensive OLAP, no big scams typically. And importantly, eventual consistency is acceptable for these applications. Now it is gonna be acceptable because when they were using Memcached before, that's all they got. So it's clearly the case that web applications can work with eventual consistency. Higher levels of consistency, of course, would be desirable, but they always come with a trade-off in performance. And in Noria, we shot for maximum performance, but reduced developer complexity. To achieve this, Noria, unlike some other database systems, or I would argue most other database systems, is structured with materialized views as first-class citizens. So in Noria, whenever we need to answer a query, we generate a materialized view. And this design decision is motivated by the read heaviness and by the fact that Noria must scale. So the whole point of this is, of course, that we are trying to scale the web application. So Noria must be designed fundamentally in such a way that this application will scale out as the web application grows. And materialized views help with that for the read side. Dataflow-based updating of materialized views, as we'll see, helps with this on the write processing side. So let's dive in and start from first principles of how Noria works. And the first starting point is that Noria to the application code, let's do the yellow box on the bottom of the screen, looks just like an ordinary SQL database in terms of the interface. Now, I'm careful to say interface here, because as we said before, it's eventually consistent. So even though the interface is SQL, for things like transactions that you would expect to see in something like MySQL or Oracle are not supported, but SQL queries that do typical joins and aggregations and selections and filters and all these things are supported. And the storage layout of the database follows a relational schema just like in a normal SQL-based relational database. Now, in a classic database, when the database receives a query like this query that in the example joins stories with their vote count. So we have two tables here, a stories table. Think of this as an application similar to Hacker News where people can post stories and a votes table where people's upvotes get entered as they vote for popular stories. And the point of this example query on the slide is to count the votes for each story. So this is a group by count and then join it with the story. So, and then filter perhaps according to a block list that the user has specified. So in this example, for example, the outcome would be that the yellow query has two votes, the green story has one vote and the blue story has two votes. Now, again, in a classic database, the way this executes is the application code dispatches a query and on query receipt, the database computes the query results over the contents of the database. Of course, this is terribly inefficient in a web application where you might get the same query over and over and over again and just recomputing the vote counts for stories every time the application asks is clearly ridiculously inefficient and will be very slow. But we can observe that web application bar quotes as mentioned before mostly reads. So a typical web application will have 90% reads and only 10% writes. So that suggests a way of making the system much more efficient by doing what we as computer scientists are very good at doing, which is to focus on the common case. So instead of doing slow reads and all this repeated work, why don't we flip this around and instead of computing on reads, compute on writes? That's the essential idea of a materialized view in the database. You compute the results of a query, you store them in a materialized view and then when you want to read the results of this query, you can read them efficiently. When a write comes in that updates some data that affects the contents of the materialized view, like in this example where the application is inserting a vote for the blue story, of course the materialized view needs to be updated. So the trade-off that we're making here we're making reads fast, but in exchange we have to compute on writes rather than just inserting into a table. Now the materialized view has to be updated. So the count for the new blue vote has to consider this new vote, update some internal state to say that the blue story now has free votes and then compute the join for that particular blue story, run through the filter and update the contents of the materialized view. This makes sense because writes are comparatively rare in these web applications. So putting a little bit more work on the right side in exchange for very fast reads is a good trade-off in terms of the overall load that our system can tolerate. And again, I will remind you that any read that the application code makes from this materialized view is going to be incredibly fast. If you think about it, this materialized view is stored in memory and DRAM and that's the fastest way you can possibly get your query result. There's no way a computer can supply an answer to a query faster than having it already stored in memory or in cache and sending it back to the application that's asking. So this is the overall plan. We're going to compute on writes and make the reads fast. But there's still a problem with this, which you might have observed. And that problem is that materialized views can get rather large if they contain all the results for all possible queries that the application could send. Moreover, some of the internal states in this computation, namely the counts associated with the count operator in this materialized view definition are not actually ever exposed to the application. And in particular, we're computing a vote count for the green story here that doesn't actually end up in the materialized view because the filter throws away this row. So this extra state, state for stories that the application is never asked about and counts that will not actually make it to the materialized view is actually using memoriams, using space that is completely wasted. So this approach of building fully materialized views for everything that the application might ask for is not only wasting memory, it also would not scale because it would use an enormous amount of memory. So of course, also not how the caching solutions like Memcache or Redis work, right? These things will only keep a subset of the contents of the database in the cache. So it depends also, if your filter is filtering based on the output of materialized view, you don't know what to get filtered until you materialize the view. That is correct. So you're saying, Andy, that the green story of the count operator couldn't possibly know that it's going to get filtered away, right? Because- In some cases, yes. If it's like a having several where. Yeah, absolutely. And we will actually solve that problem in Noria's design. So may I ask a question about the drawing? So if you do the drawing, then, okay. Actually, if you do all kinds of these operators where if you want to do them in parallel, then I think there'll be lots of consistency issues, right? Yes. So our join, we'll get into the weeds with this later, but our join is an incremental join and to parallelize the join operator, we have to shot it by key. So that all the joins for the same keys are always processed by the same friend, but joins for other keys can be processed in parallel. So how can you guarantee the consistency among different threads, right? So let's say, even if you do the key partition, you will have the consistency issues among different threads, right? So let me convince you later in the talk that we don't actually have consistency issues between different friends. We, because the different friends in the incremental join execution here, and remember, it's all happening on right, right? This only happens when the contents of the tables get updated. The different friends will be processing entirely different keys on both sides of the join. So your keys A to L and like M to Z will be processed by completely different friends. But, you know, A would never join with Z, right? A would only join with A. So we can make sure that updates for the same keys always get processed by the same friend and therefore always get processed sequentially. Okay, is that answer your question? Yeah, yeah. Okay, great. All right, so our plan is going to be to, instead of just willy-nilly building huge materialized views that contain way more data than the application actually needs, implement caching semantics where we compute only what is actually needed. And at any point in time, any materialized view or any internal state of our materialized view update system only refers to rows that the application is actually currently interested in, which in this example would be the yellow story and its vote count. So the key abstraction that we introduce in order to achieve this is the idea of partial state. And partial state you can think of as sort of the idea of a partially materialized view which is a long, well-known idea in database systems. But in the specific case of this data flow-driven updating of materialized views, this incremental streaming data flow computation that was actually a new idea. Prior to Noria, none of the existing data flow systems could handle partially materialized state. And therefore these systems would all build huge materialized views and huge internal states. But Noria can have these absent, conceptually absent entries in that state represented in my visualization by the bottom symbol, this inverted T, that are simply not present at the moment. If the application were to later need them, the system will compute them, but right now they are not being maintained and therefore the system is not using any space to store them, reducing its memory usage and it's using no compute cycles to update these entries. Therefore it becomes more compute efficient on write processing, which is really important because writes are the thing but Noria already makes somewhat more expensive than a normal database would. So this new abstraction of partial state is really at the core of the system design and any state in Noria, with the exception of the tables, the actual on disk tables, any state in Noria with the exception of these tables can be partial. So any state apart from the on disk tables can have absent entries that are not currently maintained but can be fetched lazily on request. Now you might wonder, is this something that we could just use an existing system for? And there are plenty of streaming data flow systems, there are plenty of materialized viewment and systems just to name a few. There's old Aurora and Borealis, sort of the early 2000s stream processing systems. There's Apache Flink and Spark Streaming, which are widely used in industry, NIAB, Millwheel, DB toaster, you name it. There exists plenty of systems that do some sort of incremental materialized view update. But as I mentioned on the previous slide, these systems fundamentally don't support partial states. You cannot make a partially materialized view in any of these systems. In addition, many of them only support window joins and windowed aggregates and sort of don't have exactly the same semantics as a relational database would, which is something that Noria also changes. But the key sort of fundamental limitation that they all have is that their processing models simply do not support partial state. A second issue that these systems have is that when the queries change, so when you add a new query, which Noria involves adding a new materialized view, these systems actually require restarting the system. So there's downtime involved in adding a new materialized view to the system to maintain. And that's a complete non-starter when you want to run a website that always needs to be on. You cannot shut down your system, add a new materialized view and then bring it up again. So our goals in Noria are to avoid these fundamental limitations and we achieve this with the abstraction of partially stateful data flow. The way we think about this partially stateful data flow is that it's a computation model that can reason about absent state. And it's supposed to be somewhat principled in the sense that we can define a variety of different queries and state representations and yet all of them support this notion of partial materialization. And we hope to extend good ideas from existing data flow systems. Data flow systems have been tremendously successful in scalable big data processing. If you think about Spark or Dryad, NIAID, these kind of macro-duced inspired data flow systems and of course also the stream processing systems I mentioned on the previous slide, these have really great scalability and there's some great ideas there that we are trying to build upon but adding caching semantics and adding the notion of partially stateful data flow that is required to make websites work efficiently. The outcome is the NARIO system which gives you a high performance web application backend that you can use as drop-in replacement for MySQL database with some asterisks that I will get to. And NARIO computes only the necessary state to make your application run efficiently with the current load pattern that it generates right now with the current queries issues right now. And the data flow adapts live without any restart as the application's queries or the keys that it accesses or the data that it accesses changes over time. So I will spend a little bit of time talking about four key design elements that make NARIO work. First one is up queries which are the way NARIO recomputes and missing entries on request, a sort of lazy computation part of NARIO. Then I'll talk about how NARIO can change a live executing data flow and the new materialized views with very low overhead and without any pause to system execution. And then I'll talk about correctness. I'll talk about how we design partial state NARIO to achieve actual correct outcomes. In particular, how we achieve eventual consistency. We can guarantee that the system will be eventually consistent which turns out as Facebook found in their work not to be as trivial as you might think. And finally, we obviously had to add some concurrency to get good performance and tolerate high loads. So let's start with up queries. So if you buy into the NARIO model that I've described to you so far, then you will know that we have these absent entries in materialized views and of course, also in internal state of the data flow graph. But you might be wondering what happens if the application actually requests this data. So in particular in this example, what would happen if the application actually wants to read the details and the vote count of the blue story but the blue story's entry in the materialized view is not currently maintained. In that case, NARIO issues what we call an up query in order to fetch the missing entry and computed on demand. In particular, clearly the NARIO system needs to give some answer to the application. And the way to do that is to compute it lazily. So NARIO defers computing the contents of the blue entry in this materialized view until an application actually asks for it until it is actually needed. When that happens, NARIO sends an up query through the data flow. It sends a message corresponding to this green, excuse me, to this pink dotted arrow to the closest stateful operators, the closest materialized operators in its data flow graph. Now in this example, the closest materialized operators are the ancestors of the join which are the stories table on one side and the count operator over the votes table on the other side. In particular, filter being a stateless operator does not have a materialization. So the up query passes it by and goes to the closest materialized operators. Now it turns out in our example here, the stories table and the count operator actually have the information required for the blue story. So these two materialized operators can satisfy the up query and send a response back down the data flow in forward direction to fill in the missing state in the materialized view. And you see this in the animation shortly. I wanna point out to you though that this response is not a direct answer. It's not the actual entry that will go into the materialized view, but rather the stories table and the count operator each send their current state for the blue story's key and then the data flow actually computes the entry for the materialized view as it processes these entries in the forward direction just like it would when it processes normal updates that flow through the data flow graph. There are some differences in how up query responses are handled and we'll get into those when we talk about correctness but in principle, this is the same sort of forward feed forward computation that you do with normal updates. Now, when the response survives at the materialized view at the bottom of the data flow graph, it fills in the missing entry that absent bottom symbol. And now Noria can respond to the application with a response that says the blue story has three votes. Now, before I talk about query change and adding new materialized views, are there any questions about this sort of basic mechanism of up queries and lazy reconputation of missing entries? Yeah, I got a question for you. Go ahead. So the application actually does it so when they perform an update or insert, you do it on the underlying tables or it goes to the math views and probably gets down. Right, so the updates and inserts are always done to the underlying tables. So the ones that the tables that are at the top of my slide and then the materialized views get updated in response to the changes that happen to the tables. So the materialized views are always automatically updated as derived computation from the contents of the underlying tables. It's not possible to update a materialized view directly. And for good reason because the only way we can ensure correctness is if the materialized view at all points in time represents a computation over the content of the base tables. And again, because of eventual consistency, there can be some lag in what base table entries the materialized view represents. But the data that is in the sort of more refined entries that sit in the materialized view has to be present in the base tables and the updates that are currently flowing through the graph. If you updated the materialized view directly, that would not be the case. And then we could get some really bizarre correctness issues. So all updates always go to the underlying on disk tables and then propagate through the data towards the materialized views. So do you handle universal quantification like not exist or left out a join or pull out a join? No, well, we do left join, but we don't do exists. And I'd be happy to chat offline about the exact reasons why exist is challenging, but it turns out that supporting exists in an incremental data flow processing where it can be very difficult, particularly with the up query mechanism. Fortunately, in the web applications that we looked at, there were no queries that required and exist. And we believe that even if such queries did exist, you could rewrite them to be satisfiable from materialized views. But we do support left joins. So what if I do update or delete on the base table? How do you handle it? So let's say we have the mean and max. We have the mean and max and how do you handle this? Okay, so there are two parts to your question. The first one is about defeats and updates. And the updates that flow through the data flow graph that like blue box that was flowing through in the animation, these things are actually signed. They can be positive or negative where a negative revokes prior positives. So you can model a delete as a negative update that flows through the data flow and removes any derived state related to a prior positive. So in the counter, for example, it would decrement the count in a materialized view it will remove the entry, et cetera. So that's how we handle defeats. Updates are simply a negative followed by a positive. So when you revoke first and then we put the new data in, you can optimize this into a replace if the view has a primary key. But for views that don't have primary keys you do a negative followed by a positive. And then you asked about max and min. So this is actually somewhat tricky to handle efficiently. And we apply a heuristic solution which is to say we keep for max and min. We don't just keep the max and min but we actually keep the top K so that if the max or min gets revoked Noria can substitute it with the next, the next best value so to speak. So the max but one or the min but one without having to recompute the entire computation over the base table contents. However, if you get extremely unlucky and all your K in a max values get revoked one after another then Noria would have to go and execute the query over the base tables again from scratch. In practice you can size K appropriately so that this isn't usually a problem but there are some situations without queries might have to do a lot of work if you got a very unfortunate sequence of revokes when you do max or min. Thank you. Okay, and there was a question in chat how would Noria handle the Fundering Heard problem often cited problem in caching solution? Yeah, so this is an excellent question and Noria, part of the motivation for Noria is actually handling Fundering Heard because this is the one thing that the Facebook folks had to put so much work into getting, getting right in their system. So in Noria if you note that if you get a query and a request for some data it's not currently in the cache. So that was the example we saw in here, right? In the previous slide. We do not actually send multiple up queries. We send a single up query because the materialized view knows that it has already up queried this blue key. So that solves the Fundering Heard problem because we are not going to issue 10,000 queries to the database because the application suddenly gets very excited about reading the blue key but rather we send a single up query to the database and then wait for the response to come back holding any requests for the blue key until the response comes back which is typically in a few milliseconds then we respond to all of them. So no Fundering Heard builds up. I hope that answers the question. All right, so let's talk about adding new materialized views to the system. I said materialized views are, you know the sort of first class primitive of Noria. So when the application changes, for example because developers are rolling out a new release that might add new queries then Noria better find a way of adding materialized views that can satisfy these new queries. And in particular, the way we do this is we receive the new query and then Noria analyzes it to identify any overlap with the existing data flow graph that maintains existing materialized views. So in this example of a new query that computes the comma score for users which is the number of upvotes that their articles or their stories have received there's actually a joint prefix with the existing query which is the count operator that counts the number of votes by story and the join with the story details. So Noria can reuse two operators without having to install new operators in the data flow including the stateful count operator. After identifying shared subgraphs, Noria then adds any missing operators and any new materialized views in operator state. But importantly to make this fast Noria starts with the operator state completely empty to begin with. So all entries conceptually in the operator state are bottom all of them are absent. This means that Noria can deploy the infrastructure for a new materialized view within milliseconds because it only needs to modify its data flow graph it does not need to actually process any data. That allows Noria to avoid downtime because it can do this in the blink of an eye. And then the application as the application queries the materialized views Noria then uses up queries to fill in the missing state lazily over time. Again, processing only the entries that are actually needed by the application which might be a small fraction of the conceptual contents of these materialized views. So in this example, the application has read Alice's story count but did not read the vote counts for other users who may not have posted in many years or might not be popular at the moment. So this way Noria can spot live query change but one big challenge with all of this and we've sort of touched on this at various points in the talk already is how do we ensure correctness with this partial state? There are all these pieces of state that can have absent entries and there's all these updates flying around in the data flow graph of course there's concurrency, there's multi-fighting how do we make sure we get this right? So to figure out how to do that let's first say what right means here, right? If we think about the model that the system is based on what we need to achieve to have correct state is that the state produced by an up query when one of these lazy up queries goes upstream to fetch some state that previously was not processed where Noria deferred the processing then the result of that up query better be the same as if Noria had continuously maintained an entry in each materialized view for this data. So in particular in this example if Noria up queries the blue story and finds that its current vote count is four then that better be the exact same result that Noria would have got to if it had continuously maintained an entry of each materialized view for the blue story incrementing the vote count every time a vote for that story came in. And we formulate a bunch of invariants the details of which you can find in our STI paper that ensure that this representation of state is always correct in Noria. At an intuitive level what these invariants refer to is that it cannot be the case that a materialized view that's further down in the data flow so a piece of state that's closer to the materialized views that applications read from contains a key that is missing from a materialization upstream of that state. Because if that materialization upstream was missing then it turns out in many cases the updates would not reach the derived piece of state and therefore that state would not get updated and therefore it would not be correct. So this is more formally stated in four different invariants that we have in our paper but I wanna dive into two particular examples that illustrate the challenges and also how Noria resolves those and make sure that the results are indeed eventually consistent. So the first one relates to a concurrency. If you have concurrent up queries and update processing in the system which of course you will because the up queries get triggered by application read requests and reach up the data flow graph while the update processing, the processing of writes and the forward sort of downwards direction in my slide diagrams gets triggered by application writes and these things in a multi-threaded system can happen at the same time. So we must ensure that in this situation we maintain correct state in our materialized views. And again, the goal is that an up query should restore state as if it had been present all along but consider this situation where account operator maintains state for the yellow story saying it currently has two votes. Now that means that at some point in the past the account operator sent an update downstream that said the yellow story has one vote and then it send an update downstream that said the yellow story has two votes. If now an up query arrives from downstream asking for the vote count of the yellow story the response is the yellow story has two votes which matches the computed result that was sent downstream in a sequence of updates that occur prior to this up query. Now if concurrently, however, a new vote for the yellow story arrives we gotta be careful to get this right. In particular, we need to make sure that we send downstream an up query response that's the yellow box with the pink border that says that the yellow story has two votes if we source the up query response from the state before processing the new third vote for the yellow story. And then afterwards we can send an update saying the yellow story now has three votes but the up query response conceptually is a aggregation of the two shaded updates that are in the pink box on the slide the one that said the yellow story has one vote and then the one that said the yellow story now has two votes. That is what the up query response includes and represents but importantly it does not include the third vote that just arrived. And if we were not careful about how we structure the processing and operators in the data flow itself this could easily get violated. Now the solution on a conceptual level is that we must when in a data flow when we have an up query response and an ordinary update that was processed at the source of the up query response after the up query response was generated we must maintain the order between them in the data flow as we process going forward. In particular, we must maintain the order that the up query response in the pink box in this example always comes gets processed before the update that says that the yellow story now is three votes and now you're trying to be a scalable system has no global coordination. So it does not have locking it does not have global locking it does not have timestamps it does not have watermarks it ensures this purely by its operators being written to maintain a set of invariants that result in a correct total order of processing. So in particular this order is correct and produces a correct answer in the materialized view but a reordering like this one which could easily happen for example if you had a fork and then a join in the data flow then you could end up with the up query response getting reordered after the update saying that the yellow story now has three votes which results in a state saying the yellow story has two votes this is most definitely not correct because that third vote will never arrive because of the reordering and that would not even be eventually consistent because even if you let the system run to an infinite time horizon you would never get that missing update that got reordered. So what we need to ensure is that the up query responses always remain in the same total order with the surrounding data flow updates. And again, we use the invariants to in our operator implementations to ensure that this is always the case. And this is easy for some operators to just straight line process updates one after another like filters but it can get quite tricky for joins for example where you need to really make sure that when you've received an update on one side you process that to completion including any nested up queries triggered on the other side before you process the next update that the join receives. There's also some merging of up query responses that can be evictions because this is a caching system and that can be cascading evictions that they lead to some more complexity that we discussed in the paper. Now let me hold in on the second challenge which is processing updates that touch absent state. This could happen in a situation where for example the count operator receives a vote for the green story but the green story state is currently absent. Now the count operator cannot emit a useful update for the downstream operators, right? Because it does not simply does not know what the vote count for the green story currently is. So when we design the system we spend a lot of time trying to wrap our head around what is the right thing to do in this situation. Eventually what we found is that of course the right thing to do is to just throw away the green vote and ignore it completely, send no update downstream and do nothing for it. We rely on the lazy execution triggered by a future up query to actually recompute the vote count including this vote that we just discarded in the future when it is actually needed by a downstream materialized view or by an application querying a downstream materialized view. All right, so that was the end of the correctness discussion. Any questions before we move on to the sort of execution model, the concurrent execution model from Noria? I think you're good, keep going. I also like to have a question. Maybe I missed this earlier, but how do you determine, so initially you said when you have a new materialized view it's all absent, basically is that always how you start? So initially if you start up a system you have no state and only if it's ever cored you will actually start doing that. That is correct. Okay, thanks. Now you might be thinking about implications of this. This does have some implications like you need to warm up the system, you need to make sure that when you actually expose this to applications, they can deal with the mis-rate that they're experiencing and you can use standard approaches like warming up clusters. You can also, the system does, Noria does have support for fully initializing some materialized views and that's a big batch computation in that case. But that would be something that explicitly, you explicitly specify on view creation that you actually want to now keep all the contents of the materialized view. All right, let's go. Oh, sorry, go ahead. So to support snapshot isolation, well, let's say in your stories, let's say in your story, you have the yellow and blue where the user always see a consistent snapshot or it just can be, yeah. Noria does not guarantee a consistent snapshot. It guarantees eventual consistency, meaning that if you let the system PS and process all the updates that are in the data flow right now to completion and no new updates arrive, then you get a consistent answer, but there's no guarantee that you read a consistent snapshot. In practice, this is actually not as big a problem as you would think, because you can always define a materialized view for what you want to read and materialized views have de facto, oftentimes give you consistent reads. There are some exception cases when you have very complicated view definitions, but for most web applications, that's actually sufficient. I'll talk later at the end about a system that Facebook recently published called Flight Tracker where they talk about how they achieve read your write consistency for their pipelines and that idea would actually apply to Noria and kind of bummed that we didn't come up with this idea when we built the system, but right now the system, as I described it, to use just eventually consistent, no snapshot isolation. Okay, cool, thank you. All right, so concurrency execution model, it's a pretty simple story. We have multi-threaded processing at materialized views and at Dataflow operators. In particular, materialized views, we use a lock-free data structure, a lock-free hash table that we heavily optimize towards reads. So it can support reads completely without any atomic instructions or taking any locks, but write processing is slower and write processing needs atomics and needs to be pointer swaps. It's basically a double-buffered hash table that gets swapped every time a write gets applied, but this allows us to scale completely linearly for reads, which are the common case in our applications, and when the writes happen, those are a little bit more expensive than they would be, but similar in cost to a hash table with fine-grained locking. The other dimension of concurrency is concurrency in the Dataflow itself, and as mentioned before, we use sharding there. So we shard the Dataflow by key and we execute different charts of the Dataflow in different threads, but they touch disjoint data, so therefore there are no consistency issues there when we multi-fread that beyond the eventual consistency that the system already is in for. So just to summarize the sort of implementation detail part of the talk, Noria is sort of in its core based on this idea of up queries of this sort of lazy execution in partially stateful Dataflow. They also enable live Dataflow change, which allows us to add new materialized views to the running system, and all of this correctness is formalized through the invariance of partially stateful Dataflow, and we support concurrent processing for high performance. So in terms of the actual implementation, we have a SQL MySQL compatible interface. We in fact have a proxy, a MySQL proxy adapter that makes Noria appear towards legacy web application just like a MySQL database. And this MySQL adapter gives us queries, which Noria turns into view definitions that it then transforms into a Dataflow graph that we've seen already. This is sharded for parallel processing, and the whole system is implemented in 60,000 lines of rust. We rely on VoxDB for storing our persistent on disk tables, and we use Zookeeper to elect a leader and monitors the list-regulated system for any failures. So in Noria evaluation, I'm gonna talk about three questions today. The first one is, can Noria actually improve the performance of a real web application? Then we compare Noria to a bunch of state-of-the-art setups that people use to run web applications, and finally we check if Noria can indeed support query change with no perceptible downtime. In the paper, we also have a case study of how complex the applications are that Noria developers write. The answer is not as complex as what you would have to write if you had it caching manually, somewhat unsurprisingly. We evaluate scalability across many machines, and we compare with differential Dataflow, another system that is similar to us, but does not support partial materialization or any notion of partial state. Now our experimental setup is that we have a single server of 16 virtual CPUs on an EC2 instance running Noria, and many open loop clients that send requests to the server. We measure latency and throughput. And the application that we look into is a web application, a real web application called Lobster's. It's sort of like hack and use, a similar sort of foreign type application. It's a Ruby on Rails application with a MySQL backend. And the Lobster's developers actually hand optimized the schema and the application queries in order to maintain cached upvote counts on stories and comments because they were too expensive to compute on the fly when the application queries the database. So they had to do some significant manual work to actually get the performance that they wanted. Noria turns this applications queries into Dataflow with 235 operators representing 35 SQL queries and requires none of this manual work that I just talked about. Our load generator emulates the Lobster's production load. So it emulates the distributions, the access distributions that you see in the real Lobster's application. So on this graph, on the X axis, I'm showing you the offered load. So that's the number of page views per second that our load generator sends to the backend. Oh, sorry, that our load generator sends the application from then. And each page view corresponds to, in the median, about 10 SQL queries. It differs a little bit which endpoint. I think it's between 18 and seven, the range. And then each SQL query gets handled by the baseline system of Inoria. And on the Y axis, we show the 95th percentile latency. So an ideal line here would be flat all the way across and very low, right? The further to the right you are on the X axis, the higher the load the system can tolerate. And the lower you are on the Y axis, the lower the latency is. So bottom right is kind of the best place to be for a system. So here's what we found when we ran this with MySQL, which is the database that the real Lobster's production deployment uses. At about 1,000 page views per second, MySQL actually saturates the 16 CPUs on our machine because it simply is overwhelmed by the compute work required to satisfy all these queries. Now, on the exact same hardware, using Inoria with its materialized views, we can spot 5,000 page views per second, which is a 5x improvement over the load tolerated on the same hardware resources by the MySQL database. Now, this is not completely free because of course the materialized views require space. In particular, they require memory. And the materialized views and the data flow state that Noria maintains in this experiment amounts to a 3x space overhead over the size of the MySQL tables on disk. Now, we believe that's a reasonable trade-off. It's in the same ballpark that the space overhead of manual caching solutions is in. But of course, if you keep in memory caches, there is going to be a price. And in this case, the price is 3x space overhead for a 5x improvement in load tolerated. Now, in the second experiment, oh yeah, go ahead. What happens at 5,000? So why is this skyrocketing? Yeah, excellent question. So the reason that we actually bottleneck there is we're not using all the CPUs on the machine at that point. We bottleneck because the thread processing a particular part of the data flow graph that's involved in processing updates for a particularly expensive materialized view definition because that single thread that processes the keys for a particular slice of the keys in this materialized view definition, that single thread is fully loading a single CPU core at 100%. So we are bottlenecked there by the fact that we cannot further parallelize the processing within a single data flow operator because we are partitioning by key. You could do more sophisticated partitioning than what we are doing. We're just doing equal range partitioning. If you shrunk the ranges and make sure that there's more resources available for popular keys, you distributed the load a little bit better. I suspect we could move the green line further to the right, but we are definitely not using the full CPU on the server at this point, but we are bottlenecked by the data flow at 5,000. So are you using a single thread to interest the data? I mean, to handle the right transactions. No, no, no. Well, so for every table, every on disk table is also sharded in the same way that the data flow sharded and there's a single thread for each shard. So in this experiment, there were four shards and therefore there are four threads processing updates for each table. So basically no contention, right? Well, I mean, if all the updates are for the same key, then you would have contention, right? Because they would all be processed by a single thread. But in practice, there's no contention. There are the bottleneck is actually further down in the data flow. Okay, so it's still the MySQL, right? Well, MySQL style transaction processing, right? So it's not like, well, let's say it's all carving, right? We are not innovating in the on disk table process. And it's just that Friday had a lot of group commit and all the kind of stuff that you do in the database, but no different from MySQL in principle. Okay, but do we have a bit dedicated to the threads for processing a transaction? No, for processing a data flow, right? That's correct, yeah. Okay, okay, cool. All right, so in the next experiment, I picked a single query from the Lobster's workload to sort of dig into a comparison with other systems. And at the loads that are now shown on their X-axis are larger because we are only processing a single query while previously we were processing 35 queries at the same time in our data flow. And again, the Y-axis is the 95th percentile latency. This particular query gets executed in 80% of Lobster's paid views. So the first observation here is we tried a bunch of baseline systems. We tried the hand-optimized MySQL deployment that the Lobster's developers made. We tried system Z, a commercial database, and we tried a combination of MySQL and Memcache D. And what you can see is that all of them sort of basically drop out at about 100 to 200,000 requests per second because they cannot keep up with the load anymore. Noria, by contrast, for the single query scales up to 14 million requests per second, which is a 70X improvement over these other systems. Now, of course, this comes because Noria is super duper optimized for reads and this workload is 95% reads, but that represents the kind of read heaviness the real web applications see. So we believe that this is a good optimization. We also compared to a baseline that is just Memcache D, maintaining only counts, only vote counts for stories in memory, which is what this query reads. And this Memcache D only baseline, which is unrealistic because it really only maintains the vote count, achieves 8 million requests per second. And you might not be saying, hang on a second, Malte, you're saying you're outperforming a unrealistic baseline, idealized baseline that does the same thing that you do but stores less data. Yes, we do, but the reason is an implementation level choice, which is that we use log free data structures and Memcache D uses per bucket locking and it's hash table. So we imagine that Memcache D could achieve the same throughput we achieve or higher throughput if they also embraced log free data structures. So this is not a fundamental difference, but having competitive or better performance with Memcache D was our goal and we achieved that. So the bottom line here is that Noria provides a much simpler interface than the key value store because it's SQL based and it provides better performance in this particular benchmark, comparative performance being our goal. Now, my final experiment that I wanna show you is an experiment where we actually add a materialized view live at runtime. So the materialized view that we're adding is we're adding a new view that introduces ratings for stories. So instead of just having upvotes, now the users can leave star ratings. And on the x-axis in this graph, we now see a timeline. Time zero where the vertical gray line is is where the new materialized view gets added to Noria's data flow graph. And the y-axis shows white throughput. So a ideal result here would be a flat line at the top of the graph with no interruption. Remember our goal is that we should have no downtime. Now, what we actually see when we run this experiment is pretty close to that ideal. In particular, observe that there is no gap in the blue line. The new view is live instantaneously. There's no downtime for write processing. This graph only shows write processing. Reads continue the whole time at a much higher rate than 300,000 per second. They're not shown on this graph, but they also see no interruption. So we see no interruption for read or write processing. And importantly, just one second after the new materialized view is added, 80% of reads for this new query actually proceed without triggering an up query because all the popular keys are already present in the materialized view and the queries for them are satisfied directly from the materialized view. The drop in throughput after the new view is added is simply owed to the fact that we use a fixed amount of hardware resources and now an additional materialized view must also be maintained by Noria. So the bottom line of this experiment is that Noria can change your set of materialized views without requiring restart and therefore suitable for real web applications. And that brings me to the end of the material on Noria. We built the system. It's rather cool. It reduces complexity compared to the manual caching solutions that people implement today. It's efficient, up to 14 million requests per second on a single server. We find, I think it's pretty easy to use because developers can continue to write SQL queries. And partial state was really the key idea to making this possible. It allows us to make a very fast and albeit eventually consistent SQL database for serving web application workloads. Now, as Andy mentioned in the beginning, this is a project that I did about two years ago. So I figured I would include a little bit of retrospective material and say, what in retrospect did we do well and what could we have done better? So what we did well, I think, is that the use case is very real. We certainly found that after we published this paper we were inundated with people who wanted to use Noria who had tried to deploy their own caching systems, who hand-rolled their own infrastructure as a giant pain. They really hated it and they wanted to use our stuff. Now, out of that was born a startup company, readyset.io, which I am not at all involved in for visa-related reasons, but they are commercializing Noria. I know nothing deep about what they're doing, but you can check them out and contact them if you're curious. We also found that the Dataflow paradigm at the heart of our system was really a good idea because it gave us great performance and it made parallelism easy because it provided a blueprint for how to make the system process data in parallel. And the fact that we built a complete system rather than just a toy research prototype was a really good idea because it actually allowed us to test this with real loads, with real MySQL, with the real MySQL protocol and made it a lot easier to actually try it for people to try it out in the real world. Now, what could we have done better? I think we could have had a better consistency story and as I mentioned in passing earlier, I'm kind of bummed that the FlightTracker paper and OSDI sort of scooped us on this. And of course I did not have the ideas that these guys had, but the FlightTracker paper is a really elegant way of getting video write consistency in data flow oriented systems such as Facebook's index update pipelines that has basically no overhead at runtime and propagates information with the data and gives the client what they call the ticket abstraction that allows a client to check whether the read result it received is consistent or not with very low overhead. So it's really cool. I would love to combine those ideas with Noria. I think that would work really rather well. The other thing is more sort of an implementation thing. We used a lot of Rust async libraries that we were very excited about. I think they were mixed blessing because they actually made the system harder to debug. And in the next system that I'm building now, I'm back to synchronous C++ and I find the debugging experience a lot nicer than dealing with this async Rust libraries. So just to finish, this is my final slide before the slide I end on. The next system, we're building a system called Pelton. That's a code name. It's another data flow system but this time in service of a different goal. The goal is GDPR compliance by construction. So we're trying to solve privacy compliance through a paradigm where we break the database apart and we say each user gets their own database. Each user gets their own what we call micro database that contains only the data related to them. And then this makes it very easy to be compliant with things like the right to access or right to be forgotten in the GDPR because you can simply send a copy of the database to the user when they invoke their right to access. You can simply delete the micro database if they invoke their right to be forgotten. But of course, the application still needs to consume data combined across users in order to display shared content. And for that, we use materialized views and again, we use data flow for efficient maintenance of those materialized views. This system is written in C++17 and we're going for memory efficient design trying to address some of the memory overheads that we saw in Noria. And our current single threaded prototype achieves between two to 14 million data flow updates per second depending on the operator that we're processing. So that is something that I'm very excited about and that I look forward to hacking on this summer. But that's the end of my talk for today. And I'd be excited to take any more further questions that you might have. Thank you. Okay, so I will applaud on behalf of everyone else for a bit over time and I have a baby I gotta go deal with. So we have time for maybe one question from the audience. Do you have a mechanism for evicting entries that are infrequently used in the partial state? Yeah, we do. So we have a very simple mechanism which is that when we get close to the memory ceiling that we configure for Noria, then Noria starts evicting random entries but you could make something more sophisticated using LIU or your favorite cash replacement algorithm. And then when Noria evicts an entry from the eviction point of the sense of a negative update down the data flow which evicts and divides the entries as well. Okay, awesome. So, thank you so much for being here. So one quick comment. Not because it's so similar to our database, former database named Peloton, I think that you're gonna have a searching issue because everyone knows the bike now, Peloton with an O after the L. I think if you follow your system Peloton, what's you call it Peloton DB? No, no, no, this is not right, right, right, right. I don't think you, I was surprised that you were having the database called up Peloton. It was a turbine, high performance turbine that's not gonna be the final system name. So there won't be anything from Peloton. Okay, you're good. Again, we ditched the name Peloton because of the atolls of the bike. So I'm not saying, oh, it's too close to my date to even call us there anymore. So I just know you have issues. No, we're not gonna get, we're not gonna try to get soon either. So yeah, we'll be there. Yeah, well, that's our story. Anyway, Martha, thank you so much for being here. We appreciate you spending time with us.