 We're excited today to have Chiangang Wu from Berkeley to come give a talk with for us about the Anarchy Value Store that they've been building at Berkeley as part of the RISE Lab. Chiangang is a fifth year, this is finishing your fifth year, correct? Yeah, yeah. He's finishing his fifth year as a PhD student at Berkeley with advisor Joe Hellerstein and prior to that he did his undergrad at Brown University. I don't think you and I overlapped, did we? No, unfortunately. That's okay. So with that due, and so the way we'll organize this, if you have questions, just don't raise your hand, just, you know, unmute yourself and shout out and, you know, interrupt them as he goes along. This is not, this is supposed to be interactive. So go for it, man. Okay, cool. So hello, I'm Chiangang. I'm a final year PhD student from RISE Lab at Berkeley with Joe Hellerstein. I'm very excited to be here remotely and thanks Andy for inviting me and also thanks for pronouncing correctly for arranging this all kind of stuff. So I'm going to talk about, you know, the Anarchy Value Store. I target V0 on the slide because there's actually a series of work that we've done with Anna. And today I should be focusing on the initial piece of work, which, you know, put more effort on the theoretical side and some design principles. Hopefully it will be more interesting than some of the following papers which focus more on the auto-scaling engineering part of this stuff. So let's get started. So conventional wisdom, right, or at least, you know, what Jeff Dean said in his talk about the challenges in building large scale systems says that, you know, whenever you want to scale your system by an order of magnitude, you will have to sort of redesign the entire system and come up with new architectures and execution models. But, you know, as researchers we want to ask the following kind of cultural question. So can we build a system that achieves very good performance at any scale? So in this project, we're going to explore the answer in the context of the Anna Keevali store. So Anna is a distributed in-memory Keevali store. And sorry, are you guys slight advancing? I'm not seeing... I see a research question. Can we build a system that delivers? Yeah, so it's weird because I'm advancing on my side but not seeing yours advancing. Oh, there it goes. Yeah, there's some lags sometimes. Yeah. So Anna is a distributed in-memory Keevali store. The name actually refers to a California native hummingbird, which is the fastest animal relative to its size. And indeed, you know, our Keevali store at least, you know, the initial version of it is pretty likely with only a few thousand lines of C++. And, you know, despite the size of the code base where Anna delivers very high performance across any scale from a single multi-core machine to new architecture to a geo-distributed deployment, Anna can outperform many state-of-the-art Keevali stores by over and over magnitude. And in addition, Anna supports a wide spectrum of consistency guarantees, which includes things like last-runner wins and basic consistency and as well as causal consistency and also some transactional isolation levels such as recommitted. So let's just talk about, you know, why the two aspects Anna focused on, namely, you know, any scale high performance and flexibility in consistency are important. In terms of scalability, right, in a distributed setting, it's already very well understood that we want our system's performance to grow as we add in more and more machines, right. But scalability is actually also crucial even within the single node. So modern cloud providers like AWS right now offers multi-core beefy servers with very high computing capabilities, like they now offer machines with 64 CPU cores or more. So if we're going to use these beefy servers for our application, it's very important to, you know, be able to efficiently utilize those computing resources as much as possible. So there are systems like Redis Cluster that can exploit multi-core patterns and, you know, things like Cassandra can have good scalability in a distributed setting, but at least, you know, two, three years ago, there was no Keevali store that excels at all scales with a unified architecture. And also in terms of consistency, right. So high performance Keevali store should ideally be able to benefit a wide range of applications. So these applications, the problem is may vary in their consistency demand. So for example, if we use the Keevali store as a caching service, we may, you know, want to tolerate certain degrees of stillness, but at least, you know, we would like eventual consistency where the replica eventually converts the same state. And sometimes we need stronger guarantees. For example, in an online shopping cart, we may want, you know, to track the causal dependencies between updates from different users that may share a same shopping cart account, in which we may, in which case, we may need a stronger guarantees like causal consistency. And finally, for large-scale index indexing, we may want, you know, the update to our index and the data to appear atomically as a unit. So we need some form of transactional isolation in these cases. So basically, that's why, you know, any scale high performance and flexibility in consistency are important for Keevali store. So you listed that you guys support item cut. Do you actually have an application that wants item cut? For item cut, item important cut, we actually didn't find an application that's really can take advantage of that specific consistency. It's been listed in Bayless's, you know, highly available transactions work, but I think for that specific level, it remains pretty theoretical. That's why I asked. Awesome. Yeah. So let's focus on the scalability side first. I'm going to first discuss, you know, the fundamental limitation of the state of the art share memory model within a single machine and then present, you know, how Anna sort of addresses this limitation with its coordination free active model. I'll then talk about, you know, how to extend Anna's execution model from a single node to a distributed set. The share memory model is, you know, actually very widely used within a single multicore machine. So basically, we spin up one spread per each CPU core and the threat can reason rights to share memory space and popular in memory storage systems like memcash D mastery and hackathon speed W3 uses this model. The problem, you know, is that whenever write operations are involved updates, operations are involved, right? We need to apply some form of threat synchronization to prevent memory corruption. And synchronization usually take place as a forms of locking like mute, mute access and spin locks. But recently, you know, lock free data structures such as Intel's threat building block TV have also been introduced, which uses a single instruction atomic compare swap to serialize conflicting operations. And there's some studies that shows, you know, synchronization using lock free approach can be cheaper than locking. But the question is, you know, is lock free really good enough? So here we show that even the state of the art lock free synchronization can have pretty poor scalability in a multicore machine. So basically, what we did is a micro benchmark that use the TV hash map to build a very simple in memory key value store and just benchmark its performance against a single threaded baseline with right heavy workload. So under high contention, right, where only a small subset of keys are being accessed, the aggregated throughput of TV hash map actually decreases very significantly as we increase the threat count. And this is, you know, pretty obviously because we share memory model concurrent updates, the same data have to be serialized. And in addition to that, the majority of CPU time is actually devoted to retry when the atomic instruction fails. So as a result, not surprisingly, the throughput is even a lot lower than using a single thread. So this observation also agrees with some recent study on lock free synchronization by Valero and the body. So the way Anna sort of addresses this limitation is via it's what we call coordination free active model. So in this model, instead of just using a shared memory model, each thread only has access to its private memory view. Because of this, there is, you know, very straightforward, there's no need for locks for lock free synchronization at all because, you know, every thread just access its own thing. And in case, you know, threat performs updates to replicas of the same data, Anna just let them propagate these updates asynchronously background via explicit message passive. And because gossip, you know, only happens periodically in the background, the overhead doesn't occur on the execution's critical path. So, you know, in this model, there's no overhead due to waiting whatsoever, which is one of the key design insights that makes the system perform fast. Now, by employing this weight free execution model, Anna is able to achieve good scalability within a single machine. But the next question is we want to go distributed. So what changes do we have to make to extend this model into a distributed setting? So the answer is actually there's no chain required at all, because since we're using async message passing instead of share memory, we've essentially just built a small distributed system within a single machine. And, you know, extending this model to a real distributed setting requires no work, because we're just simply adding more and more machines without making any change to our execution model and the system just scales very naturally. So, hi, can I ask a question? Yes. Hi, this is Ling. I think we chatted your signal. Yeah, yeah, yeah. Nice to talk to you. Yeah, so I have a question is that when you say Anna is using this propagation of the gossips, right, to apply the changes, so I was thinking you eventually need to apply the rights at some point anyway, right? Yes. So when you eventually apply the rights, even though it's asynchronous, but you still don't need any log, any protection? So whenever we use protection, we're talking about the model where, you know, there are multiple thread concurrently updating the shared memory region, single region. So in this model, when they exchange messages, what happened is that one thread will send a message through, you know, either a PCB channel or IPC channel to another thread. In that sense, we don't need to apply anything like logging and atomic instruction. So another subtle point, but important point is that this gossiping is done periodically, every 10 seconds or 100 seconds. So imagine you have a workload that's, you know, really right happy. So during this 10 second gossip, the same data may get updated, you know, 100 or 1000 times. But at the end of the 10 seconds, the thread only gossiped the end state of the object to other. I see. I see. I see. Thanks. I have a follow on question to that. Okay, so you can imagine you're in a bank, and there's a bunch of bank tellers, and you're giving very good service to that first row of bank tellers. When there's the issue is behind the scenes at that second level. And in a database, this just creates a situation like rights queue, right, where where two people read something and then write, and then that right is a problem. So when you push that thing off, there has to be some kind of a resolution, which in Jim Gray's book, he talked about this as writing wormholes to the log. In other words, the initial the solution to it requires time travel. So the question is, how do you deal with that? I guess you're going to talk about this. Yeah, so if we're going to talk, so basically the straight up answer is that, you know, Anna only support a limited type of coordination free consistency model. So the phenomenon you talk about like right skew that, you know, definitely important for use cases like bank applications and stuff typically requires stronger consistency guarantees like either serializability or stronger version of a snapshot isolation, which is actually beyond the scope of consistency that Anna can support. So you have no acid characteristics at all, right? No, no, no, no asset transactions. Definitely no, no asset transactions. So basically the design principle that in our mind is that, you know, asset transactions, those consistency that requires strict coordination is at odd with, you know, auto scaling, super high performance scaling. So there's a hard trade off between these two. Okay. Yeah. Do you keep track of me? I don't figure talk about the gossip protocol, but this is a new, this sounds like newDB, would they keep track of like where data like the home node or thread of data, and they know who needs to be told, Hey, here's an update, are you maintaining any that or just like propagate whatever. Yeah, so for now. So first of all, Anna does multi master application, which means, you know, if the data is replicated three way, those three replica can all accept updates and they gossip to each other. And for now, we're not super, you know, optimized for or, you know, paying special attention to how efficient this gossip process is. So ideally, you know, you can imagine if we have, you know, 10 way replica, then it'd be nice to apply some form tree based, you know, gossip. And, you know, there are even other optimizations you can do to selectively replicate, you know, keys that are more important and vary the gossip period for a per key. But for now, we're not optimized for that. But that's definitely a certain pattern that we're observing that could be very useful in a lot of applications. Okay. Does your just another quick, quick, Charles. Just for everyone, when you ask a question to say who you are and where you're coming from. Okay, my name is Charlie Johnson. I used to work at tandem and then HP labs and now at Nutanix. Awesome. And so when you're gossiping, do you at all talk about these kinds of conflicts so that these might raise be raised in some level with some kind of resolution later? Yeah, yeah. That's the second part. Okay. All right. Yes. Cool. Any other questions? No. All right. Let's go ahead. Yeah. So that's basically the slide. So but, you know, this execution model obviously, you know, introduces a new challenge because we propagate those updates asynchronously. So the same set of updates may arrive at each thread in different order. For example, here, we have, let's say, a value replicated across three threads. And at the beginning, you know, threat T1 may write value a and T2 writes B and T3 writes C, respectively to the rapid cut. And later on, you know, T1 may receive the gossip from other two threats in order a, C and B, right? T2 in the order B, A and C, and T3 in order C, B and A. So if we are naively just let the gossip override all existing values, then obviously the states across replica will diverge, which is bad. So basically, I want to shift the discussion from scalability part to consistency and talk about sort of how Anna uses lattices to achieve replica conversions and implement, you know, a wide spectrum of coordination free models. So Anna addresses the consistency issue by sort of encapsulating the replicated data into a lattice. So we can think of lattice as just a very simple data structure, which contains an element of some data type and has a merge function. The updates is element in the way that is associative, commutative and item voting, we call them ACI properties. So the database community, right, the idea of leveraging ACI for conflict resolution actually has a long history. And the concept of lattice is also the basis of the CRDTs introduced by Mark Shapori at all. So for example, here we have a set lattice. So whose element is just a set and the merge function is a very simple set unit whose ACI property can be very easily verified. And lattice ACI property can work what it can achieve for us is that it shields application from anomaly to the message reordering and duplication, you know, both of which happen very frequently in the distributed setting. So yeah, we use that lattice to represent this monotonic growing timestamp and how the merge logic works is that, you know, you know, given an input, right, if we first compare the two integers, if the input's max lattice dominates, then we basically override the value and otherwise we will keep the current value. And definitely is it easily verifiable that this satisfies the ACI property and for the same set of updates, right, regardless of their ordering, the updates with the largest time sample always override the others. So the figure on the right basically shows how we compose lattices to implement last-order-win. The key value store itself is implemented as a lattice value, map lattice, whose element, you know, is an unordered hash map and the keys could be, you know, any type, but the value is the last-order-win lattice that I just discussed with the max lattice being representing the timestamp. So when the input key doesn't exist in the map lattice, we'll just simply update the map with the input key value pair and otherwise we invoke the merge function of last-order-win lattice to resolve the conflict to achieve replica convergence. I think Tianyu has a question. Yeah, sorry, a quick question. So I'm Tianyu, currently at MIT, I think we met at the Berkeley last year. Yeah, yeah, yeah, sure. So yeah, I guess one question here is that you seem to rely on some sort of timestamp to resolve red conflicts here, but doesn't that sort of require a coordination outside of your normal gossiping just to get that right? Yeah, so the current implementation of timestamp is generated from the client side using the wall clock time. So in that sense, it doesn't require strict coordination, but you know, that at the same time means the timestamp may not truly respect the real order that the event happened because different clients, their clock may be slightly skewed. So, you know, the only thing that the last-order-win achieves accomplished is to guarantee eventual replica convergence without coordination. But it doesn't necessarily guarantee, you know, that the last value you've seen happens really, you know, it's the latest update that happens in terms of the wall clock time. That's the trade-off. Okay, cool. Can I open you up to, like, like someone fucking around with, like, a malicious client giving you funky timestamps? Yeah, so by client, I mean the Anacline that we embed in the user applications, but you know, if they hack the Anacline, yeah, currently that can happen. But I feel like that, if you're geo-distributed, though, doesn't that give you all kinds of fairness problem across replicas, unless the data center coordinate the clock trip or something? So a fairness in what sense are you thinking about? I was like, basically, if you have a data center that whose clock runs faster, they basically always win out, right? Yeah, that could be, so if a certain region of the client, you know, they're, somehow their timestamp being generated is always ahead of the rest of them, that could be an issue. And you're right, that we definitely, if that kind of issue arises, we definitely need to look into that. So somehow come up with some solution to sort of periodically think, those client side costs and stuff like that. Okay, but basically what you're saying is if you, if you just happen to have true time, this is not a problem. If you happen to have true time, yeah, this is not a problem. True time, it requires coordination in reality. So let's go through, you know, just the same example, because the previous one is a little bit abstract. So let's just go through the same example and see, you know, how last week our wind resolved the conflict among replicas. The threat still receives update in different order. So T1 in order ACB and T2 in order, you know, BAC and T3 in order CBA. So, but these times, these updates are a timestamp, right? So A with 101, B with 142, and C with 123. So although T2, T2, right, receives C in the end and T3 receives A in the end, the value that's corresponding to the largest timestamp, which is, you know, B with 142 will always dominate. So, you know, the use of lattice in this case guarantees eventual replica convergence. And note that, you know, last time our wind is actually the simplest and weakest consistency model supported by Anna, because as we discussed before, it only guarantees convergence. It doesn't guarantee any form of, you know, time and true time and stuff. But, you know, there are definitely stronger alternatives that we support. So recently, not recently, 2013, you know, Bayless summarizes a wide spectrum of coordination-free models. And, you know, we found that by carefully composing these different lattices together, Anna is able to implement pretty much all of them with very little code change as shown in this figure. But due to time constraint, I'm not going to go through the detailed definition of all of these other kind of alternatives for now. But instead, I'm going to talk about some evaluation results. So first, present the performance of Anna's coordination-free execution model. And then I'm going to show how Anna smoothly scales from a single deployment to a distributed setting. And we also benchmark against, you know, other state-of-the-art key value stores in the macro level, but due to cut time constraint, I'm probably going to skip that point. So the first experiments, we want to answer the question of, you know, can Anna's coordination-free execution model achieve high scalability in a multicore machine? So as discussed before, Anna replicates keys for performance. And in this work, we use a single replication factor for all keys. I'll return to that for the end of the talk. But to benchmark the key value store's full capacity for now, requests are pre-generated on the server side. So we compare the performance with other single node systems, including the 3D HashMap and Mastery, which is another in-memory key value store. That's pretty popular. As a performance baseline, we also implemented a shared memory key value store that doesn't really use any threat synchronization whatsoever. And because of that, right, this key value store is not even correct, and memory corruption can even happen. But nevertheless, it represents the fastest performance one can get with a shared memory model. And we denote this key value store as ideal on the field. So with high contention, for Anna, we see the performance scales nicely up to its replication factor. And this is because different threats are able to process updates to the same data concurrently on different replicas. And all shared memory key value store don't scale at all just due to civilization overhead, and as I mentioned before, the atomic instruction overhead. And interestingly, although the ideal shared memory key value store performed better than TB and Mastery's and stuff, it still failed to scale because of the significant cache invalidation program. So at Threadcom 32, Anna can outperform Mastery and TBB by up to several hundred times. So a quick question on your previous graph. What coordination-free consistency model are you using for this experiment? Because it sounds to me like the other ones will have different consistency models, right? Yeah, all of them for Anna, the configuration is last right to win. And for all the other shared memory key value stores, so I mentioned for the ideal one is no synchronization, so no consistency, not even correct. For Mastery and TBB, they're using shared memory model, and shared memory model gives you linearizability for free. So in that sense, definitely, they're achieving stronger coordination full consistency levels. So you can think of it as, although Anna achieves super high performance, but it's definitely relaxing the consistency model is trying to achieve. Okay, thanks. And why the performance of ideal is not ideal? Yeah, so yeah, the term ideal is a little bit misleading in the sense that it's ideal only in the shared, so if we keep the regime to be within shared memory model, it's ideal in that sense. Yeah, so basically the fastest thing you can achieve if you let different threads just simultaneously update things in a shared memory region without any protection. So it's ideal because it's better than TBB and Mastery and represent the highest performance you can get in that regime, but still, you know, you're bottlenecked by the caching validation. So that's the point. It's not like true speed of lightning. I see. So it's basically bottlenecked by, it's almost worse than Anna just because of the caching validation. Just because of caching validation. It's because of the caching validation among different NUMA regions or even in the same NUMA region. So based on our observation, definitely they're both factors both we present, but even in the same NUMA region, there's definitely caching validation costs that are exacerbated. Okay, sure, thanks. Sorry to keep you on the slide, but do you observe any sort of difference depending on what you configured the gossip period to be? Yeah, sure. So basically for here, I believe the gossip period is like once per every 10 to 100 millisecond, but you can imagine, you know, on your high-contention workflow, if, you know, on one extreme, if you keep the gossip period to be really, really short, like one microsecond or something, you know, that's no different from full replication because you're essentially like gossiping every update to two to three. So on the other extreme, if it's very, very high period, like gossip every one minute, then you'll see a perfectly linear scaling because the gossip overhead is just unnoticeable. So in the real application, it really depends on how stale your application can talk, how tolerant, you know, your application can be to stillness and you can, you should be able to tune your gossip period accordingly. But this is just benchmarking this. So just another point on this, gossiping, in other words, the more you do with it, the more you're going to be sharing memory because every cash in validation is a memory right, right, not an intel one. So if you're doing high-contention, then you're going to be, that whole thing is going to come down. Exactly, yeah. So that's what we call a compressed on sender trick where we, it's in some other sense, it's taking advantage of lattices of sociativity in the sense that if you, you know, if you say your application tolerates certain degrees of stillness, say 10 milliseconds, right, it sort of relates to the question that I answered back by 10 minutes ago that during that, you know, 10 milliseconds, you can pack up writes locally and only cost of the end state to reduce the amount of, you know, of contentions that incurred. No, this isn't an advertisement for Gen Z. But if you had multiple nodes and you were sharing memory across Gen Z, of course, because it's, there's no coherency, your cash in validations would just affect your local node. And so you might actually still get this scale. Since you're not coordinating anyway, in other words, no asset. Okay, so you're running the Gen G stuff that not super familiar with it. Yeah, so that's that's an HB labs thing. Sorry. Okay, yeah, maybe we can talk a little bit more about offline, but not super familiar with all the details. Yeah, cool. Okay, so low contention. So low contention, all key value store scales linearly, which is good. But you know, add on with replication factor one can still significantly all perform all the alternatives. And the reason, you know, is that even without contention, the overhead due to atomic instruction is just much higher than regular instruction. And this overhead occurs, importantly, on the critical path of every reason, right to share memory. And interestingly, we see that, you know, once we start to increase, and as replication factor is performance start to go down and down. This is because, you know, within a gossip period, low contention, the number of distinct keys being updated will grow significantly, which in turn increases the gossip overhead. So that teaches us a lesson that, you know, a key hotness can be can have a very big impact on performance. And probably, you know, having a single application factor across all keys is not a good idea, which I'll come back to towards the end. So then we answer the question, you know, can Anna's execution model provide sort of smooth scaling across different scales? So here, the first, you know, 32 threads are on a single machine shown in blue, and the 33rd through 64th threads are on the second machine shown in yellow, and threads from 65 onwards are on the third machine shown in green. So we do observe a small performance degradation between the 32nd and 33rd thread because, you know, there's a place, this is a place where we start to introduce distributed gossiping overhead. But, you know, in general, the performance scales linearly. And so Anna's execution model does let the performance scale smoothly from a single node to a distributed setting. So basically just two high level takeaways, right? So share memory model introduces high contextual overhead. So let's use strap on core plus private memory access to eliminate synchronization. And the second point is that by using lattices, it provides a neat way to achieve replica conversions. And, you know, lattice composition, like carefully composing this different lattice pieces together, allows Anna to support a white spectrum of consistency models. And of course, the third bullet point that I didn't put in is that we are sort of achieving all of these performance gains and consistency gain by sacrificing, you know, coordination full consistency levels, so what we call strong consistencies like serializable, linearizability, and snatch isolation. Those consistencies are not supported by Anna. So before I go to next, any questions at this point? Yeah, sorry, just curious. So did you have any numbers to measure sort of the difference between really within a single node that have overhead of gossiping and over distributed gossiping? So for example, you have 32 core on a single machine versus if you have 32 servers each with one core, like, can you quantify that performance difference? Yeah, so we didn't put that figure on the paper, but we definitely measured them because implementation-wise, when we're gossiping within node, so implementation-wise, we're using zero MQs, either TCP message passing or inter process message passing. And within each node, we use the improv message passing mechanism. And underneath is just, you know, putting and getting to a shared queue, I guess. And across node, we're using TCP transport. So definitely, there is a significant amount of difference in terms of gossip overhead. But what happened exactly during the macro benchmark is that because we noticed that the performance is actually bottlenecked by the network side. So the difference between internal gossip versus internal gossip is actually not as pronounced as in our microbenchmark. So that's our observation. Okay, cool. Next. Any other questions before I move on to what happened next? Just maybe a suggestion that if you did this with the RDMA thing that allows you to do memory-to-memory transfers by MMAP, that can operate at full memory speed across the network. This would be an interesting experiment. Exactly, exactly. So I think theoretically, and I'm very curious to do the experiment, it will definitely boost the transfer efficiency a lot. So did you guys, when you first started the project, did you know at the beginning that you were going to go with an actor model or is that something you sort of organically discovered over time? Yeah, I mean, at the beginning, I don't think I even knew that this actor model thing. So the beginning of this project is that we noticed, so because we started with what everybody else did, the share memory model, right? We just started with that. And as we benchmark with varying workloads, varying connection levels and concurrent threads and stuff, I just noticed that the TV stuff just fails to scale. So how to solve that, right? So that's sort of what motivated that work. And the simplest thing is you don't coordinate. So if you don't coordinate, you go fast, but then how to deal with consistency. And that introduced lattice and stuff. So that's how everything pans out. Is there a framework you're using for your actor model implementation or did you roll your own? Actually, for the initial Anna, we wrote everything our own. We didn't use any existing actor libraries and stuff. We implemented our own lattice composition libraries and stuff, which is sort of open source. Cool. So I guess related question, when you use zero MQ as sort of your messaging framework, right? So are you differentiating between local and remote messages at all? Are you just going to pay the same serialization overhead for both of them? So implementation-wise, when you're gossiping in, yeah. So actually, we are differentiating it in terms of when we're gossiping within each node, we just make a path, we're just passing the pointer around. And whereas, you know, in turn node communication, we're actually serializing, deserializing and sending messages and stuff. So yes, we are differentiating. Cool. So basically, I'm going to talk a little bit about, you know, what happened next because this first Anna version is actually, like, we started 2017, 2018-ish. So 2019, so what we did is, so one of the key limitations that I mentioned before is that Anna is very sensitive to this replication factor thing, workload skew. So, you know, when the convention is high, you want to aggressively replicate your data to spread the load. But, you know, when the convention is low, ideally, you actually want to minimize your replication factor to reduce the gossip overhead. But the problem is, like, you don't know your workload ahead of time. So it'll be very nice if, you know, Anna can just dynamically adjust each key's replication factor based on the workload skew. And, you know, higher level picture, you know, 2018-19 is where, you know, the concept of serverless become more and more popular. So, you know, going even deeper and higher, it'll be nice if, you know, the developers won't have to worry about, you know, how many nodes to deploy and what's the optimal configuration of replication factors and stuff. And they instead should just specify, you know, high level goals, like latency as a role and cost budget. And ideally, the system should be able to tune dynamically based on this budget to achieve the optimality. So the second push for Anna, Anna v1 is to make it towards a serverless key value store, where you can dynamically adjust its deployment based on the workload skew and achieve the high level goals specified by the end user. So that's what happened in 2019. And then going to 2020, so now, so basically, you know, serverless computing is just taking off more and more. And people start to use it more and more. And especially, I think, at Berkeley, there's been tremendous amount of attention put into serverless. And we're really the strong believer that, you know, serverless will take off in the next, you know, five or ten years. Everybody will just start using serverless. Everything will become serverless. So if you look at, you know, what exists today, there are some AWS, there's some service computing platforms, existing platforms provided by some cloud providers, AWS, Google Cloud Functions, Azure Functions and stuff like that. They're really good at handling stateless workload, in the sense that, you know, you just upload a function, it executed for you, returns the result without accessing any external shared needable state. But, you know, if we believe that the serverless computing is really going to take off, everybody's going to use it, then your workload is going to be inevitably consists of some access to remote shared memory, shared needles state. So currently, how it's done is that, for example, if you use Lambda, you can do a remote call to your data that's stored either in DynamoDB or S3. That kind of introduces two issues. The first one is like existing key-value stores, S3 DynamoDB, doesn't have as rich consistency level as provided by Anna. And the second problem is that once you leave your function executor, once you make that remote function call, you're going to incur one extra network round trip. And especially when your data is large, that overhead can be very huge. So basically what we want to achieve in the next project called Cloudburst is to build a serverless computing framework using Anna as a storage backlink that simultaneously achieves, you know, low latency function serving and at the same time, rich level of consistency provided by Anna. And sort of the one something summary, the key insight is that we just attach a cache, you know, co-located with each function executor. You know, when you cache stuff, things become faster. So that's very obvious gain. But the challenge comes, you know, whenever you have a cache, you always need to worry about, you know, your cache stillness and stuff and how it collaborates with Anna to provide meaningful consistency levels. So the real research challenge for the Cloudburst project is how do, you know, as I mentioned before, simultaneously achieve pretty strong consistency levels. We actually pushed Anna's consistency even further to what we call transactional causal consistency, which is the fastest, sorry, which is the strongest consistency level that you can achieve in a coordination and free setting while eliminating the network round trip as much as possible. So that's the Cloudburst project. And sort of last slide, what's happening next and now is that, you know, this is not so much like my grand vision, but definitely Joe's grand vision of like he's been trying to answer the question of how do we program the Cloud for the next, like for the last like 20 years or so. So he definitely wants to build a project of like now the project is called Hydro that encapsulated both Cloudbursts and Anna. That's supposed to be a platform for programming the Cloud. So basically, we've been building things from ground up. We have Anna that acts as a storage and over the network, built with lattice composition and then Cloudbursts, which is a containerized, you know, server is computing platform with consistent hashing and fault tolerance built in. So the next step is to build from, you know, further upward to take a look at the programming language side. We've definitely a polygot of programming models available. It's becoming popular. We have things like logic programming, a lot of bloom, you know, functional reactive programming, RX, actors, and more recently, you know, futures array. And who knows, next five or 10 years, what's the next best programming model would be. So given all these popularities, we want to, you know, make a Cloud compiler toolkit, tentatively naming, you know, hydrolysis that chasing all of these models and emit intermediate representation in hydrologic, which is a universal disorderly algebra for Cloud computing that's then can be emitted into Cloudbursts run kind for efficient execution. So that's sort of, you know, our groups of grand vision towards what's happening next. And hopefully, like probably I won't be, you know, in the journey of developing hydrologics and hydrolysis, but hopefully the younger PhDs that I will take on from that point on. So that's pretty much all I have today. I'm going to take questions. Awesome. So we obviously can't applaud. You could hit the applaud button, but in my opinion, that's an empty gesture. All right. So any questions? Yeah. So I guess not necessarily related to this presentation, because I just happened to read a Cloudburst and was it hydrocache? Yeah, yeah, yeah. Paper. So when you when you say you sort of support fault tolerance, can you can you go maybe go a little bit deeper into what kind of model you have in mind? Yeah. So the fault tolerance paper is actually not part of the hydrocache paper is it's called aft, called atomic fault tolerance and serverless computing. And in that paper, we're focusing on supporting what we call Bayless's ramp protocol, which is then for read atomic multi-person transaction, which avoids both dirty read and fractured read. And we apply that we basically extended his protocol into the serverless setting to achieve exactly one's function execution. And actually worth one thing worth mentioning is that that paper is not limited to Cloudburst and Anna, because it's it's built with the mindset of being a shim layer that can, you know, have any kind of function executors on top, be it Lambda, Google Cloud Functional Cloudburst, and can have any storage system on the lease like S3 DynamoDB and Anna. So it's a plug and play middleware layer that can just, you know, you hopefully, you know, you don't have to make any change in your application just insert our layer and it just guarantee you, you know, richer fault tolerance guarantees. This is, this is Vikram's paper on our got. Yeah, so if we, yeah, so all of the papers are on my web page and I think it's the second or third one, it's called aft, atomic fault tolerance and serverless computing. Okay, and if I remember correctly, this is just like a delta layer, basically you put on top of a storage layer. Right, it's a shim layer that sits between the fast compute layer and the storage layer. And without that shim layer, you're still basically requiring serverless functions to be unimportant because when something fails, you don't roll back any intermediate state. Yeah, okay, the middle layer itself doesn't do rollback. So we rely on, I remember we rely on two fundamental two important things. We rely on this underlying storage system to provide persistent guarantee, the persistency guarantee, and we basically keep whatever retry policy that you have for the upper function execution. But if you only do retries, how do you handle sort of a, I mean, when you, when you fail in between says a function execution, right, that you might need to redo stuff, but you also might need to undo stuff, right? Yeah, so the middle layer does put, you know, push all the changes made within a function or a DAG atomically at the end. So there's some buffering going on that, you know, it's a little bit detailed notation, but it avoids, you know, pushing this partial updates that are later seen as normally by other function executors. So it's like another cache layer essentially in front of your hypercache layer? Yeah, so the middle layer, so the after shin layer itself acts also as a caching and buffer layer. Okay. Cool. Any other questions? I have a point of interest, perhaps. And that is, you're working on Amazon AWS, right? Yeah, for now, all of our deployment gets run on AWS. Yeah. Having spent an unfortunate amount of time trying to get RDMA working on AWS, it doesn't. And, and they're in plans, it won't. So the two things you want to do with RDMA is this memory to memory thing, which in the infiniband people flag, and also remote access to EBS disks. Neither one of those is in plan for the foreseeable future. Whereas on Azure, of course, because Azure is supporting provisioning SGE systems from HPE, you have full infiniband access and you will be able to run at memory speed with low latency and try it on Azure, you will actually be able to do the full RDMA thing and you'll get the full performance that they're getting in the high performance transaction system, say it aims NASA and all these other places. So it's like, if you want to get the fastest performance, that's what you do. And you should be able to do that on Azure without any trouble. Right, right, right. So having support for multicloud, GCP, Azure and AWS 3M is definitely good and there's a there's more comparison point as well. And there's a there's a talk that's that's the slides are online from HBTS conference 2019. And by Tim Kraska of MIT, it's called fast fast networks and the next generation of transactional database systems. You might want to look at that because he shows dual XDR infiniband performance that it's just faster than memory speed, which is like what you want. Yeah, I should definitely think of this Tim and also what's the name of the process working? I think Irfan has been working on it, but yeah, yeah, definitely we're thinking of. Yeah, he's at a desale, the data systems in AI lab. Yeah, Tim was my undergrad advisor. Yeah, we know Tim. We'll leave it at that.