 Hi, good afternoon. Thank you for showing up for the last talk of this QCOME. So my name is Suryan, and I'm working as a software engineer from Google. And I'm Bogdan, I'm a software engineer at Apple. So today we're going to talk about unleash the power of SCD, the potentials and constraints of an extensible SCD. So the idea of this talk is, so we all know that SCD uses B-Boat as its persistence storage, but we're just wondering why and could it be something else? So this talk is just entertaining this idea of maybe we should make the back end of SCD more extensible and try something else. So in this talk, we'll cover some background knowledge and then introduce how we, on our side, how we extend the SCD back end, and then we will share some results of the benchmark performance of three back ends we have tried, and then dive deep into the CPU and memory profile of these three back ends. Okay, so initially we're gonna go through some background knowledge for those people who are new to SCD or don't work on the SCD all their day. So here's the, so SCD is the strongly consistent K-value store, and it uses roughed to achieve that strong consistency. So a lot of SCD architecture is around roughed. And since it's a key value store, it also has its own MVCC implementation that is backed by embedded BoltDB or B-Bolt database. And then also a lot of logic in SCD is dealing with peer communication, which is also part of roughed. So let's look at the typical write flow, which is on the leader. So the main, I'm not gonna go through all the numbers, but once, and I'm gonna, I want to really focus on, is to showcase that there are two streams of work. There's that number three persistence to transaction to a wall, which is basically writing to write a head log, and that's a requirement that is coming from roughed. And then there's also a write to the MVCC back-end that then asynchronously will write to the B-Bolt. So MVCC stands for multi-version concurrency control, and basically it gives us history of the revisions of the keys. Let's look at also typical read flow. So this one is simpler. We don't need to write any, of course, like writing this into wall, but we still need to make sure that we are reading the correct index. So that's why there is that get a read index. But most of the work is done on the MVCC layer, and that's where we get our key values. So next we're going to dive in into the MVCC portion a little bit more. Try to explain it here. So here is the data model that MVCC implementation of HCD uses. HCD has its own in-memory tree index that basically maps revisions of the of the values and key values to that basically maps the key values to revisions and manages the history of the revisions for the certain key. And then when we go to the back end, which is BoltDB, that part doesn't really know about versions, and we just store these revisions as key values in the B-Bolt. And this index will be referenced further in the presentation during some of the performance benchmarks. Okay, so let's talk about why B-Bolt was chosen as the MVCC back end for HCD. So HCD is a pretty old project. So that the decision to use B-Bolt was made back in 2015. And sort of some of the requirements were to have the back end in go, you know, to obviously to have a reliable back end, predictable performance and non-blocking snapshot. This is for the situation when you need to do snapshot of the back end when you're doing writes or reads, and the snapshot in part is used as one of the design features of HCD. And then last, we wanted to store about like 10 million key values of approximately one kilobyte. So it sort of adds up to 10 gigabytes of data. This was I think mainly driven by Kubernetes requirements back in the day. So people then did some performance testing of B-Bolt or was compared with BitCask and LevelDB and they went with the B-Bolt. And one advantage is also that B-Bolt is using B-Trees and they are better for random reads compared to other databases that a lot of them use OSM trees. So let me also highlight some key concepts of B-Bolt. So there's maybe a little bit of a name confusion because we call it B-Bolt but sometimes also referenced as B-BoltDB. So HCD maintains a fork of B-BoltDB, which we call B-Bolt. The original B-BoltDB project is archived. So since it was archived, people in the HCD community fixed bugs and added sort of some new features, some improvements. So it's actually actively developing. So what is B-Bolt? It's a key value store. It has transaction support, which is important for HCD. And it uses B-Trees, as I mentioned prior, and a very interesting feature. It uses a memory mapped file as the key concept of how B-Bolt stores values. It was largely inspired by LightningDB, LMDB. So now we're going to focus on that memory mapped file feature and the actual memory usage of the B-Bolt. So nowadays people face new challenges with HCD and Kubernetes. So on one hand we have systems that are trying to go for as minimal requirements as possible, for example Edge or some systems with special requirements embedded maybe, and the IML where people just want more storage for their special workloads. Let's dive in a little bit more here. So systems with low requirements. So important part that we can change is that fast disk performance is required by RAFT because we have to f-sync on every write. And if you don't do that, then you break RAFT consistency guarantees. So we can't remove that. But most of the people nowadays run HCD so back end fits in memory. So that memory mapped file is always in memory. You don't do any page eviction and so on. So then the question becomes, can reduce the memory consumption? I want to highlight here, this is one of the recent papers that was published by Group in the CMU. And I didn't reproduce the results, but it's sort of interesting that they are saying that memory mapped files maybe not that good for databases. So one thing they highlight is it's hard to achieve transactional safety. This I believe actually doesn't apply for HCD because we have a single writer situation. But the two other items will potentially apply if we try to go beyond the memory available in the system, which is having sort of slower IO on page faults and where you need to evict and access evicted pages. And then they also did some testing on the NVMEs and they claim that when you use M-Map databases, memory map databases, you might have issues with throughput and this is related to some CPU caching. This will be interesting to replicate. Again, come to replicate this. And for EIML systems, the problem is the opposite. So people want sort of more storage and like we actually can't have systems with, you know, we can, but it's hard to just provision systems with like 10 gigabytes of memory. And also there are some other issues that we'll talk about during the performance testing, why it's difficult to go higher. So in order to plug in different backends for SID, what we have done is first we made an interface to extend the SID backend. We call EACID. So this work is provided by Han, my colleague. And so basically we extracted three backend interfaces out of the SID code base. So the first one is the database interface and the second one is the transaction interface. And then the lowest one is the bucket interface that directly talks with the backend database. So with these interfaces, so we implemented these interfaces for two other database backends. And then we ran a bunch of benchmarks to evaluate their performance. So the three backends we're comparing are B-Boat, SQLite, and Badger. So the setup we have tried is just for one server cluster. And then we ran the benchmark, we ran the different servers in Docker and with different memory constraints. So one is very limited memory for 4 gigabytes for up to 10 gigabytes of data. And then the second one is with enough memory, that's 12 gigabytes of memory for up to 10 gigabytes of total data. And we also ran the benchmark for different key value ratio, size ratios. So the first one is like the key value size are comparable. And the second one is key size is one tenth of value size. And the last one is like you have a very small key size and then very big value size. So the first slide is about the right benchmark. So we can see like overall, B-Boat has similar performance as Badger. And then SQLite is much slower in writes. So this might be a little unexpected because we would expect Badger to have better write performance because it's a pen only writes. And another interesting thing is the graph on the top left. So this is when the memory is really constrained. And then the special thing about this graph is the key index. So when you write a total of like four or five gigabytes of data into it, the key index itself is taking almost all the memories. So that's why it gets slowed down even for all the three backends. I forgot to mention that in all these benchmarks, we're not the experts of Badger or SQLite. So their settings might not be optimized. So this might not represent the best performance of Badger or SQLite. So the second slide is about the range benchmark. So from these graphs, we can see that B-Boat is actually the best in terms of random writes. And SQLite is really slow, but Badger lies, it's about 50 to 60% of the performance, the rate of B-Boat. And again, the same pattern shows up in terms of when the key index is consuming all the memory, then everything just slows down. And this is another important benchmark. So this is when you load a DB file, when you start a database directly from already existing database files, how long it takes for the server to boot up. So for Badger and SQLite, it's pretty much linear with the size of the DB file. But for what's interesting to see is for B-Boat, when the memory is small, once that the total DB file size is close to the memory size, it takes forever to load the database from the file. And in some, in a lot of cases, it takes hours to wait for the server to start up. And then when the memory is really constrained with like really approaching the key index size, then everything just takes forever, which is the top first graph, top left graph. So there are also several questions we were baffled about why the Badger is not faster in writes. And then what's the difference in terms of range speed between Badger and B-Boat? So where's the gap? So we look deep into the CPU and memory profiling of these operations. So the first one is the write CPU profile. So you can see that the majority for B-Boat, so you can see the majority of the time is spent on SCD-related operations, rather on raft or other system calls. Only 1.5% of the time is spent directly talking to B-Boat back end. So, and this is very similar to Badger. So that can explain why it's like, as long as B-Boat is fast enough, it's very hard, it's very hard to like further increase the write speed. Yeah, it's harder, there's not much improvement for writes, but you can really slow it down with SQLite. So because we're, for SQLite, for each write, we have to issue a new query to the back end. And then this is the range CPU profile. So like for B-Boat, about a third of the time is spent on searching for the keys and get the values. And then for Badger, a chunk of the time is spent to find the key. And then another big chunk of time is actually spent to get, retrieve the values from the value log because value is not stored in the, as I'm tree, it's actually stored in the log. So we think that's why it's slower for Badger, for read. And again, for SQLite, this is just very slow database. And then we also look at the memory profile. So this is the memory profile after loading of 18 million key values of database file. So you can see like majority of the memory is spent on the key index. And then also the memory assigned to rebuild the index from the entire DB file. So across three different back ends, the memory usage is pretty similar. And what's not reflected in here is the M-Map used by B-Boat because it's a system, the system like kind of determines what it uses. So Golan doesn't really have any visibility on how much memory is used by M-Map. So with all these, we have some takeaway message. So the first question is, is there any value in extensible ACID? We think so, because like right now the use cases for Kubernetes is very diverse. So maybe for, if you're optimizing for speed, B-Boat is still the best choice, but you might have different requirements. You might not, you can maybe tolerate a little performance, lower performance, but you really care about memory size. Then you might want to try some other back ends. And also maybe you don't want to use a map for the reasons that Bogdan has mentioned. So that's also another reason to use ESID. And then the second question is can we, with extensible ACID, can we scale ACID up into like the terabytes region? The answer we think is to some extent, but there are still limits. So the first limit is there's like the way ACID operates is it needs to send, the leader sends the snapshot of the database to the follower, if a new follower joins, or if the follower is falling back, falling behind. So there's still latency sending large snapshots between members. And the second one is like as we have seen in the benchmarks, that ACID is still using a lot of memory in the key index. So if your memory is small, then that would really slow things down. So to extend, to scale up ACID, another approach might be shouting by like, for example, for Kubernetes resources. And then lastly about the future plans for ESID. So currently, the ACID community is just trying to stabilize the whole code base for ACID. So this will not be in priority for ACID to get it on to 3.6. But if you have a really strong use case for it, then we can talk and then maybe we can come up with a plan. Thank you. So I want to reiterate that we're looking for feedback from the community. If there's enough diverse or representative from different companies who will reach out to the community and say that this is a valid use case for them, they'll probably be, this will probably be a bump in the priority. And we're also looking for contributors. So ACID community is right now has a roadmap and just maybe doesn't have time to focus on this. And I think this is pretty interesting work. So if people want to contribute and they have the use case, please reach out. Thank you. Thank you. Okay. Any questions? Does this, okay. So one of the problems that I was facing at my earlier employer was that we were trying to run multiple control planes using the same ACID. But we were worried about the noisy neighbor problem because it was not possible to limit the resource usage of a specific control plane in that ACID. Would the extensibility story help us to make sure this path uses only that much resource and it doesn't make noise for the others? Just want to get an idea how extensible it would be or how configurable that is. So this effort is mainly about extending the back end, like make it more accessible for different use cases. But for multi-tenancy use case, it's more about the intake control of ACID itself. So that's not part of this extensible ACID? Yeah. So the answer is probably not. But there is effort or issues open about rate limiting. So if you are interested in your use case, you should definitely comment on those issues. Thanks. Anybody else? Who's like, in general, show me, like raise your hand if you interested in trying out the ACID with a different back end, if it was available. Okay. Okay. Thank you.