 So I work on the storage side of things on Taikevi. And prior to this, I used to work for Oracle on my SQL. And I was the lead on InnoDB. And recently, I joined ThinkApp. And my experience is limited, but I do work on the storage side of Taikevi. And this is my colleague, Andy. He's been working longer on Taikevi. And so he'll take it from here. And we'll talk about, you know, how we make Taikevi really more cost-effective. Hello, everyone. My name is Andy. I'm a software engineer of ThinkApp. And I work on Taikevi. Prior to ThinkApp, I worked for Google's banner team. And so here I am. Our topic today is about making Taikevi our distributed storage engine that is cost-effective on the cloud. So here's the agenda. At first, I'm going to give you a brief overview of Taikevi. And then I will show you some of our own experience building a cloud service. And yeah, that's it. And then last, I will give you some examples like how we do cloud cost-effective works. And as well as some works we're doing and we'll be down in the future. So first, what is Taikevi? So Taikevi is an open-source distributed and transactional K-value store. So it's written in Rust programming language from day one around like seven years ago. So we were so glad we made the right decision and it was a tough decision to make. And since it's Cubicon and CNCF conference, I'm honored to mention that ThinkApp donated Taikevi to CNSF Foundation like three years ago and now is a graduated product of CNSF. And I'll give the link below. That's our GitHub link. You can search it online. So, you know, Taikevi as a storage engine has been globally adopted by many users in many different scenarios. So here are some examples. You know, we are infrastructure for infrastructure. In some cases, like there's a file system called JuiceFS that use Taikevi as their metadata storage. And the GED cloud object storage also use us as their metadata storage. So there's also like a blockchain company called Harmony. They use Taikevi to store the blockchain data as well. I guess it's probably the super node or something. I'm not sure like whether you guys play mobile games or not, but I guess, you know, like Pokemon Go and the company behind Pokemon Go, Niantic, they use Taikevi as a transactional key value store. So, you know, we are open source project. So we have a large community and like many community members did a lot of great things like around Taikevi. So they build a radish product on top of Taikevi and they're called like Tidies and Titan. And of course, like Pincap as a company, we provide like a solution, like a query SQL, MySQL compatible database that is called Tidp. That's our main product. And it uses Taikevi as an underlying storage engine. So, yeah, this is like an overview of Taikevi architecture. So this in Taikevi is partitioned into multiple charts and each chart is like replicated across a multiple nodes and a rough algorithm is used to maintain the consistency between replicas. So here like a region, you can consider it like a rough group. I know it's a terrible name. Some people like misunderstanding this with cloud region, but this is a rough group. And in addition to raw QV APIs, we also provide transactional APIs. And the transaction implementation we use is called Percolator and there's a paper about it and you can search it online. So through this architecture, TIKV like scales to petabytes of data effortlessly and more importantly, all those scaling rebalancing works and it's all done by Taikevi itself automatically. So in case you're curious and if we zoom out the TIKV architecture a little bit, here's the TIDP architecture, the whole system. So as I mentioned before, TIDP is a query engine and sitting on top of the storage. And similarly, we also have another query engine called TI Spark. So you can write some Spark like commands or query using Spark syntax. And on the side, we have a component called PlacementDriver. That's sort of like a centralized service to store some of the metadata of our system. And, you know, there's a new thing in our system that is called TI Flash because TIKV is the raw, like the old, raw-based storage engine. So it's aiming for like OLTP workload and there are more and more customers like one OLTP system. And so we built a component for that scenario that is called TI Flash and it's in column-based data structure. So TIKV or TIDP can be deployed both on print and on cloud. You know, like actually many of our bank users, like big bank users, they prefer to deploy TIKV on their data center and TIKV on their data center. I mean, it's okay, but I have to say it's hard to support them because, like, when anything happens, we have to go to their place to do some work. It's not a skillable solution. Well, the good side of this is that they're very rich. You know, these big customers, they're very rich so they don't care about cost very much. They give us the hardware and we use it and they don't ask any questions about, like, do you have some room left? Is the resource enough on that? Well, but now our focus is cloud. So, you know, we're a cloud-native database. So, by the way, we are deployed on cloud using Kubernetes. I'm not sure how familiar are you with Kubernetes, but we use Kubernetes operator pattern to deploy our product. So, the good thing about cloud is it's a more skillable business, especially for ourselves. But, you know, when we provide our service as, like, software as a service on cloud, we have to care about, like, cost and resource and efficiency and all those things. You know, like, everything is not free on cloud, right? So, you have to pay for the computational resource, like virtual machine, EC2 instance, and you also need to pay for the storage. And, you know, like, for example, the EBS, you're not only paying for the space, you're also paying for the number of aisles and, like, a bandwidth, something like that. And, you know, if you have experienced building a service on cloud, you will know that there's, like, a price difference between cross-region traffic, cross-zone traffic, and network traffic. So, there's many things, like, specific on cloud. So, what we can do to reduce the cost or make the resource more efficient? So, here are some examples I can think of. For example, like, at the business level, you can do some same plan because, you know, if you do your budget planning ahead, you'll get a discount, just like a regular normal life. And the technical level, there are much more works we can do, like operational work. So, if we can reclaim or shrink down the unused resource, we save a lot of money. But, in this talk, in our talk today, I mainly want to focus on, like, architectural changes and optimizations we did, like, to our main component, which is a techie. So, on the right here, you can see, like, a resource analysis of TIDB or TIDB on cloud, like, 75% of the budget goes to, like, easy to instance, for example. So, how can we, like, make the resource more efficient, right? So, for computational resource, we can, in general, we can write, like, more efficient code. That's the most effective one, probably. But, we can also, like, do some work, like, reducing the unnecessary process. And for storage, we can, like, we can have a system with a smaller size. We can reduce the number of IOs or have, like, a storage service, storage service with lots, or with a smaller bandwidth. And for network, like I mentioned before, they have different, like, press difference. So, we want to reduce the long-distance traffic as much as we could. So, first, computational efficiency. As mentioned before, we use roughed algorithm for data replication, or, more specifically, multi-roughed. That means each roughed group stores a piece of data, and the whole dataset of TIDB is stored by many roughed groups. So, I'm sure that, you know, you can imagine inside each roughed group, all the peers, they have to communicate with each other to make sure that they are in sync, right? And this process is called a heartbeat. You don't need to know, like, more details about it, but I'm sure you can understand the heartbeat. So, well, if you go with naive implementation of roughed, you will find it not super efficient. Why? Because when the dataset, like, becoming large, and there will be, like, millions of roughed groups in your system. And, well, in reality, you know, like, most of the groups store code data, right? So, what we can do is really to putting those inactive groups into sleep. So, and we call them hibernated regions internally. So, by doing this, we were able to greatly save a lot of CPU resources and, you know, of course, some network as well. So that's the first thing we did, like optimizing the multi-rough process. Second, you know, I see many of you guys, like, use MacBook. So, probably running Apple M1 or M2 and I have a laptop which is running on Apple M2. So, you all know that the power of ARM, like, not new, but, like, the other CPU architecture. So, you know, like, we have been trying to migrate our service from, like, X86 to ARM and our testing shows that by using ARM we can save, like, around 20% of the budget but achieve the same level of performance. Of course, right? So, if the performance is not the same, it's pointless. That's about, like, computational resource and next is about storage efficiency. So, here is a graph showing you, like, the life of a write in Type-AV. So, we just talked about the hard bits, like, just now. And that's more like the rough core and this is more like the rough application. So, we can also divide the whole process into three steps. So, a request is coming to the leader of the rough group and the leader replicates the rough log to its followers and all the followers will, like, persist the rough log in their local storage engine and we call it, like, rough storage and then there will be another, like, process called log applying process and they will fast log from the local rough storage and do some processing and then put the real user data into the real user data storage engine. So, I guess, like, we can tell that there are mainly two types of IO here. One being the rough log and the other one being the real user data. So, we will discuss them separately. So, first, let's see what we can do here to the real user data step. So, you know, on each node, on single node, Type-AV uses ROXDB as a local storage engine. ROXDB is an LASM tree structure implementation and the LASM tree structure looks like this, roughly. So, a write coming to an LASM tree will be put into both the memory table, which sits in memory, and as well as the write-highlog, which sits in the disk. And, like, when the memory table goes beyond a certain limit, it will be flushed into the disk, in shortest string table, SSD. Well, you know, if the engine crashes, there will be some data in the memory table, and there will be loss, right? And that's the time when the write-highlog picks in. So, the recovery process will read the log and rebuild the memory table. So, this is like a typical LASM tree implementation. Well, in reality, you know, in our system, write-log actually plays the same role as write-highlog, like native write-highlog of ROXDB. So, in our case, you know, what we can do is, instead of using the native write-highlog of ROXDB, we can just use write-log. So, you know, this optimization is actually kind of like a no-brainer. You may ask me, like, why didn't we do this at the first place? I will tell you, like, two reasons, one truth and one lie. So, the first reason is, using ROXDB native write-highlog, we can get faster recovery time. And the second reason is, we don't have the manpower. You can guess which one is truth, which one is lie. Okay, that's about, like, a real user data step. And as for the write-log, you know, in the past, actually, we store both user data and write-log in the same ROXDB instance. So, you know, our SMTree or ROXDB, in general, they have, like, better write performance comparing to BTree. Well, our SMTree structure usually needs some background activity to do some work, you know, to make sure the read is being performed. And that's called compaction. And there's, like, a well-known problem of this process. It's called write amplification. Because, you know, you flash the data from the memory table to a sort of string table, and there will be, like, a lot of files set in the desk. And sometimes you have to merge them together, like, load them into your memory, and then merge them together, and write back to the disk. So, like, one single entry will be read into the disk multiple times because of this process. And that's called write amplification. And, you know, but one character ROXLog has is they are coming in straight sequential order. So, what that means is that we don't really need the merge functionality or the compaction functionality of our asymetry. In this case, so we wrote our own, like, a new storage engine. We call it rough-ending. Oops. Yep. So, through rough-ending, we were able to reduce the total number of I.O. by, like, 20 to 40%. That was, like, a huge win for us. Yep, that's about I.O. So, for a network, you know, like, as mentioned, cross-region traffic and cross-route traffic are significantly higher than local network traffic. So, you know, our works here are mostly about reducing the long distance traffic. So, TIKVN or TIDB is a distributed system. We usually deploy the whole system in different geographical areas and then scatter the replicas, like, one replica per zone. You know, if that's the case, if, like, by default, there's no much room we can improve this process because we have to replicate the data, right, at least once. But, again, in reality, some of the users, they have more than one replica in the same zone. A typical use case is, as mentioned, we have new component tap flash, sometimes they deploy a tap flash in addition to TIKVN in the same zone. So, in this case, you know, if we don't do anything about it, this two replica will, like, may try to gather own data from the leader, which is in another zone, right? So, obviously, we can optimize that. So, instead of, like, getting data from the leader at the same time, we can force the second replica in the same zone to gather data from the first one, right? Just copy the data across zones once. And similarly, the same principle can also be applied to the read flow as well. So, like, one sentence, like, read locally as much as possible. But this requires utilizing the follower read functionality and provided, which is provided by MVCC, which is, like, multi-version concurrency control plus wrapped algorithm. Here's, like, a summary. So, for computational resource, we optimize it by optimizing the multi-wrapped heartbeat flooding problem, and we're also able to reduce the cost by using ARM. And for storage, we write our own wrap log engine. It's also open source, and using that gives us, like, 10 to 40% less IO and some computational advantages as well because we don't need to do the connection. And we substitute ROXDB write-out log. That saves us a lot of IO as well. Then, for network, the main principle is copy data only once across zones. So, here are some future works we are planning to do. Basically, leveraging more cloud services. For example, Cloud Lambda. It's sort of, like, on-demand computing framework, so we can use that to do compaction. And secondly, we can use EBS snapshot functionality to do the backup. And last, you know, currently, we mainly use EBS, we don't use S3. S3 has, like, large throughput, but the performance is not good. But in the future, if we can distinguish code data from the hot data very accurately, we may use S3. Yep, that's it. That's our topic today. So, questions? Thank you, Randy. So, anyone has any questions? I can then walk around with a mic. Oh, sure. Thank you. No problem. Check. Here we go. So, for keeping traffic in the same AZ, if I were, say, deploying my own TI-KV, is there a way that I would need to configure that to do that? No, it's our work. Yeah, a little good, yeah. So it's just aware of which zone it's in? But I think there's one where we may need to cooperate, which is to label the server. Okay, yeah. So you tell it which zone the server's in, then thank you and it takes care of it. That's cool. Thank you. Anybody else? Thank you. I have two questions. First is to reduce the harbids of raft. Those harbids are for different consensus groups, and they have different, for example, terms. So how does it actually work? Like who are the sender and what information does it combine together and who are the receivers? I imagine that will be every leader or basically every node because it must be a leader or it's very likely to be a leader of some consensus group to send everyone else. Yes, exactly. So the picture I was drawing there was from a per-group perspective. Yeah, this is just one group. The leader sends harbids to his followers, right? So what we did was just to stop the harbid process if there's no rights into this group. But if there's a request, then we have to resume the whole rough process. Oh, it's not to piggyback a bunch of harbids into one message. It's more like reduce the harbids interval for the idle groups. Right. So what you're talking about is another optimization. Oh, got you. Thank you. I think that's it. Okay, sure. Anybody else? There, please. Sorry, I joined late. But how can you describe the architecture? If I need to kind of use this in my AWS tenancy, how would I use it? Yeah, so how do you use, like, technology? Explain the architecture. If I need to use your service right now in my tenancy, how would I use it? I think this is what you're looking for, this. And like I said before, so we're being deployed on cloud using Kubernetes, more specifically, Kubernetes operator pattern. Yeah, this is like the whole system. And by the way, we also have our own deployment tool called tie-up, T-I-U-P. If you want to play with it, like in your local environment, you can use that as well. But on cloud, we mainly use Kubernetes. Right, I would encourage you to use tie-up. And so it will actually even download the binary for you and copy it around everywhere and then start them all up and get them talking to each other. Yes, yes, yes. Thank you, another run. So to allow the type flash node to fetch the data from other followers, it's pretty different from the raft protocol where the leader sends all the messages to the followers in push manner. So I guess the type flash nodes can only fetch the committed data in the learner mode. Is that how it works? Yeah, exactly. It's a learner. Gotcha. So for followers, they have to get the message from the leader directly, so they cannot follow each other. Yeah, I mean, implement the patient detail-wise, there are many more works. Yeah, we have some other internal component called a rough proxy, right? Yeah, yeah, but that's like the work of it, yeah. I mean, you can have a chat afterwards and you can ask him all the details. Thank you. But in general, you're right. It's a learner, yeah. Good question. Anybody else? That's it. Great talk, Andy. Thanks. Okay, thank you so much for coming here.