 Hi everyone, I am Tao Xingye. I'm a senior engineer from PINCAP, which is a company behind Type-QV. In the past few months, my team and I has put in a lot of efforts to make Type-QV work better on cloud storage. And today I'm gonna share with you some of those new improvements that we are very excited about. Just before we dive in, I think we should get a basic understanding of the situation first. Let's take a look at Type-QV. Long story short, it is a distributed storage engine. Unlike traditional storage engines that serve from a single machine, it has the capability to scale out, normally to hundreds of nodes. And we need to replicate both WAO and data files to provide high availability, which is actually one of the reasons that we are willing to tolerate the complexity of a distributed system. Now let's move on to the hardware design. Nowadays, most public cloud vendors provide virtualized disks. And those disks can be mounted to a local file system, and they appear like a local disk as well. But internally, they're forwarding the IOs to multiple remote disks that are potentially shared by multiple users. EBS, for instance, well-replicated any right IOs to three different locations. And this internal complexity can be a real problem for our systems because the latency for starter is obviously higher than a local disk. Also, since you are sharing hardware with other users, anything that you use will be charged. That includes disk bandwidth and IOPS. Finally, we should all know that cloud infrastructure are not always as reliable as they claim it to be. Service degradation will be relatively frequent, and it should be considered in our system design. Ideally, we want to make a large Type-V cluster behave similarly to a traditional RDBMS. Unfortunately, it's actually hard to accomplish that on a cloud storage. First, we want to build a scalable service. The scale means more failures. To be more specific, we are worried that as the system, the hardware, the storage hardware performance is more likely to degrade because on a larger scale, a real event will not be so rare anymore. The cost is a problem too because every storage operation is charged by the cloud vendors. The users now have more reasons to care about exactly how and why our system is using those resources. And by that, I mean read and write amplification, which is the amount of vials the system needs to issue before finishing one user request. Here, I drew a simple graph to demonstrate our system's runtime usage of our resources. Over time, the user writes are very stable, as you can see from the yellow bar there. But they are amplified multiple times because of background writes. And that includes compassion and garbage collection. In addition to that, large events incur actual IOs that are usually not so predictable. From this graph, we can see that the most straightforward way to reduce cost and improve scalability is to keep IO usage under the hardware watermark at all time. And for that, we introduce two new features, which I'm gonna talk about in details. First, we have Raft Engine. It is a new lock store for Type-TV that is written in Rust, just like Type-TV. And for those who don't know it, we previously used RocksDB to store all transaction logs. Clearly, it's not a optimal choice, but it is a decent solution at the early stage of development. But right now, we want to replace it and improve it. The primary goal here is to write less than the RocksDB. Consequently, we can reduce IO cost and reduce the possibility of hitting storage performance limits. Of course, we have a secondary goal here is to improve the performance as well. But it is not being actively working at the moment and we are really hoping that more contributors can join us to improve it in the future. Now let's talk about how exactly we accomplished the primary goal, which is actually very simple. Raft Engine maintains an in-marry index of all log entries. The reason we do that is not to improve read performance. It's actually about reducing the background works. In RocksDB, compaction is needed to keep all data sorted and clean up deleted data. But in Raft Engine, we don't need to sort anything. And garbage collection doesn't need to read out obsolete data because we already have a map of all active data in the memory. And then Raft Engine further reduces foreground IOs by compression. All log entries are compressed with LZ4 before they are actually written to the log files. Mainly with those two techniques, we are able to reduce nearly 30% of all server writes. And in practice, that is a very improvement very good improvement that we can use. After that, we have another feature called the priority IO scheduling. It is not a new thing. Many systems already have it. What we managed to do is to add this functionality without introducing major change of architecture. By that, I mean, we did not change the internal tasking system of Taikevi no additional IO queuing is required and no actual overhead at all. The algorithm is again, very simple. We first trace and categorize all system IOs into three different priorities. We can call it AD and C here. During the execution, we periodically assign individual IO limits to those priorities. In the beginning, the IO limits are all very high. All IOs run with no restrictions. Eventually, as you can see here at Apple too, the IO usage exceeds a predetermined global limit which is what we call an overflow. After the overflow epoch, we will adjust the IO limits for lower priorities to make sure that in the next epoch, system will not use so much IO resource. And here in specific, we will make the IO limits for priority B and C much smaller than epoch two. And after that, in epoch three, the global IO usage of the system will be decreased under the predetermined limit. As you can see, the algorithm is not perfect. It tolerates short period of overflow, but in practice, it works exceedingly well. Here internally, we conduct to the test to simulate large events during online workload. In this test, a large table is mounted, is encoded while a TBCC workload is running. After applying the priority IO scheduling, as you can see on the left-hand side, the system performance is much more stable than before. Well, that's pretty much all the features that I want to cover today. Other than that, there are a few more important things that we are experimenting with. The CPU limiting feature, for instance, is a new strategy that we are pushing for low-resource environments such as four core machines. Basically, we want to smoothly apply back pressure to the user before the system is overloaded and the wrapped witness. It is an attempt to reduce replication costs by using a write-only node that only replicates transaction log but no readable data. In the future, we want TechEV to further adapt to cloud hardware and we truly hope that the community users can benefit from our work here. So thanks again for joining me here today and goodbye.