 Hello everyone. My name is Wilson Wang and I'm from Biden's infrastructure computing team today I'm going to present the performance improvements in 3.5 release as we know Acti 3.5 release in mid June it has a lot of new features and bug fixes and performance improvements and Today we will cover some of the important performance improvements in this new release Before we start, let me first cover today's agenda. The first part is a brief introduction to Acti D And the second part is about the poems and optimizations in Acti D 3.5 The Today we're just going to cover two of them. The first one is about concurrent read transaction the second part is about inefficient warning log and after that we will cover the performance change in 3.5 and then after that we will mention the new benchmark command that we added to Acti D Repoietory and finally we will talk about the ongoing and future work that we are planning to do in the future Now let's begin First a brief introduction to Acti D What is Acti D? Acti D was introduced into some 13 by cool or S team It is a distributed KV store which provides strong consistency it is using Go language and The algorithm it uses is rough consensus algorithm Its best fit scenario is when you have more reads than writes in your database environment Here the graph on the right side shows how Acti D cluster looks like The raft Portable that are running on each of the Acti D cluster node behaves like a state machine It changes its state based on the message coming from other peers in Acti D added a thing using a single raft The rough entry is applied in the first in first out order Also in Acti D is support multi version concurrency control, which is MVCC There are two important data structures For supporting MVCC the first one is key index key index is a B3 data structure that has the mapping from key to versions and Second data structure is BodeB backend BodeB is also a simple B plus 3 data structure that maps Versions to key value tuple So that with these two data structures when you have a key then the key will be mapped to a particular version and Using the version later you can get into BodeB and find out the exactly Which value you are looking for so that in this way multi version concurrency control is supported Finally, there's a war log, which is write a head log Write a head log is mainly used for data recovery in case of failures Now let's talk about some of the important Acti D APIs Acti D clients talks to Acti D server using GRPC There are five important Unary GRPC calls we need to pay attention here the first one put which basically put a key to Acti D server database and the second one is range which Can be used to read one or more keys from Acti D server database and Someone did a range which can be used to do either one or more keys from Acti D database then the fourth one is transaction which is basically a mix of read and write operations in one single transaction and The fifth one is compact which can be used to compact the current Acti D database The stream GRPC call watch is also very important Acti D client use watch to watch all the key value changes in Acti D server database In Bydance Acti D is widely used to manage different Kubernetes clusters Or Acti D cluster sizes continue to increase beyond 15,000 with an average of 8,000 nodes among top 20 clusters The reason we have these huge clusters is because of the co-location of Kubernetes and yarn tasks In the post-border, we will have super large Kubernetes cluster with 20,000 or 100,000 nodes in the near future So Acti D performance is one of the important factors that limits our Kubernetes cluster size To meet all ambitious goals with super large Kubernetes cluster We need to resolve some of the problems in the earlier versions of Acti D Now let's talk about some of the optimizations we have in Acti D 3.5 To improve the performance In the first part we will cover concurrent read transaction What is concurrent read transaction? Concurrent read transaction was introduced in Acti D 3.4 It is used to avoid holding read mutex so that the write transactions will not be blocked In this way the concurrency can be improved With concurrent read transaction The Acti D read write latency can get significantly reduced by around 90% However At the throughput get reduced due to TX Rebuffer deep copy TX Rebuffer is one of the few in concurrent read transaction data structure It holds a list of sorted key value pairs that are not yet committed to the storage yet We observe that the Kubernetes API server throughput in Acti D 3.3 is actually higher than 3.4 because of this TX Rebuffer deep copy In this graph we are showing the super difference between Acti D 3.3 release and 3.4 release The taskbar we have is Cluster with 5k nodes We are running a Scheduler that will schedule 200 k-pods in this cluster as we can observe from this graph that The 3.4 release has a much lower throughput than 3.3 Although with concurrent read transaction the 3.4 release has a lower latency than 3.3, but the throughput is much lower In order to improve the concurrent read transaction performance The first thing we did here is do no TX Rebuffer copy in transaction call Transaction call uses concurrent read transaction for read only operations inside transaction call Even if the transaction call operations contain write operations, our read only concurrent read transaction is always created Each of the new concurrent read transaction gets a private copy of TX Rebuffer This is very costly. So the solution here is as the read only operations Execution is very short. We use read transaction instead of concurrent read transaction in transaction call Here on this page The right side is the set of the graph that is showing the performance improvements when we were using read transaction instead of concurrent read transaction The subplots in the top is actually mostly write operations The subplot in the bottom is mainly read operations as we can see here With different combinations of read and write operations The performance improvements can be different But generally from the graph we can see The transaction call throughput can get improved to around 2.7 times the original performance In each of the subplots the x axis Is the size of the connections from the client While the y axis Is the size of the value that we are saving to the database As we can see from the first one When we have more clients coming and also larger values for the key read to store Then we will get higher performance improvements Now the question is Can we further reduce the overhead of TX3 buffer deep copy? The answer is yes How do we do that then? Well, the solution is to share the TX3 buffer between different concurrent read transactions As we mentioned before Concurrent read transaction each has a deep copy of private TX read buffer However, when there's no Write operations happening between different concurrent read transactions The data they are holding is actually the same So Here We made the changes so that the different concurrent read transactions When there's no write in between They will share the single TX read buffer So that we don't have to copy The TX read buffer each time when you're creating a new concurrent read transaction Here on this page We're actually showing the performance improvements When we were sharing TX read buffer between different concurrent read transactions The graph shown here is very similar to the one before And the subplot on the top has higher Ratio of write operations While the one at the bottom has a higher ratio of read operations As we can see With this new change Transaction call throughput can also get improved And the maximum that we can get is around 2.2x Now let's talk about the inefficient warning log problem Actually server needs to print out the warning logs Sometimes the warning log needs to print out the size of the jrpc message by calling portal.size However, when you are making call to portal.size It actually needs to Marshall the whole message Before knowing the actual size This is very expensive So the solution here The first one Is to use jrpc message types size instead of the portal.size function call So that the it would directly return the size of the message instead of Marshall and calculate the size second If you cannot use the first solution Then avoid calling portal.size in warning log function if this log is not going to print anyway So in this way you will Not do the Marshall call And save CPU time The first solution is done by chow from aws And Here is a link to his pull request In the description of the pull request we can see the message users can reduce up to 50% This is a great improvement And in solution 2 solution 2 is done by aws from buy dance And here's also the link to our pull request With all solution XD throughput can increase another around 4% Okay, now let's talk about the new benchmark command We contribute to activity repository activity repository contains a tool Under the tool directory, which is called benchmark benchmark tool contains several commands, which can be used to test different aspects of activity It has operations such as put range transaction pull Watch and so on However in activity We don't have a straightforward way to test the mixed read and write operations In real world environment It is common to see mixed read and write operations sent to activity So in order to measure the performance In the mixed read and write environment We added a new benchmark command in activity Here it is linked to our pull request Besides We also added a python script Which can be used to compare the two different benchmark results And generate the graph So that from the graph we can see the difference between different branches Here on the right side It is a plot of the performance difference between two different activity branches The first column is the performance of the first branch The second column is the performance of the second branch And the third column is actually the difference between the two Here the different rows means different read and write ratios Now let's talk about the performance Here in this page We are showing the graph of comparing the memory and cpu usage of 3.5 versus previous versions In the script that we are running we are actually Running mixed transaction operations And the mixed transaction operations Contains different lengths of key and values We can see from the result that With the same script running The cpu usage and memory usage of 3.5 is actually Lower than previous versions Please know that in this graph We don't have the full length of 3.3 The the result of 3.3 Is very similar to 3.4 Because of we don't have enough side to show them So here we just Display part of the 3.3 performance In the last part we're going to talk about the ongoing and future work In order to evaluate Actd cluster we need kubernetes environment However, we cannot have a full setup of kubernetes each time when we are doing evaluation So how do we resolve this issue? We propose a new dump and replay feature to the community With this new feature We can dump client grpd requests to a file and later replay it On another Actd cluster So later we can compare the performance difference between the different clusters We're also working on further improvements in different Actd components The first one is watch We know that the watch performance can get significantly affected when we have a large number of Client doing the watch at the same time Is there any way we can improve the performance here? We're actually working on that The second is martial and unmartial Client send messages to Server And the server will do a lot of Martialing and unmartialing on these messages However Many times we're actually doing some actual Martialing and unmartialing operations These are not needed So we probably can't provide the cache so that we don't have to do the Martial and unmartial operations each time We actually had a prototype and we can see that with a caching layer We can improve the Actd cluster performance by reducing some of the unnecessary Martialing and unmartialing However, we're still working on that right now The last one is multiple db backend support Currently Actd only has one bold db backend And people are discussing the possibilities to generalize this database layer So that we can have different backend database support If you have any questions, feel free to contact me at my email address shown here Here is the list of references we use in our talk today And thank you for everyone Bye