 So, okay, good morning everyone, my name is Mingran and I work at Wefront. Today I'm going to talk about new memory storage engine we have been coding and debugging over the past few months. Here's the agenda. We start with a brief introduction to the current FDB memory storage engine, its main component and functionality, and then move to the motivation behind the registry base storage engine. Why would we want a new one? The last thing is some of the optimization we came up with during the development process and of course the most important part, the performance result. Let's see how storage engine fits into the big picture. In FoundationDB we have this distributed lock system and we have a storage server row. Storage engine lives inside storage server and storage server contains exactly one instance of it. Our server receive mutations in version order from distributed lock system and constantly apply them to storage engine. The main purpose of storage engine is to persist key value pairs into disk. For memory storage engine specifically, data is stored directly into Kivi container, the green box here in memory, but all the operations are locked to disk. That's how durability is guaranteed. The Kivi container is a component inside memory storage engine that helps store and retrieve data. Storage engine is used by a single process from one thread. This makes development a lot easier, right? Because we don't need to worry about the concurrent issue. All the storage engine types, including SSD, memory, implement that abstract intercourse called IQ value, this interface supports operations of recovery, set, read, rearrange and commit. Then we come to the lowest level and also our main focus today. The Kivi container. For current implementation, it's a balanced binary search tree where in each node, we store the Kivi pairs directly. Key is stored in order, of course. Both key and value are essentially string ref. String ref is a data type has been widely used inside FDB. You can think of it as a reference to heap space and we will get to detail later. The motivation, why would we want a new memory storage engine? What's the obvious weakness with the existing one? First in first, we try to be more space efficient. If we can build a new type with less memory usage, that will bring benefits to all the cluster in Wavefront. In order to do so, we probably need to revisit the Wavefront point structure today. What's the key format? What has been really stored inside the database? Key is a large mountain prefix overlap. Keys electrographically stored over the underlying bytes. For example, character zero is always stored before character one. Key starting with the same prefix are stored together. This is just the key part. Value a tiny compared to the keys. The problem is with current memory storage engine, we're storing the same prefix multiple times. There's literally no prefix compression at all. That's the reason why we decided to adopt a different key V container using Reddit tree. So what is a Reddit tree? If a familiar was tried or dictionary tree, Reddit tree is a compressed and space-optimized version of it. The differences we store in Reddit tree, we store the common prefix key usually as a string instead of one character. There are two types of nodes inside Reddit tree. The inner node maps partial key to next-level nodes and leave no stored value corresponding to the key. I have a quick demo to show you how Reddit tree works at a very high level. Let's start with an empty tree and insert first key value. Key is mine and value is term. It's pretty straightforward, but remember to update the parent. In this case, it's a root node so that parent knows it has one child with perfect starting with s. Then we insert the second with key smith and value 20. We compare smith with existing key and find the common prefix here. I highlighted in Reddit in the slides. Next step is to split the original leaf node into one internal node. The key is a common prefix, then update the old leaf node and finish inserting the new key. The main logic for insert operation is to find the right insertion point, split if necessary, and insert with a cracked key. But there are many corner cases and details involved. Let's keep inserting the key smiley with value 30. You see a split and end up having this leaf node with an empty key. The challenge is how to deal with the edge case. For my implantation, I use a special number minus one to represent the empty key. Delete is relatively simple compared to the insert operation. You find the leaf node and delete it. But the tricky part is to merge if possible. In this case, the parent LE has only one child left. It can collapse two into one to save the space. So what advantage over the comparison based binary tree for Redis tree? First one, it's a space efficient. It's especially true for the use case involves a large amount of prefix overlap because you store the common prefix only once. The height of Redis tree depends on the length of the key, but in general, not on the total elements. The third one is the Redis tree requires no rebalancing operation. All the insertion order result in the same tree. But what are the trade-offs? First, key inside Redis tree are stored implicitly. You must reconstruct from the past when you do the rate. So this could lead to increasing rate latency. The second one is possible memory fragmentation. Because Redis tree tends to split the long key into smaller parts, and we use FDB memory pool to store the key. But memory pool in FDB is allocated in chunks with smallest size 16 bytes and growing the fashion of power of two. So we start with 16 and followed by 32 and 64, leading to variable padding overhead for each key value pair. In this section, I will focus on the optimization we came up with after the first round result, and of course, the latest result as a comparison. One thing worth mentioning about is our task environment. It has six DB nodes in total. For each store server process, memory is said to be 22 gigabytes, and operating space is said to be 14 gigabytes. One of the amazing properties of our testing cluster is that it has quad buffer configurations, which means we send wavefront data to four memory shards with murmur hash to calculate distribution. This configuration gives us the ability to see how Redis tree behave against a regular one under the same workload. Memory zero and one are Redis tree storage engine, and memory two and three are regular storage engine, workloads are evenly distributed among four shards. However, the performance result for the first round design were not good enough. We ended up using more operating space and total memory. So what could cause the memory usage gap between theory and production? Let's look at our key format again. The average length of Kiwi sites on our testing cluster is around 50 bytes. So basket scenario for each input key, we can have 34 bytes in comma, exclude the value part. What's more for Redis tree? The ratio of total nodes to Kiwi pair is 2.2. However, for regular storage engine is always one because it's a binary tree where you store the Kiwi pair directly. The node overhead for both types is 72 bytes. And now let's do the math together. So for each Kiwi pair inside Redis tree storage engine, we're able to save 34 bytes in the key but create 80-something bytes of actual node overhead. The bottom line here is in order to have the same performance result as a regular tree, we need to save another 52 bytes somewhere somehow. Based on analysis here, the solution is simple. We can try to reduce two things. The first thing is node overhead. The second is total number of nodes. This is initial design of the node structure. There are seven member variables in total and the number inside parentheses indicate how big it is in by unit. And I'll go through them one by one trying to find the potential optimization we can have. The first thing is to differentiate between leaf node and inner node. I can't believe I didn't do it in the first round. So both leaf node and internal node, they share some of the member variables. But leaf node don't care about children info. By extracting this part out, we can save 24 bytes for this specific node type. The second is to change map into vector. So standard map is a sorted container usually implement as red block tree that requires additional memory usage for maintaining structure. For example, when you insert an element, it will bring at least 16 bytes additional memory usage. But vector is a sequence container. You simply append in the end. But in this case, we need to keep elements in order with doing the insertion and deletion. The second idea is to use B field for metadata. This is metadata part. We have a boolean leaf to indicate whether this node is leaf node. And we have an integer mdabs to keep track of the common prefix length with an ancestor. The value of boolean can either be zero or one, so one bit is sufficient. The value of mdabs will not exceed 100k. That's a hard limit for FDB. So 16 bits, 17 bits are sufficient. One plus 17 equals 18. By using B field, the space for one integer can hold up those two variables. That's reduced the size from 8 bytes to 4 bytes. The third idea is to use inline data and union type on string ref. In registry, we use string ref to store key prefix. It consists of two parts. An 8-byte pointer that points to a contiguous heap space assigned by the memory pool. And an integer to define the length. So total is 12 bytes. One observation we have is around 80% nodes in registry under wavefront workloads have data length smaller than 12 bytes. Hence union plus inline data. For key smaller than 12 bytes we use inline data. It's a byte array that will store the key directly inside. For key larger than 12 bytes, we switch to string ref. Store the key inside the memory pool. The next idea is inspired by a paper, adaptive registry. Instead of always using vector to store the children info, we have two types of inner node with different capacities. The first one can store up to three children, key and pointer pairs, as my example on the right. It currently holds two elements with key 0 and 4 using fixed length array. So when the children side exceeds 3, we'll switch to the origin design using vector. Theoretically, vector can hold as many elements as possible. The motivation behind is based on the observation that the number of inner nodes is 1.2 times of live node. So a good chance here that most of them might have children size smaller than 3. And why choose number 3? It's also a balance between node overhead and no children capacity. Those are the major changes to help reduce node overhead. As for to reduce the number of nodes, this part is actually interesting because I think I picked the wrong design in the first place and didn't realize for a long time. My origin design did not have the concept of value. I store everything as key. So what? Is that from my computer? I'm sorry, is that from my computer or anyone else? No. Okay, let's listen to it for a while then we can continue. Let's do it again. Health music, right? So my origin design did not have the concept of value. I store everything as a key. Here's an example. When you start the key smile and value 10, I create an internal node with key smile and the live node with key 10. But after some experiment, I finally expected to collapse 2 into 1 by adding value field inside the live node. This is my latest design. That's all the optimization we have so far and finally come to the most exciting part. As summary, the node overhead for registry, it used to be 72 bytes for all the types, but now for live node and internal vector, it's 56 bytes and for internal 3, it's 64 bytes. Also, the ratio of total nodes to a key repair has reduced from 2.2 to 1.3. As for the performance result, we've used two types of dataset. The first one is we're from workloads and there are three metrics for paying attention to. The first one is operating space as calculated inside FDB. Other than the key resides, it also includes node overhead and memory padding. When stabilized, registry has 25% more operating space than the original one. The second one is total memory usage. This metric is gathered by Linux operating space. Basically, it includes everything. Same conclusion here, registry is taking up 25% less memory for each process. The third one is a histogram. The third one is a relevancy, a histogram gathered by we're front. This data here actually makes sense to me because remember the trade-off we talked about in the previous slides, registry need to reconstruct the key. Thus, tend to have higher relevancy. The second dataset is Wikipedia title benchmark. This one is more generic and of course smaller. The key is all the titles from main space of Wikipedia archive of different language. The value is some random double small compared to the key. Total size is 300 megabytes and I did shuffle before the insertion to avoid bias sorted dataset. The table here is a performance result and for memory usage specifically, registry is taking 20% less than the regular one. Yeah, I believe that's everything I have. Thank you so much. If you have any questions or comments, come talk to me afterwards. Thank you.