 Let's start. I'm Diego Didona from IBM Research in Suric. I'm going to talk to you a little bit about our work on performance profiling on FoundationDB. And this is joint work with our colleagues on IBM Research in Suric, with IBM Cloud, and also with Snowflake. So before starting with the presentation, with the technical content, you may be wondering what is the involvement of IBM research with FoundationDB. Well, you have seen, most likely, in the first presentation of the technical presentation of the day, Adam talking about IBM Cloud, and how IBM Cloud is using FoundationDB now in production to support strongly consistent distributed transactions. So what we're doing there, as IBM Research, we're helping deploying FoundationDB in IBM Cloud in an efficient way. And in particular, we're looking at end-to-end performance issues and optimizations that we can do in FoundationDB, especially at the storage engine level. So in this talk, I'm going to talk to you about how we use the Linux Perf tool for our purposes to analyze, improve, and monitor the performance of FoundationDB. And in particular, I will talk to you about three use cases, the first one being bottleneck identification using Perf. Second one is overhead analysis and performance improvement of the Red Bull Storage Engine, about which we will hear more later. And the third use case is code instrumentation for monitoring the performance of FoundationDB. And this last bullet is kindly contributed by the Snowflake team. Just one slide of introduction of Perf, whoever is not familiar with it. Perf is the performance analysis tool that is part of the Linux kernel. It's very powerful. It's very versatile. You can do a lot of things, among which tracing performance counters, setting and monitoring trace points, doing profiling. But in this presentation, I'm just going to cover three basic features of Perf that are system code analysis, CPU profiling on a Perf function basis, and registering and monitoring trace points. If you're interested in other use cases of Perf, there are plenty of resources online that you can check out. There is a word about it outside. So let's get started with the first use case for today, bottleneck identification. So the first thing we did when we started our project with FoundationDB, of course, we wanted to get a sense of the performance that FoundationDB can achieve, so we ran a very simple benchmark. We deployed a server with a memory engine, the simplest one, and we loaded it with some very simple uniform key value pairs. And then we also ran another machine, a client, that injects the most simple transactions we can think of, so redoli transactions on a uniform workload. And we deployed several threads within this client process. We want to see, as we vary the number of threads, how correspondingly varies the performance on delivered by the servers. And the result we got is the plot in this slide. On the x-axis, we have the number of threads that we deploy within the client process. On the y-axis, we get the throughput of the systems in terms of operations per second. And what we can see is that after an initial scalability, we hit bottleneck here, the performance stops growing. And we want to understand what's going on at this point. So to do this, we looked at the server and the client. And what we saw first is that on the client, despite having 320 threads active, and if you can see, only one core is basically active at the time. The other ones are mostly idle. So to find out what's going on, we run perf. Specifically, we use the perf trace-s command on the binary of the client. And this command gives us a trace of the system calls that are called by a binary. On a per-thread basis. So what we get is for each thread of this binary, we get, first of all, the load of the thread. It's up here. And then we get the system calls that are called by the thread with some useful statistics, like aggregated CPU time and average CPU time per system call invocation. So when we run this on the client binary, what we get is we get 320 very similar traces, each corresponding to the 320 clients of the application that we spawn. And then we get one different trace corresponding to the library thread that is spawned by the foundation of the library transparently to the user. Looking at these traces, we find out two interesting things. The first one being that this client library thread is taking up 50% almost of the CPU. I don't know if this is visible. So this corresponds to most of the load that we see on the top of the slide. And the second thing we see is that the CPU, the system call that consumes the most CPU is the Futex system call. It is a synchronization system call. And this shows up both in the client threads as well as in the library thread. So this means that most of the CPU time is spent in synchronization between the library thread and the application threads. And overall, this means that the bottleneck that we are hitting here is this single networking thread. It is doing too much work to handle the requests of the user-level threads. And it's awakening them to process and send them the replies of the foundationDB. So the solution to get around this bottleneck is, instead of spawning a single client process, we can spawn multiple client processes each with its own library thread. And we did that. And indeed, we saw that performance increased a lot. We have here an example of this. Again, this is the same test that we ran earlier in the thread. We have the same configuration as before, where we have all the threads in one single process. And in blue, we have the same test in which we deploy the same number of threads, but across 20 foundationDB client processes. And we see that performance is much, much better and scalability is also much, much better. Now, whoever you is familiar with the forum, you know that this very same issue has been reported a couple of times at least and discussed a couple of times at least. And to answer this, somebody that is very versed in foundationDB has to tell the user, listen, yes, there is this problem with the foundationDB client thread. And this is because the guy who replies is very expert in foundationDB. We were at the beginning of our journey in foundationDB, we didn't know about this specific issue. And this demonstrates how Perf can give you very good insight about the performance and the issues of the code, even without being an expert in the code and without having really access to the low level details of that code. Onto the second use case, overhead analysis and performance improvement in Redwood. As you should know, Redwood is an next generation storage engine of foundationDB. It's going to be, it's already pre-release, I guess. And we are working with the Apple team in particular with Steve and Evan to improve this performance. So what we are doing is we profile a benchmark using Redwood with Perf. We identify the hotspots, meaning the functions that consume the most CPU. We try to improve those functions and then we repeat the process. So what we do is we use the Perf record command. This command basically profiles the binary and sees the CPU overhead corresponding to each function. And then with Perf report, we have a very nice view of these functions and the corresponding overhead. So we run this on a benchmark that uses Redwood. And on the rightmost column, you can see the name of the functions. And on the leftmost column, you can see the CPU overhead for that function in order. So the top function is the one that consumes the most CPU. In this procedure case, what we see is that there are two main sources of overhead, CPU overhead. Memory comparison, the first two entries. And memory management, memory allocation and allocation, we are now looking way free. So to improve the performance of Redwood, we want to address these two overheads. Starting with the memory comparison one because it's the one that has the most impact on CPU. Of course, to address the overhead, we first have to understand where the overhead comes from. And what we find out is that Redwood uses a temporary in memory sorted buffer. It's called the mutation buffer. In this buffer, the incoming updates are sorted and kept there, and then they are flushed in batch all together at the end upon committing. This mutation buffer is implemented as a standard library map. Then on its turn, it's implemented as a red-black tree. Now, a red-black tree is basically a balanced binary tree, in which each node is a key. So whenever you want to find a key or you want to insert a key, you have to traverse the tree top down. And at each level that you compare to determine where is the level at which you have to stop, you have to compare the target key and the key in the node. So this means that at each comparison, we are comparing the whole target key with the whole key in the node. And if the two keys have a size of roughly b bits, this is an o-bit, o-of-b operation. And the number of comparisons that you have to do is asymptotically logarithmic in the number of elements that you have in the tree. So the total cost of finding and inserting a key is o-of-b log n. The red-black tree is not aware of the fact that actually in many workloads that are interesting for foundation would be keys share prefixes, or in any case share common part of the key. And we want to leverage this particularity of the workload, of the keys. So instead of looking at a red-black tree, we propose to use a tri. Tri is a tree-like structure, but where each node of the tree, instead of storing the whole key, just stores a portion of the key that corresponds to the prefix of the keys that are in the subtree. So whenever we have to find a key or we have to insert a key, we have to descend again the tree from top to bottom, from top to top down. But at each level, instead of comparing the whole key, we just compare a portion of the target key with the corresponding portion in the node. So the total cost for finding a key is just o-of-b. It's a logarithmic improvement. And in particular, we propose to use the adaptive verdict tree in short art. It is a state-of-the-art tri-like data structure. It is very compact, it's very cache-friendly, and also implements a couple of tricks that perform prefixed compression, that further reduce the number of bytes that are compared when you are inserting or you are looking out for a key. So what we did, we created a version of the demutation buffer that instead of using the map, uses this art data structure. And we compared it to variants using a microbenchmark on the storage engine. This microbenchmark is very simple. It ingests 2 gigabyte worth of key value pairs in 100 iterations. And the workload is very simple. It has random keys, 500 bytes values, and we experiment both with small keys and with large keys. And the results are in this slide, where we report the speedup over Redwood in terms of commit rate to disk. On the left, we have the results for the small keys. On the right, we have the results for the large keys. So in blue, we have the baseline. It is 1, meaning, of course, there is no speedup with respect to itself. And in orange, we have the speedup that we have with our modified version. And the speedup ranges between 11% and 15% for our use cases. Now, changing from the map to art improves the memory comparison overhead, but it doesn't get rid of the other overhead, the memory allocation overhead. This is because also in art, we have to allocate and allocate the internal node of the tree. So we profile again the system, and we see, indeed, that the problem of the memory still persists. So what we want to do here is instead of using malloc and free as they are now, we want to use a slab allocator, as it's done in Foundation B elsewhere using the arena. Of course, instead of implementing everything and then see what is the gain that we have by implementing this slab allocation, we first want to have an idea of what is the improvement that we can get. So instead of implementing everything with the arena, we just link our banners to TC malloc, that is a slab allocator for memory, instead of malloc. And then we measure the performance that we get. So this would be an upper bound on the performance that we would get if we implement our own allocator for art with the arena Foundation B. And this is the results that we get. There is an additional column in here for both small keys on the left and large keys on the right. This is this dashed column that represents the speed up over the baseline that we have when we use both our art data structure and the slab allocation. And we can see that the improvement with respect to the baseline ranges from 22% to 28%. So by using Perf as an indicator of where the over it is, we could improve the performance of the specific part of Foundation B by up to 28% over the baseline. The last use case, I want to talk to you about this code instrumentation for monitoring performance. So what the Snowflake team did is it added the support for USDT probes in Perf. USDT probes are user-level trace points that provide a hook to call arbitrary function within the code flow. So what is new now in Foundation B, there is this new macro, FDB trace probe. It takes the first parameter, the ID of the probe that you are defining, and the arguments for this hook function that you want to use and to call at a specific point in the code. With Perf, you can enable and disable these probes at runtime. So whenever you're not interested, for example, in production to something, you just disable these traces, these probes, and the corresponding tracing cost is very, very low. When you're interested in something, you see there is a hiccups in performance you want to see what's going on. You enable them, and then you pay a little bit in performance, but you get an idea of what's going on in your code. So you can get live measurements of performance methods that are defined within these probes. So let's see with a couple use cases how this works. First use case is monitoring the invocations of the profile actor enter and exit. So to do this, there are now two new invocations of this FDB trace probe. The first one with the name actor enter and takes as parameter the actor name. And the second one is the actor access that takes as simple the actor name as well. So these are added automatically to the code. It's transparent to you because they are in the actor compiler code. So to use these, you have to activate them on the foundationDB binary using this perf command. And then you just run your foundationDB server normally. Then whenever you're interested in looking at these probes, you use, again, perf record specifying with the minus e flag, what are the actors that you're interested in. And then you run perf script. And perf script gives you, for each of these probes, the invocation time on the left most column. What is the probe that is being triggered and the parameters as an actor ID? So looking at this wall of text, you can post-process it. And what you can get, for example, are interesting measures such as you can compute the actor execution time between an enter and an exit just by looking at the differential within the timestamps. The second use case has been added is the monitoring of the rank queue in foundationDB. So foundationDB is mostly a single process. And there is a queue of tasks of events where events are placed. And there is a single thread that goes through this list, processes the tasks, and then it goes over and over it in loop. So maybe what we're interested in to understand the load on the specific foundationDB server is how many of these tasks are in this queue and how much time it takes to go through all these tasks at once. So to do this, it's not like I had a couple of probes before the invocation, before the starting of this loop, with this run loop task start probe that takes an input parameter of the current queue size. And another probe at the end of the loop, this foundationDB trace probe run loop done, and the current queue size. So again, you can activate these two probes on the binary, and then at runtime you can probe them to have information about these two statistics. So the number of elements in the rank queue and how much time it takes to go through the rank queue once. A nice way to visualize this has been also contributed by the Snowflake team with a script that uses BCC. It is a framework to build profiling programs. I will not go into the details of this, but the code I think will be made available online for everyone to see. So what this script does, it consumes the results, the traces generated by these probes. At the runtime generates the histogram to see the distribution, both of the sides of the rank queue on the left, and the time it takes to go through the rank queue once. So every second, this gets refreshed. So you can have very neatly visualization of what's going on in your foundationDB deployment. I think this is the last technical slide I have. So I just described three use cases for the perf tool that we use to analyze, improve, and monitor the performance of foundationDB. And other than using these for our own purposes, we are trying to build a set of tools that we can then release to the public for you all to use. Disclaimers that have to be there. And thank you if you have any questions. I'd be happy to take them now or offline.