 The databases for machine learning and machine learning for databases seminar series at Carnegie Mellon University is recorded in front of a live studio audience. Funding for this program is made possible by Google and from contributions from viewers like you. Thank you. Welcome, everyone. We're here for another talk here at the Carnegie Mellon University Database Seminar Series. We're excited today to have Stefano Charada. He is a senior data scientist and at Akamus, which is an automated optimization platform. His PhD dissertation at the Polytechnical Institute of Mellon was based on Akamus, which he then spun it out as a startup. So as always, if you have any questions for Stefano, as he's given the talk, please unmute yourself and say who you are and feel free to do this anytime. That way he's not talking to himself for an hour in Zoom. And with that, Stefano, the floor is yours. Thank you so much for being here. I should also add, you're in Italy right now. You're actually outside Mellon. It's your five hours ahead. So it's 9 p.m., but we appreciate you staying up. Lay for us. Thank you. Thank you for having me here today and for the very kind introduction. Yes, I will talk to you about Akamus, which is a software for autonomous performance optimization. And with this presentation, we will start by looking at what we mean by configuration auto-tuning. So what is a configuration? Why we should tune the configuration and what it means to tune it automatically. Then we will explore Akamus. We will see what it does, how it works and how it differs from other automatic tuning solutions. As we go on, we will make some practical examples using MySQL and basically to demonstrate Akamus capabilities. So let's start by looking at configuration. We know that modern IT systems have a complex and multi-layered structure. And as an example, we can have a look at the layers that allow the Cassandra database management system to work. Cassandra is written in Java. So this means that it runs inside the Java Virtual Machine or the JVM. The JVM in turn can be executed inside a Kubernetes or Docker container, or you can skip the container and go directly on top of the operating system, the Inux or Windows. And at the bottom, you have the physical layer. And this can be either a physical machine where you run the database or a virtual machine. In that case, you will clearly have another layer for the physical machine. Now, each layer has a set of tunable configuration parameters. These are options that you can set in that particular software and they control the behavior of that software. And by controlling the behavior, they have the potential to shape the performance of your entire system, your database. As an example, we can start by looking at the machine layer, which is the bedrock of the IT infrastructure. And so the choices that we make at this level have really far-reaching consequences on the database performance. We can make different hardware choices or we can select different cloud instances. But basically the idea is that if you select a different physical disk, you have different input-output performance. And so you will have a faster or slower database. It's a very simple example. If we go up, we have the kernel. And the kernel configuration impacts the system performance as well. As prime examples, we can consider the file system or the settings of the CPU governor, the scheduler of the CPU. Moving up in the stack, we can look at the JVM and the JVM settings play a really crucial role in the performance of the system. Since an ill-configured JVM will lead to an excessive time spent doing garbage collection, and this has a really severe impact on the performance of the database. Finally, we have the database itself and the database parameters are used to fine-tune the behavior of the database. So they have clearly a direct influence on the performance of the system. Therefore, by configuration tuning, we mean the strategic selection of all of these parameters at each layer of the stack to optimize your system, your database. By optimize, I mean optimizing a certain metric. And this could be enhancing the throughput of the database, reducing the response time, or improving the efficiency of your system. So basically to reduce the infrastructural cost. However, in today's complex IT landscape, achieving optimal performance through manual tuning has become a formidable challenge. The reasons for these are meaningful, but it's essential to understand why traditional methods are no longer effective. First, the number of layers has expanded. We always have more abstraction layers in the system. And inside each layer, the number of available parameters have multiplied. So the proliferation of parameters makes it increasingly unfeasible to manually fine-tune each parameter, basically because we have two main parameters to control. Also, the effect of main parameters is unclear or poorly documented. Even when the documentation exists, it often lacks any insight into how a given setting might impact the performance of your system. And so if you do a manual configuration tuning, you can just pick a certain value and see how it behaves. And to make matters even more perplexing, the default value, which are provided for many parameters, are not always optimal, and sometimes they are quite far from the optimal value. As an example, we can look at the middle figure in the slide, and here we have the throughput of a database as a function of the cache size. And you might expect that sticking with the default value for the cache size is a safe choice. But counter-intuitively, in this particular example, we have seen that both reducing and increasing the cache size led to performance improvement. And this is a problem because if you're doing this experiment manually, you don't have a clear linear effect of the parameter. So you really need to explore the entire domain for the parameter and to basically see how you should set this value. So this is not nice. And finally, we have an acceleration in the release cycle. Basically, the time between each release is getting shorter and so there's no more time to do manual tuning before each release. That's because IT teams are facing a race against the clock to keep up with evolving technology and the releases are so frequent that we can no longer do manual tuning. So in this evolving landscape, they need for an automated and intelligent approach to configuration tuning becomes evident. It's no longer feasible to rely on manual efforts. The solution is in tools like Akamas that can adapt to optimize and continually fine-tune configurations for us. So let's now look at Akamas. Clearly, Akamas is an auto-tuner. And the idea is that it modifies the configuration of a system to meet some optimization goal. This can be improving the service quality, reducing the cost, the construction or increasing the service resilience. We can do many different things. Our goal with Akamas is to create a machine learning powered optimization platform which supports as many applications and technologies as possible. So Akamas is not strictly a database tuner. We do not offer services like previous team optimization. We are not specialized in databases. However, we do support some databases. And so we can use Akamas. We have used Akamas to tune the configuration of the database. Also, we do support different layers of the stack. So we can use Akamas also to tune the operating system or the instance where a database is running. So clearly, Akamas has some key capabilities which makes it different from other auto-tuners. And in my opinion, the most important one is that it is a full stack auto-tuner. So it can simultaneously tune different layers of the AT stack. Also, we are technology agnostic. So we want to be as versatile as possible. We don't want to be constrained by any specific technology or platform. Now, tuning different layers of the stack is not only important to have more parameters to tune and so to potentially extract more performance, but also to take advantage of some interdependencies between layers. As a simple example, if you select a larger machine with more memory, you can give more memory to the database. So this example is simple, but there are more complex stuff that really when you tune multiple layers, you can take advantage of. The second most important thing is that Akamas is a gold-driven auto-tuner. There are some auto-tuners which are strongly opinionated and they have an idea about how a specific application should perform. Instead, with Akamas, the user sets the objective goal. It can be, again, boosting the throughput and asking the response time or whatever function. And when the user specifies a certain goal function, Akamas will navigate the parameter space to find the configuration which optimizes that specific method. Akamas is the world. So, Stefano, can you talk about what are you actually doing for that? Assuming someone signs up and says, I want to optimize throughput. Do you change what parameters you're going to tune? Do you change like... I'm sorry. Do you change what parameters you're going to tune? Do you change how aggressive you're tuning things? What changes versus throughput versus latency or cost? Well, for the gold part, depending on what you set as an objective, then that will be the metric that we monitor during the optimization process. So, basically, we have a machine learning models that maps the configuration that you apply to a specific scalar value, let's say. And so, this value depends on the function that you specify. However, for the second part, the second part, the next point, which is the safety, by safety I mean that we have many different constraints on other performance metrics, like service level objectives or maximum infrastructural cost. And these constraints depend on what application we are tuning if we have some knowledge about that application or actually on how we are tuning the application. But we will talk about this in a moment. And by these safety features, I mean that basically during the optimization process, we make every effort to expect these constraints and clearly at the end of the optimization, we make sure that the final configuration aligns with the specified constraints. The final point is that Akamas is a workload aware. So, we keep measuring the workload that comes into the system and we incorporate this information into the recommendation. This is especially vital when we are tuning real work environments, production environments, where we are exposed to the real workload, which is noisy and it varies over time. So, we must measure it and keep track of it. So, this actually brings me to the last point of the introduction, which are the two Akamas flavors, let's say. So, there are two ways in which you can use Akamas. The first one was our initial offering and it was designed to tune a replica system, replica environments. The idea here is that Akamas clearly controls the configuration and also we control a load injection tool. This is a tool that creates a synthetic workload for the system. And in this way, we can compare different configurations because we are always testing against the same workload. So, we can easily compare different configurations. In the second flavor, instead, we tune real production environments. So, here we work in the real world with the real system and the real workload. Clearly, there is no longer the load injection tool because we are exposed to the real load. And here in this second flavor, it becomes paramount to respect the constraints that we were talking about earlier. And also to move in a much more, in a smoother way because you don't want to make a broad changes in our production environment. And also to take the workload into consideration because it's the real workload and we need to keep track of that. So, we can now have a look at the first example where we see an optimization of MySQL where we assume that we are tuning a replica of the system. And so, as an example setup, we have used the MySQL database running on top of Ubuntu inside an Amazon Elastic Cloud machine. We have used Prometheus to gather performance data and we have used the Benchbase to create a synthetic load for MySQL. So, the first step of the tuning process consists in applying the configuration to the system. To do that, we offer some configurators which are tailored for supported technologies. And this makes the configuration process really straightforward because you basically tell Akamas that you have a MySQL database and Ubuntu machine need to be with some credentials. And then Akamas automatically knows which parameters are available, which are the most important parameters because we support that technology, which domains are feasible for the parameter and more importantly, how to apply the parameters to the system. Clearly, if a technology is not supported by Akamas, we want to be a generic auto-tuner, so we try to make it fairly easy to integrate a new technology with Akamas by providing some custom scripts, basically. However, in this example, both MySQL and Ubuntu are supported technologies, so there was no need to write custom scripts. Notably also EC2 is supported by Akamas, but we have decided to not include it in this example. Basically, tuning this layer means to select which instance to use to run your database. So if you switch the instance selection to migrate the database, and this would make the example more complex. So we have decided to not do that. However, tuning the instance selection is a thing that we do quite often. Most of our real optimization are with Java applications running inside Kubernetes, so if you have a container, it's extremely easy to move it to another instance. Nonetheless, once the configuration is applied, we need to start the synthetic workload, and for that we use the bench base. We use a TPCC workload with 100 warehouses and 50 terminals, and we try to achieve the maximum throughput possible. We let the test run for 20 minutes, and during these 20 minutes, we collect performance data using Prometheus. In this example, we have used the node exporter for the instance and the MySQL exporter for MySQL. That's just because Prometheus is really easy to set up and it made a nice example, but in reality we do support many other monitoring solutions. At the end of these 20 minutes, Akamas collects the results from bench base and the metrics from Prometheus. Reading the output of bench base is really straightforward because there's a nice CSV file that you just import in Akamas automatically, so it's straightforward. Whereas for Prometheus, we have MySQL and Ubuntu, which are supported technologies. So again, Akamas knows which metrics are important for these layers and how to query Prometheus to get those metrics. With this information, we use the machine learning model to understand how the configuration behaves, basically. We get a new suggested configuration, we apply it to the system, we run a new experiment, we get new performance data, we update the model and we have a new configuration. So basically we go on iteratively and then we can either stop the process manually if we have found a good configuration and we are happy with that result or we can let Akamas go on until we exhaust the optimization budget, basically. Now, with this example, we have run two optimization studies. The first one focused on two very important parameters of MySQL, the buffer pool size and the redo log capacity. And by tuning these two parameters, we increased the maximum throughput by a notable 65%. So from 130 operations per second, requests per second to 220 requests per second. However, it's important to say that achieving this result was relatively straightforward because the default buffer pool size for MySQL is 128 megabytes, which for this particular workload is really not enough and so in practice, most database administrator would have opted for a larger buffer pool size and essentially they would have obtained the same result as this study. Now clearly with Akamas, you can tune many more parameters for MySQL. And here we have just done these two parameters to show what I told you at the beginning so that the default configurations are indeed suboptimal for many realistic workloads. And you can achieve substantial improvements even with very simple optimization. Here we have seen- What version of MySQL is this? The last one, I don't remember. Eight is the latest. Yeah, I'm getting that. And then it says experiment number 13. So it took you 13 iterations to achieve like double performance. Is that correct or not? Yeah, it's the last version of MySQL because I ran the experiment last week. And basically 13 experiments because that's the time that it took me to stop the optimization study. And so the best result was there. But actually already from the third or fourth iteration, we were already up to 60%. Okay. Basically because you have this domain for the buffer pool size and whatever you do, apart from making it smaller, but the majority of the domain is bigger than the default. So it's easy to increase the throughput here. Got it, awesome, thanks. And this same result holds for other layers of the stack. So there are a lot of parameters which have suboptimal default. And we should tune them. So from this configuration, we run another experiment. So we start now from a MySQL configuration with a big buffer pool size. And we add the 27 parameters from the Linux kernel. With these parameters, we are able to go up to 255. So basically this is a 15% improvement if we compare it to 220, but it's a 25% improvement if you compare it to 130, which is the original performance. So basically of the entire performance improvement that you can obtain by tuning MySQL and operating system, two thirds come from MySQL and one third comes from the operating system. So it's still a substantial improvement. Clearly, tuning to MySQL database can be done by an expert human database administrator, but tuning 27 Linux parameters is an entirely different story. And it's much more complex and it involves really different skills. Nonetheless, there are potential performance gain that you can achieve. And actually at the end of the optimization process, Akamas gives you some insights into which parameters contributed the most into achieving the optimized score. So in this example, clearly the two MySQL parameters are the most important ones, but then in the third place, we have the latency of the CPU scheduler. And this is really interesting because this result is very different from what we have observed in the past when we run similar experiments and this parameter was not among the most important ones. So this led us to dig deeper to understand what was going on and why these parameters become important. And... Sorry, are you computing that relevance? Are you computing that like her thing you're tuning or is it like pre-computed ahead of time? Like how are you computing that? This is computed by the machine learning model. Basically, it's linked to the relevance that the parameters are in the model. So it's based on the result of the optimization process. Can you share what the model you're using? We use a lot of models. The core of the optimization is by using optimization with Gaussian processes. So this is a link to the automatic relevance determination that you have in the kernel of the Gaussian process. Got it. But it's not that directly. Actually, one of the things that I would like to do is to introduce a final step in the optimization where you actually compute the relevance by starting from the baseline and changing one parameter at a time to the optimal one and see actually which ones are the most important one. Got it. Thanks. Okay, going back to the reason for this result, we have observed that the version 5.13 of the Linux kernel, more specifically this commit here, moved the parameters of the scheduler in this directory, this kernel debug scheduler. So they were marked as debug parameters. Now, this commit didn't change the default value for the parameters, but by changing the path that you need to use to update the parameters, it broke some user space tools like Red Hat QD which are used to update these values. What happens basically is that when you put a Linux machine, the kernel sets a value and then the distribution during the boot process sets another default which they think works better for your particular use case. And if you break the user space tool, the distribution no longer updates the default values. And so you are stuck with the default from the kernel. We have observed this behavior with Ubuntu 22.4 and Red Hat 8.1, where by this behavior I mean that the default value is different between one version of the distribution and another. And actually this observation was not unique to us. If you follow these links, you will see that actually VMware reported this bug and they have seen performance degradation up to three times due to this new default in the parameters of the scheduler. So the point that I'm trying to make here is that stuff like this happens a lot. There are these undocumented performance regression where something in the system breaks. You have some external library that depend from some other library. Somebody changed something and something breaks in the system and you have a performance degradation basically. So these highlights the importance of having tools like Akamas. First of all, because they help you to understand that you have a problem, because suddenly the performance of your system is different from what it used to be. And so you see that you have a problem. But also with Akamas, you really don't need to understand why you have a problem because you can just repeat the tuning of your application and solve the issue. Actually in this example, we had to look at this because we wanted to understand why the parameter became important. But in reality, we could have just kept the result given by Akamas and we had all the performance that we wanted. So until now, our discussions have been based on the assumption that we have a replica of the real system. And in this replica, we can conveniently test and fine tune configurations. However, the real world is really different. To be more precise, our assumption has been that we have a replica environment which is a perfect replica of the real system. This is not the case for two reasons. Sometimes a replica environment, a pre-production environment does not even exist. There is no pre-production environment. And most of the times when the pre-production environment is available, it's not a perfect replica of the real system. So clearly, if you have two different environments, they might require different optimal settings. So if you run your experiments in a pre-production environment and then you move the configuration to the real environment, you might end up having a bad performance. The second problem lies in the workload. So far, we have used a synthetic workload. We have used Benchbase to test a configuration. And again, clearly, if you use different workloads, you might end up obtaining different optimal configurations. So you do need to create a synthetic workload which is a really good approximation of the real workload. However, this again is very complex because first of all, the real workload is noisy. But then the real workload also changes over time. So it's always different from itself. So you cannot replicate it because it's always different. You can just create a good enough approximation of your workload. But in practical terms, replicating the real workload conditions is a very complex endeavor. Because you need to prepare the necessary test infrastructure and the workload scenarios, this is resource-intensive and really time-consuming. This often is not acceptable. So to solve this issue, we introduced the second flavor of Akamas, which is called Akamas Live, where we tune the production environments. From the point of view of the Akamas user, we want to make this as simple as creating a live optimization instead of an optimization study. However, being in the surface, we had to make several adjustments to the algorithm. The most important consideration is the increased emphasis that we place on adhering the specified constraints during the optimization process. So we do not want to violate any safety constraint. This is crucial to prevent any disruption to the production environments. The second point, which is important is... So for databases, it's clear that you don't turn off F sync for transaction commits, because nothing makes it to disk. For the other things you're tuning, like the kernel parameters or the JVM, can you give it examples of other safety constraints you have to put in place? Well, basically, there are many different things that we do for safety. First of all, it's kind of what you were saying. So we restrict the domain of the parameters to make it safer. We also have some new constraints between parameters, like very simple example, if you are tuning a JVM inside the container, you don't create the heap of the JVM bigger than the memory of the container. And then during the optimization process, in a live optimization, we further constrict the domain of a configuration so that the change in two different configurations is really smooth. We don't want to change configuration abruptly. Also, there are the constraints, which are basically a constraint is a performance metric, which is an unknown function of the parameters. So again, you can create a machine learning model to map this function. And you can try to create a configuration which is expected to satisfy that constraint. And basically, the safety is how much you want to stay far, how far you want to stay from the boundary of the constraint. And also a nice thing that you can do is play with the uncertainty of the model. So if your model for a constraint tells you that a configuration is really far from the boundary, but the model is not sure about this prediction, you don't try that configuration and you play it safe and you gather more data. And so the optimization process will be slower, but for sure it's safer. So it sounds like you're maintaining multiple Bayesian optimization models, one for the target objective like throughput or latency, but then also for memory consumption and other things as well, right? Yes, it's not a Bayesian, it's not a Gaussian process, but the idea is that. Okay. Can you say what models you're using? Sorry. A lot and we decide them dynamically depending on the matrix that we get. I got an ensemble method. All right, so Waynes, can you describe who comes up with the constraints and how do they do that? Is this the customer of Akmas or how do you measure the completeness of all constraints? Well, for a supportive technologies, there are some predefined constraints that are inserted when you select a specific application. And also we have constraints that are cross layers. So if you select, let's say again, the JVM and the Kubernetes container, you get a constraint that links the JVM heap with the container memory. Then there are some constraints on the, basically on the response time, which doesn't have to be great with respect to the baseline. There are some constraints on the resource utilization, which are always beneficial. And on top of that, the user can specify some other constraints. Actually, even the ones that are added by Akmas, in reality can be removed by the user if they want to. I don't know why they should, but as a user, you can do whatever you want. And then on top of that, we add some syntactic sugar, let's say to the creation of a study and some constraints are added automatically. I hope that answered the question. Okay, so that's it for the constraints. As for the workload, we said that we need to monitor the workload evolution. And actually here, there are two technicalities that we need to talk about. Basically, when you design an auto tuner, you are creating something that chooses a configuration and this configuration should optimize as an unknown function of the parameter and the workload. And you also need to be safe. So you can control the configuration, you cannot control the workload. And so you need to decide which workload you want to consider, basically. You might have the night where the system is doing nothing or the day where the system is doing a lot of work. And for this, you have two or total decisions to make. You can have opt for a local safety where basically the auto tuner will give you a configuration that is expected to be safe only for the current workload. And instead, global safety where the configuration is expected to be safe for all the possible workloads. The same decision can be made about the optimization. So you can decide to have a configuration which optimizes the score only for the current workload. So you are basically chasing the workload and the auto tuner will keep changing the configuration to keep up with the workload evolution and you will always have the best performing configuration. Or you can choose for an average optimization. So in this case, you want to create an auto tuner which will find a single configuration which will not be the best one under all the possible workload condition. But it's kind of an average configuration. It's the best trade-off that you can make to optimize the score under all the various working conditions for your system. Now, when we started working on Akamas Live we tried to create an auto tuner for the chase behavior. So to keep changing the configuration. Actually, what we have seen is that all of our users prefer the simplicity and the stability of a single and changing configuration. So we have switched that to the global safety and average optimization scenario. And that's what we will see now. Clearly to do all of these you need to have some components in the system that track the workload evolution and do some clustering and forecasting. Okay, so with that we can move to the second example which is a live optimization of MySQL. And here we have kept the same system to run the example. However, we modified the configuration of Benchbase. The most important thing is that we are no longer striving for the maximum throughput but we are using a workload pattern that we will see in a moment. So the workload varies over time and also Benchbase is no longer connected to Akamas. So we keep making queries to MySQL even when Akamas is not running there is no connection between Akamas and Benchbase. So clearly this is not a production environment with a production workload but it is a good approximation because the workload varies over time and it's not connected to the tuning tool. As for the optimization loop the only thing that Akamas has to do is to update the configuration which is the same as before then we wait for a certain amount of time and we gather the performance metrics. So basically there is no connection with Benchbase now. When we apply the configuration we can do two different things depending on how you set up Akamas. Basically you can decide to have an automatic approval where Akamas proceeds to implement the suggested changes without any human intervention or you can opt for a manual approval step. In this case, Akamas provides the user with the opportunity to manually review and modify the configuration before applying it to the system. This ensures that human expertise remains an integral part of the decision-making process. Also we offer the option to propose the new configuration to an external configuration management tool that the user already has in place and then we wait for the configuration to be deployed to the system before measuring the new performance. Clearly this level of control is critical in live production environments because the user has to remain in the loop because configuration changes can significantly impact the performance of the system, especially in the first iterations of tuning where the user has to gain some trust in Akamas so they want to remain a part of the loop and check what Akamas is doing. As for the workload pattern, here we have in blue the throughput that we ask the bench base to achieve and clearly it varies over time. In yellow we have the response time for the baseline default configuration and the workload has some temporal variations. So we have a low work zone during the night from 10 p.m. Which is here, I don't know if you can see my pointer. Okay, here we start the night here and we go up a.m. Then there is an increase in the workload and it stays high until 2 p.m. Then there is a slight dip to 4 p.m. and another peak until 8 p.m. The highest throughput that we achieve is 135 which is the same value that we saw in the very first experiment. So this is the saturation point of the system. We are stressing a lot the database and basically this pattern lasts for 24 hours. The x-axis clearly is the time and each day we repeat this pattern and we also add some noise to make it more realistic. So with this scenario, we have simulated the free optimization which are freeways in which Akamas is commonly used. The first one tries to optimize the performance by reducing the response time of the database. In the second one, we want to increase the efficiency of the system. So basically to reduce the resource consumption so to be able to downscale the infrastructure and in the third one, we start from a bad behaving baseline which cannot sustain the required throughput and we use Akamas to fix this problem. Let's start from the first one. This is the same figure that we saw a moment ago and it shows the behavior of the baseline. From this figure, we move to this one here which is the same thing but we have more percentiles for the response time. Also, we have switched to a logarithmic axis for the response time so to be able to see all the percentiles and understand a little bit better what is going on. Actually, we want to focus on this area which is the first peak, the highest peak. And as we said before, we are close to the saturation point for this configuration. We see that the response time is high, especially the upper pointiles and they are also exploding when the throughput goes up. Here we have the behavior of a configuration after one day of tuning, 24 hours. And actually comparing different configurations with multiple workloads from a single figure is very complex. So here we have decided to zoom in to the first peak to see better what is going on. And what we see is all the percentiles are lower and also if we look at the darkest line which is the maximum response time, we see first of all that it is lower than in the baseline but also that as the throughput rises the maximum response time is not exploding. It goes up but not as much as it did in the baseline. So this indicates clearly that we are away from the saturation of the system. In these tables, we basically have the, again the response times, the various percentiles. In the first table, we have the highest peak. So when we reach 135, basically it's the average of these minutes here where really the peak moment. And as we can see, Akamas reduced the response time for all the percentiles. In the second table, we have the same measurement but for the night. So it's the average of all these 10 hours. So the low part of the workload. And again, also in this situation there is a significant reduction in the response time. Even when the system is far from the saturation point. Now, this is similar to the first study that we saw, the first example study because we are starting from a municipal configuration which is really small. And so by increasing the buffer pool size, you can get a really good performance improvement. So again, this is a fairly easy optimization. Even if here we are actually, we are tuning the two MySQL parameters and the 27 Linux parameters. So we have 29 parameters in total. But still it's fairly easy. For the second example is a little bit more complex because now we start from the tuned configuration for both MySQL and Linux. So it is already a really good configuration. And we have also modified the workload pattern. We no longer go up to 130, but we go up to 200. And if you remember the tuned configuration could go up to 255. So this is a system which is not close to saturation. And so there are resources which are wasted basically. Our goal here is to tune the MySQL configuration and the Linux configuration to reduce the resource consumption of our system. In these two figures, we have the baseline on the left and the optimized configuration on the right. Again, comparing different configurations from a figure is difficult. And also because I had to make two figures there are some differences in the scale of the axis. But still we can see by looking at the blue line which is the throughput that we are looking at the two peaks during the day. And in black we have the memory utilization which is 90% for both the configuration. So there's a personal difference there which is reasonable for a database because it will try to use all the memory and if it's not used by the database it will be used by the operating system. Then in red we have the CPU utilization which it matches basically the throughput. But the more interesting thing is that in the baseline we have a CPU utilization which is close to 85, 84%. And in the tuned configuration we are down to 75%. Also from the yellow line which is the response time we start from a baseline which goes up to 24 millisecond during the highest peak. And for the tuned configuration we go down to 15 or a little bit more than 15 milliseconds. So basically after one day of tuning with 29 parameters we are able to sustain the same throughput with lower response time and with 10% lower CPU utilization which is quite a lot. Now basically this means that if you're running in a data center these directly translate in a lower power consumption and the lower carbon footprint. Whereas if you have a cloud deployment it allows you to reduce the resource that you are located to this database and you can basically spend less money. As for the third example we have a so-called resiliency scenario. Here the goal is to fix a bad configuration. So basically I started from the default configuration with a small buffer pool size and I used the higher workload pattern. So clearly the configuration is not able to sustain this throughput because here there is a clear cap on the maximum throughput that we can achieve. And now with the blue or maybe black line, vertical line I have marked the point where we start the tuning and in orange we have the maximum throughput that you can achieve with the baseline configuration. Now this is not a typical use case for Akamas Live because typically you don't have a problematic configuration in production if you have this problem you try to solve it quickly not with Akamas. So here we would like an optimization which is as fast as possible and that's not the thing that we want to do typically with Akamas Live because as we said before we want the configuration to change smoothly over time so this will take some time. Nonetheless, we can see that already in the first afternoon when we start the tuning we are already able to get a higher throughput and then we have denied basically. However, if we look at the next morning we see that we are able to achieve a higher throughput and this throughput is even higher than the one that we reached during the afternoon and this basically means that even during the night Akamas is changing the configuration. Again, we are changing configuration every 20 minutes here and by measuring the workload that is coming to the system we are able to create a model that maps configurations workloads and performance and we are able to even use these workload area these low workload area to gain insights into the behavior of the system with a higher throughput. And so the next day basically we have a better configuration. Then we go on with the tuning we have another night when we do the same stuff and basically at the third afternoon which is after two days of tuning basically one day here and second day here the system is behaving well because here there is no longer a cup on the throughput. Actually from this, yeah. Did you sort of sound like you said that you recognize that the nighttime workload even though the submission rate is way lower that you've identified that the workload essentially is the same as the daytime peak and therefore you can leverage that information. So like meaning if at nighttime they start running reporting jobs that don't look anything like during the day you can automatically say, oh, this is different therefore I don't want to learn from it. Is that what you're saying? Yeah, well in this example we are always running a TTC workload so it's always the same kind of stuff and we just measure the throughput of the database and basically what we are trying to do is to say if a system behaves better with this configuration even when the throughput is very low we can imagine that this configuration will be good also for the higher throughput. In your specific setting which is more realistic you would also need to track the kind of operations that you are running on the system so you have different kinds of throughput, let's say and then basically you start by measuring the baseline over all the possible workloads and you understand how the workloads impact on the performance because you are not changing the configuration and so you just learn the effect of the workload. Then even if you tune with the night workload you have a machine learning model that is able to extrapolate some insights into the other workload, the daily workload. Clearly it's only a prediction so here becomes really important the safety part of the algorithm because you have a prediction and you need to be able to understand how reliable is that prediction because if you only see the night and you do a lot of tuning only during the night maybe when the day comes you have a totally wrong configuration and so you really must take care of this. Yeah, so I understand that. I'm asking like it sounds like you guys are doing this in Akamas live now you do identify that the workload hash is different at night therefore don't tune based on that. Is that correct? Yeah. And then can you share how you're doing that? Not really. Okay, that's fine. No, basically it's with these machine learning models they map the configuration, the workload and the performance and then we do some reasoning on that. Okay, I understand. Also another thing that we do is that during the day well for this particular example during the day the system is really close to saturation because we have a bad behaving baseline and so we are really close to the constraints that we have for the system and if we are close to the system is close to a violation we don't modify the configuration as much because that's really risky. So in this particular setting we do most of the tuning during the night because it's safer there. So what we do basically is tune during the night then maybe go back to the baseline or really close to the baseline when the workload goes back to the daily workload and then we try to understand whether the model are correct and we can use the results from the night to get a better configuration. All right, awesome, thanks. And so that's it for the third example and basically we have seen that our cameras is a generic optimization platform. We try to tune as many applications as possible. We are a full stack optimization platform so we really want to take advantage of every layers of the stack. We have seen some example from the live optimization which is the most recent version of Fakamas and inside the live optimization we have seen that Fakamas tries as much as possible to be a safe auto-turer where it doesn't create some problems in the production environment and also it is workload aware. So as we were discussing it is able to use the night period to gain some insights for the daily period and that's it. Thank you for that. Thanks, I will clap my hand for everyone else. We have time for one or two questions from the audience. Yeah, maybe I might ask one. Hi Stefano, nice to talk. This is Jignesh Patel and his colleague here. Two questions both related. Do you at any point in time see that applying DML causes the workload, the performance to actually change in significant ways to it stabilizes or do you nearly always have an uptake? And the second part is, what's the worst workload? Is this a broad characterization that you've seen that's super hard work? Does everything seem optimizable with the techniques you have? Well, so far, I start from the second question. So far we have always seen that there is some performance improvement that you can achieve for what regards the system. As for the most demanding workload, the most difficult workload, the real problem is when you have a situation like the one in the last example, where basically you have no clear way to understand what is the workload. Because here we have bench-based, so I know that the workload would like to go up, but in reality, you don't have access to this information. So you don't know whether the system is really a little bit above the threshold or you are miles away from the threshold. And so you need to look at the response times and try to understand that. So it's really more related to how bad or how close to the constraints the system is. Great, thank you for the first question. I think that you really need to consider the technologies that you are tuning because they have different times to show the effect of a configuration change. And so basically you need to have this knowledge when you create the optimization study.