 All right, let's start. Last couple of talks for the day. We have a talk from Dimitri, senior engineer at Twitter. He's going to talk about oversubscription and how they did it in the production quality way at Twitter. So let's start. Thank you. So I work at Twitter Compute Platform Team, who is basically responsible for running Aurora and Mesus on Twitter clusters. I'm going to introduce our solution to idle resources oversubscription, which we used to try to improve our cluster utilization. This is still work in progress. It's not deployed in production, but we thought that this would be interesting to the community that we share our challenges and the solutions. So a little bit of background on Twitter clusters. We have a cluster of more than 30,000 nodes. That's a really huge cluster. And it basically runs two types of jobs. That's production and non-production jobs. Most of jobs are, of course, production. And they have about 80% of allocated CPUs, while non-production jobs at the moment are about 10%. And cluster and P95 of CPU utilization is also different for those jobs. Productions jobs are more effective, like utilizing more than 30%. And non-production, just 20. Non-production jobs and production jobs have different SLAs for we are more flexible about non-production jobs, like we can pre-empt them and do some other things. While production jobs are more critical, so we treat them in the proper way. So we are focusing here on non-production jobs. We want over-subscribe production jobs. So although it's just 10% of CPUs allocated, given the cluster size, even if we manage to get a few percents from that number, it's really lots of hosts. Our most limited resource is CPUs. We're not so much bound by network or disk space or memory. And the problem is that both non-production and production resource allocation is growing. So in order to keep up with this growth, we either have to add new machines to the cluster or try to find some other solutions, like oversubscription. We also have an observation that many non-production jobs are actually idle, like CPU utilization is really, really low, somewhere near zero. So what we did, we collected some statistics about the jobs, non-production jobs, and found that many of the jobs are not using CPU. And there are some jobs which have occasional bursts in CPU utilization, but basically almost any of those non-production jobs are idle. This kind of jobs is development, testing, or some batch processing, which is not so much critical. We also found some utilization metrics for all the jobs. We didn't include short-lived jobs because we cannot win much from them. We're just focusing on long-living non-production jobs. And that's about 80% of CPUs allocated for non-production jobs. Here's an example of idle tasks. This chart shows CPU utilization of some service, which has four CPUs allocated. But in reality, CPU utilization is less than 10% of that. And you can see some occasional bursts. But even with these bursts, it doesn't achieve the really allocated number of CPUs. Here's another statistics. This shows cumulative distribution of all jobs and CPU utilization. So basically, this means that 60% of all jobs with use only 25% of CPU when we use different metrics like P95 here. So it means that 95% of time dropped is basically using CPU less than 25%. So we can see a small spike near 100%. This just means that there are really, really very few jobs which ever reach the allocation of CPUs. So if we zoom into the left side of the chart, we could find some interesting insights. So P95 CPU utilization of 10% is actually reached by less than by 50% of jobs. So basically, if we use such metrics and detect idle jobs based on this metrics, we can win large amount of machines that can be used to run another non-production jobs by using oversubscription. So what we try to do is to detect underutilized CPU resources based on the metrics that we collect. Then we offer these resources for oversubscription to non-production jobs. We also separate how we run production and non-production jobs. So the production jobs will use non-revocable resources while non-production jobs are going to use just revocable resources. And we need some kind of intermediate stage when we are moving from current state when non-production jobs are using non-revocable resources. So potentially, we could have non-production jobs which are using both revocable and non-revocable resources. We also have some limitations and constraints. So non-production jobs must not affect production jobs because, well, production jobs are mission critical. We cannot preempt production jobs. However, we can't preempt non-production jobs. What it means is that we can kill a job, move it to some other node, and relunch it there. We also expect some impact on performance of non-production jobs because we need some time to detect that there is a contention between non-production jobs. And this could take some time to collect the metrics, see that there is a contention. And during this time, we can have an impact on non-production jobs. But non-production jobs must not impact production jobs. So Mesas have, for quite a while, Mesas have a solution to oversubscribe resources, which is basically represented here. This is taken from Mesas documentation. So the two major components here are resource estimator and QS controller. Resource estimator is responsible for estimating the amount of available resources for our subscription. And QS controller is basically monitoring the state of the jobs, of the processes, and decides that it needs to take some action to ensure the quality of service. For example, currently, there is just one action that support it is killing a job. Both of them can consult resource monitor to obtain statistics from the container, like CPU usage, memory usage, and all that stuff. When resource estimator obtains some value of available resources, it sends it to agent. Then it goes to master. And master can offer these resources to framework as revocable resources. This is the interface of resource estimator is defined in Mesas. So basically, there is an initialized function which accepts a function which can be used to obtain resource usage. And also, there is oversubscribable function which returns available resources. QS controller is very similar. Same initialized function. And corrections function, which should return the list of corrections. Basically, that's action to take on which container. Also, frameworks must opt in to receive revocable resources with declaring capability, revocable resources. By default, the framework will not receive it. There are also some parameters on agent which should be used to specify resource estimator, QS controller, and two parameters to define intervals which Mesas uses to query resource estimator and QS controller, which by default is 15 seconds. So we came up with the idea of resource, idle resources, oversubscription, which is, in Mesas model, can be presented like this. So for resource estimation, we detect unutilized CPUs by non-production jobs. That's basically allocation slack, like resources which are not allocated to any type of job. And we add underutilized allocated resources used by non-production jobs to that. QS control is responsible for killing non-production jobs if there are not enough available resources. This can be due to different cases. For example, we detected that non-production job is idle. And sometime after that, it became active. Like it started using CPU proactively. So there could be a case then that there are not enough resources now. So we should kill some of the jobs to make sure that there is no contention. And also, since we have an important job to keep production jobs safe from contention, we must use isolation and isolate all non-production jobs in a separate group. So the basic rule for detecting idleness is like this. So containers idle if p% of samples over some window with duration shows CPU usage less than threshold. In other words, if we see that during a window of, for example, one day, some containers shows that it's only using p95 metric of CPU usage less than 10% than the job is idle. It can have some bursts, but basically p95 covers this. Based on our data, we made some estimations. So using different metrics, we found that we have more than 50% of tasks idle. And using different methodologies to estimate idle CPUs, we came up to this 30% to 50% of idle CPUs. So the lower bound is based on summing integer rounded to integer number of CPUs. Because basically, if we can have 0.8 CPUs available, it's almost likely that no job will be able to use it. Because people tend to use integer numbers like 1 or 2 CPUs. So that's the lower bound. And the upper bound is just some full CPUs available. So let's see how we detect available resources. So first of all, we always ignore resources allocated to production jobs. They are not available for over-subscription. If there are no non-production jobs running, then all available over-subscriber resources are equal to allocation slack, means not allocated resources. Then if we launch non-production job, location slack is reduced. And we make this available as over-subscriber resources. If we detect that the job is idle, then we can offer unutilized resources as over-subscriber. So we just increase that amount of resources. Let's see how case corrections work in this case. So suppose we have an idle non-production job running using revocable resources. And we launch another production job. So that means that our allocation slack is reduced. And we have to reduce amount of over-subscriber resources. So in this case, we see that non-production job is no longer having enough resources. Basically, it's starving. So we need to kill it and to relaunch somewhere else. Another use case is when a non-production job becomes non-idle. So we have two jobs. One is idle, one is not. And then that first job becomes non-idle. Again, we now don't have enough resources to have them both on the same host. So we should kill one of them. Our case correction strategy is to kill non-idle jobs and preserve idleness. This might seem counterintuitive, because why would we kill non-idle job if it's doing something right? The reason is that if we kill idle job, it will be rescheduled somewhere else as non-idle. So in the end, we will have two non-idle jobs running. But if we kill non-idle job, then we will have one idle job and we will relaunch non-idle job somewhere else. This also requires some taking care of recovery to preserve idleness, because we wanted to start with idleness window of one day or maybe one week. And that's a long time. Like, we need to update agents on the hosts. Many different things can happen. So if we don't preserve this information, then each restart of the agents will cause restart of the jobs, possibly. So instead of checkpointing all the samples obtained from the job during one day, for example, or one week, this could be a huge amount of data. So instead, we checkpoint statistics and a few timestamps. When agent restarts, we recover the samples. So this can be illustrated on this graph. Suppose that's CPU utilization over time during some period of time. How statistics is obtained? Basically, we sort our samples by value and then take the metrics like p50, p90, p95 maximum. When we try to reconstruct, we do this conservatively. So we don't have any data about samples between metrics. So we just assume that the CPU usage was at possible maximum. Then we just convert from statistics space to time space to receive samples. And there's one more thing that during agent downtime, we need to shift the samples by this time. Here, we just assume that while agent was down, the CPU usage remained constant, and it was our maximum obtained sample. So this overestimates, tends to overestimate CPU utilization, but it's conservative. It means that non-idle jobs will remain non-idle. Due to the way it's reconstructed, idle jobs may become non-idle, but that's OK. We will just, if there are enough resources, they will keep running. If not, then we will kill the job and relaunch it somewhere else in the cluster. And also, after this reconstruction, because we rearranged samples from lowest value to the maximum value, it means that the job can remain non-idle for more time after such a construction. That's also fine. So we implemented this using Mezos model and found some difficulties that didn't work well. So first of all, there was an issue with decreasing our subscriberable resources. This happens because there is a delay between killing non-production job and update of the resources. There is an interval in Mezos which is used for QS corrections and for oversubscription. So if scheduler manages to get into this between those two actions, it can launch a task using old amount of resources. So more details on this. So when resource estimate is called, it must return additional amount of oversubscribable resources, not the total amount. That's one of the issues. So here in this case, we have production on production job and we use allocation slab to return oversubscribable resources. When we launch another job, so basically, here, the amount of oversubscribable resources is negative because those containers, CPU allocation overlaps. But we cannot return negative value from oversubscribable function. So the best we can do is return 0, meaning we don't have oversubscribable resources anymore, then make a QS correction to kill a container. But now there is a delay between killing a container and Mezos calling oversubscribable method. During this time, master is not aware of change of resources. And it can offer it to framework, framework and offset resources, and launch another job. And basically, we end in the same situation and we make a loop when we should kill job, hope that resources to make the runs before the framework and so on and so forth. So the workaround that we came with comes from a resource estimator and QS control interfaces. So they return futures of resources and QS corrections. We don't have to act immediately. Like if we're requested to return oversubscribable resources, we don't have to go estimate it right now and return. So what we're doing instead is we make a promise to return resources and basically do nothing. When we see that there is a change that we must react to, we then complete these futures. So our reaction is immediately as soon as we detect the change. We don't have to wait another cycle of requests from Mezos. So this way, we control resources estimating and QS correction, resources estimation and corrections by completing future at the right time when we detect the change. To detect the changes, we need to hook executor and task lifecycle events to react to those changes. And we must also adjust how Mezos works by setting intervals to zeros. This basically means that when we complete the future, Mezos immediately requests the new value, which we can hold on till we detect another change. We also found that there was duplicate work between resources estimator and QS controller and really high coupling between them. So instead of having two different modules which run independently, we came up with this structure where oversubscriber is basically a combination of resources estimator and QS controller and combines both of them. There is oversubscriber factory to ensure that we create only one oversubscriber and that this one instance is used by both resources estimator and QS controller modules. From this point, it's pretty much standard for Mezos when we have an interface in the process, lip process process, which is responsible for handling the requests. So here, I'd also oversubscriber just dispatches requests to I'd also scribe a process. And we also added hooks implementation to catch task transitions. This makes configuration a little bit weird because we have to collect all parameters in a hook module just because it's created first. And then just have names for QS controller and resources estimator without any parameters. Actually, their parameters are in hook parameters. We also implemented isolation. This picture shows current implementation of isolation in Mezos. It doesn't differentiate between jobs running on revocable and non-revocable resources. So in this case, when we run too many tasks using revocable resources, there is a potential for contention between production and non-production tasks. We wanted to avoid this. So we introduced a revocable C-group. We just groups all non-production containers underneath. So even if all of those non-production containers become non-idle, due to C-group limitations, it won't be able to affect production containers. We limit revocable C-group by location slack. So this ensures that production containers have their share of CPU time. This has some issues because if we have too many non-production containers running, there could be a contention but we might not be able to detect it just because there are not enough resources for all of them to become non-idle. This can be solved by using some techniques to detect contention based on how much CPU is throttled and so on. Or just ensure that we don't launch too much containers, non-production containers. There's also limitation that enabling revocable isolation will take effect only on new tasks. So we would have to restart non-production tasks to make sure that they are isolated. That's fine because currently our tasks that are running on our non-production jobs that are running on non-revocable resources, they just cannot be oversubscribed. So another issue is disabling or downgrading where we basically have to restart all tasks to make sure that they cannot affect each other. We also made scale tests of the implementation and found some issues. One of them was increased frequency of agent resources updates, like resources to meter, monitors active tasks, idle tasks, and re-estimates amount of available resources. This causes agent update, which is reported to master. Master then rescins offer and sends another offer to the framework. And it happened that Aurora, which we use as a scheduler, received too many rescinded offers and was using old offers to launch another task. So we saw many issues like schedulers trying to launch a task on old offer. So what we did is we just added tolerance. If changing available resources is minor, then we just ignore it and do not report it to the agent and the master. This significantly reduced the traffic and amount of lost tasks. So we also changed Aurora to make sure that rescinded offers are handled as soon as possible. Before that, there was one single queue for processing offers and rescinding offers. Now it's out of order. If there is a rescind, then it's processed with highest priority. Another issue is that scheduler is unaware of revocable and non-revocable resources relation. It basically means that if we launch some task, for example, production task, we know that this would reduce location slack and therefore it will reduce amount of our subscriberable resources available. Scheduler doesn't know this. And potentially, it could launch production task and non-production tasks on an offer. And just non-production task will be killed because location slack was reduced. So the current state that we made tests on our scale test cluster, which somewhat simulates our production cluster, it's not in production yet. We still have to fix few issues. For example, that scheduler unawareness of revocable and non-revocable resources relation. And our deployment plan is to fix the issues and then migrate non-production jobs to revocable resources. It's pretty easy to do with Aurora because we just have to do is to change tier configuration and just configure it to use revocable resources. So new tasks will automatically launch on revocable resources. The tasks that are running, we basically have to restart them. So that's it. Any questions? Production versus non-production, that's a pretty broad brush to categorize, which may be applicable, which is fine. But I'm curious, have you felt the need to have even more finer granularities? Or do you think that adds complexity and that's the reason you're not looking at a finer granularity? Well, I don't think that we need finer granularity. Basically, production jobs, mission critical jobs, which have quotas and all that stuff. Non-production jobs, it's usually development or testing jobs or some really some jobs that do not require that high level of SLA, which is required for production jobs. So maybe when we start over subscribing, we will find something different, something else. Any other questions? Can you repeat the question? I was just wondering if you ever observed something that made all the idle jobs spike at the same time and then maybe even fail because they weren't able to get the resources they needed in a short period of time, just because there was some external thing that makes them all spike at the same time. Well, this shouldn't be an issue because of the isolation. So we grouped all revocable, all jobs running on revocable resources in the same C group. So they have limited amount of time and they cannot impact the production jobs. But if they all spike at the same time, well, there will be contention. We're fine with this contention for some period of time. So it's acceptable to have this contention. But when we detect this, we will reschedule some of the non-production jobs. Only the other hosts when we'll have probably better situation. How is your experience with the QS controller? Do you see the kill rate is too high? Do you see the kill happening when the system is actually not under heavy load? Have you looked into that aspect? Sorry, could you repeat, please? Like the QS controller, like the killing of the job? Like how's your experience? Do you see the kill rate is too high? Like is it killing too many kind of containers? Well, based on the experiments, I don't think that the rate would be too high because, well, the statistics shows that really many jobs are idle. And this keeps for a very long window. So the kill only can happen if we launch a production job. That's the issue that we have with the scheduler, which is that it's not aware of relation between revocable and non-revocable resources. So this can be fixed on scheduler level, that it would prefer to launch production jobs and non-production jobs on different hosts, like basically it's trying to launch jobs from different directions, if you see what I mean. So another potential is to kill a job if idle job becomes non-idle, but that doesn't happen frequently. So I don't think that, given the current situation, we would have too much preempted jobs. How do you detect it's pick-up site as non-idle? How do you detect what is your detection mechanism? Is it like per container? Sorry, can you please? I'm sorry. So for like, how do you detect the system is like the container is idle versus non-idle? So do you have per container kind of, you're looking to the SLA for that per container, or is it based on the complete system level statistics? So idleness is basically, is this rule. So we collect metrics for each container. Mesos API allows only collection of such metrics on container level, not turn each individual process. So we collect this on container level and the parameters, P, D and through short, are the same for all non-production jobs. Any other questions? So you applied this only for CPUs as a resource, or have you applied it to other resources as well? And would the strategy, if you've not applied it, would it work as well, or would you need to tweak it for oversubscription of other resources? So currently we only apply this to CPUs. Potentially there is an issue with network bandwidth. Like if we started with subscribing CPUs, then we would also oversubscribe the network. But since we are not currently network bound for most of the jobs, we believe that this wouldn't affect jobs in any way. For the memory, there is no oversubscription, obviously. And if there is memory shortage, then we just wouldn't launch non-production jobs on these jobs. And then maybe I missed it. Given this new method that you use now, what kind of CPU cluster utilization improvements did you achieve? We don't have this in production yet. Our estimate is that we would be able to reclaim at least 50% of CPUs used. Not 50, but 30% to 40% CPUs used. OK, cool. Thank you. Any other questions? So going back to the memory oversubscription, like the Linux home controller provides mechanisms to kill a particular task, have you started looking into the memory oversubscription, or you don't have in the plan, the memory oversubscription? So memory oversubscription, we do not consider this? You're not planning. You don't have in the future plan or something? No, because we're not memory-bound right now. Our main issue is that CPU allocation is quite high, and there are still idle resources. So we can postpone hardware, buying more hardware to run the cluster. But for memory, there is just no need for now. Any other questions? Cool. Thanks, Dimitri. Thank you.