 So, this talk will be about quotas which are part of the building allocated if you've been at the talk before. We're talking about this a little bit. Again, if you haven't been to the previous talk, you know me already. If not, I'm Alex. I'm a patching methods committer. I work in methods sphere and as I like to say, my job is to search for right abstractions and make sure methods get better over time as good wine and cognac. So, let's set the goal for this talk. And the goal for this talk is to share some experience that we have about quota, some operational experience, to talk about limitations and future improvement that we are going to do. And I also try not to fall asleep because I'm still a little bit jet lagged. That's my own personal goal for this talk. So, the basement is why quota? Why do we need quota? Why haven't we invested time to implement quota? Even though if you don't know what quota is now for the moment, it's fine. I will touch this later, but first why? So, usually every solution starts with a problem. Well, so we had a problem. So the problem is, suppose we have a methods cluster. So this is the methods cluster. We have four CPUs and four gigabyte of RAM. It doesn't matter what nodes. We have just one big computer, four CPUs, four gigabyte of RAM. There is a methods master and we have two frameworks. One framework is your production application. And one framework is, say, Spark analytics. And as you may guess, Spark may use all resources that it gets. But about this later a little bit. So now we have the time, time is time zero. So that's the situation. Two frameworks, resources are free. The next second, we have allocation interval set to one second. Methods master allocates one CPU, one gigabyte of memory to each of that framework. That's fine. Framework accepts these resources and they start running the task of these resources. Again, at the same time, we're still at time one. At the same time, the rest of the resources by methods master is being allocated fairly, equally to these two frameworks. So our production application gets an offer of one CPU and one gigabyte. And our Spark application gets an offer of one CPU, one gigabyte. So everything is fair, everything's fine. But our production application says, I don't need any resources now. So it declines the offer with a default time out of five seconds. The next time, time two. Methods master thinks, okay, fine. I have free resources in the cluster and these resources are idle. However, I just offer it to another framework that doesn't decline them. So it offers these resources to Spark. And Spark gladly accepts all the offers that methods does. And now we have the situation when the status quo, the production app uses what it wants to use and the Spark uses everything else. Everything is fine. However, in the future, after some time, the first application production needs to scale because it gets more requests. It can happen easily. So what it would like to do this application, it would like to observe an offer from Mesos. But unfortunately, Meso doesn't have any resources to offer. So nothing happens and the framework becomes an unhappy framework. So this is the problem. One of the problems we wanted to address with quota. This problem can result differently. Quote is just one of the solutions. So what quota is? Now we come into what quota is. Quota is the resource guarantee that allows apparatus to reserve some resources. It's expressed via, oh, it's got cut. So there are two links. One is the link to the Meso documentation. You will get access to this slide afterwards and you can click that link. And the second link is the link to the talk by Jörg Schardt and Joris van der Waalten and my colleagues. The talk that they gave at the Mesoscon in Chicago, I believe, last year. So you can also watch the talk. It gives the introduction to quota, to quota feature. It talks about how quota is implemented and how it works. So at the same time, quota provides a REST API for operators, not for frameworks. Quote is only for operators. And that's where REST API allows operators to lay away resources for certain roles. Okay? How it works? It works very easy. A request comes in and we check the capacity, whether there are enough resources in the cluster to satisfy the request. And we proceed this request in the registry. It is necessary for failover. And then we basically exercise the request if we can do it. And everyone is happy. That's the theory. So that's what we expected. After a ship quota, I opened a bottle of my favorite wine. And that's what we expected. People will start using quota. They will come up to the devilish thinking how it works. Unfortunately, it didn't happen. And we were thinking why this didn't happen? Why quota didn't see wide adoption? I know that some people do use quota, but not that many as I wanted to. So there are some limitations. And here are just some of the limitations that quota feature has. Limitations are fine. Almost every feature or every project has limitations. Let's go through those limitations. So first, resources that we laid away for quota, they are not offered to other frameworks. Which means if you say lay away two CPUs in your cluster for future use of that production application, these resources currently will not be offered to anyone else. Which is probably fine. And that's how static reservation works. And that's how dynamic reservation works. But this is a limitation. Another limitation is that quota is on the limit currently, instead of a tuple of guarantee and a limit. Another limitation is that we support just calorie sources. There are no atomic updates. The question is what does it mean? Limit only and not guarantee? Quota can be understood in two different ways. First is limit and the guarantee. The limit means that you say that some entity of framework is not allowed to use more than certain amount of resources. Alternatively, you can say I guarantee that this framework is allowed to use at least these many resources. There are two different things. And sometimes in English language, quota can be referred to one of those of both. Currently, I will talk about this later. We will come back to that second limitation later. But for now, just understand that quota can be understood as a limit and as a guarantee. And probably or intuitively, you may want to specify both. So you may say I have framework and I would like this framework to be able to use at least 10 CPUs but not more than 100. So this is guaranteed plus limit. And there are some other limitations. For example, quota is gameable. I don't go into details. There is no granularity support. Under granularity, I mean that you cannot say that you would like to have, for example, you say 10 CPUs you want to use. You want to reserve as part of the quota. And these 10 CPUs should be five chunks, five pieces, two CPUs each, or vice versa. You just want these two CPUs and one chunk coming from one node. You cannot express this correctly. That's one of the limitations. So what we're thinking about why quota didn't see wide adoption. And we learned some lessons. And these lessons are that of these limitations that I listed and there are more, I just didn't put them here. Two are very important. And these two are the first two. And this is probably the main reason why quota didn't see wide adoption. I still just look into the future. I still think that we'll fix those in the nearest future and we will get that moment that quota will be widely adopted. But that's another story. So first, why haven't we implemented the limitation number A? Or limitation A is because in order to implement over subscription of the ladyway resources. And that's exactly what we need. If you would like to offer resources that we laid away for quota, we need to implement preemption. Because suppose you said that you need two CPUs being laid away for a certain framework, for a certain role. Because currently this production framework doesn't use these two CPUs what you want to do. You would like these two CPUs to be used by someone else, right? Now this someone else uses these resources and at some point of time what we've seen before this production framework says I now want my two CPUs back. So here we come with preemption. So you should have a mechanism how to preempt these resources and reuse them and give them back to the production framework. We don't have this in business now. We currently work on that, but we don't have it. And because we don't have it, we were not technically able to implement that over subscription of ladyway resources in quota. Second, the second limitation is a little bit more interesting I would say. So first, when we were thinking about how we can solve that problem that we were talking before, we said let's introduce a guarantee. So you remember this tuple, limit and guarantee. What we're thinking in the guarantee direction. So let's say our production framework needs a guarantee like this two CPUs and two gigs of map. If you remember that example, we would like to set this guarantee. However, however, there are two different, two different approaches. So if you set a guarantee, what you try to express is you want to protect a production framework by setting the guarantee for production framework. Alternative approach is to limit a greedy framework, which spark is. So if you set a limit, you can limit the amount of resources that a framework can use. The question is, if you have time and engineer resources with just one, what shall it be? A limit or a guarantee? That's the first question we were asking to ourselves. And that's why we decided we decided to conflate those two and implemented something that it acts both as a guarantee as an end limit. So how it works. There are two different types of resources and methods, revocable and non revocable resources. I think I have time to dig into this a little bit. So non revocable resources are the resources that is the default type. And these are resources that we cannot easily take back. Once we offer these resources, resources accept that we cannot take them back. Non revocable is resources that we can take back anytime. So while we implemented just one thinning method, it's called the guarantee. It's technically also acts as a limit. So these resources are guaranteed to your framework. They are laid away, but once you set the quota for that particular role, you're not going to get any non revocable resources beyond that. So sounds like a good thing. However, people often ask me, why don't we have both separate guarantee and separate limit? Why haven't implemented two? Why haven't we implemented a tuple? And the answer to this is, I don't know whether you're familiar with that, but we're going to move in a so-called, how we called it, in a revocable by default world. What's this? Currently the default type of resource is a non revocable resource, which means everything that Mezos offered or everything that Mezos allocated cannot easily take back. And this seems a little bit unfortunate because it complicates the rebalancing operation method because once we, let's say, we did a mistake or we didn't just know in advance what is the fair allocation is and several new frameworks joined the cluster, we can't rebalance that because resources that they offered are non revocable and we can't simply take them back. So what we want to do, we would like to make resources revocable by default. And in this world, technically, we don't really need both guarantee and limit. And that's why we didn't implement that because in the revocable by default world, how quota will look like, you will just set the quota, which means simultaneously the guarantee for non revocable resources. And because you're not getting non revocable resources by default, you actually don't need the limit because the limit is already there. The limit is zero for the revocable resource if you don't set quota. So most probably, that's my guess, we probably won't be doing, we won't be introducing limited guarantee. We will simply wait long enough until we will move into that world and then this limitation will be naturally solved. So next thing, I promise that we talk a little bit about common pitfalls and we share some operational experience. First, very important thing, is we decided to make quota set per role. So, quota is not for framework, it's for role. One of the reasons we do that, because we had the future of roles in mind. Imagine you have a hierarchy of roles. It will most probably come one day. And in that hierarchy, a framework may be attached to multiple roles and it makes more sense to have certain resources attached to the role to denote in the tree of roles to attach to a particular node and add to a framework an entity which leaves outside of the tree. It's very important because roles can be implicit now in Mezos. And if you set, you may set, you technically may set quota for a role that doesn't really exist in the cluster and then if you don't double check, you basically log a lot of resources for some role that doesn't exist. So it's important. If you have multiple roles and multiple frameworks in the same role, again, they share the resource guarantees that the quota gives to you. So if you say, if you set quota per role in different frameworks in that role, all of them will benefit from quota. Second is we don't have atomic updating yet. It will be coming soon, but not yet. And the combination of remove and set is not exactly the same as update because since it's not atomic, after remove, there may be some events happening in the cluster that will render set either impossible or meaningless or whatever. Another pitfall is that setting quota, the act of setting quota may lead to rescinded offers, outstanding offers in your cluster. And most, it's not that easy to calculate upfront which offers and how many offers will be rescinded. Which means if you have frameworks that rely on ordinary resources and they keep the offers for some time that the method gives to that frameworks, you may impact the operation of that frameworks. This is something to keep in mind if you use quota. Next, we've talked about that already, but the resources that are not used as part of the quota are idle. Which is not efficient, but that's something to think about which means you should be very careful about the amount of resources you set as part of the quota. The more you set, the higher the probability that these resources, the more resources will be idle. Next thing is how quota is technically implemented and once we, if you remember that steps of the pipeline of quota processing we'll stop the quota and then we enforce it. And the enforcement happens during the next allocated interval. It doesn't happen, it doesn't happen sometime in the future, it happens right now. So once we start the quota, it's applied immediately. However, it may happen that you're not able to execute quota right now and you have to wait for resources to free. If you remember, we're still in the non-revocable by default world, which means, for example, you have to simply use radio cluster. You set quota for 10 for this particular role. It doesn't mean that eight Mason resources will be free. So you set quota for 10 and two free resources will be allocated immediately after, but for eight you will have to wait. It does mean that what first eight CPUs that will be free will be allocated towards the, will be allocated to the role for which you set that quota, but you don't know when this will happen. Why? Again, because we don't have preemption yet. It will be fixed once we have preemption. So there will be a little bit user option, something like wait for resources to be free or free some resources. What you can do, if you know that you would like this quota to be reserved right now, you should manually free some resources. So you should decide which tasks, which frameworks you would like to kill. And once you kill these tasks, they will be laid away, or allocated straight away as part of the quota. Another thing is there is the, if you probably remember, there is a capacity check that I mentioned before. And the capacity check prevents an operator to block the cluster operations. What I mean under that? If you, for example, have a cluster with 100 CPUs, and then you set quota for 1000 CPUs. Once you set quota for 1000 CPUs and you just have 100, which means it means that all 100 CPUs will be laid away as part of your quota. And even if you will be adding agents to your cluster, all new agents will also be blocked. So there will be no frameworks outside of the role that you set quota for. They will get no resources at all. Now imagine you, since now we have implicit roles, imagine you do a typing in your role and you do this on Friday evening, right before you go home, you set the quota, you do a small mistake in the role, and then basically all three resources in your cluster you laid away for a non-existent role. So by default you can set quota only from the currently free resources in the cluster. But clearly this doesn't address all the cases because obviously you may know, okay, now I will set quota and I will be adding 10 or 100 new nodes to the cluster right after. So I do know that I will have these three resources right now. I know better. And we all know that one of the worst things that can happen to a software engineer is when he has a feeling that computer does know better, and to prevent that we introduce the force flag. So if you know that that's exactly what you would like to do, you would like to lay away more resources that are currently available in the cluster, you should use the force flag. Another very confusing thing is the difference between quota and dynamic reservations. But two different things that may look similar but are very different. Quota is about quantity and dynamic reservations about identity. Dynamic reservations are attached to a one specific node with all the consequences, which means if you dynamically reserve resources for a particular role or a particular node and as this node goes away, it dies, this role loses this reservation. The reservation is not transferable because it's attached to the node. It's not true in case of quota. In case of quota you say reserve resources somewhere in the cluster and if nodes on which tasks are running that are part of the framework's quota and this node is gone, then you will get another place in the cluster where these resources will be reserved. The difference is that quota can be set only for free resources while dynamic reservations can be set on free resources and also offered resources. Currently quota is for operators only, so frameworks can set quota using framework API. Of course, your framework can be at the same time the operator. I don't know whether it's a good idea, it's a different question, but the framework can use operator API to operate on quotas but it's not available in framework API. The most important thing here is that quota and dynamic reservations behave differently in present failures. So if you start losing nodes with dynamic reservations, you lose dynamic reservations. If you lose a node with quota, you will get these resources somewhere else in the cluster. Some future improvements to the feature will do atomic update very soon to address one of the limitations. We'll introduce quota support for hierarchical roles, once we have hierarchical roles, and we definitely implement oversubscription for quota once we get preemption. That is to address the first limitation and the second limitation that quota is not a tuple, I hope it will be non-existent, so this limitation will go away once we enter the revocable by default world. That's it, if you have any questions, ask them now or later on, you can catch me at the booth. Yes. So the question is, there are two nodes, on one node a container, say container A is running, the other node is fully booked now, so there are no free resources. Now the first node goes away, it fails. The question is how can we make this container A running on the second node? Well, quota is the answer, obviously. So if you think that your container A is a production container and it has priority over other containers or tasks that are running on the second node, you should reserve resources in the cluster for your container. Alternatively, you can also limit other workloads that are running on that second node so that there are free resources on that particular node. And then the framework that you use most probably will reschedule the container if you use something like Aurora or Marathon, they already do that. However, there is one important detail here. If the other node is completely booked and all the workload is running on non-revocable resources, you can simply run your container, you can simply kill a random task there and run your container. It depends how it's fully booked, and that's the point I'm trying to make. If it's booked with revocable resources, we can easily remove them and run your container there, but if there are non-revocable resources there, we cannot simply randomly kill applications because we don't know what SLAs for these applications are on non-revocable resources. And that's why we want to transition to the world of revocable by default so that we have more workloads running on revocable resources so we can do it easier. We may also introduce different tiers of quota and priorities, and if you have priorities for roles somehow, then you may deduce the importance of your workload and, for example, if the priority of your container is higher than the priority of some tasks running on that node, you will be able to kill those tasks and run your container, but currently, if all are non-revocable, they have the same priority and if we cannot randomly kill tasks in order to run your container. So the question is, is there a simple way to reserve 30%, for example, 30% or a certain number of percent of all resources on each node? And the answer is yes, and it's called static reservations. It's not in percentage, it's in absolute numbers, so you can reserve on every node a certain amount of resources for a particular role. It's probably not desirable because it's a static reservation. What else you can do? You can create a dynamic reservation on every node and we were considering introducing dynamic reservation templates, which means once a new node comes in, it automatically, a template is applied and some resources are reserved dynamically, but yes, the answer to your question is static reservation, even though it's probably not desirable. Exactly, you cannot do that and memory is not compressible, either you can't do anything or you have to kill. There is no other way. Currently, yes, if I understood you correctly. So I used quota like three months ago, so I don't know the latest status, but one difficulty which I had using quota is that once I set a quota and then I want to know the status of the quota, I mean, how much has been satisfied now? Then there is no straightforward API. Okay, so the question is, once you set the quota, there is no straightforward API to understand where the quota is satisfied. How much has been satisfied? I set it for ACPUs and how much is currently being satisfied? Six has already gone, two has left. Okay, there is no easy way to understand how much of this quota, how many resources are already used. Well, we were thinking about that and you have to query two endpoints. Okay, the reason why we haven't done that, so actually, some history behind that, I was arguing for giving all this information. I was also arguing to give notifications about when you about to achieve quota or about when quota cannot be met. We decided not to do that for the reason because it's probably not Mezzo's job to give you some notifications. You probably know better and you have your monitoring system. Another reason why we didn't put it into the same endpoint is because there should be one source of truth. Quota endpoint gives you the status about quotas set and if you would like to get the status and statistics about your tasks, then you should query another endpoint. The problem is that if you get part of this information and deliver it as part of the quota endpoint, then how do you guarantee that in the future this information doesn't go out of sync? Then people will ask, okay, if I hit this quota endpoint, is this part of the tasks running is the same as I get from the tasks running endpoint? So that's why we decided not to put it in the same point. We don't even have to give the list of tasks running when you hit the status of the quota endpoint, but you could say that among 8 CPUs, 6 have been already used and 2 is left, something like that. You don't have to give more granular resource information but just tell the overall. Yeah, it's definitely possible. Because you seem to calculate that every time you schedule it, so it should be easy to simply set it out. Yeah, it's definitely possible. So there are other opinions that people think that it's probably not, it's probably bad ideas because we can give you the feeling that you have 2 CPUs, which are probably now already allocated to someone. It's just the endpoint didn't get updated, but I would encourage we can talk about this later or we can also, we can start a thread on the dev list and come up with this proposal so that other people, not just me, can comment on that. That will probably be the best way, the Apache way. Any other questions? Yes, shut up. The question is about do I have a roadmap or a timeline for revocable by default? I think I don't have an exact timeline. I definitely can say that we all want it. It will be very hard to transition because almost all clusters operate now under different assumption. So we were thinking about what is the best way to transition and there are different strategies. For example, you make 80% of the cluster non-revocal by default and next year 70%. So I think it will take some time for this to happen. Most probably we will do a switch, like a master flag, how to treat resource by default. So people who start new clusters and are interested in this feature, they can use it. That's probably the way. And I think it doesn't really make sense to introduce this switch before or until after we finalized preemption and revocability for revocable resources. Once we're done with that, and it's currently been worked on, so once we are sure that all things are worked for clusters with revocable by default switched on, then we introduce the switch. I mean, I could, but I don't want to. Okay, if there are no other questions, I think we're done. Thank you.