 Sorry guys, we're a little late, but I think we're going to try and get started now. Our session is enhancing the OpenStack projects with advanced SLA and scheduling. Basically, we're going to talk about the scheduler a bit. My partner in crime is Sylvain Bazas. He's a senior software engineer at Red Hat. I am Don Dugger. I'm also a senior software engineer, but I'm at Intel. I personally have been working on OpenStack since the Diablo release. Go figure that. Sylvain, how long have you been involved in this? I've been involved in OpenStack maybe since Diablo as an operator, and then I moved to a developer since I was in the house. Okay. So we have some background on the subject anyway. So anyway, let's go talk the scheduler. We're going to start out by going over a little bit of a schedule background, kind of a schedule 101, give everybody some ideas about what we're talking about and everything, and I'll go over that. And then Sylvain will start talking about some of the future stuff that we want to do and maybe give you some ideas about where we're going in the future on the scheduler. I should point out all the good things from this presentation comes from Sylvain. He's the one who put most of the content together. Any problems with it? Blame me. So in SLA, what are we talking about in SLA here? I mean, the scheduler is kind of the first point where you have an agreement between a service provider and the customer. This is where the customer wants something to run, and we guarantee that that thing will run. So the scheduler is kind of a linchpin of trying to provide that kind of a service agreement. And so that's why the scheduler is a fairly important piece of the entire OpenStack project. So having said that, let's go under the hood a little bit and see about how the details of exactly how the scheduler works. This is kind of the big picture. You probably, everybody has anything to do with OpenStack, has seen this picture before. We're showing the different components that at least make up the Nova part of it. And the important thing to bring out of this picture, I think, is those two things in the center, the queue and the database. The queue is a message queue, which is the way the different components talk to each other, okay? And the database is the persistent storage, where information is maintained. And everybody has access to all that stuff. And you can look at the lines, how everybody communicates with each other and whatnot. It can be a fairly busy graph, but that's just the way it is. When we talk about the scheduler, kind of an interesting point to note. Everything, almost everything in OpenStack is kind of a pluggable architecture. You can take a component out and plug in a different component and so on and so forth. The same thing can be said about the scheduler. We don't have just one scheduler. We have many schedulers. The filter scheduler is actually what I'm going to be talking about a lot, because it's kind of the main workhorse for what we're trying to do. But we have some other filters, like the chance scheduler is one of my favorite. It basically takes a random, it selects a coast at random, and that's where it tries to start your instance when you try to install an instance. It's just completely random where it'll wind up in your cloud. Surprisingly enough, it works fairly well in the grand scheme of things, but not as good as something a little more knowledgeable is concerned. Caching scheduler is an attempt to, rather than hit the database all the time, I'm trying to save some of that information in memory and cut down on the overhead associated with looking things up in the database. And having said that, if you want to provide your own scheduler, that is perfectly feasible. You as a provider, not necessarily you as a customer. There is a parameter line in the nova.conf. There's an example of it right there. And you specify where the scheduler code lives with that parameter. And you can change it to your own. So if you think you've got a better scheduling idea, knock yourself out. You can do just that. So let's talk about the filter scheduler. As I say, it's kind of the main workhorse of what we've been doing right now. And it basically consists of two parts. There's a filter, which basically is kind of a yes-no question. And there's multiple filters that can be lined up in a row. Each one provides a yes-no answer about whether or not it's acceptable to schedule an instance on this particular host. And what happens is every time a filter says no, that hosts a thrown out of consideration and will no longer be concerned for this particular scheduling request. All the hosts that get a yes out of the filters get passed on to the next filter. So then you can have multiple filters can say, do you have enough RAM? Yeah, that's fine. Are you in the appropriate aggregate? Yeah, that's fine. Are you doing the appropriate affinity? Yeah, that's fine. Those are three different filters that would go on. When you get all done, you will have pared down in this particular situation six hosts. You've thrown out two of them. So you only have four hosts left. Now the question is, which of those four hosts do you want to use? There's a waiting function that goes on that will take those hosts and rank them in order from the most appropriate to the least appropriate. In this case, 5, 3, 1, 6, 5 is the most appropriate and 6 is the least. So this is how hopefully we come up with the best scheduling decision we can based upon the information and what the user is looking for at this point in time. So when we talk about filters, as I was saying, there are multiple filters available. They basically are utilized in sequence, one after the other of the filters that you choose. You can actually pick and choose from a set. You'll notice down here, you have to tell the scheduler where to find its filters. Basically, it's a directory somewhere containing all the filters. And then you have to enable which of those filters you want to use in your environment. And so you can enable these filters, like in the order that are given here. It would first try the retry filter, then go through the availability zone filter, so on and so forth. You don't have to use them all. You have to use at least the retry filter for reasons that we'll get into in a little bit. But that's the basic idea behind it. And we do have, you can go to this particular URL to get more detailed information that you can go to and examine at Leisure. So, as I said, after you've come up with a set of hosts that are acceptable, you then have to throw these hosts through a set of wares. And the wares basically just based upon some metric, be it amount of memory available, amount of number of instances running on this particular host, whatever metric is appropriate for that wire, it will rank the particular host and come up with a value for it. And then you'll come up with a list of your hosts at the end from the most acceptable to the least acceptable host to try and satisfy that request. And that's the basic idea behind the filter schedule. You filter things to get rid of things that you really don't want for whatever reason. And then you weigh them to find out which of the remaining hosts are the most appropriate ones to be utilized. As I said, you can talk about the different wares available like one of the wares we have right now is a RAM filter. It says which particular host has the most free RAM available? That would be potentially a good one. Maybe you want a host that is utilizing its IO bandwidth as least as possible is the one you want to use. You can come up with different forms of wares to throw into this. And again, the wares are completely pluggable just like the filters are pluggable. You can change them as you wish. Resource tracking is fairly intriguing because now we have to, in order to do the weighing and even in many respects to do the filtering, you have to know information about all the hosts in your system, okay? So the resource tracker is the way we do that. And basically what happens is on a periodic basis where the period is 60 seconds, the resource tracker will get information from the hypervisor saying what's the information about this particular host? How much RAM is available? How many instances are running? Whatever. And then the resource tracker will send that information to the database through the conductor since the compute nodes can't write to the database directly. I should point out that we did make one minor optimization that every time the resource tracker checks to update that database, it checks to see whether or not the information has changed. And if the information hasn't changed at all, it doesn't bother to update the database. Turns out it used to do just that and there was an awful lot of kind of update the database with the same value, which was a little bit silly. So we did get rid of that problem, okay? So now let's examine maybe in a little more detail exactly how a boot request, e.g., an instantiation of an instance works. Coming in from the API, the API goes into the conductor and then the conductor sends that request off to the scheduler saying, tell me what's the best host to run that on? So the scheduler does its work. It might be the chance scheduler, it might be the filter scheduler, whatever scheduler you've configured into your system. The scheduler will eventually come back and say run it on this particular machine, compute node. So the conductor will then send off to the compute node, okay, fine, start this instance, compute node. And the conductor of course updates the database to indicate that I've started an instance over on this host and the hypervisor and the host starts the instance and everybody's happy and all's right with the world. What happens if it doesn't work? What happens if the hypervisor for whatever reason can't schedule that particular instance on that particular node? What will happen is the hypervisor will send a response back to the conductor saying, you need to reschedule because I can't do it. So that's where that retry filter is kind of a crucial member of the filter chain because what will happen is the message going back to the conductor to say retry will say remove me from consideration. Okay, so the retry filter, remember the yes no filters. The retry filter will say, okay, that particular host is out of contention so we won't even consider him. We'll do all the other things that we did. Now find the best host, go back and do everything that we did before and hopefully this time on compute node two you'll wind up executing your instance, okay. So that is kind of the background. As I said, that's scheduling 101 for the current scheduler and whatnot. Sylvain will now talk about how you might want to partition your cloud and whatnot and some future stuff about where we think we're going with scheduling in Nova. Thank you Dan. So before going into the details about all the future that we'll have about the scheduler, I'm taking the opportunity to discuss with you about what is possible with the scheduler if I want to split my cloud. The main problem that we have is that we know that if we have multiple compute nodes provided within your cloud, then probably there are high chance that you would like to actually have a specific set of compute nodes to be provided for your both requests. So basically if we're talking about I want to direction the use of this compute node from this booth request, there are multiple solutions with regards to the scheduler. But when we talk about segregation of our cloud, there are actually five different concerns. The first one is using the regions. But since the regions are only something related to Keystone, it's not related to the scheduler at all. Sorry about that. I won't explain exactly how the regions work because this is a pure Keystone object. But for your purpose, there are actually a few more solutions if you want to split your cloud. The first one is using what we call cells. There was a presentation this morning about that. I will recommend you to look at the video if you did not have the chance. The second one is using what we call aggregates. And the second one is using what we call availability zones. I will explain that further later in the slides. One last possible segregation is to use several groups. Please bear in mind that several groups are not related to compute nodes. That's the main difference between the several groups and the others. Here, using several groups means that you want to provide a separation about instances and not compute nodes. That's basically the idea. I will come to that further. So let's talk about aggregates. Like I said, this is about grouping compute nodes. This is totally implicit and not explicit, which means that an end user who is not having admin rights for using the API doesn't know at all about the aggregates. This is totally invisible to him. How does that work? That works as a concept in Nova using a specific API that you can just create an aggregate, read an aggregate, and add a host to an aggregate. Plus, there is also something relative to the information of the aggregate, which we call metadata. That's basically key end values relative to that aggregate, which means that all the compute nodes within that aggregate will have those tags. For example, I just provided a few filters related to those aggregates. How does that work? Basically, when the request comes in the scheduler and within the filter, then the filter is looking at the aggregate trying to read the value. I mean, he's looking at the aggregate. What's the host is currently trying to verify his belonging? And then he verify if the metadata relative to that aggregate is valid or not. If this is not valid, then he denies the host. If this is valid, then he accept the host. Availability zones. How many people here are knowing Amazon availability zones? I mean, Amazon, AWS availability zones. Maybe hands and, okay. Just keep in mind, this is totally unrelated, which means that when you're thinking about availability zones in AWS, this is totally unrelated because this is not the same thing. I won't explain those AWS objects. I would prefer rather to explain the availability zones in OpenStack. Availability zones is the way to provide visibility for any user about an aggregate, which means that basically an aggregate which has a specific metadata key called availability zone. Well, that's not exactly true. The main difference between an aggregate and an availability zone is the fact that you can actually provide the same host to different aggregates, but there is a specific thing that you cannot add multiple availability zones to the same host, which means that, for example, if I'm a user and I want to use a specific set of compute nodes, but I don't exactly know who they are, then I will use the key minus, minus availability zone in my boot request and my operator and me will be pretty sure that actually I will have one of the hosts belonging to the aggregate having the key called metadata. Let's talk about several groups. Since I explained, availability zones and aggregates are compute-related, you add an host to an aggregate or you add a host to an availability zone. Well, that's not true for several groups. The main difference is that that's also another API in Nova, which is currently working progress because there are still some concerns about race conditions that we could have. Since, I mean, those are instances, you can probably migrate those instances. So we want to make sure that you will not probably raising some problems if you're migrating your instance. So for the moment, what you can do with instance and instance groups or several groups, if you prefer, is that you can create an empty server group. You can, and then you can say when booting an instance that you want to have this specific server group UID for your request. The thing is, when you create your server group, you define a policy for that. At the moment in Nova, there are two policies. The first one is anti-affinity and the second one is affinity. What does that mean? That means that if, for example, your server group is created using an anti-affinity rule, it will make sure, I mean, Nova will make sure that when you are booting two different instances using that specific UID, at the end, eventually your instances won't be on the same compute node, which is pretty cool if you want to make sure that you have specific instances not on the same host. But for example, for some specific use cases, you would like to make sure that even when migrating, you will have two distinct instances located on the same host. Then you define another server group having the affinity policy and then we will make sure that Nova will enforce that thing and there will be any possibility to have, there will not be any possibility to have the two instances on two separate compute nodes. Last point about sales. Well, I won't paraphrase what has been said, I will just give you a brief overview of what the sales are. The sales are a way to share your cloud using separate databases and using separate message queues. The main thing is that you will have a parent cell which will actually proxy the request up, down to the child cells by using a separate API. For that purpose, in order to select a specific cell, there is not only a filter scheduler driver and a scheduler on the child cell, there is also a specific scheduler called the sales scheduler running on top of the parent cell just for making sure that it can actually provide a cell. Like I said in my slide, sales V1 are considered experimental in Nova. Why? Just because there is no exactly feature parity in between what the regular Nova API provides in terms, for example, of security groups and that means that we prefer to leave it as experimental just for making sure that you will be aware of the limitations of that. I just would like to mention one thing, since we know that this is experimental, we are working on a separate effort called sales V2 which will be taking into account all the limitations that we have and we totally redraw how it will be working and part of the work is planned for liberty. Here in the URL, you find all the motivations for that effort and you will find, like it's saying, the manifesto for that effort and I hope that you will have some remarks if you're pretty interested in. So know that I presented the concepts of segregating your cloud. Maybe I thought it would be interesting for you to exactly know the possible drawbacks, the possible limitations that the scheduler has. I will take that as some exhibits that something has to be changed within the scheduler. The first one is how many people did that some problems when booting up a request and how many people tried to troubleshoot a boot request failing? Have you tried this, not using the debug fly? I mean, there is a huge problem that we are aware of and that's one of the things that we will be planning to work on for liberty. The second one, have you ever felt that even if your scheduling request was valid, there was some cases where the compute node was denying your boot request? Like Don explain, we heavily used the retry filter because, and that's really important to understand, because that's by design, that we consider as a trade-off that the scheduler is not locking the request when coming in. When you have concurrent requests coming into the scheduler, there is no at the moment a locking system for performance reasons. So that at the end, two distinct concurrent requests could end up having the same compute node. That's why we have a specific system called claims in Nova and that's why we have the retry filter because we have to make sure of that. The main problem is that that's probably getting worse if you want to have two separate schedulers running at the same time. In general, what's recommended is to use one single scheduler unless you exactly know what you want to do, like making sure that your request can be shared because at the moment, the two schedulers don't share their status. So they don't exactly know if one scheduler picked an host while the other did not know. That's another aspect that we would like to work for Liberty but I will come to that later. A third problem is the scheduler performance. Why? Like Don explain, all our requests are verified by checking the status of the host in the DB, which means that the performance of the scheduler is killing per request. The more you have requests, the more we have requests, the less the performance will be good for the scheduler. So that's exactly why the caching scheduler is there. Since the caching scheduler reduced the number of calls made to the DB. That said, again, that's a trade off because as the scheduler won't refresh his status by looking for every request at the DB, there could be, again, some race conditions. So keep that in mind. That really depends on your cloud because there is a possibility to reduce the number of DB requests, but that means that the number of retries will be higher. Keep that in mind and I mean, you have to benchmark it for your cloud. The last thing, the technical depth. I mean, the scheduler is maybe one of the smaller components that we have in Nova. And I mean, I really like working on the scheduler because that's fairly straight to work on. But the main problem is that, that's not exactly the scheduler itself, which is having, from my opinion, a technical depth. The main problem is that, like I tried to explain and like Don said, there are some back and forth in between Nova and the scheduler for voting a request. And here is the flow. Here is the flow because there is some synchronization in between Nova and the component at the end. And that's why we want to make sure that this interface and all which are the requests expressed to the scheduler are well written and well done. That's something that is currently a bit limited in the scheduler. So, now that I know that that just drawn a bad portrait for the scheduler, I'm sorry about that. So maybe let's talk about a better insight about, in order to let you know what we're currently working on. We just put a few, we just put a few notes about what was delivered for Kilo. And in the next slide, I will just show you what would be hopefully the next efforts that we will do for Liberty. So, about Kilo. About Kilo. We will be doing, well, you probably understood that due to the technical depth, we are struggling in between the interface between Nova and the scheduler. So, I think it was a bit unclear. So, that's why we actually worked on trying to provide a draw, try to draw a line in between the scheduler and Nova. For example, the filters were basically looking at the DB for some concepts like aggregates and instances to know their status. We modified that way. That's fairly different. No, the compute nodes are reporting the status about their aggregates and their instances to the scheduler. That's, from my opinion, that's a major change because, no, there is a line about how the scheduler is knowing the status. But we also draw a second line by creating an internal Python client, not used externally, which means that if I want to talk to the scheduler, I won't talk to the API directly. I won't talk to the RPC API directly. Instead, I would prefer to use the scheduler client. So, hoping for liberty. I said hoping. We all know that we have the design summits and some talks have to happen. So here, I just try to provide a few notes about what we would like to talk during the summit. So during the summit, we'll still be focusing about the technical depth. Even if we try to provide some split between Nova and the scheduler, we'll be continuing to work on that. We'll also work on a specific problem that we know about how we store the request information for the scheduler because at the moment, this is not persisted. For example, if you are using a scheduler int, the information is not persisted, which means that could probably end up with an issue when you want to migrate. Since the migration doesn't know the previous status of the scheduler int, the scheduler has no way to exactly know if it can enforce the previous request. Like I said previously, remember that you could be in such a trouble if you want to trouble shoot a request. That's why an effort will be done about trying to provide a better visibility for the operators about why a request is failing. At least providing you for a request, the UID of the instance which is failing. I mean, that's something that needs to be discussed during the summit and maybe later on, but the idea is that you will have some way to exactly understand why your request failed. And last point, since, like I explained, remember the availability zones can, I mean, unhost can only be in one availability zone, but it can be in multiple aggregates. This situation is leading to some problems within the code. What we want to do is to take the opportunity to have the design summit to discuss about all the problems that we have about availability zones and make sure to see if this model is still valid or not. Also, we'll be discussing about the relationship in between an instance and availability zone just to see if this is still valid or not, since the availability zones are related to an host. One last effort will be about trying to see how we can actually have multiple schedulers in Nova. That's why we'll try to discuss about if this is possible to have multiple schedulers sharing the same state. We'll be also discussing about the possibility for the compute nodes to define their own allocation ratios. I don't know if you guys played with the allocation ratios. At the moment, this is a configuration flag for the scheduler, which means that you cannot have different allocation ratios for different compute, which is a bit bizarre, since you could probably have different workloads, depending on different compute nodes. So what we want to do is to provide those allocation ratios to the compute nodes, so then you could be able to provide different flags for those allocation ratios. And of course, having the way to schedule based on that. We'll be also working on, I would say, a dry run command for the scheduler. For example, say that you want to migrate, say that you want to migrate an instance. What would be cool for you is maybe before migrating, just verifying that it's working. At the moment, the only way to verify is to do the migration. If the migration is failing, then you get an error. What we would like to have is to have a way for the users to exactly know if the request is valid before doing the action. And one last point I'd like to discuss is about possibly the scheduler split. Wait, what? The scheduler split? Possibly some of you guys heard about the word gantt. I won't explain that further because that's another slide, but there are some talks within the community about if this is necessary or interesting for various reasons to split the scheduler. And that comes to my last section, the manifesto for better scheduling. Like we said in the title, we promised advanced SLA for the moment, I don't like the word advanced. For that, I will explain to you what I think we need an advanced scheduler. Let me tell you some stories. What if I want golden VMs? What if I want high available VMs? I know that some work is currently being working on providing high available VMs, but perhaps I would have to have a possibility within the scheduler to have this kind of golden rule that I want to enforce. About ability, I discussed with several groups. For example, for several groups, you can define affinity and anti-affinity, but maybe discussing about affinity between instances is not exactly what I want. Maybe I want to discuss affinity between my instance and my volume. And for example, my nick, if I have a network card, maybe I want to make sure that I will be using a compute node with a specific bandwidth. Or at least if I want to boot containers, I want to make sure that perhaps they are pod-related, which is a pure container object. Also, I want to make sure that why should I update the information to the scheduler if I have to send all the info to the scheduler? For example, say that I have a cloud which is not using aggregates. Why should I send to the scheduler information about aggregates? Or say if I don't care about CPU, which is probably not insane. Why should I provide information about CPU for the scheduler? That's something also that we should be doing. And one last point, when I'm discussing with some people during the summits, some people discuss and say, well, scheduler is pretty cool, but I don't want to schedule now. I want to schedule in the future. I want to make sure that I can say to the scheduler, please give me an instance. Well, please schedule me, not give me. Please schedule me an instance. Please make sure that there will be some room for booting a request in one month. Since, for example, if I'm a cloud operator and I know that my workloads are very different based on a specific timing, maybe I could have some way to express that to the scheduler, that I would like to have my workloads. Well, I would like to do some capacity planning. That's something that we would like to add. Based on what I said, based also on what Don explain, what we need is not actually features. What we need is a way to implement those features because we're not very in lack of ideas. I think what we need to do is trying to find a way to improve our velocity by adding more features. Given that, the main problem is also about the technical depth. Since, you know, it's like pulling on a wire. If you work on the scheduler on that, it's like pulling on a wire and then you probably have some problems. We need to be more agile. We need to be more flexible. The main problem is that, okay, that's cool. We say we need to be more flexible. We need to have a better velocity. That's nice, but how to do that? That's, I think, the most crucial discussion that we have on the design summit. How can we do that? I personally have an opinion, but I think that's probably something that needs to be done during the design session. But the thing is, we need to discuss about that, but now, now, that's why there was an initiative which was called Gantt, where the idea was to split up the scheduler. To be honest, that's something which still needs to be discussed, so that's why we're coming back to the idea of a Nova scheduler. Probably that's the best idea. If we're able to provide those features that I mentioned to Nova, maybe that's fine. But then we have to discuss about that. And like I said in my last slide, in my last point, I think the most crucial thing is not losing the operator's feedback. I mean, you guys are the ones who use the scheduler. To me, that appears that's the most important thing. I need to understand what are your needs. So I leave it to you. You know that the scheduler team is meeting every Tuesday, and maybe that's my fault. I didn't mention that. If you have something to say, don't hesitate. Jump into the weekly meeting. If you don't know when this is it, come to the RC channel and discuss with us. We are always keen to discuss. So that comes to the end of our presentation. If you have any questions, ask me. Oh, ask, don't, sorry about that. No, ask you. Any questions? We're over time anyway, so thank you guys. Thank you.