 Hi, welcome to a session on scheduling. I'm Gary Cotton from VMware. I've been working in OpenStack for just over three years now. Prior to joining VMware, I was working at Red Hat. I'm a core reviewer in the Neutron project. And in Nova, I've been contributing for nearly a year and a half now. And this is Gillard. I'm Gillard Slotkin. I'm working for AdWare. I'm working in OpenStack, I think maybe two years. Maybe two years and a half. In Red Hat, we mainly focus on network services, on Neutron, on Elbas. And actually, our Elba solution brought us to explore scheduling requirements. And that's how I became involved also in a Nova scheduler. And I think we are working together on that. That's the second year? Yeah. OK. So actually, some of the slides, if any of you attended our session in Hong Kong in the Havana summit, you may recognize two or maybe three of the slides. So there are new stuff. Beside that, we are basically continue the work that we did for Havana. And we will share with you what was already done and what we are planning for the roadmap going forward. OK. So scheduling, it's all about the scheduling that we are talking about is all about going away from best efforts. And best effort is not something that mission critical and performance critical application can live with when deployed in cloud. So more we are analyzing mission critical and performance critical application requirements, like availability, service level, performance service level, and security, more we see a requirement for Nova schedulers, for a neutron scheduler, for storage scheduler. And you will see how those things ties together. And when we talk about mission critical application, we are mainly trying to focus our customer attempt to migrate existing mission critical application to the cloud. Some of our customers are building application from scratch, very much tailored to run on cloud. But we also see customers that are looking to deploy existing application built with multi-tier, with a legacy fault tolerance model. And they would like to migrate those applications to the cloud without changing them. And this also applies to scheduling policies that we need to imply in order to help those customer migrate application without changing them. OK. So we are talking about availability. Availability actually is a various definition of what availability service level means. We like to see two different levels, two different layers. The first one is fault tolerance. It means the application continues to be available. After the first fault, some application may be built to sustain even a second fault. High availability sometimes means that application recover from a fault, but it's being unavailable for a very short period of time. Typically, high availability means that there is restarting the application from a persistent state. And disaster recover means the application restarting on a different site, typically from a persistent replicated state. When availability is addressed from a cloud point of view, usually availability zone, it's the main tool. And availability zone is something that allows you to deal actually with all level of availability. But availability zone traditionally is something that is more related to disaster recovery, actually. And you can see many applications actually using availability zone also to address fault tolerance. We will see that sometimes availability zone is not sufficient. You will need to have a more sophisticated scheduling policy to ensure fault tolerance. Performance, when we talk about performance, usually there are two main metrics. It's latency, transaction latency, and transaction throughput. Transaction latency has to do with networking connectivity. It also has to do with compute capability. And also bandwidth has to do with both network capacity and compute capability. And we will see how it relates to scheduling. The last one, security. There are several aspects of security. Most of them can be addressed by scheduling. It's data privacy, data integrity, and denial of service. It's also a combination of network connectivity and network service to protect application from denial of service. But those are the service level that we are looking when deploying mission critical and performance critical applications. So next. So you may be wondering, what all this has to do with the Nova scheduler, with the network scheduler, with the storage scheduler. So let's start to explore how it relates to scheduling. So for example, availability means that you may want to have anti-affinity. If you deploy, for example, two instances of database, so you have a fault tolerance, you don't want to schedule those two VMs to be placed on the same auth. Because if you're losing this host, you lose both copies of the data, so you don't really gain any fault tolerance. So it mainly relates to anti-affinity. And we will explore anti-affinity. Actually, anti-affinity was already implemented as part of ISOs. We will talk about that. Next is performance. Performance, it's mainly about network proximity. You may want to place VMs that depends on a network capacity closer to the network, on hosts that are closer to the network. For example, on the Cisco UCS system, there are hosts that are better connected to the network, specifically to deploy network services than other hosts that are deeper in the network tree. So you want to take those network proximity considerations when you deploy applications that are sensitive to network capacity and latency. Another aspect is host capability. In some customers, the cloud environment is not a geogeneous. You may have different type of servers. Newer servers maybe are better capable than older servers. And you may need to take those capability into consideration when you're deploying VMs. For example, if you have a multi-tier application, you may want to put a database on a server that has a more memory capacity, faster memory, stronger CPU. Currently, the Nova scheduler, at least until Havana, each server could be chosen on a best effort level. The same goes for storage proximity. And we will show some examples. And database, it may be a good example. You may want to place the database instance on a host that is a closer connectivity to your son or to your shared storage. And we will show examples to that as well. Regarding security, it's all about isolation, resource isolation, and exclusivity. And exclusivity can be either on the compute side or on the network side. And we will show examples to that. So as you can see, availability, performance, and security very well applies different strategies or different policies for scheduling. Next. Let's use this as an example. This is kind of a typical 3-tier or 2-tier or multi-tier application. You have a load balancer, a two instance of load balancer for availability or for fault tolerance. You have several instances of web server VMs. And you have two database instances. And when you are deploying this seven VM subsystem, you need to take those interrelationship between the component into consideration. And what we suggested, I think it was back in Grizzly, that you cannot just schedule each one of those VMs independently and hope for the best. And we will show the example. If you do, next slide. So for example, we have three hosts. The empty boxes means available capacity. And you would just run another scheduler on those seven VMs independently. You will end up for the next day. You will end up scheduling this. So if you just start, the two load balancer would be maybe a schedule on host one. It means that every host, every single failure, guarantee to failure entire application. That that would mean a best effort scheduling of this application. Otherwise, alternative scheduling will need to take anti-affinity into consideration. And that's what you would expect an anti-affinity scheduling to result with. So every single host failure, the application is still resilient to any failure. As you can see, any failure here, the application continue to work. And that's exactly fault tolerance that need to be achieved when you are migrating such an application to the cloud. So this is the motivation behind the anti-affinity that we have already implemented in ISOF. Next slide. This is your slide. So basically, as Gilad mentioned, towards the end of the grizzly cycle, we came up with the notion of doing group scheduling. So the idea was to be able to deploy at one shot a multi-tier application and to be able to have a highly available high performance and very resilient running application in the cloud. Sadly, none of that really was accepted by the community. And we've had to do it very much in a piecemeal fashion. So up until now, fortunately, we're enabled to get in the Icehouse release. The initial server groups are implemented. What does that mean? It means with server groups, we're able to implement anti-affinity. So essentially, as the application that Gilad previously mentioned, we're able to here create three server groups where we can have anti-affinity with load balancers, anti-affinity with the database instances, and anti-affinity with the web servers. This basically enables us to have a highly available three-tier application that can run in the cloud. Throughout the Icehouse and Havana Cycles, this was work that was done jointly with Debu Dutta, with Yati Udupi from Cisco, with Mike Spritzer, and myself. So essentially, what has been added in the Icehouse release. Basically, there's a new table that's called server groups where the user's able to create a server group and they're able to assign a policy to the server group. At the moment, there are two supported policies. These are anti-affinity and affinity. Prior to the Icehouse release in Havana, a similar feature was supported, but the admin would have to go and edit the Nova scheduler configuration files and add the two extra filter schedulers into the configuration file. Now, the affinity and the anti-affinity filters are already kind of default filters. So out of the box, the anti-affinity and the affinity scheduling already works. In addition to this, an additional feature that's very useful is it's backward compatible. So if somebody was using it with their Havana release, then the same support will be continued. So how does one go about booting an anti-affinity setup? Basically, one makes use of a scheduler hint where the keyword is group and the user can pass through either the name of the instance group that was created previously or the UUID. How it all works is basically the scheduler knows on which host instances are deployed. And then according to the scheduling policy, the scheduler will, in turn, decide which hosts are available for selection. And those hosts will use the additional filters of the scheduler to decide on which host to deploy that, whether it be the host with the amount of free resources or the one with the most available capabilities. And thanks to Russell Bryan from Red Hat and Singh and Juan, who are from Cisco, we're able to get this implementation done in Icehouse. So basically up until now, we've explained kind of what exactly exists in the Nova scheduler. And basically, in order to provide availability, performance, and security, we'd like to show different ways of scheduling that can provide enterprise grade scheduling. So basically, we're going to show through a number of use cases and examples how one can make use of hierarchical scheduling, cross-scheduling, and rescheduling to have a highly available kind of high performance and a very secure setup and environment to run in the cloud. So the first example I'd like to discuss is doing storage and compute cross-scheduling. An example I'll use for this is the VMware Storage Policy Based Management. What does this mean? It means that there are two levels of scheduling that will take place. The first is to select kind of the data store where the virtual disk for that running instance will be stored. And this will be done kind of in the code that we've got upstream in proposal at the moment via flavor metadata. Essentially, kind of storage policies could be defined on via the SPBM APIs where one can, for example, that have a number of data stores. Say, for example, a super fast data store, which that one could have a goal tag and a very slow old, say, legacy data store, which could be bronze. So when somebody wants to deploy a VM, they can say this VM could be deployed with a specific flavor on a gold data store. In addition to the data store selection, the specific host that the instance will be running on will have to be selected to. This basically requires that all of the hosts be connected to the aforementioned data stores. So over here, basically, there are two levels of scheduling. The first is for the data store and the second is for the host to run the instance. And here we can basically see a diagram of basically the two parts that take place. There's the VM that's gonna be running on a specific host and the virtual disk that's gonna be placed on a specific data store. These will essentially go through the storage policy-based filtering that's provided by the back-end VMware driver. And these, according to their profiles, will be placed on a specific data store. An additional advantage of the specific approach is that the data stores will be highly available and also mirrored. So essentially, that data will be replicated. So if one of the data stores, if the host fails, then that virtual disk will still be available. So this provides kind of in addition to professional, in addition to being able to have preferential service regarding the placement and the performance of the disk, there'll also be high availability in the event of a host failure. Another example kind of closer to home is performance. So here we're gonna show that if somebody uses cross-scheduling with storage and compute resources, we're able to implement improved performance. What does that mean is basically if somebody's attaching an additional volume to the instance, we'd like that instance volume to be as close to the instance as possible. So basically performance for that instance will be a lot better than if the volume is on a data store that's pretty far away. In addition to this, kind of one of the things that we've seen is that if the actual glance image is also stored kind of closer to where the running instance will be, the boot time of that instance will be a lot quicker. So for example, in the VMware case, if we're able to store the glance image on a VMware data store, instead of copying that glance image from glance to Nova, we're able to do a direct copy on the data store. So that will mean that's improved boot time for instances. In addition to this, we'd like to show kind of how rescheduling can also provide kind of high availability and additional features. Say for example, the applications that are running on a number of hosts and one of these hosts fails. Basically what we'd like to do is be able to ensure that those running instances can be run on hosts that are up and running. So that can ensure kind of that the application is highly available in the cloud. Another example, another two examples, first of all is the distributed resource scheduling. Say for example, a host becomes very, there are too many VMs that are running on the host. One's able to make use of a VMotion to move kind of VMs which say have a lower priority to additional hosts. So basically the applications that are mission critical, they can receive more scheduling and compute, sorry, more compute power. And those that are kind of disturbing the high performance applications could be moved to hosts that aren't pretty loaded at the moment. Another option is to make use of, to do instance evacuation from a specific host. There are a number of use cases for this. One is that, for example, you'd like the host to be shut down, that's for power maintenance. Another one is you'd like to upgrade the software running on the host. For example, let's take a case where we've got a KVM compute node and we'd like to upgrade the hypervisor for the boot. So basically we're able to make use of live migration to move those running instances to another host, able to upgrade the software and then redeploy the instances back to, to the specific host if necessary. Let me show you another examples of rescheduling that also relates to availability and performance. So the first example, this is actually the way Radware deploy load balancing as a service. We deploy a pair of load balancers per tenant, running in a specialized project. And those pairs are connected so they can actually share a state. So if one of the load balancers is down, then the service, the load balancing service is tolerant, it's tolerant to this failure. However, the service now lost is a spare wheel. The service is running, but the service is not full tolerant anymore. So what we do, we orchestrate what we call a fault recovery rescheduling. So we basically create a second instance, a failover instance, and we wire this to the surviving instance such that the load balancing service become fault tolerant again. So this is example for rescheduling for availability. The second example, actually it's a rescheduling for performance. Assuming the application was deployed with some load balancing capacity, but this application become very popular and this capacity is insufficient. So there is a need to scale up the load balancing capacity and you want to do that while the application continues to run. So what we do in this case, we actually doing a controlled failover. So we create a larger load balancing instance, we connect this larger instance as a failover. We control, we do a controlled failover to this instance, we upgrade the first one. So after this complicated rescheduling and scheduling, we start with two small instances and we end up with two large instances to increase the load balancing capacity and we are doing that while the application continue to run uninterruptedly. So rescheduling those instances may actually end up on a total different host, especially if we are deploying a larger instance, maybe there is no capacity on the original host and we still need to make sure anti-affinity between all those instances throughout the process. So this is another example of rescheduling. An additional example of scheduling is hierarchical scheduling. Basically what we'd like to show here is basically how to use host exclusivity to have a secure tenant isolation. At the moment through Neutron, one's able to have network isolation. What does that mean so that each tenant can run their traffic on their own networks? But what happens if you've got a host that can be compromised? Basically that's one VM that's running on the host can throw backdoor access another VM. So here additional level of security is to basically have host isolation. Say for example, tenant one can only run its instances on a specific host. And tenant two and tenant three, those can share their hosts. So essentially tenant one has got compute security. In addition to the network isolation, it's also got host isolation. So it enables it to have a more robust and secure environment which the VMs can run. So basically just to sum everything up of what we have and where we're going hopefully in the future. So currently by Icehouse we've got the server group implementation. At the moment there is anti-affinity support and affinity support. In addition to this as mentioned previously there's the backward compatibility with a Havana installation. So what does the future hold? Basically I'm gonna go over a few things that are kind of being discussed at the moment and hopefully at the summit we'll be able to discuss and bash out and in the coming versions we'll be able to provide this kind of support in the scheduler. First and foremost in the server groups there are a number of new filters that we'd like to add. One of those is network proximity. Say for example a host is connected to a specific nick which is super fast and another host is connected to a nick which isn't that fast. But those two nicks are connected to the same virtual network. So basically if the scheduler knew kind of of those capabilities and the response times and the number of hops between all of the instances it would be able to select kind of the host which would provide a network proximity regarding various characteristics of the network. In addition to this we'd like to be able to add rack affinity and anti-affinity. Similar to host affinity we'd like to have rack affinity that certain instances could run in the same rack for improved performance, maybe security, et cetera. And as Gilad mentioned at the very beginning is we'd like to be able to take advantage of host capabilities. Say for example kind of regularly vendors, kind of hardware vendors, they come out with their new and improved servers. So how are we gonna be able to leverage the fact that we've got a new kind of super computer that's been added to our range of hosts and we've got the old legacy ones. So how can we enable the scheduler to take advantage of that for special applications that we'd like to have them with high performance and how can we use kind of say applications. For example in Amazon you can just run instances, spot instances. So how can we leverage a functionality like that? Another thing that's being discussed at the moment is something that Mike from IBM proposed that's simultaneous scheduling. One of the discussions is whether that will be done at the layer of heat or within the Nova scheduler. Kind of it's something to be able to have similarly to what we mentioned at the beginning of the talk is to have one initial placement where the scheduler's got like a holistic view of the entire, all of the resources that are available. In addition to this there's the host exclusivity at the previous summit. This was something that was proposed by Phil Day from HP. I know kind of that's still kind of in discussion regarding the proposal. A few extra things that are kind of in flux and in discussion at the moment. The first and most interesting is the scheduler's a service project. This kind of the nickname is called Gantt. And basically the initial steps kind of with this will be driven by Sylvan from Red Hat at the summit is to essentially take a forklift from the existing Nova scheduler and hopefully we'll be able to discuss APIs. Hopefully kind of moving forwards this will be kind of a platform to enable cross-scheduling between a number of projects. At the moment kind of each project has their own scheduler and it's kind of mutually exclusive from all of the other projects. So say for example the Nova scheduler can select on which host to run an instance. The Cinder scheduler could select a volume and that may completely kind of negate the policy that the end user would wish to achieve. Some backend drivers using their technologies are able to ensure that that kind of works super magically out of the box but in the open source community that's not really out there. One of the pain points of the scheduler at the moment is performance. Is that if there are large number of schedulers that are running and there's a considerable number of hosts in the setup then the scheduler could be the bottleneck. And one of the features that has been proposed by Boris from Marantus is to have a no DB scheduler. That's basically each scheduler will have an in-memory picture of the current situation and be able to improve performance times. Okay, just before we get to the questions let me just summarize. So basically we identify some mapping between service level and scheduling policies. This is not exhaustive table but as you can see availability maps to anti affinity and rescheduling also required to really ensure availability. Performance means proximity, network proximity, storage proximity, host capability. And as we saw in some example it may require some cross scheduling and rescheduling and security applies resource exclusivity and also hierarchical scheduling. But those are just few examples and as Gary mentioned, scheduler become a bottleneck and actually more complicated is expected to be. The challenge of its performance is going to grow. So I think the Marantus initiative to boost up scheduling by use in-memory database is a good move forward and it allows us to push scheduling complexity even further. Any questions? Yeah, would you mind to go to the microphone so everybody can hear you and you will be recorded? Yeah. Okay, so when you're talking about anti-infinity you mentioned that you can place a VM on a given host. Now in regards to VM where this is, if you did that, does the scheduler then know if DRS kicks in and moves the VM to a different host that that VM exists in that host? I mean, how is it keeping track of the VMs? So anti-infinity, it's an attribute that you don't forget after the initial scheduler. You continue to remember the anti-infinity relationship between different VMs so you use it also when you reschedule. Okay, and that's the way it's currently implemented. So kind of regarding to your question, kind of the anti-infinity support here is more relevant to, say, the libvert driver whereas there's the one-to-one mapping between the hypervisor and the actual host. In the VMware KS safe one would be using the anti-infinity scheduling it would basically pick another cluster. A cluster basically? Yeah, kind of one of the things on our to-do list is to enforce kind of that anti-infinity filter to percolate through to the VMware driver and then ensure that there's anti-infinity with the ear-sex hosts. I've got one more question. Sure. This is related to the reschedule link. So you mentioned that if a VM fails, for the low banter, for example, let's say, the VM fails, the rescheduler sees it, it then kicks up another one, right? What happens if you start spinning up another one and that fails and then does it continuously do that or is there some sort of logic to say don't just shoot yourself in a pot and by spinning up to like 100 VMs? Okay, typically a full tolerance, you get to design your full tolerance system to a given number of failures. Typically it's one, okay? You can design the system to be tolerate to more than one failure, okay? Of course, if the system was designed to resist a single failure and there is more than one failure, the system will fail, okay? Unless the system was pre-designed to sustain a more than one failure. But the design you were talking about here is a compass for just one. Yes, yes, and this is typically, when you say full tolerance, it's a single, it's a single, it's tolerant to a single failure, okay? It may be resilient to multiple failure but not at the same time, not simultaneous failure, okay? Yeah, more questions? Yeah. Did you mention the possibility to schedule with rack affinity? My question is, at the moment, another is not taking account rack location. So, in order to get this feature quite quickly, do you plan any cross-project integration like triple O? I haven't really thought about the triple O yet but the current implementation of the rack affinity in NOVA at the moment, that's using the selections done if they're running on the same subnet. The idea here was to add to the server groups the ability to do rack affinity and there'd have to be some kind of ID whereas we'd know which hosts are in the same rack. Kind of that we've yet to think out and decide kind of what the best way of going about that is. Okay, my question was about the definition and the discovery of the location of the hosts within the racks. So, hence the proposal of triple O because. That's a good idea, yeah. Yeah, okay, kind of. Thanks. Yeah. That's kind of best effort first come first serve. I was just wondering if anyone's curious about profiling the VMs so you can do more risky scheduling to know something is pretty idle in the app and are more active in the evening. How much CPU network it uses so you could kind of riskily place them on different hosts and get more out of your resources. I'm curious about interest in the community. I think that's a very good idea at the moment kind of that's something that's not really taken into account at the moment by the existing scheduler but it's certainly something that's worth discussing I think. I think one example I'm aware of in VMware vCenter scheduler you can set the priority. So you can have a VM priority one, two, three, maybe four I think that's the highest priority and the priority is taking into account when you do the vMotion. When you need to evacuate the server or you want to do workload balancing so you tend to move the low priority VMs also in. No, the notion here is not about priority assume both your VMs have the same but maybe together they can't coexist because they were both active equally at the same time but knowing that one's active in the morning and one's active in the evening they can coexist on the same physical resource and expand to its full capacity. So that's the idea that I want to explore here. I'm not aware of this type of attribute. We can innovate that. Yeah, that's a nice idea. Hi, so this one's about quantity of information and complexity and it's an open question. So you've got networking and you mentioned about having cami routers in between things and needing to know that so you can make decisions about networking usage. You've got compute nodes and racks and being aware of that and then you've got PCI stats and stuff like that about each node and then you've got the affinity and the server groups you were talking about. We're starting to get service in the middle if we follow that path that knows all things about everything. That's how far do you think we should go down that path versus having something that's very coarse-grained abstractions or do you think there's enough room to deal with the complexity? I think more application will be cloud ready, will be designed to run on cloud and take best effort scheduling into account in the design of the application you will need less and less these fancy scheduling attributes, okay? I think the requirement for fancy scheduling it's a phenomena that's very much related to the problem of migrating existing application to the cloud, okay? If I'm just telling you all you have is availability zone, okay? Design your application accordingly, don't assume any fancy scheduling, you will probably be able to deal with performance, with availability, with everything and see how many customer users are deploying mission critical application on Amazon with the best effort scheduling unless you buy their VPC service, okay? When you get to control it in a finer grain but I think it's a two-way movement, okay? On one end, you can leave with best effort scheduling if you take this into account when you design your application, okay? And the context of this, it's mainly around taking existing application, existing fault tolerance, methods and map them to the cloud without requiring to redesign your application. That makes scheduling more complex. I think in the future, we don't really need that level of complexity to answer your question, okay? I think we are running out of time. The flashlight, okay, thank you, bye.