 So it's 2.40, so let's get started. So forever is here from the last talk. Some of these first slides will be repeats, and I'm sorry. We're talking about profiling the Nova scheduler in this one, and the case study of going through. We had some problems with it, and we're going to go through why the problems happened. And I want to make this an interactive session, so if you guys have any questions, please ask throughout. So a bit about me. I'm an employee at Cloud Scaling. I'm an open-stack contributor and developer. I've been working primarily on Nova, and I've been very active in the Folsom timeframe. So some of you guys have seen this slide before since you're just here, but there's a lot of numbers out there for about Nova and Folsom, and who did what and that kind of thing. I want to show a different view of these numbers. So this is the number of lines of code from Diabolat Essex to Folsom. You can see it's steadily increased, but the amount of code turn, which is defined as the number of lines inserted or removed according to Git log, has actually gone down. So in Essex, there's enough code to rewrite all of Essex once, which is crazy. But that's actually gone down a lot in Folsom, so the amount of code has gone up and the amount of code turn has gone down. I think this is a sign of maturity, and hopefully things will get more and more stable over time. And also, you see the number of other files are going up, the number of non-Python files are going up. This is a sign of some of these, a lot of these files are used for testing in documentation, so that's also been on the rise. I'd like to take a minute here to thank all the other developers who worked on Nova Folsom in the past six months. This is a list of all the top... Everybody who has more than 10 commits on Nova Folsom. And there's 27 of them. There was a tape on the last slides. There's 27 of them from companies all over the world, and so we wouldn't be here today without all these guys. So two problems we're trying to address in this talk. So we had a case where somebody was running on a cloud. You could run instances-n-1000 on an open-stack cloud, and they're broken horrible, tremendous ways. You want to know why it happened. So two things happened. One, you try launching 1,000 VMs. We got more than 1,000 VMs running. Another problem is you try launching 100 VMs in a smaller case. It took 24 seconds, and that's too long. And so we're going to go through why is it so slow and potentially how is that changing in Folsom? So we're going to talk about how the Nova S6 scheduler works. The first problem we mentioned before, why too many VMs were scheduled, some details about Nova performance, how to measure it, some of the problems with measuring, et cetera. Talk about the second problem, why the S6 scheduler is so slow, and so how does that relate to Folsom? So how the S6 scheduler works. So everybody here, I hope, has seen this slide before, and so we're going to take an example of this of how it works when you want to schedule a single VM. So, you know, say you can run a Nova launch, or Nova boot, and you want to launch a VM. So command goes into the API. The API does some processing, realizing what kind of command it is, records some information to the database, and sends a message through the message queue to the scheduler. The scheduler looks at all the available compute nodes, picks which one it should run on, records that information in the Nova database, sends a message through the queue to the compute node. The compute node retrieves the information from the database and actually launches the instance on the hypervisor. There's a lot here that I mentioned, such as Nova networking, how that interacts, and when to actual return message is sent to the API. And we'll get into some of that a little later. So now that we have a general picture of where the Nova scheduler fits into the whole everything else in Nova, let's take a look at how the scheduler actually works itself. So this is the default scheduler, which is the filter scheduler, and this is taken right outside of the docs and OpenStack. The link is on the bottom. The filter scheduler is a filter and way scheduler. So it takes all the compute nodes you have available. It filters them based on any parameters you have, such as amount of RAM, free amount of CPUs, free amount of disk, free, et cetera, and it removes any of the ones that don't have enough free resources. Now you have a list of all potential nodes that you actually can schedule something on, and you want to pick one of them so you weigh the options. So the default weighing is by amount of free RAM. So you have the option of spreading out first, so saying I want to fill up all the machines slowly, so one VM on each machine and then two VMs in each machine, where you say I want to fill up each box one at a time. So some of the limitations of the Essex scheduler in particular is you can't run more than one of them, and this is actually a bug that's been fixed in Fulsom. It involved a race condition where each scheduler was not aware of the other schedulers, and you say I want to run a VM on that machine, and the other one would say the same thing, and you get too many VMs and running on a machine. This is actually the amount of RAM that is free on the machine. So the idea is if you want to fill up a machine first, you say I want to pick the machine with the least free RAM. So you schedule one VM on it, and now it has less free RAM than anything else, and so the second VM will go on it, where you can say I want to pick the machine with the most free RAM first. So you raise a very good point, and actually the answer is that we don't actually rely on the compute node, the free dash M or whatever you want to choose your top to actually report how much, or we don't actually trust the kernel to tell us how much free memory, because a lot of these VMs will actually balloon memory over time and things like that. So we actually look at the database and how big the VM was. So you say you want to launch an 8-gig machine, and you want to launch a 4-gig VM on it. It's using 4-gigs. It doesn't matter what the machine actually knows it's using. We just, we believe, you know, we don't trust it. Yeah, so there is, you could filter on just about anything. You can write your own filters, and it's fairly straightforward to do, and you could actually, you could add something in like that. I don't think anything is in to say I want all the same images to be on a box, but that'd be easy to add. The filters themselves, you could call, you could have database calls directly or there's some information available already in the scheduler system. So if it's not available already, you could actually call the database and say what machines are running there, what image type it is, et cetera, and just say I want to go to the one that has that image already. So the second limitation, as we'll get, we'll talk about it later, is this big O of NM. So some of you guys who may have a computer science background, we all love big O, but in real life, it doesn't really necessarily mean what it, what you think it means. So NM is the number of nodes and the number of VMs requested, and this is because each time you have to consider all the physical nodes that are available, all the compute nodes, and each VM is scheduled separately. So you want to launch 100 VMs and you have 100 nodes. It's a lot of scheduling. So the first problem, why are too many VMs scheduled? We tried launching 1,000, we got more than 1,000. That's not a good thing. It shouldn't happen. This is for three basic reasons. The scheduling is slow, as we'll talk about it a little later in the slide, in this presentation. There's an RPC call instead of, call instead of cast, and this is an RPC command from the API to the scheduler. So a call is saying I'm calling something on the scheduler and I want to wait for a return code. So if the schedule is actually slow, the API won't respond until the schedule is finished scheduling. So if you have a slow scheduling and you want to do a, if you do a cast instead, you'll get a result immediately. And lastly, Bodo, which is the Python library for EC2 that works with OpenStack, retries upon an API timeout. And so this is an example of a real system. You can run instances m1-tiny, try launch 100. It takes 25 seconds of real clock time and that's crazy. So this is an actual source code for the RPC call instead of the RPC cast. This is in the compute API itself. So this is actually, you talk to the, you talk to an API endpoint and this code gets called in the API server itself. And you see here that we actually have, there's a RPC cast if it's only one VM, but if we're launching more than one VM, it's a RPC call. And that's due to some reasons that aren't necessarily true anymore. That was a bug and it's already been fixed in Folsom. So this, this bug only exists in SX, still. And the second point I'd like to mention here is you have in the number, on the bottom here, number of instances. So you have one RPC message to the scheduler, even if you're going to launch over a million VMs. So it's all pooled together, you're going to launch a million VMs. That's one call. And on the actual scheduler side, you can see here, it takes that number of instances and it reruns a scheduler for every single one. So for a nomen x range nomen instances, then you run the scheduler and that. So you launch one, it's one time, you launch 100, it launches 100 times. So now we have a concept of how the scheduler hooks in in SX to the API. So we launch, we launch 100 VMs. We get stuck in a slow scheduler waiting before the API could actually return. And so now we're going to address the problem of why the actual BOTO is retrying. So we have, let's say you're trying to use OpenStack as an EC2 user. And then we want to talk to the Nova EC2 API. In our case, we actually have a pound server which had a low timeout in the example here. And this is for scale out and having multiple API endpoints. So the user tries to launch 100 VMs, but the Nova API server is slow because the scheduler is acting slow for some reason. So it doesn't respond. It sits there, it sits there, it sits there. Pound says 15 seconds went by. I'm sending a timeout back to the user so we know something didn't work is a timeout. Then EC2, the EC2 user running BOTO will actually send a retry because they say, oh, my command timed out. Let me try again and that's done behind the scenes and the user isn't aware of that. Okay, thank you. So I know that it can be fixed. In our case, we actually fixed it in pound which was we just had the timeout there being bigger, but it's good to know that it actually exists in the Nova API as well. So now we know that it's running slow. If you guys use pound, make sure to set the timeout high and if you're using, if not, make sure to set everything else properly. So let's talk about some, how to measure Nova performance itself. So Nova uses eventlet. Eventlet is coroutine. So coroutine is a way of running a function asynchronously. So you run the function and you don't really care about getting the response code right away so the code continues and it runs in the background. So if you didn't have this, you could actually do some standard Python tracing in your code to see what happens because it's running sequentially. But now all of a sudden in the single process you have the code jumping around all over the place so it makes it much harder to just do a standard tracing Python code line by line. Additionally, as we talked about in the last talk for whoever was here, there's some problems with the native, the standard MySQL client, which is a binary and eventlet. So eventlet works by monkey patching all the code. For all you Python developers out there, monkey patching is crazy and it's evil, but it doesn't work on binaries. Monkey patching says, I want to take this other module and put my own code into it. You can't do that with binaries. It only works with Python code. So this means eventlet can actually make MySQL coroutine eventlet friendly. So all of a sudden every time you do a MySQL command it serializes all the calls. So all of a sudden everything that's going on stops. All this serialization, all this parallelization you have with coroutine stops and you have to serialize everything to do the MySQL. And this slows things down and we've seen actually if you could replace the binary MySQL driver or client with a native Python one you actually get very different performance characteristics. It's not necessarily faster or slower. It's just very different. But this also makes it very hard to figure out what's going on because you have serialization, you have parallelization and everything's acting strange. Additionally, you have RPC. So you're running this whole system, running a single box, you can just profile the single box and get some nice numbers. But running it on, you know, three boxes, a hundred boxes, a thousand. And so all of a sudden you have to measure things all over the system. You have RPC that can be running slow. If you have a centralized queue, the queue could be slow. There's been some work done on how to measure how slow the queue is. And so you have a lot of other components that it's hard to see. If the queue is down you may not, and if something goes on behind the scenes in the queue and it misbehaves, you may not know that it's misbehaving and it'll be hard to figure out what's going on and where it breaks. And lastly, we had an interesting case where we tried CPU frequency scaling on the compute nodes. So normally, CPU frequency scaling will lower the speed of a CPU when it's not being used. So this saves power and all that nice stuff. But because the strange properties of the Python load of Nova compute, it doesn't actually work well with CPU frequency scaling. So when we turned it off, we got a 2x speedup on the compute nodes, which we did not expect at all and it was very interesting to see. And that's also a trade-off you have to make about you're using all this extra power because you're always running at full speed on the CPU. Another important part of trying to understand what's going on in an open measuring it is the log files. The log files are big and scary. They can generate gigabytes a minute or an hour. And sometimes it's hard to make heads or tails of them. So this on the top here is actually the log format. It's the standard log format for most of the logs. And this can be changed if you want for some reason. The first parameter is the ASCII time. So that's the time and you can see in this example below you have the time of what's going on. Unfortunately, it's only at the second level and it doesn't have anything more fine-grained than seconds. Next, you have the log level. So there is different log levels, debug, warning, info, et cetera. These all don't necessarily make sense. Nobody's actually really gone through and said this is a debug message. This is a warning. This is an error. So this actually needs to be worked on and there's talks of working that on Grizzly. The name of actually what's running the command. So the actual service and what's running it. The request ID. And this is an important one. Every time a request comes into the API service given a request ID. And this is, you could track this throughout the code and throughout the services. And it'll have the same request ID. So if you grab the logs for request ID, you can see all the commands and all the log files are happening for a single request. And this is nice because you could track across services, across logs, and it's a nice easy way to track a single request. Next, you have the user ID and project ID. These are fairly straightforward. This is the user and project that's running them. And then this is all the header information. So when you do a log, you also have some log message in this case. This is the input, so action, run instances. So this is the actual command, the API command that comes in for launching instances. It says I'm trying to do a run instances command. And it tells you exactly where it's running from and the line number. So if there's an error or a stack trace, you can actually go in and see, oh, this is exactly the line it's running from. So now that we talked a bit about how to profile Nova and try to figure out what's going on, here's an example of, this is a fairly simple example of why the schedule is so slow. So this is 24 seconds, the life of Nova. So these are examples straight from the logs. The logs have been cleaned up, as you can see. The log files are very big. And I looked through for the request ID to track a single command across different log files. So starting at the top, we have the Nova API log. You see here, we have a post command for run instances request. So that's the client saying I want to run instances. And that's a post command. And then we have the Nova API will go, okay, and now I need to talk to the scheduler. So it writes something to the database and it does RPC round robin call to scheduler. This means it's, I need any scheduler, I don't care what scheduler it is, just give me a scheduler. And then we see 24 second later, as we have a post response. So something happens, the RPC round robin call responds and the API responds to the user. So now we want to see what's going on in those 24 seconds that the scheduler is sort of sitting there spinning its thumbs. So the first thing we get, the first command, the first line that comes in at the log file is attempting to build 100 instances. That was the command we tried running. You can run dash n 100. And now we go through the scheduling process, which is the filter and weighing for each instance on all the compute nodes. And so that takes about one second or maybe less to do it for 100 instances. And then we have some time. Now we know where everything's running. We need to actually record that in the database and send off the cast to the compute nodes. So we say, I want to run this instance on the compute node, do it in the background asynchronously. And that takes 19 seconds. And then after that, it responds to the API call saying I'm done, Nova API responds to the user saying everything's scheduled, and now we have our response in the REST API. So we have this 19 seconds and we have a lot of database calls in all of this. So let's look at the database now. So this is my SQL log. This was done by enabling the slow log for every single time. So setting if the time is zero, zero. The amount of time it takes is zero seconds, then records the log. So now we have a nice list of all the database calls that were sent. So the actual filtering and weighing process, we have 733 database calls for 100 instances. And in the one second that it takes to run this in the Python code, it takes about 0.07 seconds. So it's crazy fast. Database has been optimized for years. The database itself is going fast. It has no problems here. This is also not a terribly big deployment. So we didn't have a million compute nodes and that kind of thing. We look at the second component of this, which is the recording all the information about which machine should run where, which instance should run on which compute node. And that has a lot more database calls, 3,563. And it takes 1.7 seconds. That's a relatively long time, but for that many calls and over 19 seconds, it's not really that significant and it's not really a problem. So we saw before the big O of NM. It turns out that's actually a fairly small amount of time and the actual database is very fast. Even though there's a lot of calls and everything's going very, something else is going slow and it's not the database. Yes. It was everything, the scheduling. There's also, I don't talk about here, but also that NOVA networking will do some things. And I believe the networking calls will be intertwined in there. Right, but that's fairly small. But so most of this happens before the actual, I look through the logs and there wasn't, I don't have to double check, but I don't think there's anything, the compute nodes weren't doing too much. Right. So there's nothing else going on in the system at the time. And unfortunately I had to do everything because it's not as easy as just looking for a request ID through the calls. So it's everything. Right. That's a very good question. I don't know. I was wondering about myself and it's for some reason taking a few seconds to go back. My guess is potentially the RPC letter can be slow, I don't think it's that. I looked into that, but it's very unclear what's going on there. And additionally in the grand scheme it's only a few seconds. And so for this we decided just ignore that anomaly and focus on the biggest amount of time. But that's a very good question and that's something that needs to be figured out going forward. So from here we see the database is fast under two seconds for the whole time of the 24 seconds in the NOVA. Filtering and weighing is actually very fast, the actual scheduling algorithm. It's not the best, but it's not slow either. But the Python code around the database is actually pretty slow. It turns out that we use SQL alchemy to allow us to use any database in the back end. But the code isn't optimized at all and it can be done much better. And it actually, you see a lot, not a lot of time is spent in the actual RPC layer. That's a cast, it's pretty straightforward. You just send a JSON blob out to something else and you move on. But a lot of the time is spent in the database getting ready, either running Python code or preparing the SQL alchemy Python code for the database. So it appears that the actual problem is the SQL alchemy code around the database. And one potential solution is to reduce the number of database calls and make it more optimized. So instead of having, you know, 3,500 database calls, you have a lot less. Another thing that needs to be done in Grizzly is actually optimize the SQL alchemy code itself. That has been done a little bit in Folsom, but there's much more to do. So Folsom, so I haven't, unfortunately, I don't have a chance to run this on a real hardware. I've run it on a small scale and it's better. But the short of it, though, is that this RPC call, instead of RPC cast problem, has been fixed. That was fixed early on. And so that means no matter how slow everything is, you'll get a response for it, run instances, request right away. And so this means that all of a sudden, if things are slow, it's not necessarily as big of a deal as it used to be. Additionally, the number of database calls has gotten down into the actual scheduler itself. So here's a nice, simple example. In Essex, there's 29 database calls in the NOVA scheduler driver code. Now there's only 12 in Folsom. And so this will hopefully mean that as less time is taken in the SQL alchemy code, and the number of calls to the database are fat, or there's much fewer database calls. So I think we're a little early, but any questions? I don't have a chance now. Yeah, so it's, the SQL alchemy code is used as an ORM. And there's a lot of time to spend building the models of everything and translating to the actual SQL queries themselves. A lot of this has to do with just poorly designed SQL alchemy code itself. This has been a major, the code around the database has been a major contention in Folsom, and everybody's realized it's been a problem, and there's a lot of work to try to improve that. Part of it was that on big systems, there was problems with indices in the database. So on a small scale, you wouldn't see this, but if you have a thousand compute nodes or more, you would have some slow things, so you wouldn't have an index. Additionally, we do a soft delete. So every time you want to delete something, the database you just mark it as deleted and move on, so over time your database gets much, much bigger. And both those things have partially been addressed in Folsom, and there's more work to do in Grizzly. Are you suggesting that going away from ORM is something that we're looking for and instead use the core SQL alchemy SQL commands that would speed up things? The short answer is maybe. I don't think necessarily using it as an ORM is necessarily the only problem. I think the bigger questions you want to make, the code faster itself, and however that different options need to be evaluated there. I think making it, getting rid of the ORM feature may be a nice move. There's a lot of pros and cons to that, and they have to be weighed with actual, how much of a speedup you actually get. The ideal scenario would be not to change it, not move away from ORM, but just actually make the code better and more efficient. Okay, thanks. Right, so we didn't actually... Part of it was this very weird access pattern. So the question was, why is the CPU frequency scaling have such a big impact? And part of that is because it's very strange. The Python code isn't that heavy-weighted. It doesn't really do that much, but there's a lot of big things going on on the compute nodes. You're launching instances, running KVM. So you have this pattern of really heavy workloads combined with sort of nothing at all. And so you have a lot of work, not a lot of work, a lot of work, not a lot of work. And so it's not this big burst of work going on. And so that's our hypothesis of why the frequency scaling is such a big problem. Because every time you get ready to do a lot of work, it stops. Were you using the Essex database just from a pure install or you added indexing to it? Because indexing isn't done by the native sort of... No for install, right? If you want to do that, you have to go and look. I think, fortunately, they fixed that now, there's some stuff in the migration scripts. So in this case, we didn't do it, but this is a small scale. We did on this fairly small scale system and it was a new system, so it didn't have a lot in it. Can you say what small scale means in this sense? So in our lab environment, we had two compute nodes and then the actual production environment, we tested the test and it initially had, I believe, it was a few racks. I don't remember the exact number. And the other thing was intuitively, three and a half thousand calls for doing the scheduling of... That was scheduling a thousand separate VMs. This is one VM... This is 100. 100, okay. So it's 250 per VM, still intuitively sounds like an awful lot of stuff at that stage. So I'm just wondering, did you look at those calls to see what the pattern was? Because I'm kind of suspicious that there might have been stuff from the periodic tasks kicking in there, which we've seen have caused huge database loads, right? Which is nothing really to do with scheduling. It's just an artifact of some of the other database stuff that's going on in the system. There's a lot of calls for the security groups. I don't remember the exact components. Security groups kept coming up as well as the... So you were creating all of these in a security group that had a group rule... So it was the default security group. Right, but the key thing with security groups is whether you've got a group rule referring to itself or not, right? If you just have a plain group, it's fairly easy. If you're adding VMs into a security group that's got a group rule, you end up generating a huge amount of updates to those things. Right. So there was, there was two group rules. So that kind of suggests it's more of a security group configuration thing than perhaps a scheduler thing. Yes. But either way, we still see, we still saw the scheduler took, you know, 19 seconds to do all this. And it's not clear completely exactly why. And part of the nature of this is sort of hard to do and instrument this. And we have plans to further instrument this and look into it. The short answer is no. We use our syslog to collect logs. But some of this stuff on a small environment, I would just SSH around to the different boxes. One of the problems with our syslog is it may not, may drop commands. It may drop stuff. There's a lot of, a lot of something happens and the stack choice happens. Depending on how you set the, the our syslog, it may actually see too many logs are coming in and ignore them. But I mean, for a lot of this though, grep was worked really well because you have this request idea. You can just do grep across a bunch of, a bunch of logs and you see everything going on. It was a dedicated clean system. I have to double check that, but it was, I think work is done in the past on it, but we cleaned it up, deleted everything and rerun everything. Unfortunately, it's just the code. Definitely. Yeah, I was trying to, I was running a little late on this presentation and unfortunately the network here wasn't good enough to actually get something up. Right, so let's say that we find that some, there's no way to actually speed up the scheduler. You could now have more than one of them and just sort of throw more boxes at the problem which is cheaper than hiring more developers. There's a limit on how many retries it does. I think it's three by default and then it'll fail. Anything else? Great, thank you.