 Well, good morning, everybody. We're here for the OpenStack Compute State of the Project presentation. There are two of us? Well, there are two of us. So this is Vish Ashaia. He has been, I'd be surprised if you don't already know Vish. Vish has been involved in NOVA and in OpenStack since the very beginning of existence. He's been the leader of the NOVA project since there was a leader of the NOVA project. And I think this would be a really great time to thank Vish for all the hours he's put into the project. And for those of you who don't know, this is Russell Bryant. Russell works for Red Hat. He started up with OpenStack Compute, what, about a year ago? Yeah, a little over a year, yeah. A little over a year ago and basically jumped right in and started refactoring everything and really annoyed me. No. He's been actually an incredible contributor to the project. And he's also, during the past six-month cycle, took a lot of the load of being a PTL on. We have an incredibly huge number of contributors at this point. And so managing, hurting all of the various felines that contribute to this project is quite hard. And so having him help take over running the meetings and deal with blueprints and bugs has been a huge benefit and kept me sane over the past six months. And so I'm really glad that someone who's as committed as I have been is willing to step up and take over. And I think we're going to have a great next six-month with Russell at the helm. So give him a round of applause too. Thank you very much. And while we're on the theme of thank yous, I think there's probably a lot of people in this room that deserve to be thanked. Obviously, I don't write all the code and Vish hasn't written all the code. There's quite a lot of people that have been contributing. And we both like this graph quite a bit. This is a graph of contributors per month over time. So you see we have a really good trend and we're sort of getting up towards 100 people in a given month. And some more data, in the past 12 months, there's been over 6,000 commits, but even more interesting than that, the number of people that have contributed to the code in the past 12 months is over 300 people. So kind of curious how many people we have in the room if you have code in Nova, may I throw your hand up? Great, so good number. Thank you all very much. It's certainly wouldn't be where we are. Thank you, round of applause for all the contributions. So basically what we're talking about is grizzly to Havana and that's what we're doing. We're going from becoming grizzled to becoming spelt and cigar smokers, I don't know. Essentially what we're gonna do is split this up where I'm gonna take the lead of talking about the past since I am the past now. And Russell's gonna be talking about the future. So I'm gonna go over a grizzly post-mortem, talking about what went well in grizzly, what we maybe could improve and then what all the features were that went in which is what I know a lot of you are curious about. And then Russell's gonna talk about all the things we've been discussing at the design summit this week and what we're gonna be doing over the next six months. So one of the things that went really well is we had a problem, you'll see the spike in the top of this graph around the middle of last year. This is our untriage bugs in Nova. So these are bug reports that come in that no one's looked at yet. We got up over 200 at one point last year. And during the last cycle, we identified that we needed to spend a little bit more attention to bug reports so that we could stay on top of things. And as you'll see, we made a lot of progress. Keeping the thing lower, at maximum, right before the G3 milestone, we had up to about 60 or so bugs. We got rid of those as soon as we went into bug fixing mode. And we managed, in general, keep the load of new bugs down so that at least we know when people are having problems which I think has led to a much more stable product overall. Another thing that went really well is our feature merge cadence. So we got, in the last release during Folsom, we had a problem where a whole bunch of really big features got proposed right at the last minute. And so we were doing all sorts of feature freeze exceptions and trying to get these features in and it was kind of a mess. So we focused this time on trying to get a better handle on which features were coming in when earlier. And we only, I think, did we only end up with one feature freeze exception? Yeah, and I don't think it was anything too significant either, I think. So we did a lot better at getting the things in early and getting them reviewed quicker. So that all went really well. Another thing, and this isn't just the Nova team, but our testing in CI has really improved. There's a number of major testing changes that have gone in. We're up to about 5,600 unit tests now, which is one large change. But we now have upgrade testing in the gating system so that we can be sure that we're not breaking upgrades from release to release. We have, the tempest coverage tests have improved a lot, so we're now actually testing a bunch more APIs than we were via our functional testing. And this, actually I want to give a round of applause to the CI guys, I don't know if any of you are in here, but the people working on CI have really made this a lot better and it's getting really, really good. So thanks for CI. Stable releases. So what went wrong? Some of these, I've actually just become aware of recently, one of the things that we haven't been focusing on, and this hasn't been a huge problem up to date, is we haven't really been focusing on scale. And there's people out there that are saying, okay, Opensack's been around for a while. What's gonna happen if I put a 16,000 node cluster together, is that even gonna work? And apparently it does, but there are some major sort of issues when you get to that kind of scale. And I think one of the things that we haven't really been paying attention to is what things break at large scale and how can we verify with performance testing what kind of problems we're gonna get and make sure that we don't regress. We actually had one regression during the Grizzly timeframe in terms of database performance that we weren't aware of until someone took the code and actually put it out there and went, yeah, there's a little bit worse performance here with the database like it's 10 times as slow. So maybe there's something that went wrong there. So we need some kind of general performance testing and we have some ideas for how that might work. It's not something we've really been focusing on and we need to because now is when people are coming in and trying to do these really, really large scale deployed with OpenStack, we want it to be seamless. The other thing, so the last release we focused a lot on how do we split everything out? Like how do we separate the different projects into their own kind of, we took NOVA volume, completely got rid of it, put it out into the Cinder project, we're trying to move everything over to Quantum and all of that is great from a project management perspective and actually getting work done, but it means that there's not as much focus on sort of things that are across projects. Like how do we deal with quotas in a way that's easy for an administrator to manage quotas across all the projects together? How do we manage scheduling across different individual silos of work? And how do we sort of drive forward the shared work? And it's easy now that we have these little silos to forget about there's other groups out there working on other projects and we need to maintain that cross project communication. So I think this is just sort of a natural result of separating and now we need to find the ways to build the bridges between the projects so that we're solving the right shared problems. And we have, like we've done a lot of work moving things in OpenStack Common and things along those lines so we have some of the primitives in place to do this sharing but we just need to make it a priority to focus a little bit more on that. So those are the sort of major areas I see for improvement, I'm sure there's other people would have other ideas of things that we could do better but those are the ones that have been popping up for me. So now I'm going to talk a little bit about the Grizzly features, there was a ton of features. We had 66 blueprints go in and something like 700 bugs fixed during the Grizzly timeframe. So there's no way I can cover all of them and I've put a lot of them on these slides and I don't even think I can give detailed descriptions of every slide because it's just gonna be me droning on endlessly about well this feature added, this support, et cetera. So I'm going to go through the slides relatively quickly, give you some kind of key points about things that I think were interesting in the different features that came out and not try and get too detailed in terms of everything that went in. One of my favorite ones, because I really like deleting code, is we actually, we deprecated Nova volume during the last week so we actually removed it. So the code is no longer in Nova which means all of the volume code is in one place in Cinder, thank God. And we had a really nice easy migration path for people that were on Nova volume to move over to Cinder just replacing the API endpoint and it works essentially the same. So that was a really nice, the quantum migration is gonna be a lot harder so I wanna take the, something went well and I wanna be happy about that before we have to deal with the quantum migration. Yeah. One other thing that we did well, we've got the feature parity between the quantum API and the Nova network API really, really close to complete. So we actually proxy floating IP requests that come into Nova over to quantum. We proxy security group support over to quantum. We improve the VIF model so that we actually can have API requests that come in, go to either back end and work. That was really important so that people that are using Nova network and then eventually want to go over and use it and quantum opens that cloud can use the same set of primitives that they're used to. They don't get necessarily the same feature set. For example, quantum supports ingress security group rules instead of in addition to egress security group rules. And so if you want to use those extra features you have to go talk to the quantum API but at least a user can go, oh, I've been using this Nova floating IP ad and now I can still use the same command and it will work. So that's really nice for end user compatibility. I mentioned this a little bit already but we did a lot more improvement on testing. One thing we did in Nova, we have these things called API sample tests where we actually have a set of tests that run as part of our unit test suite that spins up an entire fake version of all of the components. So we have a Nova API running there, Nova compute, Nova network with some stuff faked out in the back end and then we make actual API requests and generate from a template and then generate the results and we use that to feed the docs, the API.opensack.org site to show us actual real API commands that could be running against the cluster. So we have, and we verify that every extension that gets added has tests to support it so that we can actually see end to end use of the APIs affecting the system. It's also been good. It'll also catch if we ever accidentally change the format of like a response or something that causes a test failure and the patch can't get in. So it's really nice that we have coverage across the whole API for that now. We also, so more complete tests in Tempest. We improved our database migration testing quite a bit. A number of errors cropped up when we started trying to do DB migrations with real data. So we have an empty database that migrates perfectly but once you throw data in there there were a few places where we had bugs and so we actually have some DB migration testing where we put real data in before the migration, do the migration and then verify that it's in the right state which is gonna make it a lot easier. We worked a lot, a couple of people worked on grenade upgrade testing which is basically what I mentioned earlier that you take and install a stable Folsom, you run a bunch of stuff, you upgrade it to an install of Grizzly and you make sure everything's still working. We did a lot of incremental database improvements specifically. We've never really had unique keys in the database and you might wonder why because databases are pretty good at that. We have this idea of soft delete so that we don't lose data. So the way that we implemented that initially precluded having unique keys, we've now modified it in a way that we get soft deletes and still have unique keys. So that's gonna make our data model a little bit more consistent and eliminate some of the race conditions we've been dealing with. There's some support went in for archiving data so because we do soft deletes the database kind of grows endlessly unless an administrator goes in and kind of cleans out all of the old records. So now there's actually a command that you can do that will do that part for you so you don't have to have a manual cleanup script and we improved the Postgres support and actually are testing it, which is good. Some scaling features that went in. I don't know how many of you guys have heard of cells. We've been talking about for like three summits. So the code's finally in and there's some, it's experimental, there's some features that don't work but for a limited use case it works very well. One, this noDB compute thing is also coming to work and I'm for three summits and Russell and Dan Smith from IBM led that, it's been pretty much eight months solid of just gradual refactoring. So the idea is for security reasons and for scaling reasons we wanted to remove database access from the compute nodes themselves and put it up into the control plane and that actually works now so you can run it where your compute nodes can't actually access the database. Bunch of scheduling features you can live migrate without picking a host now and let the scheduler pick a host for you. These are all relatively minor little features but as a whole they kind of are all interesting so you can actually boot multiple instances and have them have different names before you would boot. You could say, give me a hundred instances and they'd all be named the same thing so it's kind of hard to figure out which one's which. You can, your availability zones are now based on our general structure called host aggregates which means they can be changed dynamically before you'd have to specify in your config file what availability zone a node is in and then if you ever wanted to change it, say you plugged it in somewhere else you'd actually have to go in and manually change the config and then restart the service. Now you have an API that you can change availability zones, an administrator API. Bunch of small Nova network features so we're not trying to improve Nova network dramatically because we're trying to move everything over to quantum but there were a few sort of warts or inefficient things in Nova network that we cleaned up. One of them is just we optimized it quite a bit so it works a lot more quickly than it used to. A couple small additions. You can now have the internal DNS which is compute node to compute node can be shared across multiple hosts. So internally to the cloud you have a DNS record for the name of your host. So if you launch a server named foo then you can get to it at foo.nova local. But in multi-host mode it would only work for VMs on that host. Now you can configure it to work for VMs across your system. Just a little convenience thing. Another little convenience thing is you can actually share the DHCP IP address across multiple nodes in multi-host mode. So minor Nova network cleanups nothing dramatic and amazing because all that's happening in quantum which means all of the people who were helped. Used to help with Nova network cleanups. Are now in the quantum session. So we have a much quieter, maybe Russell will talk a little bit about that. A bunch of API extensions got added in. Most of the API extensions were about moving functionality that was in Nova Manage into actual APIs. So Nova Manage for those of you who don't know was a little binary script file that you could run locally but you had to have direct database access to use the commands which meant it made it sort of hard to script. You have to log into one of your hosts that had database access and run the commands there. So we moved all that functionality up into API so that you can do things like creating networks through the API, et cetera. All the stuff that Nova Manage used to do is in APIs now. There's a few other interesting extensions like instance actions which basically lets a user come in and see all the actions that have been done on their instance. So it'll say at this time rebooted, this time snapshotted, et cetera. So you can kind of get a view of what happened and the errors that occurred during that process. This one, the default security rules has been asked for forever. And this is another sort of little additional feature for people using Nova Network that are experiencing pain. So every time someone comes in an OpenStat cloud, they forget or they don't know that they need to allow them SSH to get into their instance. So an administrator can now say, okay, by default, put SSH in all the security groups so people don't have to do that manually. It's a nice little thing. Git password allows you to securely get a Windows password from an instance that's generated by the guest. Theoretically, it works for Linux as well, although most people using Linux don't need a password because they have SSH keys. And you can actually list availability zones to the API now, which you couldn't do. It was kind of an oversight. Libvert, bunch of minor features, spice support, configuration of Nick drivers, that's for older guests. So if you happen to have a really old guest that's from some legacy application that doesn't have vertio drivers, then you can launch it and say, don't use vertio instead use a SCSI connection or an ID connection. We have events. So state is actually mapped directly out of the, if a state change happens in the Libvert, it's updated immediately. Now it used to take a periodic task, sometimes up to 10 minutes to notice that, for example, your VM had crashed. Now instantly, as soon as it changes state in Libvert, it's updated in the database. Live snapshots, if you have a new enough version of Kemu, you can snapshot an instance. It used to be that it paused the instance while it was doing the snapshot. Now it will actually do it in the background while the instance is running, which is kind of cool. You can hot plug IPs and network adapters now. So you can say, if you're using Quantum and you want to plug a new network into your VM while it's running, you can actually plug in a new adapter. You can also add or remove an IP and it will configure the security rules properly so that you can still talk to it with the new IP. Zen, much less feature work went on in the Zen driver because it's already pretty good. It has config drive support, which is a compatibility feature that's great. They added some support for bit torrent image download, so to spread your load for downloading images across your cluster a little bit better. This is only one bullet point, but actually there are a lot of work went in DSX during this cycle. It's getting to the point where it's actually a sort of first class driver, which is really nice. And from what I understand, there will be more. So that's great to actually be able to have, we had this, we had a check box DSX driver before and now we actually have one that's functional. So that's great. Hyper-V has gained a huge amount of features. There's a lot of active work going on there. Some of the lesser used APIs are implemented now like live migration. They have Quantum support. They have Cinder support, so it's really going well. So there was my 15 minute overview of all the features. I hope I didn't bore the crap out of you by monotonously saying, oh yeah, this feature and that feature. I don't know man, it's pretty exciting to me. Well, you're supposed to be excited. Yeah, I get it. So that's not working yet. Okay, so in Havana, so just to recap, the way we do development in OpenStack projects is we have this six month release cadence and the very beginning of it is this event. We have a design summit. We've been hiding away in some rooms all week where we're going through all the things we wanna do in the next six months and Nova sessions span all four days. So really we're about three quarters the way through. It's still going on now and throughout the afternoon. So what this is is sort of trying to capture the themes of the week and some of the things that we're talking about that we expect people to be working on. So first, start with some big themes. So when I sat back or when we both sat back this morning and thought about the discussions we've been having so far this week, some things that keep coming up. Live upgrades are at near the top of the list of things that we need to continue chasing and make work for people. So it's much easier to do rolling upgrades of an executive deployment, not affect your existing user base. Security is a huge thing. No DB compute that Vish mentioned earlier was something we did for security. We're continuing to look at all the ways to make a Nova deployment more secure. So I guess one of the biggest areas we look at is what if someone breaks out of a hypervisor what is the impact of that? Having direct database access is one particular bad thing. We're looking at all the other bad things you can do. So we're trying to lock that down one more. Scale and performance as Vish mentioned we recognize that we haven't had as much or as good of a focus on that. So we've been talking about ways we can improve that and then reliability as well. Just not just, it's not just about adding a feature but what if a service crashes while it's in the middle of doing something? How can we do better at cleaning up and so forth? So on some specific, more specific things that we've been talking about. What is the internal object model? And this is sort of related to the no DB compute work and it's also related to upgrades. So one of the problems or one of the things that sort of limited us in the upgrade space is how tied we are to the database and the schema of the database. And so eliminating direct communication with the database from the compute nodes helps us quite a bit since there's a whole lot of those around talking but also for everything else the API services and the scheduler and all these other services we want to continue to decouple all of our code from the database layer. So we've been talking about an object model that sort of separates that. So that's been very important. Also on the sort of upgrade area and reliability as a matter of fact is graceful service shutdown so when people need to upgrade. So right now if you wanna upgrade a compute like it doesn't have anything built in to know about anything that it's in the middle of doing so if you tell it to like you upgrade the package and you tell it to restart it's just gonna kill it and if it was in the middle of something too bad things are gonna be in a weird state. So trying to get better about handling things that are in progress more gracefully. RPC version control. This is another case that's really really big for upgrades. So when I say RPC I'm talking about all the ways that all Nova services talk to each other and we already have versioning on all of these things. So we know if there's incompatible services talking to each other we already know that the next step we need to take is in upgrades we need to sort of pin as you're like doing a rolling upgrade of everything we need to lock all the services into talking the old versions of everything until everything's upgraded to the new hotness and then sort of flip a switch that then they all start talking the new protocol. So that's something that we're gonna get done this time and it's gonna get us even closer to the upgrade. Better state handling that's again that's actually sort of tied with the graceful service shutdown thing. So what if a service doesn't gracefully shut down? We're talking about better ways to keep track of operations and progress to understand how far we made it within something have good understanding how we might be able to clean up or if necessary how to resume where we were to continue. So that's one reason it's also there's actually some security stuff tied in with this. So to give an example right now when you do an operation like a migration there's a lot of things where compute nodes are talking to each other and telling each other to do things and that's not so good for security. So we wanna take some of that control logic and move it up a layer so that there's something else that's orchestrating these activities and that the compute nodes have less and less power to tell each other to do things. So we're gonna be doing a lot of refactoring in that area. There's a lot of discussion about Nova Client generally just giving it more love making things a bit more consistent. Trying to work with and actually the next slide is related to this too work with the effort of revving our API to be more consistent. So in the new version of the API we're talking about making there's a lot of things that are inconsistent around return codes and things like that. We also wanna do a much better job of versioning API extensions being able to do discoverability of features and things. That's just important stuff there and also reevaluate what we consider as the base API or what a lot of people have called the core API but core is a very overloaded term in our community so I'll say the base API. And Nova API extensions. I don't know how many people here have worked on writing API extensions but there's a desire to do some big cleanup in this area to make it a bit more maintainable. It's gonna be based on entry points which is a Python thing where we're moving lots of stuff within Nova and with an open stack in general over to entry points as a way to load things. And it may end up being a shared framework for all the projects. The idea of doing extensions to the rest APIs is not something specific to Nova so that's something that we get. That's a cross-project effort that we can work on. Scheduling has been a big area of discussion on Tuesday we spent at least half the day maybe a little bit more than half the day on various things around scheduling. In some cases it's sort of smaller schedule of features it's incremental things that people need and then there's actually some really big problems we need to solve. Example of just kind of a feature. There's additional bits and information people wanna be able to schedule on. For example, being able to schedule based on CPU utilization not just how many cores it has and how many cores you've allocated based on what flavor you've booted so that's something we'll add. The ability to reserve a host. There's a, some people would like to come in as a customer to the cloud come in and say I wanna pay to have a box sitting there that's mine and know that that capacity is there and only my VMs are gonna run on it so we're gonna add some scheduling magic to make that possible. Cross-project scheduling. So this is a really good example of how we've broken up the projects but now we need to do a really good job of working together and cross-project scheduling is a great example of that. A good use case is you have an instance and you have volume and you'd like them as close together as possible so we need to work together to come up with a good way of handling that and we've talked about that a lot this week there's some good ideas and hopefully we'll see some real progress this time around. Group scheduling is another thing so there's the desire to add the concept of a group of instances in ANOVA deployment and if that concept exists in ANOVA then you can add some policy around that such as I don't want any of the VMs in this group to run on the same host for failure reasons. For example, if you have two instances for high availability reasons then you want to reduce the risk of the fact that if one dies then the other one's gonna die so I think that that's another addition that will be very useful. So cells, cells is a really important thing that landed in the Grizzly release for scale reasons. We'd like to keep pushing on that, get more developers in on it. It works well for a particular use case the people that developed it. There are some additional things that we need to fill in to make it useful for more and more people. I expect more and more people are start to using it so I expect more feedback but we talked about that a good bit this week and laid out the specific items we need to work on. APIs for block devices there's been a lot of complaints about that and we're gonna work on that to make it easier to consume and more predictable in the way it behaves. We talked about some major cleanup around so we have all these different things that have completely different code paths in NOVA so we have there's migrate and then there's the live migration and then there's resizing and migrates actually resize without the resize and then you have evacuate which is yet another sort of take on a very similar operation and there's actually quite a bit of code which it can really be unified and if we can unify it then we can make these things much better tested. That's actually, this is one of those areas that breaks more often than others and part of it's because it's hard to get good tested coverage over it because it's a whole lot of stuff and a lot of different code paths that really shouldn't be different so we've got a lot of work to clean up this stuff and make it more reliable. Mothballing a server so this is an idea of so right now if you have a VM and you don't wanna delete it but you don't wanna run either so if you stop it as far as the NOVA infrastructure is concerned you're still consuming those resources in reality you're not consuming the RAM and you're not consuming CPU and so forth but the logic is such that you're still consuming it so we want to be able to not consider those resources in use because you wanna be able to take those resources and use it for other virtual machines and pass those savings on to your customers so we have some work to sort of to make all that work. Refactoring a periodic tasks periodic tasks are sort of a I don't know if a source subject is the right word but periodic tasks come up a lot and when in high scale deployments is something that better performance problem things that run every minute to make sure the state of the world is still what we expect and you deploy 15,000 nodes and it turns out that things like this aren't so good so anyways changing the way we do all these various cleanup tasks on a periodic basis to be more scalable is something we'd like to do. Quantum a single word on a slide but it's actually a pretty huge deal so for a while now we've had sort of these two parallel network stacks going we have NOVA network and we have quantum coming along the quantum team's done an amazing job the main thing that sort of held us back from coming back and making quantum the one and only thing was feature parity that's not really an issue anymore either it's been addressed or the final things are being addressed this release cycle so now we have the hard questions of what's next and the biggest thing in this area is the upgrade path so what if you're already using NOVA how do you migrate into quantum and how do we do that without just telling you to start over which no one's gonna like that answer so we're gonna work hard on making that the most seamless transition as possible so we've got some hard work ahead of us in that area but we think it's really important and the vert drivers I'm sure there's gonna be a ton of work that all of the sessions on the vert drivers are today so these discussions are still in progress but there's a lot of activity a lot of discussion of LibBert there's a VMware session this morning sounds like there's some good features coming there there's been a ton of discussion about the bare metal even though there's a specific session the bare metal driver later this afternoon there's a ton of discussion of open stack on open stack and the triple O effort all week so that's a really hot thing right now so and more so just these are the things that have come up in discussions this week we only had so many time slots even though we had a lot we had all four days there were still a lot of ideas proposed that didn't fit because we only had so much time and based on past experience just because we didn't talk about it here doesn't mean it's not gonna get done it's just anyone here has just as much ability to influence the direction in the next six months you just have to show up with discussion and code so I'm sure there's gonna be if we come back in six months I'm sure the list is gonna be triple the size of the things that actually got done because this community is pretty amazing and with that onto questions, how much time do we have about five, maybe five to seven minutes there's a microphone here in the middle of the room if you'd like to come up and come to the microphone so that it gets on the video regarding the API extensions have you guys looked at extending the Git password API to retrieve key pair to access Linux VMs? So the way that we store key pairs currently is you actually upload or create a key pair before you launch the VM and the Git password actually leverages a key pair to do it securely so part of using Git password is that already have given a key pair so there's two ways you can do it there's one is you can have the server create a key pair for you and then give you your private key which maybe is not the most secure thing but it's convenient and then there's you can create your own private key and then upload the public key to the server so that exists already is that what you're asking or because I know there have been discussions around secret as a service and right now this component is kind of outside of open stack so that's why I was trying to see if you maybe trying to address it So once we have secrets as a service more fleshed out I would hope that the key pair related stuff that we do at Novel would probably move there because there's no reason for us to also have our own way of storing key pairs but right now Novel stores its own key pairs Thanks When you use experimental to describe the compute cell what does that mean? Does that mean it's not mature yet but it will be improved you improved all that will go away at any time So what does experimental mean for cells? We do not expect it to go away at any point I guess what the point of calling that experimental is well what I know it's reliable in that it's already being used in significant scale already and it works great for them it's experimental because there's some features such as security groups security groups is a big thing that it's important to a lot of people but that's not supported in cells right now so it's just that it's not we don't consider it complete enough for everybody so we still have some significant work to do there before we consider it the answer that everyone should be using so it's just sort of like work in progress deal but for what's there I mean it's solid and it's great Anything else? All right then thank you very much