 Hello, everyone. Welcome. My name is John Garbert. I'm currently a principal engineer working at Rackspace on the public cloud. I've been the Nova PTL for the Liberty release. I'm also known as PTL for the Mataka release. And today I want to start talking about an update on what's been happening on the Nova project over the last few releases, particularly over Liberty. So I'm going to start with talking about Nova's mission. One of the things over the Liberty release, we've been very conscious of making sure that we share with the whole community where the consensus is currently going, what our goals are, and why we do what we do just to try and get people more aligned on exactly where things are going and spreading the word about what we're up to. So this is a quote from a YAML file in Git where we keep the project mission statements. The mission statement hasn't changed. The key bit is Nova is all about compute. So I want to take a moment to talk about priorities. One of the things we realized over the times of working on different releases is that we need to focus ourselves on particular things, particular focus points, and work out, like make time for them so that we get them done. I don't want to share with you like what we're currently, you know, what these focus points are. The first one is a good API. Nova is, you know, the API is key. And the reason for the pretty picture is me trying to describe an ecosystem. It's what came up when I Googled for Ringforest. Anyway, so the idea is that we need to have the strong ecosystem around the strong API. Part of this involves, the API has got to be good and it's got to be the same across all the different deployments that you're using. It's got to be a familiar thing so that this ecosystem can build up around it. We're really focusing on doing a better job with that. I'll dive into more details in a second. The next piece is making sure that we stay robust and reliable. It's really important that when you make this API call, the right thing happens. And we're focusing on making sure that we do that, good testing, focusing on getting the bug fixes in, and generally looking at places where, you know, looking at the bug themes that happen, identify those, and work on like the key refactorings to make sure that we can go and fix those areas and really dig into them. So listening to the operators and what's happening, listening to the user groups, and really concentrating on making sure that, you know, we keep robust and reliable and ways to make that easier to do. The next piece that we had a lot of feedback from was making sure that upgrades are easy and they work. They don't impact your production when you make the upgrade. It's taken as many, many releases to get to where we are today. I'm going to dig into more details on upgrades. We're starting to get quite a good story now on being able to upgrade and keep, to make it easier for people to keep up with releases. Linking back to the API piece, it's really important that we try and get people onto these newer releases to try and make sure that the API is available in the same place everywhere, supporting the ecosystem, making sure it's vibrant so that we keep moving forward. And scale out another piece of this, right? We have lots of users. We have to keep this working. As the users' needs grow, we need to be able to support those needs as they grow. There's also a background, well, not really a background piece, a key piece here is we need to work as a community to stay open, to keep this good open culture. We talk about the four different types of open. The open source, the open development, keep the roadmaps open. We have the design summit across the way and make sure that we keep this innovation going and we really build this strong ecosystem around the API and all those bits and pieces. The other piece you'll see is this focus on NOVA not expanding our scope too much. NOVA is a huge project, really huge project. We want to make sure that we can focus our efforts and do those things really well, but while at the same time support the innovation within the whole of OpenStack, we have many projects of spun out, ideas of expanding NOVA that really fit best with their own dedicated community outside of the NOVA project. They're starting to flourish. Heat is a great example. So I wanted to dig into this specific example that's happened recently. We've quite frequently, as frequent talk about pets versus cattle. And one of the things that if you want to have lots of pets is to have this concept of external server HA. Well, people are wanting this, the idea that you look after the instance, you monitor it. When the instance goes down, you need to try and bring it back up somewhere else to keep it alive. This kind of semi-HA. That's the quotes. Now this is actually, you know, this is a complex thing to do. And this is one of these pieces where there's lots of orchestration to make this really work well. We don't really want to have all that complexity within the NOVA project because it's just not something that, it expands the scope so much, we just don't want to go there if we don't have to. And what we've done here is we've worked with several teams looking at this problem. And working with these teams, we've identified ways in which we can expand NOVA and add additional APIs to NOVA that add things like ensure that you say this, we know this host is dead. Don't wait for any periodic task. We know this is dead. We just killed it. So that's the mark hosts down API. That's a new one. And in order to get all these APIs that these external server HA tools need, like those pieces, there's other pieces as well. They need to evacuate, they need live migrate, need to be able to disable hosts, report those to the users. And there's this sort of supporting structure that we need in NOVA to support this external tool. So it's not, so the key thing here is that we need to solve the problem for these users, but it doesn't necessarily mean we have to have the code inside NOVA. There's ways of adding APIs and making sure that we get that working and solve that problem within the open stack ecosystem. It doesn't actually have to be within the NOVA project. Okay. So if you read the title of this talk, I mentioned liberty. I should probably talk about liberty, otherwise you'd feel that part done by. So let's have a look at what's happening in a liberty release. Probably the key thing is an awful lot of architecture evolution. This doesn't sound very sexy. I actually find it really quite exciting, but that's because I'm a developer and love getting into these things. Like what we're doing is kind of rewriting the bowels of NOVA from the inside to make sure that we can keep scaling, we can keep this project going and increase the velocity while keeping the stability. Now, for those of you that are eagle-eyed, you'll see I'm saying maintaining stability and increasing velocity. Therefore, he must be completely and utterly crazy and lying to me because these are competing concerns, but it's all about trade-offs. If we change the architecture so that it's easier to not screw up for want of a better expression, we need to change things so that we can expand and add and do this all while keeping the upgrades working or while keeping a stable REST API. We're focusing on a lot of this work. So there's three themes I wanted to pick out about the architecture evolution. There's some API work, some upgrade work, and there's some scheduling and resource tracking work. There's probably some big pieces. So let's dig into a few of those now. So I'll start with the API, since it was at the top of all the lists and the priorities. So as a community, we've had this API evolution, and we stepped back and said, like, we need to really understand our API users and what they need from the API. So I wanted to just share this idea with you a little bit. So the first API user is named the Absent. The idea of this user is they write their script to run their app. It starts up these VMs, runs the snapshots. Their script is amazing, it's beautiful, and it's working, and that's the key thing. They want to keep the script amazing and working all the time, regardless of all these upgrades that we're doing, regardless of all the changes and versions and all the bright ideas we have about consistency and everything else. This script still needs to work. Otherwise, they're screwed, and that's nasty. So we don't want to do that. What's the second one? Well, the second one's called the active. This person is super keen. They're like, I like writing my script 17 times a year because it's amazing. So they have a look at the API docs. Even before we've written the docs, they get the code up and they go, yeah, there's a new API. I want to try that, because I want to see if I want to do a snapshot with a QS, because it's cool. They might have a problem to solve, too, and that's fine. But there's people who want to use the newest stuff. They want to know if that's available. They want to have a play with it, check its availability, and that's great. So we should help that user, too. And by the way, if we can't ever change our API, we can't add anything, so they won't get bored because there's nothing to play with. Wow, that's all true, right? There's the other case, which is Monty shouts at me. So the infra team is an interesting example of this multi-cloud case. So you've got like a more regular example is you've got your OpenStack private cloud. You've got your OpenStack public cloud. You've got your friends OpenStack private cloud that's the one under your desk that you accidentally kick the power cord out of. And you kind of want to be able to have that magical scripts that you have, all the SDKs you have, all this ecosystem built around it, and you want it to work in all of these cases. That might mean that they're on completely different versions. As you might know with NOVA, we allow people to deploy off trunk, not just released versions. So they might be halfway through the release cycle on this public cloud, like 17 versions down over here, 17 up over here. It's kind of complicated, but that should work too. As the other side of the story, I said that the active person wants new APIs to be added. Well, the ops people and the devs people probably want that too. New features, you need to expose whatever reasons, new problems to solve. As an operator, it's really good to know what people are actually using, like what version of the client are they expecting, which, how many absent people versus active people do I actually have? And that's an interesting problem. And it'd be great if we found a solution for this too. So let's look at the actual APIs. So first of all, we've got the current API, the V2 API. Made a mistake already. We've got the V2 API, which is our first API. It's actually an alias for V1.1. It's a bit confusing. V2 was our first one. Don't go there. The basic idea we had was, seemed to sound at the time, we've got this base API. Now we've got all these extensions, and you can query the API to see what extensions are available. Now this led to a whole load of interesting problems. Let's just take one. I go review the code, and I think, yes, this instance stuff here is really good. We'll get that in the API. We check it in. It gets deployed. People deploy off master. And then we go, oh, we call those servers, not instances in the API, don't we? So it's easy to make mistakes. And now all these people writing against the API, all stealing is to working. We kind of need a way of evolving this. And we were very stuck in this world. We actually came up with a V3 idea, but if I go through all these ideas that we came up with, we'd be here all day. So let's cut to the chase. We now have the V2.1 API, which is exactly the same as the V2 API, if we've got it right. Again, you're thinking he's clearly lying because he's just talking nonsense again. So first step was we want the same API everywhere. There's this multi-cloud case. There's the DevStacks pushing interoperability in a great way, and really brings this to the top of people's minds. What we want to have is the same API everywhere, in the same core API everywhere. So we've kind of got rid of the idea of extensions. If you look at the API, it still lists all the extensions. It just has them all turned on. So they're just all there. But we got rid of that, and we replaced it with this idea called microversioning. I don't remember how he came up with the name. But the basic idea is that you have the initial release of the V2.1 API is called V2.1. We add a little bit, and we call that V2.2. We add another bit. You see where I'm going, right? And then you go a little bit further down the line, and you go, well, actually we'll remove that. Then you call that V2.3, and then V2.4, you add another bit. The key piece is when you make the request as an API user, you actually make a request saying, I want to talk to the V2.3 API, please, because that's my favorite one. Then you can actually find out from the cloud whether it actually supports the V2.3 API, and if you're allowed to do that. And the idea is you keep using the V2.3 API for as long as you like, and all the changes that happen afterwards don't affect you. While we're talking about APIs, there's another piece. For the people taking photos, I've got more building to do, because I'm confusing like that. So the third-party API is basically the EC2 API is the key one that we've had in Tree. We've been struggling with it to keep it working and getting people to care about it. It's been hard. What we've found is we've actually created, well, we haven't created, some kind people have created an extra project that layers on top of the NOVA API, the EC2 API. We've been working with that project to ensure that we add extra APIs into NOVA so that it can have good EC2 compatibility, and that's going to be an external, it's an external project right now that you can use today on top of, I think it's the kilo version you need, but you'd have to ask them, but the idea is that we're going to be moving away from this. So with that effect, we're actually deprecating the EC2 API that's been deprecated for some time, so likely we may remove this in Mataka. I'll do questions at the end so you can question me on this. The other piece is right now we have two parallel code paths in the code going a bit deeper. There's a whole load of stuff for V2 and there's a whole load of stuff for V2.1. So every time we fix a bug in the base API version, we've got to change two places, which sucks. We've done some unification of the testing and everything else, but the key point is, is that if you've tried Liberty, you no longer use the V2 API. By default, Liberty is using just the V2.1 code. The V2 API and the V1.1 API are all using the 2.1 V2 compatibility mode. I'm not very good with names. Either way, if you use Liberty by default now, we're not using the V2 code anymore, it's hidden away, and that's deprecating likely to be removed in the end release if we get our own way, which never happens, but we can hope. So let's talk about upgrade. I'm going to talk about the key tenants of upgrade that we've had for a long time to start with. One of the key things that actually many of the OpenStack projects do, over is no exception, we have this concept of the data plane and the control plane being independent. Sounds fancy, eh? What that really means is if Nova dies in a fire, we haven't killed the hypervisor, it's still running your VM. That's the key piece. So the idea is that you can actually have a little bit of downtime on the control plane without actually affecting the VM ever, and that's an important concept that we build in. And I mentioned briefly about what we support in terms of people deploying, and it really boils down to what do we support for the upgrade case. So we support you upgrading from the latest stable branch to the next release, which makes sense. We don't support you upgrading from the latest stable branch to, well, you know, from end to like a release ahead, you have to go each release at a time, right now. The other piece is we actually allow you to take any commit within the same cycle and upgrade between them. So you can upgrade from the master branch. You can do a little bit more than that, the details get hairy and wouldn't fail on the slide. Another key upgrade requirement is that when you upgrade, you don't want to have to suddenly rework the whole of your configuration. What we aim to do is that when you do the upgrade, your existing configuration should just work. It may spit out a whole load of log warnings saying, hey, you're not going to do that next release, are you? Because we're removing that, or hey, this has moved over here. And there's warnings that you have to work through before you do the next upgrade, but you shouldn't be forced to do that during the upgrade process. And as I hinted, if we're going to remove something, we give you lots of warning. Try to get feedback on that the replacement has worked. And we never really want to remove something. It's genuinely a last resort. The stuff we want to keep there, but we want to make sure that we message the people if it's a real problem, we can work around it and come up with something else to do. So I'm the Nova PTL, so I feel like a big, complicated, scary diagram is the right thing to do. Now there's going to be lots of, there's a session here, certainly, that's going to go a deep dive into actually how the Nova architecture works. Basic idea is, REST requests at the top, filters through the system, you get a VM at the bottom. Simple. Let's dig a little bit more into this diagram. So when you upgrade, one of the things that you have to worry about is the database. There's two parts to this. The database has a schema that needs to change. And there's data in it that might need to move. We'll talk about those two separately. I spoke about magic happening between the top and the bottom. And what that's all about is, stuff talks to the database. And right now, any of the things at the bottom talk to the database via that conductor. That's a talk there, a whole talk. The next piece to think about is all those communications are happening over the message queue. And if you look at this diagram, I've been very odd with my colorings. What this means is, you have all the things in red are new. And all the things in the other color, which you can decide is blue or green, but that could be arguing all day. All the things in the other color are the old nodes. So the idea here is that you can replace the control plane and then you slowly place all the compute nodes. So in order to make this magic work, to have old nodes and new nodes all mixed together, what we do is we say, hey, everybody start talking the same message version over the boss. That's what I'm talking about when it's RPC version pinning. I'm not going to keep talking about this forever. I'll speed up. So the next piece is it turns out the RPC version is more about the signature of the method, how many arguments there are and what ordering as they are. The data inside them is also formatted using also version objects. When the wrong version gets to the wrong node, we can use the conductor to make sure that we get those versions all matching up. It's not worry about the details. There's magic. The other key piece here is we, I spoke about the priorities that we have to keep things stable. We have to be efficient in keeping things stable. One of the key parts is doing some testing. So we have tests that make sure we spin up the whole system with a new control plane and old compute to make sure that all this magic actually happens. So I did lots of waffling there. That was nice. I enjoyed that. How do you actually use this? Well, there is basically this right now four step process. We expand the database so it can take all the new pieces. We restart all the API and control plane pieces. And because the DB scheme is already updated, we take it all down, start it all up. Previous releases, you had to take it all down, do the DB stuff and bring it all back up. And that was very slow. But anyway, so what we're doing is we expand the DB, restart the control plane. Now we can take our time to restart all the compute nodes. There'll be many more of those. And there's a way we need to be graceful in the way that we do that due to other details. As you can tell, it could be like four talks on upgrades. But what we're trying to do here is make sure that there's a great way in which we can reduce the downtime. So here we've got a very small blip on the API and control plane. And you slowly restart the compute nodes at times people don't notice. And when you're finished, you tidy up, which is what the update RPC pins means. So I wanted to just bring this back to the diagram. We update the database, we update all the control plane nodes, the key one being the conductor. Then we slowly update the compute nodes as we need to. And then there's this magic that I spoke about the RPC pinning tidying up to make sure we enable all the new features and we can all start talking a new version because everyone's new. Now I wanted to highlight. Those of you who are thinking he said this wasn't much time time. He also just said take down the whole control plane and bring it all back up. Good point. So what we're working on in Liberty. So that this process I was talking about was available in Kilo. It's been stabilized in Liberty. So what we're working on in Mataka is the idea of basically we say all the API nodes, we can have the old API nodes running. And if you're using load balancers, right, you can bring up your new API nodes and drain the connections and bring them over to the other ones. So you can actually have a smooth API node transition. So the ordering would be a bit more like do the database, do the control plane, do the API and roll the compute. Do the API later. Still details on this. But we're working on making sure that we reduce the downtime even further. The other thing is that I actually have to explain this process. And there's a bit of the process that takes me an awful long time to explain and I haven't bothered, which is this number four piece here. We're adding features into Nova in the Mataka release to try and alleviate some of this complexity so that we can automatically, to some extent, automatically sort out all these versioning inconsistencies so you don't have to deal with that complexity. So we're trying to simplify the process while also trying to make sure that we reduce the downtime. So I wanted to quickly talk about reducing scope creep. Really caring about the API and really caring about the upgrades does take a long time. How we make time for this is by not increasing the scope. I just like this picture because it looked like the box was about to fall off. But it's kind of an interesting analogy. Anyway, so the first piece we started doing was actually trying to define our project scope better. Before it was kind of a more of an unwritten rule. So we're actually trying to be better at communicating the consensus that we come to within the community and make those documents widely available. All about making sure we're as open as possible. The next piece because everyone seems to be asking about it is containers. We have support for Libva and LXC. It's very much support that is using a container like you use a VM. It's not very container centric support. Magnum is doing a much better job of looking at how you do an API for containers that's designed for containers and works like you expect containers to work. We don't want to. So the Magnum project makes a lot more sense. For those of you that really like the inception movie, when you look at Magnum, it creates lots of interesting combinations because Magnum has these bays. These bays can be any kind of no resource. So you can have an LXC container bay with containers inside it. Let's not go there. Obviously, we also support ironic and bare metal. Ironic nodes, which are bare metal nodes. So you can have the bare metal as the bay or you can have a VM as a bay and all these options get really exciting. Exciting if you like in direction, which I do. It's a good ringtone. Another thing that comes up is Novodokka. Novodokka has been an unfortunate stepchild system. So what happened was there was a time in Nova once upon a time in a land far, far away. We decided that we should probably have some testing because every time we don't test things, they break horribly in a fire. So actually they just break and no one notices. That's the main problem. We need to lock down the testing and all part of that we sort of set an ultimatum where if you're not tested, we're kicking you out of the tree. So we're deprecating everything that's not tested. This all happened. Novodokka went out. There's some other history, but it's not inside our tree right now. I'm not sure it's working right now because we changed things and it broke. Anyway, so let's wrap up about liberty. The key thing is we're working hard on the architecture of evolution. We're working hard to make sure upgrades are less impactful. We're getting there, there's going to be loads of more work in the future, but we're getting into a place where you can do a nice upgrade, even if it's a bit complex. We're working hard to build a great API ecosystem and create the frameworks in order to facilitate that. And we're making sure that we reduce the scope creep so that we can actually get these really important things done and focus on what we really need to do. And the scale of this is actually quite surprising. At one point we had about 100 blueprints approved. We had to all review about 60. We managed to get merged. 400 bug fixes. There's loads of stuff going on. But I just thought I'd talk about the stuff that was really interesting, at least to me. So this is the time I make a complete pratt of myself on video by predicting the future. So I'll only talk about the stuff that I'm relatively confident about. I put and beyond because the time scale is probably the most indeterminate pieces. So cells. The first time we tried, well, the most recent time I remember that the developer team was got together. It's like, hey, we should probably all talk about cells and try to understand each other. About two hours later we got some consensus on agreeing what we all thought. So it's kind of a bit complicated right now. But let me try to summarize this in the little bit of time we have left. So what does cells look like? And why is it an interesting thing for me to talk about? So the basic idea of cells is you have an API and that does API things. And you have compute cells doing, owning the hypervisors. That's interesting. There are limits on the size of the compute cell right now. We want to do work to make sure that we can have larger compute cells, but there are limits on the size of that. What are the limits? Kind of depends what you do. Going above the 500 nodes, 1,000 nodes in there. Most people reckon is a bit tricksy. We'd love that to add a zero onto all those numbers, but that's life. But it's not really focused on retrofitting scalability to something that doesn't quite scale. The really useful thing of cells is when you come to look and think about production. So imagine you have cell one, cell two in production in your API on top. Everything's happy, everything's working. Awesome. Overnight, you become really successful and everyone wants to use your compute capacity and you're running out. You're like, oh no. I'm too popular. So you need to come up with a way in which you expand your deployment, right? Cells is actually really useful tool to do this. You can set up cell three. You can test cell three, test it, make sure you're happy and then add it into production. And because of the way the system works, that shouldn't affect the scalability of cell one and cell two. You might need to expand the API a tiny teensy we need it. But the idea is that you can expand horizontally by adding these cells in. And it's kind of an interesting way of expanding a capacity. If anything else, when you expand, you often buy new hardware so that you know cell three is slightly different to cell two. So when there's a bug that people are only saying in cell three, you know what to blame. Which turns out to be surprisingly useful. So enigmatically I suppose I had this v1 and v2 here. So v1 is what we have today. Cells is optional. Not all features are a bit supported. And it basically boils down to the blue line that's between the API and the compute nodes. You see there's that in connect. With cells v1, a default deployment is an API cell with a whole load of compute nodes shoved in there. Then when you turn the cells on, the magical blue line has to be appearing. It's not really a blue line obviously, but there's a separate code bath that appears to talk to cell one and cell two and cell three. That's an if statement. You have to test both sides of the if statement. We do now, but it's kind of an extra complexity that's all very annoying. One of the other pieces here, there's loads of pieces. One of the other key pieces is the API in cells v1 has a record of the instance. There's a row for the instance in the database. And the compute cell one also has a row for the instance in the database. And so across that blue line we're busy going power states change. You need to tell the API. The API's change needs to tell the compute cell and there's a lot of syncing going on which is a pain. With all this in mind and with the discovery that cells is really useful we came up with the cells v2 architecture and this is the piece that we have some bits and pieces in Liberty and we're hoping to get some more bits in Mataka and by the end release hopefully that picture below there can work in the cells v2 world. So the key point about cells v2 versus cells v1 is cells v2 the default deployment is the API, the blue line and the compute cell one. So going back to what I said about v1, v1 is the API cell with compute nodes in it. v2 is the API and the compute cell by default. So basically you may not know it but eventually you will everyone by default will be using cells v2. So when you add an extra cell you've got all the cell infrastructure already working and you just add it in. We've done that in a way that shouldn't affect performance certainly for the one cell case. So it's basically get rid of an if statement which sounds super simple doesn't it? It's not. It turned out to be really hard but let's move on. I don't really know why I had two yaks attacking each other in this picture. It's one of the key things I like about Nova we want to keep with Nova is we care about our users. We care a lot like the upgrades is because we care about our users our API is because we care about our API users and our operators and we care about our developers being able to work within all this space. There's lots of things we're going to be doing in the future to try and help this. API documentation is one of the key things. We want to improve the stability by improving all the error handling and really I want to get better at messaging about what works what doesn't. There's lots of things we'd like to do. As a team as users it's not very interesting but as a developer team there's lots of work we're doing you know to keep the process of evolving and show that we tell people what we're doing and why we're doing it so people can get engaged better. Want to work more with the product working group to help get the resources on the right things that we need and help increase alignment between the people sponsoring the development and the developers and everything else and get that kind of those things working. If any of you are developers here and want to do more reviews on Nova I would come and hug you about my little bit British to hug you but yes more reviews are fantastic. I don't necessarily mean more core reviews I just mean more people doing a review more people understand the system people helping with that is will help us so so much. I said releasing more often there's lots of things we're talking about that's one of them. We let people release off trunk we might want to try some other methods so I've waffled on for a long time I thank you all for listening I hope that gave you a sort of flavor what's happening in Nova. Thank you very