 So, to kill time while we're waiting for the slides to come up, we're going to do some improv games. Are you ready? Right. Just kidding. Boo! Boo! Boo! So, how many people in here have been to at least one of the design summit sessions upstairs? Raise your hand. All right. Okay. He's going to make it work for me. Yes. So, I have one of the, usually I have a really easy job, which is I get to go at all the design summit sessions and then summarize at the end for all the people who didn't get to go. It's not like that this time for two reasons. One is you all are apparently attending design summit sessions, which I was not told about and I'm quite upset. And secondly, we're doing the conference in the summit concurrently this time, so we've only been through a day of sessions so far. So I am going to use my powers of foresight to see into the future and pretend like I know what we're going to be talking about for the next three days and tell you about it. So I'm going to do a pre-summary. So that should be exciting. We actually did have this working about right before everybody walked in, so it's one of you using RFI interference or something, ruining it. I'm not in mirror, but it was working 5.4, but I can switch to it. Oh, is it possible that someone, yeah, he plugged it in the wrong one. That probably wouldn't make a difference. There we go. Okay. So welcome everyone. My name is Vishpinandeshaya, or Vish, as everyone calls me, or sometimes Vishy on IRC. I'm the technical lead for Opensack Compute, and I always like to start out with a brief history. Now, some of you have probably seen this before, so I'm going to keep it shorter than I usually do. Besides, we lost a little bit of time due to the keynotes going so long. So I, before I was working at Opensack, I used to live in Iowa. We have a lot of corn there, and we, in fact, I call it Silicon Valley, yes, favorite place. So a couple of years ago, I had the great opportunity to move out to San Francisco to work for NASA, the NASA Ames Research Center, for a little company called Ansel Labs, because nobody actually directly works for NASA, they're all contracted. So a little contracted company called Ansel Labs, and we were working on a project that was supposed to unify all of the various 3,000-plus websites at NASA into one overarching platform, and in order to do that, we had to have some sort of infrastructure underneath that was dynamic and scalable. We attempted to use open-source solutions and found that the existing ones at the time didn't work, and so the day that I showed up in San Francisco, Jesse, who's actually sitting right there, sat me down, and Josh Picanti, who is not in here, sat me down and said, so we're going to, this weekend, write an infrastructure-to-the-service layer. We're going to demo it next Tuesday, so good luck. Four days of Red Bull and Burritos later, we had a working prototype that we codenamed, well, actually, the codename at that time was called Pinet, but we quickly renamed it to Nova because it's much cooler, and that codename has stuck. We open-sourced the code, we managed to start using it within three weeks in our Alpha customers at NASA, and then Rackspace found out about it, and we kind of collaborated to the Nova code that we had created at NASA, plus the Cloud Files code, which became Swift, that they created at Rackspace to found OpenStack a couple of years ago. Since then, we've had two years worth of releases, and we've grown at kind of an astronomical rate, and I don't know if I'm allowed to use astronomical considering I used to work at NASA, but it's been an amazing ride. We had six people working on the code originally, and now there's just an OpenStack compute. I think we had over 75 different individual contributors over the past six months, and we've had something like 150 in the lifetime of the project. What I'm going to do, basically, is give you an overview of what we did in the past six months, what the current state of things are, and then try and use a little bit of my pre-cognitive powers to tell you what we're going to be working on over the next six months. First of all, we did a massive amount of work. This is a view of the 2012 release, the Folsom release, 47 blueprints, 674 bugs, kind of mind-boggling if there were that many. Now, some of those we created in the last six months, so thank God we fixed those. Just to briefly cover some of the really nice things that happened during the Folsom release, this isn't going to be an exhaustive list because there was a whole bunch of work that went into the project, but some of the sort of main points that I think are really interesting. We had this thing long ago when we first started. It was an OS system inside of NOVA, in fact it talked to LDAP, it's because what we needed inside of NASA to do authentication and identity. Then eventually we decided, well, this doesn't really belong in our compute layer because the other services are going to need to use it, so we broke it out of this cool project called Keystone, that's a Keystone there. But because we had people using the existing OS system, we had to go through a very slow process of deprecating it and allowing people to keep using the old internal OS, and finally during this release, I think two releases, we initially got Keystone out about the Diablo timeframe, so two releases later we've actually managed to remove the deprecated OS, so we don't have people dealing with deprecated OS anymore. Thank God. Yes, that was a big success. Another one was Hyper-V came back in, just under the wire at the end of the release. Microsoft put in a lot of effort, along with help from some other companies, to get the Hyper-V code back up to speed. We had initially had a somewhat functional driver a couple of releases ago, and we pulled it out because it wasn't being maintained. The press releases sparked some interest from the giant Redmond, and they put some resources on it, and so now we have working Hyper-V code back in the core code base, and it's actually pretty well tested, so it should continue to function well for people that want to deploy OpenStack and Hyper-V together. Another really big win, if you guys, how many of you were here for the last talk six months ago at the last summit? Okay, about half of you. So the topic of my last talk was essentially our goal for the last six months was minimizing the scope of NOVA. NOVA was the original project, and it did everything. It did block storage, it did networking, it did compute, it had its own built-in auth system, so what we were trying to do, because so many contributors were trying to work on NOVA, is pull everything out into separate projects, to minimize the scope, to make it easier to take new contributions, so that the block storage people could focus specifically on block storage, the NOVA people, compute people could focus specifically on compute, etc. And that was quite an involved process, and involved slow deprecation and trying to pull things out in migration plans, etc. So we managed to create a new project called Cinder, which is now the block storage service. The PTL for Cinder we'll be talking later today, I believe. He's right over there, John Griffith. And it's been really great, there's been a whole, it's actually quite amazing because when the volume code was in NOVA, it saw very little contributions in use. And I felt like in order to really kickstart it and get it going, it sort of needed its own umbrella, its own project for people to get together and really focus on it. And since the breakout, there's been a ton of activity. A whole bunch of vendors have contributed back-end drivers for it. And it's just really moving very quickly, so I'm pretty excited about that. One of the other big changes that went in during Folsom, and it was somewhat of a painful change is we actually started versioning our RPC messages. So the messages between the different internal NOVA components now have versioning associated with them. Which means that in theory we can run different versions of the workers and have them still communicate in a backwards compatible manner. So this is basically laying a lot of the groundwork for the future where we might at some point be able to do live upgrades of the API system. So right now the upgrade plan in OpenStack is that your virtual machines and block storage devices in networking have no downtime. So they stay up when you do an upgrade, but the API has to come down for some period of time while you're upgrading your database and moving things around. Long term, we're looking to a point where we can possibly do live rolling upgrades where you could have some parts of the system upgrading, other ones on the old, and then gradually bring everything without any API downtime. So that was a huge step, and it has required a lot of focused effort, especially from someone at Red Hat named Russell Bryant, who's did a whole bunch of that code and has made sure that people don't break it. Another really interesting thing, so towards the end of the release, I sent out a very controversial email that was about XML support. So we've had XML support in the OpenStack API for quite some time, but it was relatively untested. And so I basically said, if we're just going to keep putting out an API that we don't even know if it really works, maybe we should just deprecate it and just start telling everybody to use JSON. With the kind of assumption that hopefully end users don't really care which, if they're using XML or JSON, they just care about having a library that can talk to the API effectively. Well, that created a huge uproar. And so I said, hey, great, so if everybody really cares about XML, how about some help getting it tested? Because we have a lot that we need to do, and we don't have enough developers to work on it. And so I actually got some incredible contributions from people working on Tempest, and especially from IBM. We had three or four people from IBM do a heck of a lot of work. And we did a couple of things. We have some more testing going into Tempest to test XML support from completely sort of black box testing from the outside. And we also created a framework inside of NOVA to do API sample testing. So all of the API samples that you see on API to OpenStack to Rory, like this one here, those are actually now being automatically generated from a set of integration tests inside of NOVA. So that actually runs as part of the unit test suite. So it'll spin up sort of a fake compute worker that doesn't actually do any real virtualization work. And then it'll actually make XML requests, template XML requests, and JSON requests to the API, and then record the response and test to make sure that the response is actually what we expect it to be. What this means is any change that goes into NOVA in the future will automatically cause a chain that changes any of the API calls, will automatically break one of these tests. And so it'll have to go through extensive review to decide whether we actually are willing to make change to the API, whether it was a mistake in the spec or whether there was something wrong in the test. So essentially we can guarantee that our APIs are going to remain stable in the future. So this was a really big step and also all of those automatically generated API samples are now in the process of being synced. All the core APIs are synced right now into the api.openstack.org. So these are things that actually would get returned from a real open stack cluster instead of just things that we expect should be returned according to the spec. So that's really a big plus as well. So you can kind of see, here's the XML response you can expect when you make this call to a real open stack environment. So a lot of work went into that. We have all the core APIs tested. We have most of the really commonly used extensions tested. And then the rest of those are all going in early and grisly. And they will be back ported to the stable Folsom branch so that we can have provide the full set of APIs and testing for all of them. So that was a lot of effort right at the end. Got some great help. Another big thing that went in is we spent a lot of time working on the state management in Nova. If you guys used Essex, you notice that there were a lot of times, especially in the Diablo timeframe, but a little bit in Essex where certain things would get into a state where you couldn't really recover. So your instance would actually get into some task and you couldn't even delete it because it was stuck. And you'd actually have to go into the database and reset the state so that you could delete the thing. That's kind of an annoying user experience. So we spent a lot of time, Yun Mao actually went through an analysis of all the different state transitions in the system and created a big Excel spreadsheet and decided, found a bunch of places where we weren't being consistent with our state transitions. And spent a lot of work updating and getting that more consistent. So we have much better state management code over virtual machines so that they don't get into these unknown states as often. Another very big piece is general host aggregates. Excuse me for a second. So we had sort of a method for providing metadata about the individual host in the system to the scheduler. But it was sort of a clunky way of configuring metadata which was that you had to essentially specify a configuration option on the host and then it would be sent up to the scheduler on a regular basis. The problem with that is if you ever wanted to change the metadata, you actually had to go to the host that was running the service and change the configuration option and rerun the service, which is not incredibly useful. So now we actually have something called host aggregates, which is actually an API that's provided, it's an admin API that allows you to associate metadata with a group of hosts. And the advantage of this is you could do something, for example, you could say this group of hosts has GPUs attached. So then you can make interesting scheduler decisions about if you have an instance type that can only run on a GPU so that you can have a GPU enabled workload, then you can make sure that that only goes to the host that have that a certain metadata associated with. That's a very simple example. Another one would be hypervisor. You could actually have two different clusters, one say running the Xen hypervisor, one running KVM hypervisor, and be able to specify via an instance type where you want your image to run, whether it be Xen or KVM. So it gives us a lot more flexibility in terms of making scheduling decisions by allowing us to take the metadata and associate it with a group of hosts. Another big one was a bunch of Xen API features. So these are all features that existed in KVM only that the Citrix and some of the Xen people at Rackspace worked very hard on. I think Internet helped a little bit as well on these to actually get boot from volume, wide migration, block migration, working for Xen as well as KVM. So that was a big plus for operators that are deploying using Xen server. And that was just kind of jumping to a few of the really key ones. There was a whole bunch of other ones that put a few more up here. And we had something like 60 blueprints. So there was even a lot more that I didn't have time to cover. But here's just a few. So the API server can run multiprocess now, better for scaling. We actually have an option to have LVM backed images instead of QCAL for the libvert backend. We did a lot of cleanup in the database, making sure everything was converted to UID. So for example, volumes back in the ethics timeframe had integer IDs. Which is not very useful when you're talking about scaling and being able to migrate things across different data centers, etc. So everything's UID based now. There's some work put into something called the config drive. Config drive allows you to sort of configure an instance without needing network connectivity to provide an alternate way other than the sort of Amazon style metadata server, which is how a lot of guests configure themselves now. And we made a bunch of improvements to the scheduler, different filters, different ways of configuring that, allowing for affinity, etc. So I think when we're talking about the successes, we should also spend a little bit looking at what didn't go as well. And I hate to admit it, because some of this stuff is my fault. One of the things that we had trouble with was keeping up with bugs. So we currently have about 100 new bugs a month being reported just in Nova. And keeping up with, this is our triaging list, this isn't our fixing list. So this is basically, you'll notice this period right at the top. There were over 200 bugs that had come in that none of the core members had looked at to find out how critical they were. And the reason for this was right during that period, we were working on a whole bunch of features. And then we went through a large bug triage and we managed to get it back down to zero at some point. But it's really not ideal for that to happen. We shouldn't have a long period where bugs are sitting in the queue and then try and knock them out all at once. And so what we've actually started is we're in our regular weekly Nova meetings, we're gonna have a consistent topic talking about bugs and making sure that this doesn't get out of hand so that we can stay on top of that in the future. Migration testing, so we had a couple sort of nasty bugs that we missed in terms of upgrades. I guess I probably should have called that upgrade testing. So there were a couple of them, they were kind of edge case in one sense. But we didn't do enough testing of actually having a running live system with a bunch of real work and then upgrading and making sure that that stuff still exists. So we missed a bug in the compatibility of EC2 IDs which we since fixed and backboarded into Stable Folsom. But it basically means that if you have all of your EC2 IDs, if you have a bunch of them in Essex and then you upgrade, then you're gonna end up with new IDs if you don't have the fix in. Another one, the volume upgrade has a bug related to IDs versus UID conversion that's similar. If you're running a new cluster, no problem. If you had a bunch of old volumes, then they're gonna end up new IDs and you're gonna have to un-export and re-export them to actually use the things, which we're in the process of fixing now. Another bummer, I don't know if it's a huge bummer because you kind of expect this to happen when you're doing time-based releases. But there were some big features that we were hoping that made it in that didn't quite make it in. One of them, which we've talked at almost every summit of cells, and we just had a session about that today. Another one that we actually got in but then had to back out because it created a few bugs and it was too close to release to try and fix them was per user quotas. So right now, your quotas are per project and there's no way to say, well, this user in this project has a smaller quota than the whole project. Any of the users in the project can use the whole quota. There was a really nice patch that was put in to fix that, but it had a few issues that broke regular quotas or existing quotas and so we had to roll that back out. And then bare metal provisioning, there's been a lot of effort going into that, I'll actually be talking about that a little bit later. But it was too big of a patch coming in too late for us to really get it reviewed well and be sure that we wanted to put it in. So that's where we're at, where we've been, here's where we're going. So we're all in the design summit right now, four days, we've been through one of them. One of the really important things that we're talking about and finalizing. So one of the other breakouts was switching all the networking control over to Quantum and we're doing it a little bit more slowly than the volume to Cinder transition just because migrating between the old networking style and the new networking style is going to be a little bit more difficult. And we want to make sure that we don't break existing networking deployments. But the hope was that this release, that Quantum would be sort of the default for all new installs. And we didn't quite get to that level of completeness with it. We got to the point where we recommend it for advanced users but we still have sort of Nova network as the default. And there's just a few small gaps that we need to close before we can actually switch that over and make Quantum the default. The first being that the existing calls that Nova is responsible for need to be proxied over. So for example, you might use the Nova command line client to create a new floating IP or associated floating IP with an instance. It's not very user friendly if a user's been used to doing that and then suddenly you install Quantum and those calls don't work anymore. So there are floating IP calls in Quantum now, but the link to proxy the existing calls over to Quantum wasn't in. And actually that just got added. That's actually in or will be in the first milestone of Grizzly. So that one's pretty much checked off. Another big one is security group support. So Quantum is right now all the security groups are still managed by Compute and security groups are a networking concern so they need to move into Quantum. But they just didn't have time to get all the stuff in in time for this release. And then finally, multi-host support for Quantum and for Layer 3 networking. So Nova Network has a mode where there isn't a central sort of master host that's responsible for the networking for all of the compute nodes. And Quantum hasn't added that kind of redundancy yet. So those are sort of the gaps that Quantum needs to address and they're talking a lot about those during their sessions. And I'm sure Dan will give an update on the progress of those when he talks. But once those are in, then we can actually switch the default over to Quantum and go through the long process of deprecating Nova Network. So eventually we don't have to mess with the that code anymore, hopefully in a year or so, thank God. I was the original author of a lot of that code and I'm ready to see it go away. So cells, what cells enable, one way to think of it is trivially scalable clusters. So essentially the idea being that there's a lot of different ways you can scale a system when you're trying to get up to, say, 10,000 nodes. And some of those scaling problems are difficult, and you can make them a lot easier if you limit the size of a group of an install. And so cells, what it does is it gives you a way to make an install that can limit it, say, two to 400 nodes. And then you sort of replicate that same install and have a very small subset of the data that sort of replicated up to a centralized controller of all of these different clusters. It has a lot of very easy scaling characteristics and a lot of good failure characteristics where if something goes wrong, say, with your database or your queue, you're only losing part of your infrastructure instead of the whole thing. So it essentially makes a lot of the scaling and redundancy problems easier. It doesn't necessarily solve them and it may, long term we may say, hey, cells is not really the right way to do this. In the future we just want to have 100,000 nodes all in one system and we have a highly scalable database that's super redundant. And we've solved all of those problems. But until we get there, it'd be nice to be able to get up to 10,000 nodes without having to worry about those scaling concerns. It's already running in production at Rackspace, so they're already using this code. It has a few gaps and some clean up. So the parts of NOVA that Rackspace isn't using, they haven't really addressed. So one example is security groups, they don't use those. And so security groups don't work with cells. So that needs to be fixed up. We just had a great session. It was probably the most packed session I've seen at a design summit. I think there were about 100 people spilling out of the room, getting in to hear about the cells talk. But we should be getting that code cleaned up and in fairly early and grisly. There are various people already, there's a public branch up and there are various people using it. There's some people in Australia that already deployed it. I think Mercado Libre is also working on it. So there's definitely a lot of momentum on that effort. No database compute. This is something that we've been working on for a very long time and due to the RPC versioning and a lot of the cleanup that happened in the communication layer. We're actually at a point where we're starting to make progress on this one. This is a really big deal for a lot of reasons. One is scaling. If you only have a few centralized nodes that need to talk to the database versus all of the compute nodes, then you have much easier scaling of talking to the data store. Security, so one of the sort of long term goals is to be able to say, hey, if someone breaks out of a hypervisor, it would be really great if they didn't have access to the system and they couldn't poison the whole cloud. That's very difficult to do when each of the nodes that's running compute has access to the database. So there's two parts to that piece. One is to remove the database access and the other is to have them be signed the queues and make sure that we know exactly where each message is coming from and the messages that are being sent can only affect a small portion of the system. And then for upgradeability, it helps a lot. So all of the concerns about having different versions of the database and different workers talking different versions of the database, if we minimize that from 10,000 compute nodes to five API and scheduler nodes, then suddenly upgrading becomes a heck of a lot easier too. So this is a really important piece that's not necessarily very user-facing. Like, you're not going to see a difference from this going in, but it's going to make operations and lay the groundwork for a lot of future features that we want to bring in. So bare metal provisioning, I mentioned it didn't quite make it. It's been a collaboration between NTT-Dokomo, Virtual Tech Japan, and USC-ISI. There's been a lot of work on it. It does some pretty neat stuff with pixie booting and essentially making it look like you're booting a virtual machine when it's actually controlling real hardware. And we're going through the process of cleaning it up and figuring out how we can integrate it without disrupting a lot of the other core drivers in NOVA. But we are working with them to, you know, there's a session either today or tomorrow to talk about how we can start getting the pieces in and actually have this as a feature, NOVA-controlling bare metal. This is another big thing. So at the end of the Folsom release, Dean Schreuer, who's essentially the de facto PTL for Dev Stack, although it doesn't officially have a PTL, we had him start working on a project called Grenade. And Grenade is basically our way of ensuring that our upgrade process works. So basically what it does is it installs one version of all of the OpenStack components and then runs a bunch of stuff and then installs the next version of the OpenStack components and checks to make sure that the system is still working. We call it Grenade because hopefully we know if things blow up. And so we're working with CIT to get this included in the actual gating test. So we ran it a bunch external to the gating test right before Folsom. We picked up actually quite a few upgrade bugs right at the last minute, so that was really good that we did that. And we're going to actually include this in the gating test so that essentially we're not going to be able to merge stuff that's going to break the upgrade process, which will make us feel a lot safer when the next release comes around. Another thing that we're working on, better user experience for booting from volumes. Right now there's this horrible way that you do it, which is essentially specifying this large thing called block device mapping, which is something that we sort of inherited from the EC2 API. And it's just not user friendly at all. You don't want someone to have to specify this arcane syntax for getting a volume to boot off of. And so we're working on discussing what's the best user experience for booting from volume. How can we make it seamless and easy to the user? Get it in the CLI that way, get it in the APIs that way, and get it into the dashboard that way so that it's, for a user to boot from a volume, it's basically just like, oh, there's a volume. I'd like to boot this image, put on a volume, let's go, click. It's not that easy right now, and it needs to be. A bunch of other cool features, because I don't have time to go into detail on all of them. There's a session on configuration cleanup. So we have something like 600 different configuration options in Nova now. Probably about 30 of them are actually used in any reasonable time frame. So we're talking about minimizing all those extra ones that don't really need to be configuration options, and trying to come up with a set of very useful configuration options, and document those, and kind of deprecate all the unused ones. Policy improvements. Again, configuring policy for authorization is a little bit complex and difficult, and sort of arcane. So we're trying to improve the syntax and the documentation there to make it easy for deployers to configure policy for who's allowed to do what in the system. DNS integration, this was actually mentioned. Chris can't mention this in his talk. There's actually a bunch of sessions about, or one large session about what's the path forward for DNS. We have stuff inside of Nova that was our contributor by Wikimedia. Then we have a couple other external DNS projects. So we need to kind of come up with a standard plan for how to do DNS, because it's something that everybody kind of wants. It's much easier to deal with naming of instances rather than IP addresses. And especially as we moved to IPv6, kind of DNS is sort of a requirement. So that's a big one. We're going to talk about database improvements and kind of working on that. And then we had a session already talking about the EC2 API. And there's a new API, Google Compute Engine API that Cloud Scaling has worked on and how we can integrate that in. So there's a whole bunch of really cool stuff going on. And I'm almost out of time. I want to give you a second for questions, just in case anybody has anything that they wanted to ask before we break up. Anyone? So it depends on exactly what you're trying to do. So if you're trying to get code into the core code base, which is generally what people do is they'll develop something on their own in their company and then say, OK, we really want to contribute this back, then essentially you propose it through our code review system called Garrett. And then it goes through a review process. And the review process is Nova Core, which is about 15 people of core contributors to Nova. We'll go through and review and comment on the structure of the code, whether it's valuable, how it could be changed, cleaned up, et cetera. And then hopefully you'd come back and make those fixes and then go through another review. And at some point there's two positive reviews from Nova Core members, and then we approve it. And it goes through the automated testing and CI infrastructure and into the core code base. So that's the basic process that we go through. If it's a larger feature, it's often nice to do what we call a blueprint, which is a spec describing what the feature is going to do. And you can sort of get some feedback in advance of creating the code by sending it to the mailing list and saying, hey, here's what we're planning on doing. What do you guys think? Does anybody else want to do this? Or maybe have a design summit session about it, et cetera. So that's a very, very brief overview of sort of how it works. Yeah. Oh, that's a good question. Like not have any new features ever again? Yesterday? It's already done. No new features? Yeah, that's a good question. I mean, it seems to me like there's always going to be a need for new stuff. But it definitely could get to a point where those are much more rare than they are now. And maybe they come in the form of kind of plug-ins as opposed to actually changing in the core code. One thing we aren't discussing, you'll notice, is a new API. And I think that that's something we're going to have to face in the next probably year or two. But I mean, right now we're just getting the testing of all the APIs and the extensions down. But at some point we have to actually, we have a discussion talking about when that's going to happen. When are we going to try and do a new version of the API? And I think probably locking down is not going to happen until after that's done. So it seems like we've got some pretty big, you know, we've got the default moving network being deprecated. We have a new API at some point coming in. So probably really locking down is going to happen after that. So I think we're probably, you know, a year too off before we're really going to get very locked down on stuff. All right. Anything else? Thanks, Alex. Yay. Thanks. Thanks everyone for coming. Enjoy the rest of the sessions. We are probably overlapping into the next one now, since we started late.