 Hi, my name is Joe Gordon. We'll talk about how Diablo vs. Havana, how OpenSack has matured. So a bit about me. I'm a full-time developer on OpenSack for HP. I actually don't do anything internally for HP, or not too much. So I work all upstream. I'm a Nova Core. I've been involved in Essex. I've been trying to work on making OpenSack scale and actually work so people could actually use it. So a brief history, hopefully you guys know most of this now, history of the integrated release, so what OpenSack has been from Diablo to Havana. So Diablo was way back when Swift and Nova did everything else. Nova had Volumes Network Identity. Keystone was there, but it wasn't integrated at the time. And we had Glance. Next came Essex. We added Keystone and Horizon, and we got rid of Identity and Nova, because it didn't really belong there. Folsom came along, and we added Quantum and Cinder. So we got rid of Nova Network and Nova Volume, although Nova Network is still there. And Quantum is still trying to work on getting rid of Nova Network right now. Grizzly, we didn't do too much more. We just kept working on what we had. And this is what it looks like at this point. So it's nice and big and complex, and it's only getting bigger and more complex. So this is the Grizzly, and there's no, I don't unfortunately have this log yet for Havana. But Havana added two more services, Solometer and Heat. So now we have, I believe it's nine projects running, over 32 services to this big complex system, and it's getting bigger. So you look back, we're trying to compare Diablo versus Havana, what's changed. A bunch of these actually didn't exist in Diablo, so we're gonna mostly ignore them for this presentation. So Horizon, although it existed, it wasn't integrated, Neutron wasn't there, Solometer and Heat weren't there, so we're gonna ignore them for most of this presentation. And looking back, so we get back to that same original list that we had before. So this is what we're gonna be focusing on for Diablo versus Havana to get a good comparison. So OpenSec is grown as you saw, it's grown huge numbers, lots of new services. It's grown out, we have lots of new services, Heat, Solometer, we have Trove coming out in Icehouse, we have Ironic, Marconi and Savannah in Incubation, and hopefully those will come out in a release or two. So we're growing the project out in many ways, doing more things, trying to, as a project, as you can see, it's getting involved in more components, but it's also been growing up. So each project is actually getting more richer, and part of that has been splitting out Nova into smaller projects where they could actually focus on important pieces. So for example, we split out Cinder, so Cinder when it was a Nova wasn't that much, it had one backend. It didn't really do too much, it was sort of a second class to this, and now it's a first class citizen doing great things, there's a new API, lots of future rich aspects to it. We've also been adding extensions, so API has been growing, so Nova has been growing their API in huge amounts. We have a core API, we have extensions. So this is a list of all the extensions in Diablo. It's not too many here. This is the list of all the extensions we added since Diablo. So it's a lot more, I think it's about 60 or so here. So we're really growing out the APIs in really big ways. So one way this is changing is we're getting bigger, we're growing out, and we're also growing up. And that didn't scale well. But we're gonna ignore that. So once we normalize for that growth and the project scope and what we're doing, what else have we been doing in the past for releases? So part of the inspiration for this talk is that OpenSec infrastructure uses Diablo and a Havana-based cloud interchangeably. These Rackspace and HP Cloud is on Diablo, I believe, and Rackspace is on Trunk, which is essentially Havana. So they use them interchangeably. So they both mostly work, right? So what have we been doing besides that? If they both work, they're not using new features because they have to have the same feature parity in both of them. So what else have we been doing all this time? And I wanted to know the answer to this, so I thought I would give this presentation and find out myself. So it turns out we've done a lot. There's sort of some high, big categories of this. There's some new drivers, lots and lots and lots of new drivers. As you usually don't actually see these drivers, they're all in the back end. Lots and lots of testing and fixing bugs, performance and scalability, that's the big one that all everybody cares about. And also a lot of forward-facing changes to plan for the future and make things easier in the future that you don't necessarily see until the future is done, but they involve many cycles and many man-hours to fix. And also we've changed the process a lot. We've grown from a small project to 1,600 developers, I think, since the committers since the beginning and 1.7 million lines of code. And so we've grown a lot. We had to change a lot of the process to adjust for that and we're still working on that. So more drivers, this is a pretty straightforward, easy thing to look at. We just have new drivers, Nova. There's two examples from Nova and Cinder. Nova, we have Bare Metal, Docker, Hyper-V, Power-VM. They all came in different releases. Cinder has, this is only a small sample of Cinder. None of these exist in a Diablo. I think it had LVM and NFS, I think, and maybe that was it. Maybe one more when it was in Nova, back in Diablo. Now we have all these other great vendor plugins. So we have a lot more drivers. As a user, I don't actually see that and so something on a cloud doesn't know about new drivers because that's just how you run the cloud in the back end. So better testing, fewer bugs, that's sort of the goal. Make sure everything works. Don't have bugs for users to see. So there's some big categories of this. Unit testing, integration testing and fixing all the bugs we keep finding in all these and then some. So it actually turns out in unit testing we're actually no better than we were before. So I got these numbers by comparing the percentage of lines of code for unit tests to the lines of code of actual project code. And we actually haven't grown it by that much, a big percentage. So the code has increased in all dimensions. So we've more or less maintained the level of the number of lines of code percentage-wise to actual working code. And retrospect, this actually makes a lot of sense because we had good coverage from the beginning. We have about 80 or 90% in Nova, I think somewhere in the mid-80s. So you have pretty good code coverage and then whenever we add a new feature we always make sure there's more tests for it. But we don't actually spend a lot of time just adding unit tests for the sake of unit testing. We try to make sure they come in in the beginning of each feature and each bug fix. So this has been true sort of across the board. There haven't been really too many big changes in unit tests. This actually caught me by off guard. I figured we're growing, we're having more testing. Thought we were really gonna do much better on this. Turns out unit tests are great, but that's not enough. So Tempest is the open-sack integration test suite. This is the big one we've really grown a lot in leaps and bounds. So back in Diablo it's this little small project called open-sack integration tests. It was written by Soren, I believe. I don't think it's, he's somewhere in the summit somewhere. And had 107 tests more or less. They didn't really do too much but they had the really basics and they caught a lot of bugs and they were great. But it was only 107 tests covering mostly Nova. Flavors, floating IPs, image metadata, images, key pairs, volumes, which at the time was in Nova. So very basic things. And there's only 2.6,000 lines of code. It was actually the first release of Tempest was in Diablo. And we wouldn't actually use it to gate on that point. It was just another feature that it wasn't being gated on, just another project. So looking at what we have today, it's actually grown by a order of magnitude. It's 10 times bigger in most definitions. 1,200 job, 1,200 tests, 24,000 lines of code. So we've really actually grown this one out in many big ways. And actually it's still too small. This probably should grow by another order of magnitude the next year or so, if all goes well. It's 1,100 API tests. So API is important. We like to test it, make sure it doesn't break. Want to make sure that we don't actually break the APIs, that's really important for our users. So we have lots of tests for them. So that covers all the main projects now. Compute, Nova, Image, Glance, Identity, Keystone, Network, Neutron, Nova Network, Object Storage, Swift, Orchestration, Heat, and Volume Cinder. So we cover all the big ones, 102 files. We have some CLI tests. We want to make sure our CLI doesn't break also. That's mainly you run a CLI command. You want to make sure it doesn't stack trace on you. We've had a problem to this in the past when we added these tests in. And we're having some new kinds of tests coming in that aren't actually listed here. We have scenario testing. So you want to test it a big complex scenario. You spin up an instance. You attach a volume, do something to the volume, detach it, delete the instance, that kind of thing, a big story. We're also working on performance and stress testing. So you want to make sure our cloud works and it works in all metrics, stress, and other. This actually all runs in parallel. That was a big push in Havana. And we did a great job of it. And it takes only 20 to 30 minutes to run in gate. So that you push your code up for review. 20, 30 minutes you get results back saying it passed or not. Not too bad of overhead for a developer to actually to have for getting such great testing. So just to break down the API test more to sort of see what we have. So NOVA is a huge project. It's one of the oldest ones. It was one of the first projects in NOVA and OpenStack, NOVA, and Swift. So we have a lot, a lot, a lot of tests for it. 785 API tests. Keystone's been an old project and also very big. There's a lot of new tests in here. There's a V3 API that came out that has full coverage in Tempest, I believe. And we have a lot of other ones for all other projects. Imaging, network, object storage, and volume all have tests. And we're gonna start adding more things for things like heat and others as we grow out. So API testing is a really big part of making sure everything works. Because the APIs are all the same and we haven't really, we've read some APIs, but we still support all the APIs in Diablo. The question I sort of had was, could we run Trunk Tempest, Havana Tempest against Diablo? And so the example is this case is DevStack, which is our development environment. It turns out you sort of can. So it took me a few days to get this up and running. So the basic principle is we have new APIs and old APIs that have been in Diablo. So let's try the new Tempest against the old APIs to see how well they actually worked. So Diablo has been end of life about at least a year ago at this point. So it actually isn't supported. There's no branch, it's just a tag upstream. So it took a bit of tweaking to get it actually working. And the results we actually ignore a lot of these, a lot of the results because many features, all those extensions in Oven, all the new APIs and projects that weren't there are tested and we could ignore those results. So I actually got 622 out of those 1100 or so tests to run. I got 154 of those to pass and 110 are compute. So actually it's more tests than we had in all of Tempest back then. So just some results of that. So that was, I thought a little surprising. Actually it worked. I didn't think it was gonna work at all. It didn't work as well as I thought. Part of that is some of the results I found in Diablo we actually had really poor parameter validation. Passing bad parameter, it would just return a 500 break wouldn't tell you. Really bad user experience. And we've done a lot of work in fixing those. Some good examples of invalid key name, pass a key name and something would fail quietly in the background and you would sort of curse to yourself and not what's going on. Now I actually tell you, hey you can't do that. Examples of your key name is too long. It'll say you can't do that instead of just truncating it and not telling you. Another example is the min and max count for multiple creates. So you say I wanna create between this amount of instances and this amount of instances and it used to accept negative values which doesn't make any sense. So we actually tell you that hey that's silly don't do something that doesn't make sense and we actually have a better user experience. There's some bad things I found actually which is we actually changed the APIs and we broke things and we're sorry for that and we're not gonna do it again, we promise. Well, we'll try. We change a lot of the negative cases so you do something and it's gonna fail on you. So it used to be a 400 error saying this is invalid, I can't do it. Now it'll just return a 200 and say it'll be quiet about it. I didn't actually dig down on how this changed but that was one of the big things we noticed. The temp has changed for that along with the projects and that's something you have to be more careful about as you go forward and make sure the APIs stay actually stable. There's also really poor pagination in Diablo. This is really important for a really big cloud. You have 5,000 instances or 10,000 volumes or something. You wanna make sure you don't have to do, you don't wanna get all the results every time you wanna use pagination there and it didn't actually work in Diablo and we didn't test it, it didn't work which is no surprise there. We did a much better job in testing and actually having it work in Havana. Another big one is you actually only tested in Jason and Diablo, we have two types of APIs, Jason and XML, there's supposed to be the exact same thing once Jason won XML. As developers we generally like Jason but we understand the value of XML so you wanna support both. We did not test any of that in Diablo. We actually tested all now. All the temp has tested in Havana will run both XML versions and Jason versions. With the same unit test, so it's actually really great and we're trying to make sure they actually look the same from the user point of view. There's a lot of false negatives here in the results because the API is good but there's a lot of corner cases that aren't really clearly classified. But a good example is you have metadata in images and we change how we use that slightly and that will break things. So even though the APIs have stayed stable we actually have adjusted a few corner cases that make things a little hard to make your code really, your test really flexible from Diablo to Havana. And so there's actually been some problems there and that's a problem we're trying to work on. It's a really hard one for us and we're slowly trying to fix that and make sure that your code actually is, the APIs stay the same, that's important for us and we're gonna keep working on that. I think in conclusion Diablo was pretty bad and I think it's safe to say that now. I think we knew that at the time but it was, we're always making progress just better than before. We did a bad job of a lot of things, bad job of testing it. This wasn't actually running against gate. APIs were sort of okay and we haven't done a great job historically of making sure we don't break APIs and we're more aware of this now and we're really putting a lot of effort into testing and making sure that things actually stay the same. So critical bugs is another big part. We don't like critical bugs, they're bad, right? Turns out we have more of them now. I didn't see this one coming. So in Nova we've actually gone up, we've stabilized I think is good but there's overall there's really no trend. There's just some big reasons for this actually where critical bugs, a few things is that as the project has grown it's actually been harder to track all the bugs and classify them in the same way. The number of amount of code has actually grown too so what's not really shown here is the amount of lines of code and features has grown in massive ways and so the critical bugs not exploding it actually sort of means that we've been keeping up in the level of quality with the scale and not doing any better which is actually pretty good. So that means we're not getting better, we're not reducing the number of critical bugs, we're keeping it sort of stable if not slightly higher, slightly lower while expanding the scopes of the project. So that's pretty good. Some big categories of these bugs are those non-gated features, you add a feature in, it doesn't work, you didn't add a tempest test since we're not gating on it so when you write a code before it merges it you're on all of tempest on it to make sure it doesn't break anything. So if it's not in the integration testing it may break without us knowing and so those are hard ones to catch that may be critical of the important feature. We didn't test it, we didn't know it broke and so it breaks and we have to fix it. Another one is dependency if you actually float the dependencies upstream this has been a very long and painful battle for everybody and not everybody agrees with the floating dependencies. The big one is an upstream project who wanna create packages, let people downstream create packages. We don't wanna say that you have to use this dependency that's four years old because that's all we tested. So we actually float a lot of dependencies saying this is open-ended and a new version comes out and it breaks everything. We accept that in the gate because that sneaks in because it's a dependency change and we don't actually check those. And that's actually on purpose and so we can actually fix them right away. So those happen every few months and everybody freaks out and they're fixing a few hours and everything's fine but those are always critical. They block everybody's work but they're fixed very quickly and then we can move on from that. So that's another big category these critical bugs is. Gates down, that's critical. Nobody can do any work, no patches emerge. That's critical, we gotta fix that. Another hard one to catch in gate is race conditions. So if something fails 10% of the time it's gonna get in probably. So we're trying to do a better job of actually not letting those in and identifying those but race conditions could be really big and really hard to detect in gating. So it passes all your jobs you ran against it to make sure it works but it fails 10% of the time and you only ran nine tests so you miss it. Performance is another hard one. We run really small open stack clouds to test. It's DevStack, it's all in one. It's not what anybody should be deploying in scale. It doesn't work at scale I think. So we really don't do a good job of testing performance. This is another thing you're actually working on. So another big category of critical bugs is performance. Somebody comes along, tries it out. Somebody running a continually deployed cloud actually tests it out and says, hey, this doesn't work. This has happened quite a few times. Somebody comes along, somebody racks based or whoever is saying, this isn't working guys. Please fix it and they'll tell us right away that we put the code in last week. They ran it the week after. We fix it pretty quickly but it's a critical bug. That means all hands on deck. Do everything and kind of fix it. So performance and scalability. This has actually been a hard one for me to test. Performance is a little easier but scalability is sort of hard for to run on a few VMs somewhere. So there's some big things underneath that we've done to really make things better. So Nova Scheduler is still not great and but it was much worse in Diablo. It actually didn't support active active. We can only run one scheduler. We still have a race condition but we fixed it or worked around it. We have the race condition of two schedulers aren't aware of each other. So you have two of them running. One schedule something in one place, other one does the same thing. They don't know about each other. The compute node is over subscribed and that's bad. So we have a way around that. We fix that problem so they can have as many schedulers as you want which makes a lot of sense when you want to support that. There's some other small things we've done so far and we have a lot more work to go forward. We use an RPC fan out or we used to use RPC fan out before. At one point in Havana it was there but it wasn't doing anything. So there was one piece of the schedulers using it but nothing else was. This is all this extra traffic going through and the schedules had to process each one one by one. And Python's not really that fast. So we try to reduce the amount of extra overhead we have that's not necessary. So we got rid of that and that fixed a lot of things. I think there's an example from Bluehost they said where the fan out, the schedules are working so hard just to keep up with the RPC broadcast they couldn't do anything else. So it was consuming all the CPU and nothing was happening. So we got rid of that. A really big one we've worked on over the past few releases is the database. We added indices, unique constraints, smarter queries, don't look for everything all the time if you don't need to and fewer queries, don't keep doing the same thing over and over and over again. Try to pass the data around when necessary. Just don't be stupid about it. And we've worked really hard on that. There's been a lot of people putting effort into that and you've come a long way. There's a few other examples. One example is a service group. So by default OpenStack services respond right to the database every 10 seconds saying hey I'm alive don't forget about me. And we need to know who's alive and who's not so we need that. But writing to the database every 10 seconds isn't great. An example is you have 10,000 nodes that's 1,000 database calls every second which for a silly overhead that seems like a lot of for just a small thing it seems like a lot of overhead for a small thing. So there's a few things we've done with an adjustable interval. So maybe you don't care about 10 second intervals we need to get five minute intervals. So that reduces the frequency. Or it could use things like memcache or zookeeper. Cause this data is mostly ephemeral anyway. You don't need long term records about how alive somebody was or how long ago they started listening. So you can use something like zookeeper and memcache to be is a better way to scale this. We don't test these out actually by default. So they hopefully work but they may not. That's probably something we need to test going forward. But this has been a big help for making things sort of scale a little easier. Some other examples is the workflow for moving around images or large pieces of data. Turns out that's slow to do inherently no matter how fat your pipe is it's gonna take too long. We're all impatient and nothing's ever fast enough. There's been a lot of work across all the projects to do this. Glance and Oven Center have all worked on how to move large data around more efficiently. Turns out we were pretty bad about in Diablo. Diablo is a very early release. It was pretty rough in many ways and we didn't really think really too closely about making sure we're doing things efficiently. Want to move data around as least as possible as a short answer. You don't want to copy around to four locations and mount it in four places to do one thing. Moving around once at most and leave it there. So we've done a lot better at that. I think we've seen some real results with that. Another big one is some, and this is in Keystone PKI tokens. So that back when in Diablo you had to talk to every time you had to validate a user, you had a round trip to Keystone before the user could talk to the project. So talking to Nova or Glance had to talk to Keystone round trip each time and Keystone had to validate you every time. That's a whole extra round trip. You shouldn't really have to do it. So this has been some great work recently to actually fix that. So now you actually have the token. You keep the token around long term and you pass the token and it's using PKI so you don't actually have to talk to Keystone every time. So Keystone needs to be this big choke point for the whole project in some ways. And now they've sort of removed themselves on that choke point that could do authentication and all that wonderful things without actually having to have a lot of traffic on a Keystone, which has been really great. So here's our cool thing we have. I work in HP so we actually have a public cloud that's actually done all this. So I'd like to bring Tom up here who's actually worked on this to talk about sort of his findings. So he's actually run a public cloud on Diablo and Trunk. So he knows all about this too much. Yeah, too much, thanks Joe. So yeah, our current public cloud that's available is running a version of Diablo. We took the decision to default to fork essentially Diablo back in late 2011. And as you can see from some of the stats on the board, we made 1,477 separate patches to Diablo to some 40,000 lines of code. Joe's actually already talked about a lot of the things that we added in the previous couple of slides. We had to do a lot of work to get volumes working properly for us. Security group, the performance around security group rules was very troublesome so we had to do a lot of work around that. We had to do a lot of work to get windows, windows instances working and to get rescue working properly. And then a bunch of stuff around security. The, we found a bunch of security vulnerabilities around file injection, around maybe denial of service attacks, that kind of thing. Quota handling, validation at the API layer. A whole bunch of stuff, 1400, 1500 patches. We also spent hundreds of hours analyzing the database, putting indexes in. There were no indexes shipped in the Diablo database so we put all those in. In Havana, it all just works. We get all of this stuff for free. We've got, what, a thousand lines of patches which we are in the process of upstreaming will maybe have five lines of patches permanently applied to our Havana code base. Now we're operating our Havana or actually Ice House based cloud in private beta at the moment. Looking to roll that out soon. So overall our experiences are like night and day between Diablo and Havana. Thank you. So you can see it was pretty bad at one point. So some other big changes made us at least forward facing internal changes that you guys hopefully don't know about too much because we don't really want the users to have to care about it too much. We split out projects. You guys probably know about that one. That's actually a big thing for development. It's hard to have one project with a thousand developers on it or one repo. It's really hard to do so we split them out, makes it a little easier. People can focus on what they know. So Cinder's a great example. Cinder was a second class citizen when it's Nova Volumes and now it's a great project. It's a lot of extra great features that were never there and Nova Volumes never would have existed. So you have people focusing on the individual components. Makes it a little easier to manage. We have APIs between them so we could change the internals of each project without affecting the other project, without affecting related open stack projects too much. We have this project called Oslo. So Oslo is a way to, we have something like nine projects right now that are integrated. That's a lot of code that's duplicated. A lot of things do very similar things. So we have an Oslo project which is to pull out all the common code and put it in one place so you don't have to have cut and paste everything everywhere, which never works. We have the place that we're still working on Oslo to always work in progress, where we haven't actually started using Oslo. We have nine copies of the same thing, slightly different and slightly different versions of working states and it's really painful to deal with so Oslo's been a great help for us. Getting rid of code, making it all look the same, less work for everybody, it's been really wonderful. Another thing we're always working on is this the road that's alive and continuous upgrades. So we don't want you guys to have to take your cloud down every time we have a new release. Another one is people want to actually always deploy open stack. They want to be continuously deploying it all the time. This part because upgrades are hard, so that simple answer is upgrade more, which makes no sense. The idea though is you upgrade smaller amounts and it's a little easier. You spread the burden out over a longer period of time. You don't have this one, you know, half million line patch you're applying to everything. You have lots of small things that you can understand a little better. So we want to support both of these and this has been a long work in progress. Companies are doing parts of these today, but I would think everybody says this is not easy today and we're going to try to make it better. We want to support this upstream stock, so it just works for you guys. And we're a lot to work on this. There's lots of sessions on it today. And this has been a big thing for the past two or three releases and it probably will be for the next few. Iterate on the APIs, another big one, this guy's thing, this guy's UC, this is something you see. The APIs were never really that great to start with. We're not always. We learn a lot from actually people having, people using them and us using them. So we have a lot of lessons to learn and we try to fix them, but we can't change the APIs because you guys are using them and we don't want to break it. So we had Iterate on the APIs, have new virgins coming out. Nova has a V3 API, Glantz had a V2. Maybe they're V3 now. Cinder had a V2 recently. Keystone had a V3 recently. So we're all really iterating on these APIs to make them better, add more features and make more sense, simplify things, and generally make the experience much better. So processes, this has been a really big one for us. Turns out we're the biggest open source Python project in the world that I know of. If anybody knows anything else bigger, I would love to hear. Define that number of users, lines of code, you have it, I think we're bigger. That's really scary, turns out. Python's not really what you think of. Let's write a large, massive thing to deploy a data center with 1,000 developers in Python. That sounds a terrible idea. It doesn't have some really great things you want, static analysis, all these other really cool things. You can pilot code to see what happens, get compile time checking. We don't really have any of that, so we have to work around that. We have people actually deploying trunk all the time, which is a terrible idea we would think, but we want to do it. So there's actually a lot of work we have to do to make sure that actually happens. So there's a lot of big processes that have evolved, especially from Diablo. Diablo, we had nothing. We had no gating. We had, I think under 100 developers. Everything was done by hand and we've really evolved the process to this really nice automated system that allows us to keep growing. And we're always working on it. So before we get into the process, some big principles we have from driving tenants. Never break trunk. There's a few reasons for this. People want to deploy off it, so always make it work for anybody pulling trunk. It should always work. It may not today, but this is a goal. And developers are never blocked. When you're working in a patch, new feature, you take trunk, work on your code and push it back up. But if the code's broken upstream, then you can't do anything. So we don't actually break all these 1,000 developers. If people break trunk, people get pretty pissed off really quickly. 1,000 angry developers is not what you want. So we try to make that not happen. We actually try not to let any breaking patches in. As I mentioned before, we do a pretty good job, but we're always looking to do better. Transparency, we're open source. We're also an open ecosystem and you try to be very open about things. Anybody can see everything. We have to try to have no-back in conversations. We have a few big ways of conversation. Mailing list, IRC, the review process, all that is out of the open. Try to make it very easy for everybody to see what's going on. Automate everything. It's hard to do things something twice the right way, at least for me. So we automate it, makes it easier. Why have a person do something that a computer could do? Now, when your project is this big, a good example is some of the reviews. So we've taken, running the unit test, does the unit test work or not? You don't want to humans do that every time. Somebody's gonna forget. So why don't you have the computer do it for you? So that's where we have the gating. So our Garrett and our infrasystem will actually run all the tests for you. So you push a patch up, all the tests are run for you. It makes it really great. Write a piece of code. You hope it works. You push it upstream. You find out if it works or not. Makes life really easy. It means you don't have to have dev stacks lying around all over your computer. Everything can be done for you. We also try to automate the review process as much as possible, which is hard to do. Make sure all the unit tests, so we make sure everything works. We also send some basic style guides. We actually enforce automatically because somebody's saying you missed a space somewhere is not what you want to do as a human. That seems like a big waste of time. The biggest limiting factor right now in open-sec development is the human being. So that the reviewer and the developer, should try to make things as efficient as possible for the reviewer and for the developer. Gallitarian. So anybody here could commit a patch if they want. Anybody could review anything, the whole process is out of the open. So you have a cool thing you want to do. You push it upstream, we'll review it. If you want to review somebody else's code and say that's a great idea or I think there's a bug there, you could review it. There's no, it's an open process. Anybody could do it. We want to keep it that way. We have no benevolent dictator for life and that's by design and we try to keep it that way. So we're very open at Gallitarian model. Be strict, so reduce burden on the reviewers. So one of the ways to make the scale is you actually push, be strict, somewhat arbitrarily sometimes, you usually not. But by being strict you can actually move the process from the reviewer to decide, is that right or is that right? So you say, we're gonna pick one or the other, that looks right and automate that as much as possible. As I said, the reviewer is a big limiting factor and this is actually something that a lot of talk about at this summit is how to make the process scale even further when we don't have enough reviewer. So part of the big answer obviously is more people reviewing the code but there's a lot of other things we could do. So back in Diablo, what did the process look like? There actually was no automated gate. It was all just sort of wide in the open. In the release, the actual, the month of the release in September of 2011, there's 87 developers on it. It's not bad. At that point you could actually all talk to each other. I think everybody probably knew everybody else on the project by name or by IRC handle or something like that. There's no gate, no get, no garret, garret or review process. We're using a bizarre and launch pad for everything. There's a very different world. People are fixing silly things. There was no automated anything. It didn't really scale well. So we've done a lot of work on making this process a lot more efficient. So where we are today, we have a lot more developers. This is the last release. Over the lifetime of the project it's been I think 1600 developers or so. So that's a huge number of people working on this. It's a little insane in some ways. So in the last, in one month long we had 346 developers. So we've really had to scale the development process up. The big part of that was breaking the projects down, automating the workflow here. We have this big integrated gate. So before we had no gate. So gate makes sure nothing breaks, integrated across nine separate projects and 30 odd services in those nine projects. We have one merge pipeline. The reason is because something in Nova may break something in Cinder or vice versa or Swift or Glance. So a patch, you have to make sure that none of the patches in the pipeline before you will actually break your codes. You can't actually merge things independently. We have one integration test that tests everything together and we test everything in the pipeline so we make sure we never sneak in a breaking bug. And this happens every so often when we have something that sneaks past the gate. We get ourselves in a strained wedge state. We can't undo it because the gate is broken. The fix won't break the gate. And we have ways around that. But it happens once or twice a release, so almost never. And it's always very quickly fixed. So running this kind of thing is a little crazy when you think about it. So we have nine projects, 30 odd services running in a large system in Python, all over the, cross hundreds or thousands of machines. And so debugging this is really hard. I'm not going to try to debug it. I'm sure it's painful for you guys. It's painful for us. Something doesn't work. Your patch doesn't work. That's pretty easy to figure out. If you have a race condition, something weird, spooky, in the dark, failing on you, you don't know what it is, fails 10% of the time, it's really hard to debug. So we've actually done a lot of work on making this possible and fixing this. There's actually a session upstairs about how to fix this right now. So we classify tempest test as not as passing and failing anymore. So it's not, does this test work or not? How often does it fail? And so this has been a really painful thing because the gate's not perfect. We can sneak past anything that fails a fraction of the time. And that happens a lot with nine projects, 1,000 developers, 30 services running all over many servers. So we actually have a problem. If something fails 0.5% of the time in any kind of call, that's terrible for us. That means we're running hundreds or hundreds of open stack clouds a day. And each runs running thousands of tests that we have now, 1,100. I think it is in tempest right now. So something that fails 0.05% of the time or 0.5% of the time is still really bad. We'll see that several times a day, if not several times a week. So we need to actually ration out the point where we can actually fix those bugs yet. But we're trying to work on making sure we can actually fix all these strange transient failures that we see. And we're getting better at it. Do you bugging this really hard? All the logs look a little differently right now. We're trying to fix that. You have to trace things across separate projects. I may not fail all the time, so reproducing the bug is really hard. Identifying the root cause is really hard. We had a bug recently. Where's a bug in HTTP lib 2, I believe it was, that was the result was we're across talking RPC and REST calls, so HTTP calls from the command line, and RPC, which is AMQP calls. And it was very hard to debug a Tuck. I think about a week of five or six guys working full time on it to get it out. We have very strange bugs these days, and it's really hard to catch them. So we're trying to do a much better job of actually doing that. Part of that is a project I worked called Elastic Recheck. So the background of this is we have, at one point, we put all the logs into Elastic Search, which is a great tool. We classify them. We could search on things. We have these great parameters we could search on. So we have all the logs for. We keep two weeks of logs around in Elastic Search. That means you could actually, if you find a query that fingerprints for a bug, we could actually see how often it occurs. That also means if you have your job fails and gate, something goes wrong in your check, we could actually say, is this a failure I know about or not? So we're actually doing that now. That's in running an open-sack interface today. And it's been really great to actually help people know what's going on with this thing. This broke. It's not your fault. This is a problem in the system. Just letting you know you don't have to figure it out. We'll figure it out separately. Now we actually have a list of all the bugs. Some of the bugs, how often they occur. We can actually really identify what's the critical point to fix and work way down these hard bugs to deal with. And this is probably going to be, I think, the big challenge at Ice House from a QA point of view is how to get rid of these transient bugs. It's a very asynchronous world out there in open-sack. Nine projects, 1.7 million lines of code, in Python no less. And it's really hard to deal with. And I think we're actually breaking ground on this, which is a little scary. We're very scary for me. Because I don't think anybody else is doing this. We're trying to do all kinds of crazy things. Let's talk about Bayesian filters for logs right now to identify anomalies in it. You name it, we're trying. And we need better ideas. This is going to be a really hard problem for us to deal with. And I think it's going to be a big push in Ice House. So however we're going to keep maturing, more testing. We never have enough. There is never enough testing. Testing is never aggressive enough. We're a strong believer. If it's not tested, it's broken. So that means anything you push in, it's not tested. Odds are actually probably broken. So we try to stick by that mantra and make sure that we do a better job. So every API to be tested extensively and then some. Live rolling upgrades, this is still a big thing. I think it's still a release or two down the road to get it really working well. We believe this is a great feature. All the big clouds have been pestering us for years now, in fact, and have really been taking seriously for a long time. And there's a lot of big steps to do this. There's actually a session just now upstairs about a small step to make this a little easier to actually test things. The example is you're going to actually be able to upgrade compute nodes slowly over time and not all at once. So you're actually going to support upgrading in Ice House. Upgrade all of your code, your database, and then upgrade the compute nodes slowly. So you don't have to upgrade all 5,000 compute nodes at once or 1,000 or 100. We can actually let you do that slowly now. And we're working on getting a gate test for that in to make sure it doesn't break and we actually say it actually works. Improving quality while the project grows as a whole doesn't have been a really hard one. So I focus a lot on what was there in Diablo. But as the project grows, we want to maintain the quality while improving the quality, or improve the quality rather, over the whole system as it grows. And that's been really hard for us. Part of that is the QA team and the infra teams aren't that big, and they need to grow. And so if you have developers out there who want to work on it, we would love them. But keeping your project so big, running with all these separate components and the integrated gate and integrated release has been hard for us. And we're doing a great job of it now, I think, but we're always looking to do better. I think any bug after a stable release is too many. And I think we all agree on that. So we're actually trying to make sure this never happens and get better and better at making sure this all works. And scalability. I think the biggest cloud today is Bluehost at 20,000 nodes. This is on Falsum, actually, which nobody thought was possible, at least I didn't. This is one cloud, one AZ, one deployment, no cells. Then I do a bunch of changes to actually get that to work. And we're actually working with them to learn what they did and how to fix it, along with all the other public clouds to really fix it and make things really scalable to be on it, you can even imagine. So this is another big challenge we're working on. That's the big part is that performance, not doing silly things, making sure all the pieces scale out, all that. Overall, it's been a lot, you know, Diablo vs. Advent is a lot changed as you saw. We have a lot more to go, I think, and we've done a great job, I think, of making it something actually deployable. Thank you. Any questions? Thank you.