 Good morning everybody. My name is Joe Gordon. I work for HP and today we're going to talk about very large development how to run code review for a thousand open source developers. So about me, I'm a full-time developer on OpenStack at HP. That means I work almost entirely upstream. I do almost nothing. I don't know anything about HP, so don't ask me any questions about that. I work on OpenStack mainly. I'm available on IRC and etc. So this is taken straight from OLO and I think they're here, so if anybody's in the room from OLO, great job guys. So what is very large development? I'm using the definition from OLO. The past 12 months, OpenStack has had 1,241 developers making this from the top 2% of all projects on OLO and 1,600 developers in this lifetime. So this is a massive project, all open source, all Python, which is not really the language you'd expect to have a massive open source project in. It's not compiled. It's a nice language, but it's not compiled and that leads to some problems. So what is OpenStack? It's a lot, a lot of Python. These numbers were taken yesterday, then it's 1.4 million lines of Python. Of that, there is about just under a million lines of actual code. We have a bunch of other languages here, XML, JavaScript, and some other ones. We have a few other pieces of code in there, but it's mostly Python at the end. So this actually turns out that I think it's the biggest open source Python project that I know of. If anybody knows of a bigger one, I'd be happy to update this. I did this talk a while back and PyPy was bigger at the time, and now we've beat PyPy. PyPy is a Python, Python, the Python language written in Python, so that's why it's the snake eating itself, and now we're bigger than PyPy, just under a million lines of code. So we're huge. We're in uncharted territory for Python itself. We're with this massive project doing this massive things on a massive scale. So what is OpenSec? A cloud operating system is to take the marketing phrasing here. It's a bunch of APIs that use, consume different resources, compute, storage, networking, but we're not really talking about OpenSec today. We're talking about how we develop it. So it turns out that under that nice little picture, it's complicated. And under that simple picture, it actually is even more complicated. So we have a lot of moving parts here, tons of services. This is actually very outdated at this point. There are many pieces missing as Florian is looking, trying to find all the missing pieces. So this is actually a very simplified picture itself. So we have this big complex piece of software they're working on. So who is OpenSec? A lot of big companies are working on it, HP among them, AT&T, IBM, Ubuntu, Rackspace, Red Hat, Suze, a whole bunch of others. A bunch of smaller companies are using as well. Some other big ones, Cisco. And those are all the contributors and there's actually even more users of it. So these are all, this is a small set of the users, and there are many, many more out there now. This is also an old slide. I don't know the full list of all users now, but it's hundreds and hundreds. You hear more of them every day. I think the biggest deployment to date is 16,000 nodes in a single data center. And I think that's probably closer to 18,000 now. And that's Bluehost. So we're looking for the data center scale. So that's where we're designing the software for, this big massive piece of software with all these big moving components. We're trying to run a data center. This is actually Rackspace's data center. The caption's cut off. You can see it's pretty empty in this picture, but this is an actual OpenSec data center for Rackspace. So we have this big project we're trying to do, and we're doing this massive deployment scale. But it turns out it's also a massive development scale. So this is big project, million lines of code working for the data center, and we have over 1,000 developers working on it. So these are the latest numbers. The past three days, we've had just under 4,000 commits. The past 12 months, we've had just under 40,000 commits. Past one month, we've had the range of 40 contributors and 1,200 in the past year. So this is a big project. A lot of contributors were bigger than most other projects out there. So the biggest Python project, I mean, were biggest number of lines, number of contributors, and any other metric you can think of, I think we're bigger now. And if there's another metric you guys can think of, something that's a bigger project, I'd be happy to know about that one. Over its lifetime, which is just over three years now, there's been 90,000 commits from 1,600 contributors for 1.7 million lines of code overall. So this is huge. We have exponential growth, which is hard to handle when you're trying to run code review and run a development process. We have more contributors every day, and it keeps going up. In January 2011, we had 61 contributors and 71,000 lines of code, which is a nice size project by every definition of project. And now we're way past that and we've gone way beyond that with millions of lines of code and thousands of developers. So it's been hard for us to actually sustain this exponential growth in the development process, and we're still struggling with that. But we've made it this far and we hope to make it much further. So Python's a great language. I don't know if you guys ever use it here, hopefully some of you have. Okay, at least one person has. That's a good sign. So it's a fun language. We like it a lot. It's a fast language to develop. It's fun. If you ever use something like Scenigo to Python, there's no semicolons, no curly braces. It's a nice language to look at. It's very approachable. It sort of makes sense when you look at it for the most part. But with some cons, we're doing this big, massive project. We don't have type checking in Python. We don't have any of this static time analysis that's really great for these big projects. There's no header files. If somebody changes a function definition somewhere, we may not catch it at static time analysis. So we have to work around these problems. It turns out concurrency isn't that great in Python yet. And so we have, this is a big concurrent project, so we have some problems there. But overall, we've overcome most of them. And we don't have any other kinds of static analysis. You make, you know, you run GCC or you run make or something and then another compiler, you have all these errors, you get out all these warning errors. And we don't really have that chance to do that. So we have to work around that and do what we can. And that kind of thing is, these problems are not a problem when you have five developers, 10 developers, you know, 50,000 lines of code. You could sort of manage around some of these static time analysis problems. When you have a million lines of code and no one person knows what the whole project does or knows the whole architecture, this static time analysis becomes really valuable. So development process. So we have had a, we have a fairly strange process to develop this. Some people don't like it and I understand that. It's a bit unusual and it could be burdened somewhat first, but we're trying to work on that and make that easy. But most importantly, we've made this actually work on the massive scale that we work on. So this is, just to compare, this is a life of a GitHub patch today. There's many ways of using GitHub and Git. This is a very common one. You fork the repo, you press the little button, you write your code, you test it out locally, push it up to GitHub repo. You submit a pull request. Hopefully Travis CI runs a test on your patch if you're using Travis CI or something like it. And that, the Travis CI feature is fairly new and that's because this actual workflow is new. This is, this workflow is actually newer than OpenStack, which is one of the big answers to the question of why don't you use GitHub. GitHub is about, OpenStack is about three years old. This is about a year and a half old, this workflow. The patches reviewed by some people and eventually it's, you know, you fix any problems that the, the reviewer has and then gets merged in and then Travis CI runs on Trunk and Trunk may fail. Hopefully you didn't make it fail and everything works, but it may. And if you ever look on GitHub, it fails all the time. So where we started three years ago, roughly, a bit over three years ago now, this was on Launchpad with Bizarre and all that weird stuff on Launchpad. If you guys ever use Launchpad, it's a great tool. We still use it today. We just don't use it for code review. This was how it looked initially and it wasn't great. It was pretty good. This is 61 developers roughly, maybe even less at this point. It was sort of easy to manage all this, the very simple system. This worked very well, but it wasn't enough for us. The first step we did is we moved to Garrett. Garrett's a great tool. It came out of Google, I believe, and it's from the Android project. And it's a nice code review tool written in Java. And this is what the first and early code review looked like, which is somebody made a change, somebody ran the unit test and said, you broke the unit tests. That's a really silly review. We don't like to have to manually do these things. This means that if somebody didn't run the unit test and they reviewed it anyway, then trunk broke and we don't like breaking trunk. So the first thing we did is we added some gating tests in. So before you actually merge something, we have Garrett and Jenkins will actually merge it for you. So at this point we have no human merging. So there's a set of viewers who are privileged and they have approval power. And approval power means that the tests are run and then Garrett and Jenkins will actually merge the patch. So the first thing we did is added basic unit tests in. We want to make sure we don't break the unit test. It's a pretty simple requirement. We also want to make sure it merges. If it doesn't merge, it happens a lot. Actually, nowadays it doesn't merge a lot because we have the velocity of the project is so great that you have to rebase a lot, unfortunately. And we want to have some basic style checks, in this case, Pepe. So this is really great as a first step. But it wasn't enough. It turns out we actually want to support Python 2.6. Red Hat still has 2.6 support in REL. So we want to support it as well. We also want to have integration tests. This is a big project with a lot of different moving components. Integration tests are essential for this. So unit tests are great, but we really actually want integration tests. So we actually wrote a small development environment for integration testing. We run against that. We have all the same tests as before. Now we have the unit tests running twice in 2.6, 2.7. And now we have a bit better code coverage. We still had a problem with this, which is that we're merging code. And we say approval. We try merging the code, running the tests. And if everything works, we merge it. The problem is that the reviewer has to manually still look for these issues. So here's an example. After we had all that, somebody noticed, hey, I think you forgot something. So this is a Pepe 8 violation. For some reason, in Python, we don't like tabs. We use spaces. So we just, okay, we do that. And we have to manually, at this point, we had to manually look for that. So the next step is we actually have a check test. So you push your patch up, just like on Travis CI today. You push your patch up, and we actually run all the tests on that patch before it gets merged. So now we have, you're running the test twice. So if you have a perfect patch, you push it up. There's no problems with it. Everybody loves it. It gets reviewed by two core reviewers. The core reviewers are this privileged set of people who've been around, and the community trusts them. I mean, to get it pointed into the core team, somebody in core nominates you. You push your patch up, it gets reviewed. Everything looks good. The unit tests run against it. The reviewer knows, hey, he didn't break anything. Now I don't have to care about year syntax. I don't have to care about if you broke the unit test. I just have to look at the new unit test you added and make sure those unit tests look good. I know that you're not gonna break anything that we test in integration testing, and we can merge it, and then we try again. And every so often, actually, something will look good on the check and will fail on the gate because the project moved and the velocity is so fast now that you have to rebase every so often. And this is actually a fairly old list of the tests, and we've actually grown this much further today. So now we have over 10 tests. We have docs tests. We have docs. We like to make sure they work. So we have a docs test. That actually rebuilds the docs every time. And if you actually click on the link, you can actually see the proposed doc change, which is great. Somebody's working in a doc patch. You want to see if they rendered it properly. You can click on the link and see it being rendered. We have Pepe. Pepe actually needs to be corrected. That's actually full style checking. We have Pepe. Plus a whole bunch of other things now. One of the problems they have with things like a big project is that we want all the code to look the same. We have nitpicking. I'm sure you guys have gotten to this before. Somebody likes, you know, camel case, somebody likes Turkish notation. What's the other one? Polish notation. Anyway, there's people like different things. You want your code to look the same. It doesn't really matter what the code looks like long as it's the same. That's the important part. So we have a big set of tests to make sure all the code looks the same as much as we can. We run the unit tests. We actually have, I don't think it's there yet, but we're actually working on Python 3.3 support. We have PyPy support we're working on as well. And in this project, we don't actually have it, but other projects we have. We get on Python 3.3 as well on PyPy. We have a whole bunch of integration environments. Turns out when people run an open stack, they don't run it in a single environment. There's many different supported environments. So we try to support at least a few of them upstream and gating. So we run it on MySQL. That's the standard we use today. Postgres, because people like Postgres as well, we want to support that. And every so often we'll have a case where something works in MySQL and not in Postgres or vice versa. And so this is an important test for us. Can you explain what the platforms are that we actually run the DiffStack gate checks on? Yeah, I'll get to that. Let me just get through this first. We have Neutrons. We have two networking supported networking pieces, Neutron and Nova Networking. So you run both of those. So we have DiffStack gate, VM full, VM Postgres are both on Nova Networking. We have a Neutron test. We have a large ops test, which is brand new. And I worked on this a couple of weeks ago. And the idea is here that we actually want to make sure this stuff works at scale. Unfortunately, we can't test this at scale. So we use a fake vert driver. We swap out KVM or QMU. And we use a stub driver, which just writes something to a database and make sure that things don't break on that scale. And it turns out, sometimes things do it's a big complicated puzzle. So we want to make sure that it actually works on some sort of scale testing that we can do on small scale. We do this for Neutron and Nova Networking and we have a grenade test, which is upgrading. And so the different platforms we have is we have these, we have a tool called DevStack, which is upon an OpenStack DevStack. DevStack is a development environment not used for production, but this is what we use for testing. It's a simple environment. It's great for testing and it's the same environment every time. So this is our standard de facto testing environment. So we have this nice integration environment you could test everything in and we run integration tests on top of that. Grenade is an upgrade test. So we want to make sure that the code upgrades. People have legacy deployments now. They want to upgrade it. We want to support that. So we have this test. We deploy an old version and try to make sure we can upgrade to the newer version. And we're slowly trying to expand out all these tests. I think there's several other tests in the works. We have something called Cells, which is for scalability. That's going online very soon. We're trying to expand the upgrade testing. Upgrading has been a big struggle for us in the past and we're trying to get better at it. And so the way to do that is you fix it up, you get on it and now we can't break it. So all this comes to the life of the patch today which is similar to the GitHub model but slightly different. So just like the GitHub model, you clone everything, rate the code locally, test it locally, submit your patch for review. Code is automatically tested on submission. Just like it is on GitHub, code is reviewed. You fix it up for any problems you have. Code is approved, not merged approved. And then the code is automatically retested and merged if it passes. So if it doesn't pass, it gets approved. It fails the merge or it fails a test. It comes back to you saying this failed, take a look at what happened. You look and go, oh, so and so who's working on the same piece of code as me? You got their code in first, now I have to fix my code to rebase on theirs. You push it up again, you try again, it gets approved again, and then hopefully it merges. So the big difference here is that we actually gate on things so no human has merged power. I mean, this keeps us all sort of honest and makes it easier for everybody. So where does this leave us today? So you have some basic principles that all this leaves up to which is never break trunk. We have 1,000 developers to break trunk. Some of these don't wanna kill you somewhere in the world. We're all over the world now, so there is developers all over the, every time that has people that I know of, this means that if something breaks in the middle of the night somebody's gonna notice. So you wanna make sure trunk never breaks no matter what. Turns out it does break, this is very hard to do. Some of the big ways it breaks is we have floating dependencies. You wanna keep floating dependencies because distributions don't like when you say I need a three-year-old version of this library that we've been testing against. So we keep our dependencies floating so sometimes that breaks. I think nowadays we probably break trunk once a month for three hours a month, I would say, which is incredible because of where we started two years ago, even. Things would break for a couple days at a time every other day, every other week. So we come really far on this. So it's never break trunk as possible but there's always exceptions to that. The big one here is developers are never blocking on this. We're always working on trunk so we always pull from trunk. We don't have an unstable branch. Everything's on trunk and we wanna keep trunk green because people wanna deploy on trunk, crazy enough. Rackspace is deploying trunk, HP is trying to deploy trunk and there's some projects I'm working on that try to make it easier for other people to deploy trunk. But this means that you're doing continuous deployment and continuous upgrades. If you guys saw Monty Taylor's talk yesterday, that's what this is about. So we wanna keep trunk green and deployable and we also wanna never break because developers are pulling off trunk. We wanna make sure that they pull something that's not broken because if you have a thousand developers trying to work on something that's broken, you get a lot of angry letters. Transparency, we try to keep everything out of the open. Coderview is very open. We wanna make it more open as possible. This is an open source project gonna keep things open as possible to make it egalitarian for everybody. So egalitarian, make it equal for everybody. Anybody could do anything. Anybody here could do a patch. There's only three or four steps to get a patch in. You have to sign the CLA. We're a patch in two projects. We need to make sure that's the case. There's some legal issues around that. You sign the CLA. You push a patch up. If it's a good patch, we'll merge it. Automate everything. Turns out people don't like to do things. So you automate them and it makes it easier. People don't like running unit tests or running integration tests, so we do that for them. And this has been a really, this really helped us to increase the velocity and move ahead. To make a patch, for example, you don't actually have to run the integration test or the unit tests or the style tests. They may fail and you could do the Ecoloc upstream on the log server from Garrett and see what's failing and fix them then. But we do recommend running tests locally when possible. So you don't have to. And be strict. It turns out in this project, we have thousands of thousands of thousands, or hundreds and hundreds of developers, but the number of actual reviewers is much smaller. So the limiting factor here is the reviewer. So to solve this, or to help solve this problem, we've been strict as possible and try to reduce the burden on the reviewer. So developers, we have lots of them and we love developers and we want more of them. We also want more reviewers and to make sure that the burden, we want to push the burden from the reviewer to the developer when possible. So that's an example that is style checking. That's not something a human should ever have to do. Looking at white space is tedious and nobody likes to do it. We all like everything to look perfect, so we have a computer to do that for us. Same thing for, we don't want a integration test to fail. So we have the computer run them and make sure that they never break. So we have a bunch of different tools for this. This whole workflow, we have a lot of different tools and there's been some other talks here about more details on how the tools work underneath. We have some basic tools though. We have workflow testing tools, integration testing tools, and most importantly, communication. So workflow, we try to keep the workflow as streamlined as possible. So use Garrett for code review. It's a great tool if you guys haven't ever used it, take a look. If looking for an internal code review tool, I really highly recommend it. We have this little tool here called Git review. The problem with Garrett is you have to cut and paste this strange URL to push things upstream onto Garrett. So we have this simple tool type in Git review and it pushes it up and it gives the URL on the review server for you. So it makes it nice and easy. You work in a patch, patch looks good, Git commit, Git review. So you don't have Git push, you have Git review and that's a big change for the Garrett workflow. And you wanna fix your patch, you just amend it and do a Git review again. And we have Jenkins behind the scene. We're actually getting rid of Jenkins. It turns out Jenkins doesn't scale to what we need. We have three Jenkins servers today and we're looking to get rid of them all together but that's a separate talk. But Jenkins runs our test for us. We have automated testing from Jenkins, a streamlined workflow from Git review and Garrett for the code review process. So testing, testing is really important to us as you hopefully figured out by now. So this is for the integration and style checking test. We need a Python tool called TOX. TOX is a great tool that runs virtual environments for us. So we have all these dependencies. We don't actually wanna have everybody install them on the computer every time. I don't usually wanna do that myself. So you can run TOX, it'll install the dependencies locally, run a Python 2.6 version of the unit tests, re-run it on 2.7, re-run it on Python 3.3 if you want, re-run it on the PyPy. So you can actually have one tool to run, make sure you support different versions of Python. Test repository, test repository is a great tool written by Robert Collins, who's also working at OpenStack now. And one of the problems we have with so many tests is they're slow. So unit tests should be parallel in theory. They're not usually if you don't write them correctly like we don't. So the first step is we fix them up, made them parallel and then now we use test repository to run parallel unit testing and parallel integration testing. So this dropped our test times down. Nowadays you run everything. So this is running 10 tests or so all in parallel and it takes under an hour to do this. This is thousands and thousands and thousands and thousands of tests run very fast on all VMs, running I think fourth reds each, if I'm not mistaken. So switching from single threaded to parallel testing has really helped us. It's a swap in if you use Python, swap in for a nose runner and it works great. It also has some other nice features on it. It turns out a future work on that is it can actually run the unit tests on multiple machines, which would be even great if you have a bunch of cloud servers, you can actually run, you know, use five cloud servers around the tests on five different machines that go even faster. And we're hopefully gonna do that in the near future. And we have a whole bunch of automated code quality checks. So we use something called Flake8, Pep8, PyFlakes and hacking. It all runs in our one umbrella project which is called Flake8. So you just type in Flake8 on your machine and it all runs. So going into that, as I said before, style guides, everybody hates them. Everybody loves them. We feel the same about it. We hate it and we love it. Not everybody agrees on what everything should look like. So we try to pick a consensus that most people like and we go with that. And the main reason that once again is, so there's less nitpicking about style in the review. You missed a space here. There's too many spaces there. I don't like how you did this thing. We try to solve that by enforcing it automatically. And it doesn't matter what your style guide is. The important part is that it all looks the same and everything looks the same. A good example is you're looking at a new piece of code. You want it to look like the rest of the code because you don't want to deal with different things. It all looks the same. It makes it a little easier to understand. Here's a short list of some of the checks we have. So from Pepe, Pepe's the big Python standard for style. Turns out it's great and we like it a lot, but it's not enough for us. We have a bunch more on top of it. From Pepe, we get indentation, white space, line length. We use 80 character limit because people are old school. I assume many of you are also. You have the 80 character terminals and anything bigger you get really upset about, exactly. So we do 80 characters for everything. That can get a bit annoying on Python because you have the four space indentation so you can get a lot of weird code in there, but overall we like it a lot. White space, we just want it to look the same. We don't really care what it is, so we just pick Pepe. It's nice and easy. It's a standard. You can't really argue with that standard too much. So that's a short list of some of the Pepe checks we have. We also use Flake8, which is, looks for bugs in Python and does a great job of it too. Some examples are unused import. So you have import, you change your code. You're not using it anymore. Refactored it, made it cleaner. You have this leftover import. You don't really need it. It doesn't look good. It can actually slow your code down in some cases. So this tells you if you're using it or not and you get rid of it. So now we can actually say we have no unused import in any of our code anywhere, which is a great thing. Undefined variable names. We've all done this. You make some code, you mess something up, you delete a line. You don't assign it before you use it. This will tell you, so this is a great tool for us. So this is an example of some of that static time analysis that we have. It's not nearly as good as a compiled language, but it's not bad. We have some stranger ones now, and now we're getting to our own tests. We decided that we actually want to have Apache 2 licenses and everything. Every file has an Apache 2 header. This is for the standard reason why many projects do it. We wanted to make sure if you look at that one file, you know it's Apache 2. Turns out people don't copy the Apache 2 header correctly every time. I don't do it every time correctly. So now we have the automated test to make sure of that. So we have two tests, H102 and H103, to make sure that the Apache 2 header is there and then it spells correctly and there's no typos in it. We have a whole bunch of import rules. We have six or seven now, and the short of that is we want it all to look the same, and we also want to make sure there's no merge problems. Sorry, could you repeat that? All of them. So this is all done automatically. You run this in your own machine. You can actually use, so this is all under a tool called Flake8, which has plugins for Emacs and Vam and things like that. So when I'm developing, for example, I actually run them in my editor. And that's the nice thing is that this is all on, all the tests you run on gating can be run by the developer, and especially the unit tests and the style of God tests. So there's no, I write the test, I write the code, I run all the tests like you run, and I push it upstream and then it fails because that just makes you furious. Good question. So we have some import rules. A big one is you want to make sure there's no merge conflicts with imports. So we have alphabetical imports. We break them down into third-party standard and project imports to make things look a little cleaner and a whole bunch of other rules. We have one import per line, alphabetical imports, and both of those are for, if you have two people working in code, they both import the same thing. You don't want to merge conflict. If they're all in alphabetical order, you're not gonna have a merge conflict. This actually worked really great for us. We haven't had any merge conflicts in a very long time. And the nice thing here is you don't have to know the alphabet for this. The computer tells you what ABC is. When I'm looking at, you know, removing a piece of code, I can't tell if something's in alphabetical order. This is one of the reasons why we've actually started this project is because manually reviewing ABC, you know, it's not fun, you don't want to do it. You want everything to look the same. You want to enforce this, but you don't want to do it yourself. And so moving all this enforcement to the computers really freed up a lot of reviewer time. We have some doc string rules. This is just one example. We don't want doc strings to start with a space. This is sort of silly, but once again it's all look the same. If you ever looked at a piece of code, there's a whole bunch of different doc string formats or any documentation formats. You get a little confused and sort of furious at why things are all formatted differently. So we picked one style that we want to enforce and we're automatically enforcing it. So that covers the unit tests and the style guide tests which are quite extensive. I think in NOVA, which is the big, it's the compute project and OpenStack. We have several thousand unit tests. I don't remember the exact number now, but it's massive and it's growing every day. And then we have a whole bunch more integration tests on top of that. We have about, I think, 16, between one and 2,000 integration tests and that's growing every day by leaps and bounds. We have a great team working on integration tests. We have a whole project for integration tests and they do a great job of writing them. So the integration tests we had two years ago are not nearly as good as we have today. We have a bunch of tools to actually support integration testing. As I mentioned before, we have this simple environment called DevStack for testing. So the first way you want to get in, you want to actually work on OpenStack. The first thing to do is you set up DevStack in a VM somewhere. It's easy to use relatively and this is also how we test things. So it's a nice environment to use that on. We use a tool called Zoolfer Integration Testing and other testing. So we have this, all these tests take about half an hour or so. At one point before, we went parallel, they took about an hour each and we want to merge everything serially and because you want to gate on it, you don't want to have any merge conflicts. We don't want to have the case of the patch before you break something and then your patch gets merged because you run them separately. So we have to test everything in order of the queue. The problem with that is that if you run it in the trivial case, is that you can only merge if it's an hour long test, you can only merge 24 patches in a day and you run all the tests together because it's an integration test. So compute change will affect a keystone or an identity change or a volume change or a storage change. So you're going to run all these tests, all these projects are working together and we rarely, during a very busy time, will actually merge well more than 24 patches in a day. So we have this tool that optimistically pipelines things called Zooland does many other great things that I'm not going to go into but that allows us to actually merge as many patches as we want. I think in the last release, testing cost us, our servers are donated by HP. That's not a plug for HP, we're just grateful that they did that. This is all running an HP public cloud and a Raxpace public cloud and they both donate about roughly the same amount of hardware and integration testing costs is about $100,000 in testing alone, which is when you tell people that some people are impressed and most people are going, that's not so much, we could do better and we're trying to use more money. So we add more tests and with all this pipelining and everything, we actually run hundreds of hundreds of tests a day and we plan on hopefully running hundreds more. So with integration testing, when we started out, it was all sort of black and white, it all sort of worked, but now we have these transient failures, this big complicated moving puzzle and you get transient failure, something underneath is failing, it only fails 2% of the time, it's a race condition or it's something else. We have to have a way of dealing with that, so we have this tool called Recheck. So if you get your review, Garrett or Jenkins says, hey, your patch failed and you don't think it was you, you look at the bug failure or something else, you say, hey, Recheck, you list the bug you think it is, tape that in and it'll actually rerun the test for you. This is true for merging, if your merge fails and you look at the failure and it's not related to your patch, you don't think it's you, you look up, you find a bug and you say, Recheck that bug. Turns out this wasn't enough and humans are really bad at diagnosing things when it's not their problem. So we have another tool which we recently wrote called Elastic Recheck. As far as the change rate is concerned, out of curiosity, have you compared these numbers to what the Linux kernel is doing? I have, they're bigger. Linux kernel is much bigger than us, that's a great question. They're also, how many years old now? Nobody knows here? 20 years old, thank you. We're three years old, so we're still working on it. A big difference is we have a egalitarian model which means we try to keep as flat as possible. We don't have lieutenants or anything like that. All you need is you need two core reviewers to review your patch and core reviewers are generally interchangeable for the most part. So if I'm reviewing a patch and somebody else is reviewing a patch, that's the same as two other people are reviewing a patch. And so we have a very flat model and that's one thing that makes us very unusual. That and we also try to do continuous deployments. We want to make sure trunk is always green. So those two things make us a very unusual model. So you don't have this case where somebody is building this patch, you don't have a lieutenant working on next generation networking for a specific subset of a component. So that's, this goes under the top 2% of all open source projects, not top 1%, I guess. So we have a problem which is the failures. Humans are really bad at, you see a bug. You're gonna wanna, we have actually another command to recheck bug X. We have recheck no bug. That's where something fails and you know it's not a bug. A glitch in the system, something infrastructure broke for a minute. Whatever it was, you wanna say recheck no bug. Maybe it was running the patches old patch and you think something is broken underneath it. You wanna make sure that it still works on trunk. Recheck no bug is what you want. So people run that generally when they're not supposed to and we wanted to fix that without making people spend hours debugging things. So it turns out the infrastructure team where some of them here this week, they have this, all our logs go to log stack, or a log stack, Kibana, elastic search, that whole suite. So we have, we generate every test, every tempest test, which is our tempest or integration test, generates tens of megabytes of logs, 10, 20 megabytes of logs and that's growing all the time. We run everything in debug mode because it's for debugging things. So we have lots and lots and lots of logs. You don't wanna read those all by hand. I've tried, it's slow. You're on like a slow connection on your internet. You get, it takes forever. So we move everything to log stack and I could use log stack and elastic search to actually do queries. So now that our integration tests are actually failing, but we measure them by percentage of failure. So you have 1% failure right now, I think, on average. At one point during our, we had a feature freeze recently and we had a release Havana, which is our latest release. Things got terrible. Our failure is about 10% and nothing can get merged. You had, if you have 20 things in the queue, something's gonna fail and you'd have to recheck all the time when it was a nightmare. So this is one of the tools that helped us really get things down to about 1% where it is today. I don't know the exact numbers, but it's very low and we're always working on making it lower. And you wanna keep it as low as possible. We're never gonna get it down to zero because it's a big, complicated system with tens and tens of actual individual services running, talking to each other. So it's hard to debug this by hand. So what we do is we actually, we find the bug. We find it once. We actually go through the logs once. We find something that identifies that bug. The example is we had one of the services returning a 503 error. This is a service using a lot of underneath everything else and that was making all kinds of things fail. So we look for the logs in this case for that 503 error from the log saying, blah, blah, blah, 503, returned 503. And now we actually write an elastic search query for this and now every time something fails in the future you run that elastic search query on the logs from that test. So this means that you have to only classify, manually classify failures once and everything else once you've already managed, once you identify a loss elastic search query for that failure you could automatically classify in the future. And this means you get this really nice results here that you can see in the bottom, which is a list of how often we've seen this bug in the past two weeks, I believe. We actually only have two weeks of logs in elastic search because we have several terabytes of logs already and we decided we don't actually need, that's not a priority for us now is to make sure we have logs from forever. So you can see here this has been a pretty common bug although I don't actually list all the bugs here but you can see this is out of context here, it looks like it's happening a decent amount and it turns out it is. This is one of the bugs we actually need to go through and figure out what's actually the root causes and fix it. So this has really helped us actually identify how frequent bugs are. This helped us notice that we didn't actually have one or two bugs causing transient failures, we had a whole bunch of them and they're all separate and we have to debug them all. We have one recently with a problem in a bug in HTTP lib2 I think that was causing everything to fail in all kinds of nasty ways and without something like this we'd have never known that that was A, the bug they were looking for and how frequent it was, we'd all have thought it was 10 different bugs. Because you look at the tests and you see test X failed and generally that's what you're gonna say is test X failed, that's a failure in that test and it turns out that's rarely the case anymore. Generally the failure is much deeper in the stack that's something causing any test to fail at any given time. So this has really helped us be lazy like we wanna be and automate things, yeah? So you don't use like Sonar or those types of... I'm not familiar with Sonar actually. Oh, okay. What is it? Well, it's a way to view the test result, the static analysis result, all those types of things. It's used in a few open source foundations but I was just curious to know which, I guess it's all custom made here. It's mostly it, so you're using Jenkins to run everything and we have, there's a coverage tool in Python called Coverage I think, simply enough. I'm so used that for coverage, we have about 80 to 90% unit test coverage today, which is pretty good at best. When you have a big project, generally you want like 110% coverage at least so we're never really happy with it, but it's not bad and we're pretty happy with how far we've come. And you said earlier that you're moving away from Jenkins. Yes. Where are you going? Homebrew. So we've, as I mentioned in the beginning, we're sort of at a scale that most things haven't been done at. Turns out Jenkins wasn't designed for gating. It doesn't handle the load we have. So you're running, I think we have several hundred Jenkins slaves at any given time and it turns out there's a bunch of complexities in the system that make it that Jenkins, it isn't really the great tool for us now and we're moving away from it. One thing is the pipelining, it doesn't support pipelining. So we have the tool Zool to do that and we don't use Jenkins for logs anymore because you don't have to, I don't know if you ever use Jenkins for looking at logs. They have like the console output, which is pretty good, but we have 20, 30, 40 logs that we collect now and we collect this all in a Apache 2 server somewhere and log stash. So over time we've actually grown our own system out in every way and I think we're, I think we're in step 19 of 20 to remove Jenkins. Jenkins is a great tool and I'm not saying don't use it for the record. It just turns out at this scale when you have hundreds of tests a day and thousands of patches and developers, it doesn't scale well for us. So this has been a really great tool for us and we've been really, I'm proud of this in particular because I helped work on this and this has really helped us identify what's going on, make it easier for the reviewer. Before you'd see something failed and you don't want to actually spend the time looking at the 20, 30 logs we have and find out exactly what it is. You want to just have the computer tell you what it is and that could do that for us. The other thing this does for us is it says whenever a new failure happens. So we have an IRC bot like everything else we use IRC. We have an IRC bot for the saying whenever a failure happens that we've classified and every unclassified failure. So you can see as a developer work on this you can say, oh, I think I saw a new bug today. Let's take a look at what it was and sometimes it's actually a valid failure of the tests which are great. We like to see those because that means our system is working. Sometimes it's a new bug that we have and actually analyzed before and we analyze it and we add a query into the query list and then we can actually check it in the future. So communication is important for an open source project. We're in every time zone I could think of. We use Launchpad for bug tracking. It's a great tool for us. This is actually another example. We're going homebrew on this too. We like Launchpad for bug tracking. We think it's really great but our Teri, our fearless release manager has some problems with it and the way we use it or the way it's designed and it doesn't really completely suit our needs and at the scale of rat we can actually afford to think about doing our own but we've been really happy with Launchpad today though. We use Etherpad for sharing things. Hopefully everybody here has used Etherpad before. It's great. Payspin for pasting files back and forth. You have a bug failure. You want to share it with somebody. Payspin is a great tool for that. IRC that's where we all hang out any given time of day. We're all on FreeNode in OpenStackDev and there's a few other sub rooms on there. We have a wiki with all the rooms. This is great because there's some people in Europe, Australia, China, all over East Coast, West Coast. For example, my team, my boss is in New York. Sometimes he lives there at least. He's in Chicago right now I think or flying to Chicago. He was here yesterday. So we're all over the world. I have the tech lead of my team is in New Zealand and so it's hard to keep track of things in any other way. IRC's a great tool for that. We have a mailing list also. The mailing list could be a little slow for a quick chat about something so we have IRC for that and the mailing list is for the big conversations. And we also do the code review on Gary also and that's another big tool. So we try to make sure that this is about keeping everything visible and in the open and we try to keep everything in the open whenever possible. We log all our IRC conversations. We have all our meetings on there. So there's two meeting rooms now and there's some meetings today, actually big meetings today. I think I'm gonna skip them because I won't be available. But we have all our meetings on IRC means anybody could join in and anybody could lurk, anybody could watch them. We have logs from all of them. It makes it really great for us to see what's going on in the ecosystem. Cry. It hasn't happened yet. So far so good. That's correct. Launchpad is canonical and IRC is free node. We've had a few split brain cases in IRC and generally you figure out the split brain and you wait a few minutes and it fixes its health. Launchpad I don't think we've ever had any failures with. I'm probably wrong on that but it's been really great. We actually don't use GitHub anymore because GitHub turns out doesn't, we don't wanna continuously be pulling from GitHub and we've had problems in the past with GitHub being attacked or this or that. So we actually have our own copy of most servers. We have our own PyPy servers. We have a big problem with dependencies. Turns out you have 16 different repos and 20, 30 services running. Keeping all the dependencies in line is very hard to do. So we have our own PyPy mirror to help facilitate that. We have our own, almost every other server but we don't have our own bug server or IRC because those have worked for us. GitHub will rate limit you if you try to pull too many times. I don't know if you guys ever hit that before. It's a great tool, I love GitHub but it rate limits you so it's not really great for this and then if GitHub goes down we don't wanna stop everything. So we have our own Git servers as well. And we're always looking for more tools to make things easier for ourselves. Humans are the most critical resource in the project whether it's developers or reviewers and we don't like to use them whenever possible so we like to automate everything possible. We do this for infrastructure, we do this for the project, we do it in reviewing. So if anybody has any ideas for some great tools out there that you think we're missing I would love to hear them and maybe we could use them. And lastly there's one other way we actually make things easier for ourselves. So the other way making scale and development is split things down. So when OpenSec started it had two projects Nova and Swift. Nova was compute and Swift was object storage so you could think Amazon EC2 and Amazon S3 here. And nowadays those two projects, Swift is still Swift it's still storage but turns out creating images or creating compute resources is complicated there's a lot of moving parts. You have volumes, you have networking. You have identity management, you have image management and so now we have six projects for those two. So Nova is still compute, Swift is still Swift. Glance is images, Keystone is identity. Cinder is volume and Neutron is networking. So the other way is breaking things down. We have dedicated teams work on each one. We have contractual arrest APIs between them. Turns out those APIs are really great for us but also means that we change them very slowly because they're concrete. So we're currently working on the V3 API for Nova. We're over a year in the project it's a very minor change and we probably have about another six months to go because we're so slow at making these changes. Because we need support backwards compatibility. We want to make sure nothing breaks. And thank you, any questions? Yep. Excellent question. We're working on that. We have a, that's actually one of the problems we have today we're trying to better track it is when you have hundreds of developers and submitting patches every day. I think in Nova right now, lots of check there's about 200, two to 300 outstanding patches. The Nova core team is about 19 to 20 people. So that's 19 to 20 people can actually approve patches to be merged. The number of people actually doing actively reviewing any, a large number of patches a day over one a day let's say is fairly small. Maybe 30, 40 people doing that. So we're still limited by reviewers and so we're trying to actually we can't unfortunately say every patch has to be reviewed in a day because you don't have enough reviewers to do that yet. We're trying to better facilitate that process and make it happen better. We're also looking for more reviewers all the time. It turns out if you're a full-time reviewer you get stir crazy and you get, you sort of go berserk looking at code all day and not writing anything. So you really can't tell reviewers like myself to know coding for you and keep reviewing. So that's been actually a big problem for us is we have, there's been patches out for several days, several weeks, several months unfortunately. As the project grows, this is one of our big challenges we're trying to deal with. Yes. Yeah, so we have several rooms. We have the main development room. We actually have a non-development room. So somebody asks for, hey, how do I set up this config file or something? We generally say this is not a development problem. Go to ask somewhere else so we could focus on development. We have a general development room. Some teams work in there. There are separate rooms for all of these projects except for Keystone, which I think uses the main room. So we have a separate room for Nova, so that's a sub team there. Swift has its own team, own room, Glantz is his own room, Cinder has his own room, and so does Neutron. I think this is one work load to work on it from the beginning to the coding for the review to the download. And the way you speak about it is like, does everyone was just taking some part of it and doing what he or she wants? Yeah, that's about it. So the answer to the question is, yes, we have a way to do that, which is there is no formal way, but we have, you could talk to your, if you're working on the database in Cinder's, the database in Neutron or something, you may have a, you know, you may talk in the Neutron room, you may have your own room, you wanna talk quieter, because it's too loud in there. You may use the mailing list, you may use some code reviews. You can use whatever tool you want, but overall as a project, we're actually a time-based release, just like a whole bunch of other projects, because no big, you know, there's no, there's a PTL, which is a team leader, project team leader for every project here, and they can't tell anybody to do anything. They could ask them to, but they cannot tell them to, unless they work for that company. So we have the problem of, we can't actually say this next release, we're gonna have X, Y, and Z, because if nobody does it, then nobody does it, and you can't stop them. So it's like herding cats. So it's a big open source project, they're all different companies. I work, I rarely work with somebody with my own company these days. We have a fairly big team at HP, so I work with them, but overall it's, I work with people all different companies and they can't tell me what to do. They ask me and we talk and we work it out. But there's no do this now kind of thing. Right? Yep. You're gonna be overriding a warning or a reported error, it's a false positive or it's a quarter case that you need to accept it anyway? Yeah, we do. There's two ways we handle that. One way is we have the transient failures, which we, you just recheck, and hopefully somebody will come along and actually fix that. The other one is if it's actually, a good example is, we have this integration testing. So that's generally black box. They take the APIs, the right tests for them. They don't really care what things look like underneath. Every so often, often they find bugs in the actual code and they'll say, we have this test we wanna merge it in. It's not gonna work because of this bug in the project. So you say, skip this test because of that bug and then when it's fixed, we have a bot that'll make sure everything is, let's say, hey, delete this, this is fixed. Yep. Can you explain what we're doing in terms of extending these processes to, projects not formally part of OpenStack, but being third-party contributions or community contributions, and how we're using those tools. Yeah. For stuff like ifstack and StackForge. That's a great question. So we have this, I think it gave it away in your answer question. We have this tool called StackForge, which is we have many projects trying to work on becoming OpenStack projects whether they're related to it. For example, some Chef Cookbooks to deploy OpenStack. It's all targeted for the same community so you wanna use the same tools. So we have this separate domains. We have several domains in OpenStack. We have OpenStack, OpenStackDev, OpenStackInfra, and lastly we have StackForge. StackForge is our just open to anybody tool. Anybody could put a project on there. So far it's been, I think, almost all Python. I think we have some Perl in there, or not Perl. We have, I think, one other language in there today, but we have a tool that could use all these tools we have today. You can use it in your system and our infrastructure team will support you and the idea is to make your project look like OpenStack if it's for the OpenStack community. I have a follow-up question on that. What can we do to improve those because one thing that I just happened to run into was I think DevStack has just been broken for two weeks on route, but only on route? So the answer is we need to do a better job. So DevStack actually is an official OpenStack project now. That's a whole separate can of worms. Morgating, we need more gating. We have that 10, 12 tests. That's not nearly enough. We should have infinity, I think, ideally. So I'm assuming blindly that we have infinite cloud resources. I know that is not true whatsoever, but that's not important for my, it's not my concern, so to speak. So if we have big companies behind us, if we need more resources, we could find them. So we're always looking for more tests if you run, like running DevStack on a route, for example, in the back. Yeah, that gets hairy. We have a separate, so we actually, to get a new dependency in, we have a separate requirements repo. This is new as of this summer, because of exactly what you said. So you have a, working on a project, you wanna add this new dependency on this new awesome library, this new version. You actually have to go talk to the, go to the, there's a separate repo for requirements, make the change there. That's actually what we run our integration tests against. So actually they'll, we'll try the, we actually check out the versions in the requirements repo when we're doing testing, make sure they all work, and then you can go to your project. So the idea is that we're getting better at making sure all projects have the same requirements versions. We also have a bot now that automatically pushes patches out. So if there's a dependency change, upstream requirements, it'll push it out to all the projects. So that means we have a bot pushing patches, a bot reviewing patches, and a bot merging patches, and a bot detecting failures on the tests. So the only thing we haven't automated is the rest of the commits and the reviews. And we're working on that too. This external code, because they might not follow your, we use, we don't care about that. Is a short answer. We just pull down third party, you know, import whatever, pip install in our case. Could be whatever. We don't care about third party libraries requirements. We're using them as a third party ourselves. So we just import them and that's it. We've had problems in the past with unsupported libraries that are slowly being removed and have less support. And that's been a problem for us. But overall we, if there's a bug upstream in a third party library, we've had problems in, we've pushed patches in event lit, whole bunch of others in pip, talks, and a whole bunch of other projects. I can't think of it right now. But we're just another consumer into the Python ecosystem. Great, thank you.