 I'm Joe Gordon. I'm a developer at HPN OpenStack. I have a show of hands who here knows anything about OpenStack. Okay. So this isn't about OpenStack, luckily. This is only partially related to OpenStack. So OpenStack is a cloud platform. So cloud operating system, I think, is that marketing term we're using now. APIs, provision, resources, sort of straightforward. Amazon Web Services, GCE, that kind of thing. In a nutshell, OpenStack is a really big project. It's Python, lots and lots of Python. It is one of the latest numbers here. It's 2.3 million lines of Python in one project. I think that's the biggest open source Python project in the world. I don't know of anything bigger. We're also a big number of developers. The last year we've had 2,000 developers, 64,000 commits, almost 500 developers in the past month, and these are pretty average numbers. As you can see here, we actually had a drop in contributors recently. That's been due to a big release. I think a few people are burnt out and there's vacation in New Year's going on. But overall, they've been growing and growing. This is a massive Python project. And Python is not really the language you use for big projects. We have a few problems with that. We have some basic design principles we try to follow. Never break trunk. So when you have 1,000, 2,000 developers, you don't want to actually break the code for all 2,000 other people. You have a lot of people anger at you and IRC for hours and hours, and that's no fun. Nobody likes that. We also have people who are trying to deploy this continuously. So Rackspace, Cloud and HP, Clouds are deploying this continuously. I think Rackspace is about a month behind trunk right now. Give or take. So that means you put a patch in. They're running it in production a month later. So we have to make sure this never breaks. Transparency. We try to do everything that's open to our big believers in open source and all that. Automate everything. At this scale, you can't do things by hand. We try to automate as much as possible. Reduce the burden on people. Egalitarian. This is an open project. And we want to try to be strict and reduce the burden on the reviewer. So we automate as much as possible here. So you put a patch up. What happens if you put a patch up? You run a lot of tests on you. We have a style guide test that's PEP 8 in Python. I run unit tests. We support Python 2.7, 3.3 and 3.4 now. And we support 2.6 for some older versions. We have an integration environment called DevStack that's development and testing environment for us. And we run Tempest, which is our integration test suite on top of that. We run that a few times. Different configurations we support. In our testing, we test MySQL and Postgres. We test a few networking configurations. We have two networking drivers right now. We test both of them. So you have a lot of different configurations we're testing all the time. And furthermore, we have upgrade testing it called Grenade. A lot of users don't like upgrades failing on them. So we test upgrades all the time to make sure we never break upgrades as well. When you're doing continuous upgrades and continuous deployment, that becomes really important really quickly. So inside of this integration environment, we're running hundreds and hundreds of secondary VMs. So all these jobs are running inside of VMs and then the VMs are spinning up VMs inside of them. QMU, small empty VMs for testing. So you put a patch up. This is a real example. You fix a spelling bug and it goes up and you get the code review comes back and our Jenkins system runs some tests on your reports back to you. And it fails. But that doesn't make any sense because this isn't actual code failure. Nothing about this change could have caused any failures and ever. This is just a comment. Somebody made a mistake. It's a spelling error. No tests are actually covering this. So what happened here? We see this a lot and this is really frustrating for developers. You make a change, it fails not because of you. So what happens when you're actually running these tests we have anywhere from 5 to 10 dev stacks running depending on the project. So that's our integration environment. 10 to 1,000 integration tests running inside of those integration environments. I think we're at, I think 1,000 integration tests per tempest job and we have a few of those running on some projects. A lot of data. We have a huge stream of data coming through and we have a lot of patches coming through and our high water mark is in one week we merged 561 patches in a week. So that's how we did patches a day. Thousands of viewers, thousands of developers working on this. So this is a really big problem. We have a lot of these failures happening all the time. Put a patch up, something fails, not your fault. And then furthermore we have a lot of patches going through. We have every 40 days or so we go through 10,000 new patches. So that's not including revisions of old patches. We have a lot more new patches coming in on that. Our test node cluster now I think is 800 machines and I think we're using it most of the time. So that's 800 machines or so running all our tests for us. And that's running pretty continuously at a high capacity. So let's talk some basic probability here for large numbers. We have the chance of events, the number of events per run. So let's say boot instance. You may run that a lot of times and do that a lot of times in a test. The number of runs. So this is a real example we had. GitHub is down 0.05% of the time. That's actually a great uptime. That's really quite good. But you're on 20 clones per run. That number is actually much higher now. I'm not sure the exact number we're in now but it's quite higher than that. 1000 runs are more per week. That means we're having GitHub failures 15 times a week. So 15 people are pushing patches up and failing through the GitHub. And as a result we don't clone from GitHub. We have our own massive Git server farm that does this all for us. These failures come from, some of them are just sort of internet breaks. We have transient timeouts, all kinds of things. But also OpenSax is a big complex piece of code. It's thousands of developers, 2 million lines of code. I think it's 30 or 40 binaries running for the whole thing if you're running all the services. This is actually a very outdated image. This is, I think, two years old at this point. And it's twice as big now if not bigger. It's this big complex thing. There's a lot of services. Developers working on it. And we have a lot of hidden complexity in this. So how we did this before, how do you manage these failures? You push a patch up, something fails. It's a race condition, let's say, in the code. One service is talked to another service but not handling the race properly. You get a failure back. You look into it. You realize it's not your problem. You run recheck. We have a recheck command to re-trigger the test jobs if you realize it's a mistake. And that's okay. But then how do you actually manage these failures and fix them and see what's going on? So you ask them. You say, hey, have you seen this? You point to a URL with all the logs and collect all our logs for records and analysis. And you look at it and it turns out you're not good at, humans are not good at looking at this stuff. Our brains are not big data solutions. You can remember a little bit at a time. You can't remember 100 bugs at a time. And then a few years ago we turned on parallel testing. So before running a single test runner for everything. And then we turned on, I think we're running two or four tests concurrently only given time now. And that broke everything even more. We had a lot of race conditions we started hitting then. And everything started failing. Now we had tons of failures and we couldn't figure out what was going on. And you can't remember all the failures. And having 1,000 developers remembering what stack traces and what bug doesn't work. People get really grumpy when they waste time looking at these bugs. And you don't even know what bug it is until you have to go through all the data and figure out a gig of logs and sort it all out. And it was really unpleasant. So we changed it to have a tool called Elastic Recheck to try to address this problem. So the key thing we're trying to address here is have you seen this recently? So we used a basic Elk stack, Elastic Search, Log Stash and Kibana stack. So this is Kibana. So we have a nice Kibana interface, LogStash.openstack.org and you can actually go and type a query in. We collect all their logs on there and you can type a query, put a log in there and see when it happened. Put a stack trace in a new stack trace you started seeing, you see when it started happening. And it's been really great for actually being able to visualize these failures and better grapple these logs. Have you seen this recently? We actually removed that, hey, have you seen this part where there's a single person or a small group of people who know what the failures are and everybody has to ask them, ask them for, hey, what happened? So we actually report back to the user, hey, we think you saw this bug. You should go ahead and validate that, make sure that's the bug that you saw. And that way they can see, oh, this failed not because of me, this is another bug in the system that we know about. And they can run recheck. So now they don't have to go through, you know, several gigs of logs and spend hours sorting out what service failed. Is it a raised condition? Is there a stack trace? If there's no stack trace, it could be something else. You have some really simple bugs in there sometimes. And this helps make it much easier to figure out what's going on. And so the recently part is really important for us as well because bugs appear and disappear. This isn't the current example we have. This is actually a really bad example of a bug. This is a bug saying we're logging into a VM, which this can mean 100, 200 different possible things to go wrong to cause this. But you can see this is over 10 days and some trends in it. The trend here isn't actually that important here because you can see, I believe that's the weekend where the spike drops. But sometimes you'll see a bug happen to occur in the last 10 days. Something happened, maybe it was a something underlying system sometimes it's a new patch went in and you can actually see this bug started five days ago. And they could go back in the logs and figure out, oh, it must have been these three patches and you look at each patch and figure out what happened. So what happens now when you submit a patch? Same thing happens before in a lot of tests. And then Elastic Recheck sits all behind that. When the test complete, the question sent back to Garrett for the user to see on Garrett's overview system. He reports back to that and says, hey, these are the results. At the same time throw all the logs in a static log server. And I think that's still under, is that under Swift yet? Not yet. We started a big, we started all the logs in the, this is for record keeping. Our logs are actually so big they now crash Firefox sometimes. So we now have a server side filtering as an optional thing because you load up a 40 meg file in Firefox it doesn't always work unfortunately. So you can actually, this is because even the logs are so big that we're crashing browsers now. We don't actually collect all the logs in Log Stash. We ignore all debug logs because we just have no much information. We can't store that much information in our Log Stash cluster currently. So anything that's info or higher we store in this Log Stash or Elastic Search cluster. And then we have two Elastic Reaching at the same time. One is listening on the Garrett stream for events. So that's where Jenkins reports back, hey, these are results of your job. And at that point we actually go through and look for a failure in that and then we have a bunch of fingerprints and known patterns for failures. And we run our patterns on the logs via Elastic Search and see if we find any bugs that we know about in there. So a simple example. So we see a stack trace in a Nova Compute service, let's say. And we know it's known to cause a certain bug. We do that by hand currently. We can say, hey, we saw this job failed. We saw this stack trace in your log and this log, it's probably this bug that you saw. At the same time we actually want to have long term statistics that we track on a higher level. How often has this bug happened? So that big view, not the per patch view. And we have a script that runs every 30 minutes, I believe, that updates a website that lists all the bugs we have. And that is... So this is that page here. Actually we have 87 or 80 or so known bugs today. So this list is up to I believe. We see a whole bunch of bugs here, different patterns, different frequencies. Things are actually working pretty well right now. We only have this bug that has 62 failures in the past 24 hours. That's actually pretty low considering how bad some of these bugs could get. You see all kinds of different trends in here for different bugs. A lot of the bugs are really infrequent. Here's a good example. We had a issue that came out. And you see this massive spike of bugs that we fixed and it mostly went away. So here's an example you can see on January 9th a new version of Bodo came out and then we fixed it shortly after the same day. And it's now mostly fixed. So when we started this out we thought there were maybe 6 to 10 major bugs at any given time. We got so many bugs at one point and things got so bad that we actually had to stop development and fix all the bugs. We had to tell everybody stop pushing patches and nothing was passing. When you get enough bugs in the system, the odds of a single job running that has a thousand, 2,000 tests in it is almost zero. Then we have a bunch of these jobs running and so the combined effect was that it was really hard to land a patch. People were grind away saying recheck, recheck, recheck. Nothing was merging. The backlog would get 70 hours to report back if your patch passed or not of that kind of thing. What really happened is it turns out we have a lot of bugs. We have a lot more than 6 to 10 major bugs. Right now we have 80, 90 bugs and that's a pretty common number we have. It turns out we're really bad at recognizing lots and lots of patterns. Humans can only detect a few patterns. Phone numbers are 10 digits for a reason. We can't remember a 50-digit number that easily, that kind of thing. We found a whole bunch of different categories of bugs here. We found upstream service providers. We have all kinds of things. We have Pi, Pi as bad certificates. App mirrors that fail. We use every so often some app mirror doesn't work that well and it breaks us. The providers from our HP and Rackspace infrastructure, infrastructure as a service cloud will break sometimes. Access issues, GitHub outages. We have bad upstream pip releases. We had one a couple days ago in fact. All kinds of problems like that. Upstream things that we can't out of our control. As a result of some of this we cashed as much as we can and we try to stay away from the network. Having a network call in a test turns out to be really unstable sometimes. Our infrastructure breaks every so often. It doesn't happen a lot anymore thankfully, but it happens every so often. The images we build things on may break occasionally. Our mirrors could break, that kind of thing. We're in the progress of making that better and better all the time. We have a lot of bugs in open stack itself. That comes to all that inherent complexity of having thousand developers writing in Python a big asynchronous system with no compile time static analysis in Python. It's a big complex system. There's no one designer of it. It's very organic. We have a lot of all kinds of bugs that we have. There's a lot of asynchronous work. This is a slow thing. You don't get a result right away. There's a lot of asynchronous communication happening for that across a lot of services. We have database deadlocks. We have a lot of those every so often. We have race conditions with multiple workers. We had a bug in Nova schedule. Two schedulers. They could race each other or that kind of thing. We had bugs in the tests. When we went to concurrent testing, we had all kinds of unsafe global states. We also had things like time stamp assumptions in the test. You assume the time stamp is not changing your test. You're looking for something as a time stamp in it. Every so often you hit a threshold, the time stamp changes and it would fail. There are really fun bugs which are our dependencies. We've had all kinds of liberate bugs. Maybe half dozen at this point. We had a great bug which was network block devices and open v-switch were breaking each other. If you're on our volume storage service and our networking system on the same machine it would break. I'm not sure what happened with that one or whether the root cause was there but it was some bizarre thing that we got fixed. We've had all kinds of errors and we're constantly working with upstream providers and upstream packages to fix things. Contributing a pattern is actually really easy. This is one of the important things. At one point we thought about trying to automate the process of identifying. A fingerprint is a piece of the log that you could search and elastic search that matches a bug. We do this right now manually. It's not too hard if you understand the log formats and things like that. If you're sort of familiar with that it's not too hard to find these on your own. We tried some things to automate that and we're getting mixed results with that. Yes, this is a great example. Yeah, so we have a whole bunch of these bugs. This is actually an older bug when we've had this three or four more times in different variations. Turns out we're not always great with managing our database issues. This is a great example of that. This is a good example of actually how sometimes these failures are pretty easy to understand. You go through, you find the test that failed in our integration suite, you backtrace that to the logs in the code. This is the neutron server and you see a stack trace from the database system. Oh, that's not supposed to happen and then you file a bug for it. You add a query and like this and you say whenever that happens it's this bug and then you automatically detect that in the future. We also have a, we'll try to keep up with our classifications that we have. We know the number of jobs that are failing and we try to make sure that we're actually tracking that and making sure we don't have too many unknown failures in the system. So this is, I believe, the current status of it. You can see our classification rate isn't that great. It's about 90%. If it's lower it generally means that there's a bunch of bugs that we don't know about that are in the system. We also have the ability to filter this down by the past few days. So sometimes we'll just look for the things the past two days or so. See if there's any new bugs in the system that started happening that we haven't been, we don't know about yet. Some next steps. We don't actually have any reference for how bad these bugs are. We can't answer the question of what percentage of jobs is this bug hitting? It turns out that it's actually not as easy as we thought it would be to do. We haven't got around to actually implementing that solution. The UI is a bit wonky and we can do a better job of making that more intuitive and more useful. We've talked about this for a while. We actually hit elastic search pretty hard and our logs are really bad in general. There's a lot of work to clean up our logs and make them more sensible and all kinds of things like that. This is sort of the beginning of our process to help us better manage our failures. This is also really useful for downstream employers of OpenStack. If you're running this big complex system, we actually know all the failures you're going to see probably in production. If your system is remotely like ours, even in small amounts, you may see these bugs every so often. We could tell you these are known bugs and you can track how often they happen and things like that. I'd like to thank everybody who's worked on this. We had a big lot of people working on it. A lot of these people are working on submitting fingerprints, which is a lot of the hard work to actually get this working. It's a constant challenge to keep up with all the failures in the system. I think every week probably three or four patches coming in to file new bugs with the system. Thank you. Any questions? I have a question stick. So this elastic recheck, can it be used is it hardcore for OpenStack or can it be used by other projects as well with the elastic search and all that? Yes. The code base is somewhat tied to our infrastructure, but the code is actually really small. It's maybe a few hundred lines of Python. The basic model on a lot of the code actually can be reused. I would love to work with somebody from another project to help pull out the parts that are specific to us and not specific to us to make that more accessible. And it checks the bugs in Launchpad, right? Yes. We use Launchpad although we have our own bug tracker coming up and we're probably going to have to add support for another bug tracker as well. Any other questions? No, but it wouldn't be hard to add that kind of support in it. As I said, right now when we wrote it, it was mostly proof of concept that sort of stuck and we haven't really gone back and really flushed it out and pulled out the parts that are really tied to our infrastructure. The format of Garrett, that kind of thing, but that's something I'd be very interested in doing is working somebody on that if they're interested in adopting it. The hard work of this seems to be building fingerprints and search queries into elastic search. Do you have any insight into how you might automate that part away so that you could be automatically pulling tracebacks out of bugs? Right, so it's actually not... there's sort of a few answers to that. That's now the hard part. That wasn't the hard part before. We've actually moved the problem to just doing that as opposed to when you see a failed job to figure out what happened. So if it's a stack trace, it's really straight... it's actually really simple. You look through, you look for stack trace in your logs, you find the stack trace. If your system is... our system stack trace every so often when there's not a failure, but usually you can figure those out. So if it's a stack trace, it's pretty straightforward. If it's not a stack trace, or an obvious error, it could be a network outage, that kind of thing. Those are pretty straightforward to understand if it's more complex than that. We tried things like using CRM 114, which is the spam filtering, and we've gotten mixed results with that. It actually works okay. I think the answer we found is that it's... if you're comfortable with your logs that you generate, it's not too hard. It takes... one of the problems with it is really complex, really subtle failures. We had one which is crosstalk between RPC, our message bus, and our REST APIs. So we're getting RPC calls to REST and vice versa, and that took us like a month to figure out. I think we had a... we could actually say this job failed for this reason, and we could actually track the failure in the test side, which isn't great. It's not the root cause, but it was something. So I don't think the automation... I don't think that's actually as hard as people as you think it is, unless you don't understand your log format at all, which that should be worked on as well. Just in terms of filing the fingerprints, is more monotonous than challenging? Because if you go into the unclassified page, which is where I usually go to if I decide that I'm going to spend some time and file some bugs, so go to the uncategorized page, and you just go down. So there's 20 total fails. So what I would do is I would open all those in a separate tab, and then find the stack trace and find the similar failure. So it may be that 15 of those are all the same failures, so one fingerprint will get rid of 15 of those, and then you have five that you actually have to look a little bit deeper. I'm talking about habits and so on, so I'll talk a little bit about habits. Part of the problem is it's monotonous, and you don't have any way of people actually tracking your work that you did something. Whereas if you're doing code, if you're doing something else, you get to wave a little flag at the end, whereas if you're classifying bugs, other than Joe saying, thank you very much, and posting your name on the slide, you really don't get much to show for your time, so it's hard to motivate people to do it. The monotony is why I asked the question because I understand that that's the difficult part, and I think the CRM114 answer is possibly a really good way, basing and classifying is a pretty interesting approach. I was just curious if you had thought about automating away this part of the pipeline that you've now built. It probably didn't come out in Joe's talk, but this is a very, very, very complex problem with a lot of moving parts because you have to totally understand all of the infrastructure interactions, which is a 20-minute talk on itself, and then you have to understand the additional part of Log Stash and Cabana, and there's probably about five people who have the knowledge, and the thing of it is that they're so busy fixing and classifying bugs and helping people and communicating that they don't have the time to go back and fix the infrastructure. Well, we should probably not go too far into afternoon tea, which I think is now. So, thank you very much. APPLAUSE