 All right, I think we're going to get started here. Welcome, everyone. My name's Sean Deg. I have many hats within the OpenSack community. You'll see me reviewing DevSack code, Tempest code, Nova code. But one of the things that sort of brings this all together for me is our continuous integration testing system, and especially figuring out the things that make it go funny in ways that we did not expect. To talk today is about Elastic Recheck, which is a tool that organically grew out of some issues that we were having in the way something very complicated like OpenSack gets validated within our system. And to give you back story of what it is the problem that we're trying to solve and then what we have done to solve it and where we're headed. For people, how many people here have submitted code into OpenSack before? OK, so maybe like 20%, 25% of the room. So you'll probably be familiar with what happens when people submit code. People are often pretty surprised about this the first time. If you have your Jane developer has an interesting change they want to submit to OpenSack, the act of pushing that into our code review system immediately triggers a ton of work that happens in the back end. Because we do validation for style checking, we do validation for whatever unit tests were appropriate in that project, we use DevSack and Tempest and sometimes DevSack with other kind of test systems to bring up full versions of OpenSack. What would OpenSack look like if this one patch was in place and how would it interact with all the rest of the services? And each of these things actually has a ton of work. So right now, a set of standard NOVA tests, we end up creating 180 guests inside of the test as part of a normal create model. So lots of moving parts, lots of actions triggering in our system based on an uploaded change. So what if this is your change? This is a real change, pulled out of the code review system. This is all of it, the entire change. They added an R from a spelling in a doc string and they failed a huge number of tests. And you could almost imagine the doc tests could have failed because something, no, DevSack doesn't start after this. Right. So I used to spend a lot of time on this particular problem. Part of what's really interesting, again, we're talking about numbers, let's talk about numbers, a single change will generate like 5 to 10 DevStacks, which means in aggregate, you're talking about 10,000 integration tests being run for that change, about 1,000 second level guests get created in that process. We're probably of order a gigabyte of log data, gets generated, uncompressed, gets generated on every proposed change about what happened. And that's a collection of all the service logs for all the services plus the test logs. As one change, we have high velocity. So in any given week, we're merging 250,500 changes. That means that we're actually generating a large number of revisions because you don't just propose once in your code lands. And even just when we talk about growth of reviews over time, our Garrett system sees 10,000 initially proposed code revisions every 42 days right now. So there is a ton of inbound. We talk about inbound, this kind of leads to an interesting aspect of the statistics of large numbers. There are highly infrequent events that happen. And how often a thing happens is based on how likelihood is that event to occur, some probability of it, how many events are you running in one of these test runs and the number of test runs we had. And so I gave you some of the later numbers. So let's say for the sake of argument, may or may not be true, but for the sake of argument, GitHub has three and a half nines of availability, which is probably of order what they actually have. So they're down five, 100ths of a percent of the time. Based on how much we get cloned and how many runs we pipe through the system and the fact that we basically have this huge continuous stream, that means that we will fail 15 tests every week because of a GitHub intermittent issue. They could be a spelling check. They could have nothing related with that. This, if you were running a small enough number of things, it would just be an aberrant event and you'd say, oh, that didn't work for whatever reason, I will just retry to move on. But when we have this much throughput, that means that we just failed two glance changes, a horizon change, three Nova changes in a given week for something completely unrelated to the work that they were trying to do. And as we started seeing these patterns emerge, we realized that this is a problem we had to go after. For this particular problem, we don't clone from GitHub anymore for this reason. It turns out that, I think yesterday in a session, I said that the OpenStack infrastructure is the Nagios of the internet. We typically know that one of our cloud providers has a network outage in a region before they do because it failed a code review. And the reason for that is this is the super simplified version of the OpenStack architecture, not including all the currently integrated projects. This is a lot of moving parts. These parts are communicating via asynchronous messaging mechanisms, which means that small timing changes can have interesting effects. There are lots of places for things to kind of arbitrarily slow down or get impacted in any particular sequence of events. If the test order changes, like it could mean that now a Cinder test was running while we were running this test before, which is slowing down the IO path, which means all the demons are running a little bit slower, which means we got a timing delay, which if everything's not exactly right, can trigger some sort of failure. One of the initial reasons that Linux won over The Herd was The Herd actually had an internal architecture like this, lots of little demons, asynchronous messages, microkernel, it was really hard to debug. And it got to the point where like, you know what, they just got bogged down in the fact that figuring out what patterns were happening was something they just couldn't handle. Kind of realized we were hitting some similar issues and we needed tooling to approach that. How we did this in Grizzly, because this was the point at which we had enough tests and enough complexity that we actually started to experience this issue on scale. We have a facility. We know that a test failure does not necessarily mean the code is bad. We have a facility where people can leave a comment with recheck and a bug number. This test actually failed for a bug that I know about and to signal us in the system that this bug needs to go be looked at. So sometimes people used it. Sometimes people put numbers that weren't numbers into it. Sometimes people put whatever random bug number they happened to remember in their head because they didn't really care because they just wanted to get their code back into the system. And if you were looking at single sequence patches, people would often, it would bun patch, they would just recheck, they sometimes wouldn't tell us and we would have no idea that there was an issue arising. If you had someone that was pushing a 15 patch series into the system and it happened that we had grown a bunch of these race conditions, they would see three of their 15 patches fail at which point they'd realize it wasn't just a one-off event, there's something else going on. They would come into IRC and they would say, have you seen this failure before? Point to a log file, to me. I am a poor big data solution. Then in Havana, we had this really great goal of turning on parallel testing, which meant we were doing the incredibly crazy thing of sending more than one request to the API server simultaneously. And things got really interesting. We actually, there was a point in time where it was unclear that we were gonna make Havana because we just had these things cropping up and we couldn't see what was going on. So we designed this, like we often do, we built a tool to solve a problem and a whole bunch of people pitched in. The crux of this thing that we built actually comes back to this question. This is the fundamental question we're trying to answer. Have you seen this recently? How do you figure out what you've seen? You have to have some way to search for it. And I'm gonna flash a few of these UI things and then we'll come back and actually go through the architecture of what Elastic Recheck is. This is Log Stash in Kambana and shoving all of our test logs into Log Stash lets us see things. We need the bottleneck not to be me or Joe or the couple other people that actually do see these regularly enough to tell people. People are not Elastic and when you have thousands of developers bringing code in, have you found my bug today is not the thing you want people to answer. But bots do this thing really well. So if you've pushed a patch and it has failed for a thing that didn't look like your code, you may very well have seen this handy little message come back. Like we think you failed for this reason that it's not your fault. And here's some information about how you can go learn more about it and here's the command you do to put the code back in there, hopefully verifying that this was probably your bug. And then recently is important because as it turns out, there were a lot more of these things than we realized and they come and go over time and we need to know when we think we fixed one of these issues whether or not we have fixed one of these issues. You can see sometimes they show up very quickly. This is actually great when this happens because if we see a huge spike in something, we can actually typically narrow this down to a dozen commits that might have caused it and back through and walk things out. The thing that's hard about all of these bugs is we're talking about these small statistical probability bugs. We're not talking about something that breaks all the time. We're breaking at very small levels. This is what our architecture of the system looks like and the components that exist within it. We've got, think of this as two sides. We have a synchronous model and an asynchronous bulk data collection model that gives us our long-term statistics. When your test completes, there's a normal process that has existed forever that takes all the artifacts, the service logs, the test logs, a whole bunch of things like the configuration values so that we sort of could reproduce the environment. Puts it on logs.openstache.org which has the maximum number of maximum sized volumes on Rax Cloud that we can put in a single environment and we can't quite keep six months' worth of logs. Starting in Havana, we got log stash and we take a subset of this, largely so that we don't kill our log stash and we end up with about two weeks of data that we put in an index in this log stash environment. And when test results happen, they post on Garrett. We've got this bot that is listening to the Garrett stream and responding to things. When it sees a Garrett failure, it says, oh, okay, there's a Garrett failure and goes in and grabs a hold of the contents of the test so it can understand, well, this test failed. Now, it's asynchronous, this process takes time so it's got to actually kind of be smart about knowing that it's got to wait around a little bit before logs enter our search engine so that they're available. And then we have a set of known patterns. We have these sort of distinct patterns which are elastic search queries, log stash queries, which given this query, this uniquely identifies this bug. Sometimes coming up with that query is easy because it's a very specific trace back that, like, oh, when this test failed, what actually happened is Nova Compute, like we tripped something and it traced back in a terrible way and there's a really distinct like trace back signature that we can use here. Sometimes these are really not easy to pull out because you can have situations where you kind of can see a pattern if you jump across the whole bunch of logs and say like, there was like a flow loss here somewhere. We can't currently do much with those because there's a limitation on log stash to the fact that it's a sort of line or it documents in log stash, which is the fundamental unit of search is one log line. That means it's actually had an interesting effect as developers have started to engage and use this tool that networking failed as an error is terrible. And I think one of our signatures today basically is that. And it's like, the dev team is like, for Neutron is like, well, that's like 20 different bugs that are there that could have caused that. It's like, that's great. There's nothing in your logs today which distinguish between them, which means we can't figure it out, which means operators can't figure it out. So it provides an interesting backwards lever on the dev team to like make it really easy for us to find these things because that helps them get code through, which is also really good for everyone else understanding what's happening in their environments. The asynchronous piece will do that message, I said, right back to Garrett. So it'll be the theory of providing you the actionable information as close as possible to the person that can deal with it, which is right there in Garrett. My thing failed, I go to Garrett. Our attempted SLA on this is 15 minutes. We actually assume that if we go past 15 minutes that something is terribly wrong with the backend data store and like don't try anymore. So it means this is lossy, we miss stuff, but most of the time we're pretty good about it. And now we notice when we miss stuff. So when LogStash has weird outages, we can actually kind of signify to the rest of the world. We also dump it into IRC channels in configurable ways, like the QA channel actually gets messages every time a gate job fails for an unknown reason. Because those are really interesting for us. Our goal is to get things categorized as much as possible so that we actually, every single random fail is something that we know about. If we know about them, then we know, most importantly, we know that there's not a big hidden one in there that's actually killing us. And then we also have a bulk model. This produces a set of screens, which are those graphs of frequency, as well as an end catarized thing I'll show you in a sec, that ends up at status.openstack.org slash elastic recheck. And this is kind of a way to figure out, like, not just what was important to you on your patch, it is what is in aggregate important right now affecting the whole project? So that we can provide some sense of urgency. Like these five bugs are the worst race bugs that we're seeing in OpenStack this week, we should try to get people on them, we should see what we can dive onto. These things are going on for a long time, like which are the things that we need to dive on. When we started this process, we started this process in, well it was a long process. It started with a conversation that went back to San Diego about the fact that, between Clark Boylan pulled me aside and he's like, oh, I've got this idea, we could use LogStash and we could go find some of these issues that we were finding, it was like, that's great. And I didn't quite understand what that meant, and like four months later we talked about it again and eventually by the summer we actually had something. And I had like noodled around on a little bit of sample code of how we would interact with that API and how we would pull some things out and then something else got busy and I put it aside and I'm about to get on a plane to LinuxCon in New Orleans last fall and everything is tanking in the gate, we're in like late Havana, RC phase, people can't get their bugs merged and Joe Gordon's like, hey, that thing you were talking about for a long time, do you have any sample code yet? I'm like, yeah? He's like, okay, give it to me and like whatever. And I went to a conference and I came back a week later and Joe and Matt Trinich had written the first version of this because we needed it. We really thought what we were going to find was we had a few bugs because we kept seeing them. Like Joe and I had kept seeing these bugs and it's like, oh yeah, this is this thing. And they would have relatively high frequency rates. They would be like 1% of jobs are failing for the following reasons. And in hindsight, there's a reason we thought that. There's sort of only so much information of human brain is really good about holding and once a pattern drops below some certain signal threshold, you don't see it as a pattern anymore. Because this is not what we found. Right now, today, well, a couple of days ago when I looked, we're tracking 100 unique race condition bugs in the system via this. And that means all of these frequencies are super low. They are like, you know, saying this for less than a 10th of a percent occurrence rate is like, it doesn't really get to it. I mean, we see certain things that show up once every week. But they're the same thing. A great instance of this actually is Keystone port. Keystone has an IANL assigned port, but it's in the high port numbers, which means it's totally valid ephemeral port for other services on the system. So once in, you know, 60,000 times, given a random distribution, Keystone fails to start because something's on its port. And that actually turns out to be once every two weeks for us when you look at our throughput numbers. You know, that's why Keystone should be a WSGI backing in Apache and we're gonna make that kind of the default soon. We largely have five major classes of issues and different ways that we attack them. So we have upstream service breaks. This is a comment that I made before that open second for is the Nagios of the internet. We see a GitHub outage before they've told anyone else, although we don't anymore, cause we mostly block it. We see Pi Pi issues. We have found every Pi Pi issue before they found the issue. Especially when they changed RSSL terminators and it turned out one thing in their CDN didn't have the right start. And they're like, no, no, they're all right. The vendor says they're all right. I'm like, I don't know. And like four hours later, DStuff was like, yeah, yeah, they missed one. So the infrastructure itself actually breaks. Like we have all this code in open sec infra that does all these amazing things. Sometimes it has bugs. Sometimes it has scaling limits that we didn't know about. Sometimes we fall over. These kinds of breaks, we track in the system, they're not statistically random. They show up, they're hard to find. They actually typically like stop us dead, but we do try to, we have to filter them out otherwise we don't really have a sane data set to look at the rest of it. Straight up bugs in open sec. State corruption issues, where if you do API things in a sequence, certain sequence, you'll like get a sort of corrupt compute that can't be gotten rid of. Database deadlocks, multiple worker issues. We've seen all these class of things. A thing, there was a event API added to NOVA at the very end of Ice House because we finally got to the the crux of this weird timeout issue we were getting with Neutron, which was some things in our environment were taking longer than we thought. And basically by the time the connection had been made, the DHCP agent in the computes had given up. And then like the rest of the test failed. And it was like, oh, we never had a signal to say exactly when things are done. And so they fixed that. The tests themselves have bugs. We found turning on parallel testing, which is what we do now. We typically run, you know, either two or four test workers simultaneously against the endpoints. So you'll have test running concurrently, which is great. The order that they happen in is kind of arbitrarily picked by the test runner. And there are times where you get cross or just normal timing interaction. So if you get things like dealing with aggregates, some sort of global state issue where you really can't legitimately create, list, delete aggregates in one worker while you're trying to do that in another worker because it's changing a thing which is not tenant level isolated, right? And so when we tried to get into parallel testing, we found a bunch of things largely around aggregates and availability zones and other global structures where our test suite was not expecting the right or was not able to do the right thing. But also, people are like, oh, I'm gonna get this response. I'm gonna get this response again. I'm gonna compare the two dictionaries. And those two dictionaries include a timestamp. Our timestamps are second level. That API took 20 milliseconds. So one out of every 50 is not the same. And it's like, oh, right. And we've gotten patterns to get these out now but they show up in these weird ways. And then dependencies, the OpenSec itself pulls in a ton of Python dependencies. A ton of Python dependencies. And not just Python ones. We actually, we had this issue. We couldn't figure out what's going on. Salvatore on the Neutron team dug into it and it turned out that happens to be on the precise kernel. If you make certain NBD calls for file injection in Nova at the same time that you make an open V-switch network namespace call, you get a kernel panic. Great. Which wasn't a whole lot we could do about it. We actually kind of stopped using file injection because like, whatever. And it was, we wanted to do it for a long time. We found like interesting things in Libvert. It was great when we tracked down this bug in Libvert and Dan Berenjay was like, oh, wait, I think I vaguely, and he like pointed to the changelog commit which was basically like 20 patches past the version that Ubuntu had shipped. And it's like, oh, we almost had it. And that got backported. They did a good job backporting that. But we try to get these bugs reported upstream. And, but we find this stuff all the time, right? We've actually found kernel bugs, Libvert bugs, KVM bugs in this system that no one else that those teams had never seen before because the time that maybe it failed in a test, someone tried to reproduce and they couldn't. They're like, oh, well, it's not a real issue. But we have this throughput of data and you can do a statistical analysis on it. You can actually say, no, it's a real issue, especially if I see it come back like every other day. And it might be a low frequency occurrence but it's a real thing. Now that it's a real thing, people will stop dismissing that it's not a real thing and we can start to dig and get to the bottom of it and hopefully change the test in a way that makes it a, you know, shows up a lot more often. Contributing to this project. So this project all lives in, it's an open-stack infrastructure project called Elastic Recheck, runs through our code review system just like everything else. There's kind of the code and the queries, they're all in the same tree. And the queries are literally a file which is this is the bug ID number, in the name of the file. And here is an elastic, or here's a log stash query that this log stash query really basically uniquely maps to that bug. They can have various level of complexity, they can, you know, this is both on a log message but also only certain logs we're interested in, only if it shows up in this other particular log. These are pretty easy to contribute to. These are hugely awesome if people come and help us write these things and they find bugs because it helps us figure out what it is. These sort of intermittent, you know, race condition, flaky bugs, what however you want to call them. These things that don't happen all the time. We love contributors here. We're pretty fast on code reviews here. We land your stuff pretty quickly because we really want to get this stuff in. If you want to come help us, we have this page on our status.opensoc.org slash last degree check, our uncategorized list. This is everything that's failed in the gate in the last two weeks that we don't have a signature for, sorted by the job that's failed. And it's regenerated once an hour. It's not real time, but you can go through here, dig into some log files, say, oh, this failure is really this bug and this bug is really the signature and help us, you know, get rid of these things. We have found, this is kind of our to-do list. We have found that we keep track of where we are in categorization. This was when I took the screenshot last week. We were about 75% categorization. Our experience is we really want to be above 90% categorization because otherwise there is typically a big killer bug that's impacting us a lot in the data and we haven't found it yet. And the odds of that get pretty low when we have about 90% of the things characterized. So help us there. And again, we're mostly happy people and we really love contributions here. We have a whole bunch of next steps. I think I was supposed to update this after the design summit session, but these are things we were kind of thinking we were gonna do anyway. This whole tooling basically evolved from about two weeks before the Havana shipped to now. So largely we've been working on it for release and during that release we've worked on a thing which was written in place to work really well in our system. The pattern analysis model has been super useful and so what we actually wanna do is get this bundled up in a way that other people can use this same methodology on other environments. In iteration one, that's gonna be you gotta have a log stash kind of looking kind of like our thing is but the rest of the tooling should sort of work otherwise and if people become interested in applying this model of finding and frequent events to their data sets in a cloud environment like we wanna have you part of this conversation and help. There's some basic data cleanup things like we're doing, you know, there are real better statistics that we could be providing that might help narrow things down a little bit more. We're not quite there. We're mostly on the counting stage but even that has proved really useful. We have, people may have seen this old Rechex page. It still exists. Actually, I think it's broken since the Garrett update. So it still exists a really stale version of it that's not being updated. We are trying to figure out what a value remained on that when we were getting data from users and what we can collect to go from there. And finding patterns and patterns. It turns out that one of the first things that happens when we find an issue is we actually ask a second question is it only in one cloud provider we run on or one availability zone? Like can we narrow it to a physical event immediately and if it only happens in some subset then typically it's not a problem. And so there's a layer of analysis that we could actually do that would help us trim this even further and point us already in a big direction. And the other thing that just happens is it turns out we generated more load on Elasticsearch than we thought we would in this process. We need to be a little bit smarter about some of these things. We've been working on upgrading some of the Elasticsearch backends and then being better about our query models and whatnot. But a whole bunch of work here. We had a design summit session yesterday and there'll be a whole bunch of diving in over the next cycle. I wanted to thank a whole bunch of people in our single basically cycle. We've had 33 contributors to the project. All of these people contributed either code or queries to help us make the whole system work and that was really awesome because this is, this is a much better, these people contributing to this code is a much better big data solution than my brain. So thank you. And with that we have a few minutes left. So I will stop and take questions and show you the pretty picture again. And there's a mic there or you can yell at me and I can repeat your question. Come on, Terry, you must have something to say. Yeah, yeah, so the question is we have a whole bunch of different queues in our environment. We have the check queue which is what Patches landed initially and we have the gate queue which is what things do before Final Merge. And Terry's question is, is this page only the gate queue or is it also include the check queue or other queues as well? This is only the gate queue. And the reason for that is that the check queue is full of noise. People push us code that doesn't, that doesn't, isn't valid Python. Very regularly into the review system, right? So we don't really want to spend our time like putting that into the mix of things that we need a signature for that this is not valid Python, right? It's not really useful. I mean, I guess it could be. You could use that as like a proxy for like, you know, hey, you don't have valid Python. Although hopefully they wouldn't just like run recheck though they do. Yeah, it's amazing sometimes. But yeah, so the gate queue is what we try, we consider pretty clean data because at that point it's past all the tests once, especially with the new clean check model it can't ever get to the gate queue as well as it's had two humans say this was right and push it forward. So if there's an issue there, we assume it's a real issue. With this whole pattern found we had this assumption when we had a smaller test bucket that we would catch everything on the way into the code and the thing is in a largely asynchronous environment you don't. These bugs are infrequent enough that they can get through the like five to 10 iterations worth of tests and not be seen on their way in but you get an emergent behavior especially when these things interact. And so it's flipping the problem. It's saying these bugs are actually already in OpenStack and can we use the existing test infrastructure to data mine out bugs we already have and then get those fixed. And inverting that has actually worked out really well. I think we have four minutes so we can take another question. There's one back there's here. If you guys can go to the mic it might be easier than shouting at me. Or if you wanna whatever, let's do the mic. So do we ask engineers to submit a pattern with every bug fix? We do not. Because sometimes the bugs are, so we're interested in a specific class of bugs that are things which are not easy to find. Typically, when we file a bug for these things and we file a pattern for these things the root cause of that bug is not known. And these bugs are things that we have to dig on for a long time. The patches may not come for a long time. So this is more about exposing a particular class of problems as bugs and using the tooling to prioritize which ones need to be addressed first because they're the ones that are breaking us the most. And there's one more. Yep. So you said that you have found some certain known patterns from the logs. So could you give some details about how did you, what kinds of technique did you use to find those patterns? And another question is given a new issue how do you match it to the known patterns? Yeah, so on the first part of the question the way we find these things is smart humans. It really is a manual process at this point. We have some ideas that we might be able to use like we're actually running some Bayesian analysis and some of the stuff to figure out if we could make things pop. The signal's not there yet. Right now smart humans can find these things faster than anything else. And so we rely on some smart humans to do that. On the question, what was the second part of your question? So given a new test failure or some issues how do you match it to the known patterns? Yeah, so basically those 100, when something fails those current 100 search patterns what we end up doing is we run every single one of those searches basically against the logs for that failure. And if any of them hit, we say oh you hit this thing because we know about it. And that means that if we knew about it before your test failed like you'd get this response. If we figure it out later you'll never get that feedback but we will get the bulk collection over here. Okay, thank you. Yep. All right, I believe we are at time so thank you all for coming. And feel free to come ask other other questions.