 All right, so we are at 530, so I'm going to get started right away because we got lots of stuff to cover and Yeah, let's dive in So my name's Sean Degg. I'm a Senior software engineer at the IBM Linux Technology Center You may also know me from the community I'm the the Ptl for the QA program for OpenStack And plus two on Nova and Dev Stack and a bunch of other things So if you ever tried to land a Dev Stack patch, I probably commented on it at some point And maybe on some other projects as well The talk today is is Jenkins failed my patch now what? We do a lot of interesting complicated things in our continuous integration system when it comes to the kind of rigorous testing that we Send people through and as a first-time contributor the first time you push code up there You will inevitably have your patch failed by Jenkins And then it is often confusing the first time What is it that you are supposed to do to get to solve that problem? what I am going to attempt to do today is Break down a lot of the things the kinds of tests the classes of tests that we do within our gate and How you would actually resolve an issue in any of these a roadmap to be able to successfully Make it past our first line of defense in preventing bad code from landing an open stack This is the Workflow for OpenStack. It's available on the wiki. I actually in the one that drew this diagram and put it on the wiki and the basic workflow around Garrett is that we've got You know all our code is in get somewhere You work on a local get repository you push it up. There's Garrett review and we do a bunch of automation around this What we are talking about today. There's a whole lot of complexity around here. We are talking about just this piece when when code pushes to Garrett it immediately kicks off a set of jobs that go in and verify the code and Garrett will and Jenkins will return in less than an hour some plus one or minus one on your code and Again, if you're a if you're a new developer that will most likely be a minus one because you will have inevitably tripped over something That we try to prevent You don't just get minus one You get a little more feedback than this. This is what a Garrett post by Jenkins looks like Up here. There will be a scoring section at the top. I sort of snipped out pieces of a Garrett review page But then down below you will see a post that says this doesn't work Here's some detailed documentation in maybe going and figuring out what's wrong Here's everything we just ran and what worked and what didn't work and Timings these are all hotlinks and those hotlinks take you into our log system and Let you start digging into what the results were in detail This is a was a Nova patch From last night, I think I just grabbed this So this is a Nova runs probably more jobs than any other project with an open stack But depending on the project you'll see like, you know four to a dozen jobs here and there are different classes So what I'm going to talk about now is kind of walking through what some of those classes of jobs are and some of the common failure scenarios and then And then you know how you how you get through them realizing that Not all projects implement all these things so you know some parts be are more applicable than others The first class of failures that happens are actually requirements For the project so every All the open stack projects are written in Python They all can define underlying Python requirements as part of their bootstrapping mechanism. What are there? What are the things that they require? And here's the problem is that we don't just care about One open stack project working at the end of the day You have to build a cloud which means you have to have all the open stack projects all the integrated open stack projects working Together at the same time do you do limitations of Python's packaging system itself? You can only have one version of a Python library in your global namespace at a time So it's actually really important that all the projects can function on some set of requirements We used to manage us in a very ad hoc manner and when we had five integrated projects that was manageable when we had seven It was manageable, but painful. We're at nine integrated projects. This is no longer Manually manageable So I actually spent a bunch of time over the summer Helping automate what we do here and we have a separate a whole separate open stack project, which is just the list of requirements if you want to update a requirement in a project you have to go and update it in global requirements first and That has to be approved and the approvers on that Include basically the project technical leads for all the projects So they have some idea that this is a reasonable or unreasonable piece of software to include as well as some of the Linux distributors So they can say for Technical or license reasons that software is not shippable. We can't make that a dependency of open stack This was an instance I think from last week where someone you know tried to land an incompatible local requirements change And we we check and said that's no good This one Will not happen to you very often, but when it does happen. It's very cryptic, which is the reason that I wanted to stick it in there Next up is things that will happen all the time We enforce style guidelines with With software so Open stack is about half a million lines of code in Python and In the Havana release we had over 900 developers land Some code in open stack When you have half a million lines of code over nine integrated projects, it is really important that there is some Consistency across them so that the context switching of understanding code in different projects is not huge Python itself has a set of best practices called pep8 Open stack has a set of extensions to that on which we call hacking Which are sort of based on both the like the Google Python style guide as well as Things that we have found over time if we let in the code base the code becomes less manageable And so we kick them out. This is This entire rule set is it managing this project called hacking which is available actually from Pi Pi itself and It will hit you on white spaces use it'll hit you on Some of the like function naming issues. It'll hit you on ambiguous ways that you do Python code It'll happen automatically. It will fail you and it will give you errors like this And then you have to go back and fix them Before you can get past Jenkins This has been so successful on the Python side We have a few open stack projects that are written in shell and we're actually writing a shell equivalent tooling For style issues like this It this is about saving code reviewers time Because and mental energy if all of the nitpicky stuff just happens by a tool and you have to get through the gauntlet before Someone looks at your patch That saves Me as a code reviewer a ton of time Unit tests so The next class of failures and that actually the the example I showed you what it was failing was unit tests Exactly why I don't know Open stack is a very deep test culture it is part of a a foundational principle that you do not deliver code without tests to test that functionality and So if you look at any project if you look at Nova the lines of code of unit tests in the tree are about the same as the Lines of code of functional type of function So it's like half test half function, which is a good place to be These tests for most of our projects were the test running infrastructure We've moved to is a thing called test repository tester for short Which is a test runner-runner and there's complexities around that But what's most important is that we're running these unit tests in parallel And it means the order of them is not guaranteed the order of them may change over time And the fact that you have multiple unit tests running at the same time when we converted to this We found that people were actually very very bad at writing isolated unit tests that did not corrupt some other test state Which becomes very evident when you start running in parallel because things now do or do not work depending on which ran first When you fail a unit test When the unit tests fail you kind of you start looking at these in the following order Did you just write bad tests or bad code for what you just landed or did your code break? Existing validation that was there for a reason. It is probably your code If it's not your code it could be your tests Doing this thing where it corrupts some global state or it could be somebody else's tests that we hadn't ferret it out yet and There or it could even be this other class of things where because we're running all these things at the same time And sometimes you do need to manage global state within these unit tests like you might actually need like legitimate locking in unit tests Which is supported? Tester itself actually has functionality If you run it explicitly called Analyze isolation Which will start with all your tests hit a failure Figure out all the tests that we're running in the group then start bisecting and get smaller and smaller and smaller and smaller To find the smallest set of things that you run that still generate a failure So that should let you understand that these two or three things are coupled in a way that they should not in any way Be and then you can debug from there Now the These first sort of three classes of things that we talked about these are relatively straightforward You know this kind of test You know is somewhat isolated to a single environment and it's relatively easy to run all these things in your laptop You should Just run the talks jobs on all the projects that will run the style checks in the unit tests before you push but as We've talked about right open stack Is a whole lot of projects. This is not even all the this is the open stack architecture Circa grizzly where we had seven integrated projects and Each represented by one of these big boxes and each piece in between is some Moving part of that project be a demon a required queue system or required database system And they all cross talk all the time And while we do care that any particular project by itself is not broken Open stack is an integrated thing Right, we call them integrated projects for a reason we actually care about is that all of open stack when it runs together Works the way you expect it to work There's a lot of complexity here There is also a lot of interesting asynchronous behavior that happens when you have this many demons That are largely communicating via a queuing system, which is asynchronous to one another with message passing so on On every proposed change of any of the the integrated projects We run a series of integration tests where we actually stand up a cloud and Ensure that it all works and this sort of the flow of how this works I'm gonna go through and and then start to figure out what the artifacts look like afterwards So Zool is this project that was created as open sec not technology. There's our gatekeeper. It is a way that can manage getting multiple get trees with multiple upstream references All built into a system and then we can stand it up and and do interesting things with it Everything that we do in open stack. We do in open stack clouds So we're right running either on rack space or HP public cloud They have very generously provided us with lots of cloud credits and don't really ask how many Clouds we build or they do ask but they don't they don't beat us up for it because we do a lot So on every proposed change we go and We hit a cloud and we bring up a gas and we put dev stack in it dev stack is an opinionated development install tool for open stack that pulls down all of the projects from get and Starts all the services as a single node cloud. So it's a somewhat synthetic environment You know, you don't really in a real production environment. You're not going to run all the services on one place However for certain simplicity reasons This is the way that we're running the system now There's a project called Tempest which is our integration suite system That's part of the open stack UA program Tempest is 1400 ish depending on the day API tests across the whole slew of Open-stack integrated projects as well as a set of things we call scenario tests which are Building up a complicated state environment for a cloud be it a couple of guests or volumes a particular workflow and ensuring that that works end-to-end All of these thing Tempest only touches the open stack API it is Treats open stack is a black box And that's important because the behavior of open stack should be defined by the API surface and not by underlying implementation details the So it over its course of its run will start bringing up guests It will start tearing down guests. It'll bring up other resources And over the course of its run It will I think In our sort of maximum configuration we we bring up over 120 guests over the course of a dev stack Tempest run And at the end of the day then it spits out a whole bunch of output about how what happened. How did this work did all these tests pass? 30 minutes Yeah, we have done some substantial optimizations to the guests themselves are very very lightweight that we bring up second level What I'm not sure I got the question You know we're starting second level for guests right where we are we are running as a guest in a cloud And so all the guests we create are actually their QMU guests their second level guests Yeah, yes Yep So this is sort of basically just what I talked about there the artifacts So there were all those links that were shown in a Garrett commit and Or a Garrett comment back when when Jenkins reports those links are links to log to somewhere in log that open stash org Where we have all of our logs for every single run that we do we generate somewhere around a Terabrite of compressed data per six months of open stack and we keep the last six months of all the runs for Historical and trending issues basically we know we keep every test run from the last release and What this ends up looking like If you have to start going to bet debugging a failure at the top level you have the console output that happened You have the test results in slightly prettier html format and they have this logs directory and the logs directory contains logs for every single service that was running through the environment we Run all of the open stack services at debug level when we're running these tests so that we have the highest level of tracing That we can for understanding an issue that happens our intent here is a first fail data capture so that you should be able to debug a Tempest failure in the gate just from the set of artifacts and if you can't Well, then the artifacts are wrong right at the end of the day This ends up being you know very similar to if you were an operator and something goes crazy in your cloud If you can't figure out what that is from the logs we provide that's no good So we impose this same restriction back on the development team We have the logs you better be able to figure it out from them If you can't we have to fix how the logging is happening for that project so that you can There is a sort of general pattern To figure out what's going on here if it fails when you're reading through the console The first question is did you actually ever get as far as the tempest tests themselves? Did you manage to actually break a? Project with a patch in such a way that a service didn't start right those the basic setup So the bulk of this console file is actually the dev stack installation process Which is running at like a bash tracing level, so it's a huge amount of output again for first failure reasons Secondly it didn't fail during a tempest test right which is actually probably where it's going to fail It's it takes a lot of work to break open stack at a fundamental level for the installation and When that happens You sort of have an outside in model to go look at You know first go look at the API service where things failed at If the if the Class of service also has a scheduler like the scheduler may be very well to blame like maybe you blew a resource quota Maybe you blew something else in your test and then dig deeper like in the Nova case, you know getting down into the CPU Nova compute itself did something happen. Did you end up actually breaking the way live functions? And dive into there Reading These tempest logs these console logs are is not always the most fun thing in the world however So so we're gonna talk about it a little bit basically This is what lines of tempest output is this is a failure from a week ago There will actually be you'll see the full test name in here and then somewhere in here It's it says it's okay, or it failed, right? Okay so How do you get from this failure to where was the interesting part in the logs? For the tempest runs we are running them For all the tempest configurations or DevSec configurations in the gate except for neutron We are running Parallel testing which means that we are hitting open stack simultaneously with four tests at once all the way through and In and this is a relatively new change for Havana where we're actually running everything in parallel In order to run them in parallel What we're actually doing is we're running tempest in what's called isolate in tenant isolation mode so for every test class We actually build a brand new tenant in your cloud using the administrative API that For all the tests in that test class This is the only way that you can run safely in parallel because if you were letting your tests sort of run all over each other what you would find is that You know the resources within one tenant could be stomped on by tests another thread So this way we ensure that When we're running four different things, they're actually running as four different tenants So they will have defined resources If one tenant could go affect another tenants resource, that's a huge security problem in open stack So we are relying on this fact as as part of our prevention of race conditions in our in our tests Because we create new tenants for every test class and Most open-sac services at their logger level include the tenant ID of the request coming in You can actually match the test class name Which will be the one that's in funny camel case as part of the string into the logs itself And so if this is the test that failed you can start going looking through Nova logs now And you're looking at Nova because this was a compute API test so it's probably broke somewhere in Nova and You know here's the relevant logs That are there and realize like these requests here are actually probably a different test entirely because we're running four things at once So the the you're not going to have only your tests Only the tests that you're looking at in the logs right this is a real environment running multiple things at the same time So Debugging tempest itself a tempest fail is actually oftentimes quite hard It's we do try to get enough data so we can do first fail Capture and be able to let you debug everything from the logs But realizing that there are definitely places in open stack where you don't have enough data at this point to do that so If you're running into problems trying to figure out what happened It's not deducible from the code that you pushed that step one is local replication The project which sets up this whole test environment is called devsat gate and there's actually within its readme There is documentation about how to Replicate exactly that environment in like a local VM so that you could run this and ensure that it's there You can also try to run smaller subsets of this to figure out You know whether or not it's like all the tests or just a small set to try to isolate again like what went wrong and Moving forward we're actually trying to make this whole problem better one of the biggest problems Is the fact that logging between projects is not entirely consistent today? which is something we need to do better at and There's a few of us that have kicked off this log normalization work over the course of ice house to try to at least build some common standards and Some common patterns so that we can make this a little cleaner by the time we get to the end of ice house We're also there will be a design summit session later in the week Just talking about other ways that we could possibly increase the debug ability of this scenario the last class of things that I want to talk about Because this is probably the one that is least understood by people is grenade on every patch We have this tool called grenade which is for upgrade testing and I'm going to walk through the workflow of what it does Because Realistically, it's reasonably complicated and you know last night over dinner with some of the other core folks it took us Many tens of minutes to explain the scenario that we were having in a particular failure condition that was happening within it Grenade itself Is a project that uses dev stack extensively what it does is? Let's pretend. This is the week before Havana release It would when you ran grenade It would first Use dev stack to start a stable grizzly version of open stack Using the stable grizzly version of dev stack. So first we start the last stable release of open stack Great, then we run Dev stack itself includes some basic exercises, which means that to let us just ensure that the things Looks like it started correctly. It's not like extensive testing, but it's just like Okay, we look we look vaguely sane Do we have this this custom thing called javelin which is setting up a Small number of resources in the environment, which we expect to survive the upgrade So that means that we're starting, you know a VM We're starting some volumes or starting some specific network security groups We're changing the state of the environment and state which should be in exactly that same way once we've completed an upgrade And if it's not that's no good We then shut all this down we shut down the control plane all the resources are still running But we shut down all the open stack services. This is not an online upgrade model. It's an offline upgrade model Everything's down we make sure everything's down if if we failed to shut something down Then clearly we have a different class of bug and we have to deal with that too then we start a second dev stack cloud and This would be you know again a week before Havana release. This would be on on the master branch But differently than the first time through We don't re-initialize the environment your database is the database that you had We don't overwrite your config files the config files or the grisly config files We have a deprecation rule in open stack that says you can't just pull things out of the configs You have to deprecate out over one full release So that option if you want it to go away You have to flag it deprecated and you have to give it a release and then you can remove it So a grisly config should work in a Havana environment It will give you lots of warnings. It will say this stuff's going away. You have to go fix it But it won't it should not break you and Now we do realize there are edge conditions where where like full everything in a grisly environment Will not necessarily roll photo forward to a Havana environment So we do have a mechanism for exceptions and in a lot of ways the the small number of exceptions that we land in grenade which you are Code as documentation for the incompatible things that you would have to do a manual Change to open stack to do the upgrade Our intent is to make this zero. It's hasn't been zero yet, but Because for you at this point to land an incompatible change in open stack you would actually you you would get failed at this point and then You would have to you would have to convince us to let in the incompatible Upgrade bit into grenade before you could go land in a project. We are pretty hard on people about that Once we're up we make sure everything in the javelin is still running Realistically the first time we did this it was not You know and your VMs need to survive when you're upgrading your cloud As do all your other resources And then we run tempest the same tempest we run in the normal run we run Not all the tests, but we run what we consider a good representative Set to exercise what's going on and that's mostly for time reasons We have found that when when the test runs become very long There become some other issues that that happen in feedback to the developers So so there are time budgets here, but we get pretty good coverage on this debugging Grenade is a lot like debugging a tempest dev stack except We're thinking about we have two trees for dev stack. We have an old and a new so at this top level directory There's a bunch of other files related more specifically to grenade But then there's also all those service logs for all the services in open stack There's there's an old which is what happened on the old side and a new which happened on the new side So you can see the difference and figure out like what was appropriate in there Realistically the biggest issue that this prevents from landing is thing incompatible changes to configs, right? you know we had as principle for a long time that we that you had to deprecate out config options and People didn't always do it For the set of config options that we actually are important for running an environment like this We now have some enforcement to ensure that they do Which is a good thing Yes 24 minutes ish Because we don't run all of tempest We actually it's about 10 This is 10 to 15 minutes of doing the grenade part and then 10 to 15 minutes of doing the tempest part. I don't remember which is which What We so Grenade itself right now is actually not running not all the integrated projects are part of the grenade testing today Which is a limitation you know first of all yeah, so in grizzly there were seven integrated however What is not neutron is not in here and we don't test horizon Yeah The neutron issue we are definitely fixing a nice house. We will make salameter and heat get in here But you know it's lagging a little bit Not as of yet We've been talking about that within the the sort TC model about what we would add there But let's take that conversation offline afterwards because there's a couple more things and then maybe we'll hit questions quick at the end Being the end of the day Because the last thing I want to talk about is the fact that this is a really complicated environment Which means that you might push some code It might be fixing a typo in a comet and your job fails Which means you can guarantee that it was not your code that failed the job um Complex massively complicated asynchronous environments have race conditions that's the nature of the beast and With as many moving parts as open stack has there is a class of those in there So if you push a code change and you've gone through all the mechanisms and and you say no no no I'm sure I'm sure this is not my fault We have an escape valve for that We let you put a special comment in the Garrett review Which is recheck bug and the bug number that it is Which is important that you went and either found an Existing bug that was represented over this race condition or you went and filed a new one yourself Because that data is important to us. We actually try to keep track of those things off of status.opensack.org We have this rechecks tab and this is Based on the content that people put in there these are the classes of race conditions that people are seeing in the gate when they show up and We're trying to categorize them that way And a bunch of this comes to the fact of just sort of the rule of big numbers, right? This is one week of OpenStack CI We build 25,000 clouds a week as part of our normal process We build half a million guests as part of our normal process over one week and this was not a busy week This was a week post release planning for OpenStack Summit So so consider this volume is probably double in some of the critical points of release When this happens You get some interesting things going on which is that you get Scenarios where maybe a particular race condition shows up in one out of two thousand runs But if you are running our throughput and getting three thousand runs a day That means we're seeing that thing, you know every 15 or 16 hours. We get an event. That's that event That's that particular event This starts to become something where we can now start accumulating a lot of data About particular race conditions showing up how often they are and what are the worst ones that we're seeing So during the Havana freeze cycle we actually Started building this cool little new tool We Six months ago we started using log stash to we pipe all of our our test data through log stash Which is a tool which is a it based on elastic search and it lets you do searching on your logs in like a natural search engine way What we then Built is a small thing called elastic reach act which goes and Has a hand curated set of queries that people have figured out this log stash query Actually is that bug like that bug can be uniquely determined by this log stash query So when anything fails we immediately run through all our queries and figure out if we know what the bug is and then we categorize it and That lets us Build these sort of dashboard Elements which we used very heavily during the Havana release to figure out like what is killing us right now What are the race conditions that are actually our most problematic issues and to focus the development teams on them? I actually don't think we would have got through Havana release successfully without this We saw a bunch of issues creep in kind of late and this gave us this incredible focus to get down there This is what looks like when someone fix a race condition It just flat lines out. We stop seeing it anymore and and because we have enough volume. We can really see these events show up all the time This This is one of our you know sort of cool new pieces going forward. This is going to become even more important to the overall open stack process so with that I Give you a couple of links Some things about how the Garrick workflow works the the Garrick Jenkins environment I Tend to write about some of these issues around open stack The QA process and some of the things we're doing on my blog dig net and and you can follow me on Twitter With that I think we have two minutes left. So if there's a final question, I'll take it sure So That's a sort of that's a long complicated question. Why don't why don't we take it as a offline discussion afterwards? I'm not sure I can give you a concise answer Okay Yes Yes So that's true like that we build patterns over time where we realize that certain behaviors within open stack projects like Everybody talking to the same database at the same time All throughout the cluster is a bad idea, right? Yes, right Yeah, so so let me get off Mike and we can chat more. It's harder to it's a little harder to hear up here any other final questions So no no provider no cloud provider today provides nested KVM My understanding is there are general concerns by many of the cloud providers that nested KVM functions correctly on load And the reality is our guest boot time For the image we're using it's like seven seconds. So I don't think that getting The only difference between QMU and KVM is Is the IO instruction path is much slower But we don't actually generate a lot of IO when we do these boots and so I'm not convinced that good Um, yeah, I I think there's not as much difference as you think for the environment that we have given what we've optimized for Okay, thank you folks and