 It's okay. Well, it's 440. So I am gonna start off with introductions My name is Matt Redeman. I'm Matt Trinish and we're gonna talk about the gate today At a very high level we're gonna talk about what is the gate because it's a very loose term Talk about what can go wrong debugging it lessons. We've learned and then hopefully questions So a little bit about ourselves, I'm Qa Ptl. I've been in that position since June. Oh, I've continued to get elected on a post I don't know if that's the right way to describe it, but and As part of that I do I'm a core reviewer on a bunch of projects including all of the QA projects Which I guess makes sense and I work for HP on their OpenStack upstream team working to make OpenStack a better project for everyone and I Am a core reviewer on Nova elastic recheck Also stable branch core. I work for IBM on the IBM cloud manager with OpenStack team Been about two and a half years now that I've been involved and a little background I Initially got involved with packaging and as part of the packaging we had a room entry CI system Which was running tempest and when I had issues with tempest I got talking to people in the QA channel and that's how I got sort of roped into working on some of this So the first thing is what is the gate? because It's different things to different people and it's depending on the context what it is you're talking about Basically, it's a pre-merge CI system But it's also so it's the actual you can be talking about the infrastructure You can be talking about the tests the jobs the configs. It's lots of different things There are different test jobs for different projects. It can also be thought of it as a reference config It's the gate and for the merit of this talk is basically talking about things that are hosted on the community infrastructure The reason we point this out is that third-party CI is not hosted on community infrastructure most people probably know what third-party CI is that's mostly like vendor drivers that are run by Vendor vendor companies these would be usually like closed-source Backends so the community is not going to be running closed-source vendor proprietary testing And we don't gate on those so you meaning we don't restrict approval of a change to get merged on third-party CI We gate on unit test jobs, but the majority of testing happens with integrated testing using DevStack and Tempest and Another sort of confusing thing when you're talking about the gate is there are multiple queues to this so there's For the majority of what people care about it's the check-in gate queue. There's also the experimental queue, which is where you put Jobs for experimental configurations like Ceph with shared storage was on the experimental queue for a long time and Cells was in the experimental queue for a long time and there's like a Fedora 21 job There's just lots of these different jobs that aren't Voting on every change you actually have to specifically run those experimental jobs and then there's periodic jobs that run overnight And post results the mailing list So this is a very simple picture of what happens when you submit a code change All these different test jobs, so there's pep8 for static analysis unit tests And then different combinations of DevStack and Tempest configurations. These are all running in parallel These aren't sequential jobs and then each DevStack Tempest change is kicking out about a hundred and thirty Instances per run this is to get to just sort of an overview of a Development Garrett workflow Showing that there's there's a cycle here of in your local environment I if I'm working on Nova, I'm cloning down Nova working on a bug making a change I push it up to Garrett Where we're gonna be running all of these tests on the change and it's going through review Somebody's gonna minus one that you make a change push it back up for another patch set. It all gets retested again eventually when it's approved with Workflow plus workflow or as many plus-dues as you need it goes up into the gate queue Assuming it gets through the gate queue it passes all the tests there and it burges then it's cloned or it's a mirrored out to GitHub and You start the cycle over again, and then everybody's revacing off your change once it's been merged So when we talk about the gate, it's always fun to talk about numbers and this The gates at scale has some very interesting implications when it comes to trying to figure out why things go wrong So I thought I'd describe a little bit about how big the gate is So in the past six months we've run over 80 million tempest tests in total just for the gate queue So that's not that's after the change has been approved by two core reviewers and ready to be merged We run over 80 million. We ran over 80 million tests for that As part of the check queue each proposed Commit spins off between four and 20 dev stack environments to run tests on in at the same time in parallel It depends on the configuration for the individual project that the changes push to and each full tempest run starts about a hundred and thirty second level guests in the dev stack cloud that spins up Which is not a small amount of work for a single VM and a public cloud provider and Then when you look at that whole data set for the gate queue and you look at how many things fail an Individual test run in the gate has about a point seven seven percent chance of failing and if you look at each test individually as a single unit one Individual test has a point zero one five percent chance of failing That's not entirely a fair way to measure because some tests are far more likely to fail than others But in aggregate that's a way to describe it Tests running concurrently. Yeah, right and yeah, we the test environments run for Four workers concurrently hitting that dev stack cloud with four API calls at the same time So you get a lot of interesting and this is about 1,300 tests in a full run Yeah, it's a very the number of tests vary between like 1,200 and 1,600 depending on configuration and some other factors So when we look at these gate runs What could go wrong is an interesting question. We've got a lot of things that can go wrong to be honest we've got dozens of jobs with different configurations and Because of the pre-merge nature of the CI things pass most of the time So when you have race failures that are less than a percent you're very unlikely to see it With one job let alone, you know for So oftentimes we don't catch these issues when they're Going through review or even in the test results for that change itself and when that happens We don't end up we don't see it until later in the process these these race conditions bubble up when we run at that big Scale where we're running all of those tests all the time It's a it's a question of seeing like less than 1% You're not going to see that on one. You're going to see that out of a thousand And then we also catch issues and dependencies and things open stack uses a lot of external projects and a lot of External libraries to do its work and we often catch issues in that and that's Not going to be caught with our code review system and our testing of our code I mean it will be we'll catch it with our testing, but we won't see it in our review We found that the failures That we hit in the gate break down into about five categories, which is upstream service breaks The way the gate is configured we rely on you know pulling packages from Pi Pi and Running on public clouds and oftentimes there are you know issues with those upstream services that we're using and We catch that almost immediately Sean has a nice statement. He likes to call the open stack CI system the Nagios of the internet And it's very true We catch a github outage or a Pi Pi outage immediately because we're running tests all the time and when that happens Everyone runs around with their heads cut off. What's happened? My change didn't fail or my change didn't pass We have open stack infrastructure failures where you know some the machinery behind this is very complicated and sometimes something goes wrong like any other operational service and When that happens that causes a failure and and peeds developer throughput because their changes can't land Then we find a lot of bugs and open stack where we have race conditions or state corruptions or database deadlocks and you know That's what gate debugging is really about is you know fixing these issues We also have issues in the testing of course and we try to fix those as they come up I Like to say there are less of those than an open stack, but who knows And then we also have bugs and dependencies which we hit a lot Because you know sometimes we have a pretty robust upstream testing environment, but not all open source projects have the same awesome CI setup and We catch a lot of bugs and we have to fix them in the upstream projects either by reporting the bug or Working around them or fixing it ourselves So when we're talking about the gate we have lots of different configurations That we run in and knowing these configurations actually can be important for debugging the failures as they come up We have we run tests with different database backends mostly my sequel and postgres We use different storage drivers like Seth and LVM and different configurations We differ between Nova Network Neutron and a lot of other things like that and then there are other jobs which do other things like upgrade testing and Testing operations at a large scale using a fake Libvert driver and multi-node environments and knowing the environment your test runs in is Kind of important if you're debugging a failure with that Here's a just a tree diagram that explains how Some of these jobs are configured and it is a nice little web Because we have lots of different configurations and this is just a small subset that are the common ones We actually have a lot more job configurations. There's a nice file in an info repository that explains the whole Job configuration map, which is very very long. It's a big YAML file But the key thing is like the job that it's running in actually has a Important impact on how you debug these failures specifically the MySQL Postgres one, which is why it's got a bigger box is that there are actually Additional differences between those configurations when you run those jobs MySQL runs with Keystone and Apache and does not use a metadata service While Postgres runs with a metadata service and Keystone using Eventlet, which you know That actually Impacts when things fail So I before we deep dive into you know a debugging example and how you actually look at this I thought it would be good to Call out some things we've hit in the upstream CI system That actually have had impact on real people running OpenStack clouds at some point that To show kind of highlight the importance of doing this For everyone not just developers So the first one was the number of CPU workers that we ran with in different projects So this bug was specifically for the gate, but it turned out to Kind of compound with all the projects. We were running all of the API workers in the gate with multiple With CPU workers equal to the number of CPUs and we run in a single node all in one environment So that meant for I think it's six or eight API services were running That many times number of workers and it ended up eating all the memory on the DevStack nodes and things were crashing randomly Turns out when we were looking at the projects a lot of them didn't really have same defaults DevStack was the one setting all of the the CPU count But what that ended up making us do was go into all of the projects and set a same default for The worker counts to try to you know help the deployer story with that another one was LVL operations were Timing out or they weren't timing out. They were taking longer to complete than the RPC timeout in Nova when it was making a sender call And that was because sender was looking at every single device on the on the machine instead or every single volume logical volume on the machine instead of just the ones that sender created and That was causing Nova to fail when it was doing volume operations And that's something that could easily hit anyone who's running our deployment with an LVM Backend and a fun one which took us a long time to figure out or I should say Salvatore a long time to figure out Was that there was a kernel panic when you ran with NBD and network namespaces in neutron And it would just randomly hang the box and there was no more logging and no more Indication as wife as to why things failed That was a simple fix. We just stopped using NBD because it didn't actually make much sense in our environment Or arguably a lot of environments The last one on this little sample list is a current bug actually that we're hitting right now in the gate and tracking and that's when You resize and restart or resize or restart a running compute instance in a cloud Neutron breaks connectivity and then you can no longer talk to it after you resize or restart And we're currently tracking that failure and the bug is still open And that's something that could hit anyone who's running neutron and tries to do a resize or restart It's not a common failure just occasionally So now Matt's gonna go through an example of how we debug a failure Yeah, so I actually at the beginning I wanted to ask but the show of hands Who's actually ever proposed a change and gone through this in the community? Okay, so that's good because I've given this talk with like two hands So it's sort of you can't really appreciate a lot of this unless you're working on it. So I Wanted to go through an example of this actually happened like two hours before I first gave this presentation is There was an upstream library that released and it like broke everything and There was just a thing that broke and all of a sudden there's just like somebody in an IRC channel that's saying Like Jenkins failed on my change. I have no idea why I'm pretty sure it's not related to my thing Because I'm not changing Swift and all of a sudden like Swift doesn't start Can you help me out? So there's usually a set of steps that at least I go through in debugging a thing like this so In Garrett you're gonna see something like this and we get I guess I didn't point out earlier about non voting So there are jobs that run that aren't voting meaning they can't Jenkins won't minus one the change and prevent you from approving it But for voting jobs in this case, it's the Dev stack full job for Tempest. This one's actually running Nova Network, but So it's a failure. I click the link and I go into the console log so I start in the console log and all of the tests are dumped out and 99% of them are saying okay the test passed. I'm searching for failed I see this test delete server test failed and then I go down further and there's eventually a stack trace I don't have the entire stack trace here, but there is a Tempest stack trace that says Build error exception this server failed to build and is an error status And it's dumping out the message that's coming back from the Nova API that says no valid host Which this is a super like helpful useful error message But the the useful bits here are That I wanted to point out that I highlighted is if you look at the tempest tree the way that the API tests are broken Now they're broken down by service. So there's a compute service a network service image these correspond to glance neutron Nova there's identity service or test for Keystone So I know that something in Nova is busted Which is kind of obvious with this example, but there are also scenario tests that are running They'll like bring up a VM attach a volume resize it try to attach Interface something like that you're trying to hit all these different services in a run and The scenario test will point out this test is hitting glance Nova and neutron or something The other important information this for me is the instance you ID because when I need to start digging the Nova logs I need the instance you ID So from here I go into the Nova logs because I work on Nova I know that generally no valid host is gonna have some terrible stack trace in the Nova compute log So I would start there a lot of Usually you go into like the Nova API log to see where the request came in where it's going Is there an error in the scheduler usually there's not a thing in the scheduler, but Compute log is where stuff the interesting stuff usually happens. So in this case I'm looking for the instance you ID I find a big ugly stack trace like I expected to find and At the bottom it's showing us that while there was this Livered error and it said received hang up Error event on socket that sounds not. Yeah, not good But this is a thing that people could actually be hitting in their real clouds and So from here then the next step is I've got an error. So I've got a fingerprint for an error. I would go out to Logs dash that open stack that org which has a combina dashboard and I would plug in It's probably nobody can probably see this but Basically, I take that error message and I say I've got an error message from the compute log put it in there and run it over the last Seven to ten days the logs are only stored up to ten days I run it through there and I want to see how many hits is it getting because a lot of these times Somebody has a failure and they don't think it's their problem or they don't think it's their change But it actually is their change and they'll argue with you and IRC and then you go off to log stash and you say Well, yeah, it's exploding terribly But there is actually a thing that says which changes it's hitting and it's like that one guy's change or something So you can say well, it's just you Please get better at like debugging your own failures The other the other nice the the better thing to do is if it's all check queue That's a little suspect then you have to start looking at what the actually changes are that are failing if it's gate queue Though it means it's on a thing that's actually merged most of the time if it's if it's like unit test jobs those Those fail in the gate for different reasons sometimes, but if it's like a tempest job or a grenade job Usually, it's something that's been merged. That's racing and failing So after this you check launch pad to see if anybody's reported the bug In this case, this is a this is a reported bug But from the bug we have this thing called elastic recheck and then you take that fingerprint that you've had with log stash And we write queries against this we storm as YAML files in this project called elastic recheck With the fingerprint for the bug once that merges after Once it's in every time that Jenkins fails on a change it will scan the logs check them against the fingerprints that we know about And then it will come back to Garrett and say or to come back and it'll comment in your Garrett change that oh Jenkins failed we think it's this bug because the fingerprint showed up in the logs And we've got this dashboard off on status open stack at org for elastic recheck that shows for the check queue and the gate queue All of the different race failures and it's showing the hits that little When we were looking at this the other night we were thinking like did we actually fix this for like a day? It turns out that that was when we did the Garrett upgrade and But we can also see we've got open reviews so when people are actually pushing changes against these bugs to try to fix them When you see in there they're ordered by a number of hits So if you're out there and you're looking for reviews and you see somebody's actually trying to fix this thing Maybe we should give priority to that review And it'll the link will take you off to Garrett So I already kind of went into this but this is an example of the comment that you would see so in this case I actually like this example because it's showing Two different bugs. It's showing three different test job failures And in one of the tests it doesn't have a recognized fingerprint, which means We should probably go off and check the logs and see if we can categorize whatever is failing in that job so and this is a View of a page for all of the things that are failing in the gate that we don't actually have categorized And if you you probably can't I don't know it's probably hard to see but there is a overall categorization rate which in this case is 70.8 Yeah for me for me if I come in in the morning and that's it like if it's less than 70 It usually means there is something pretty bad if it's like 45 It's usually very out like you can find any of the logs And there'll be something that like some services and even starting because of something usually that's In my experience is there's an upstream release and that's Sean wrote the What released script or whatever it's called what broke? Yeah, it basically shows everything that released in pi pi in a last like set of time and we can say oh, it's Boto released a thing again and broke us But then we can take that we classify and through elastic recheck We get these numbers down and when we can get the percentages up, then we're usually Yeah, this is a good thing that I know both of us have in our daily routine is that we look at this page and look at open Uncategorized failures because getting that categorization rate up helps us prioritize bugs that were actually hitting and lets us you know identify real failure real bugs in incoming patches because with that throughput rate we have in the low percentages of Failures that were actually hitting these races They compound and they cause real development workflow issues especially around release time when everyone's trying to push their patches in the system Is completely strained so keeping that categorization rate up? helps everyone with development and Yeah, the other important piece is the the 10-day window is critical because If we can spot that a thing started spiking if there is a failure that started spiking in log stash and The the picture I had with all the green bars like that's a thing with older liver that we're probably we're gonna either have to work around it Or something it's an older version of liver and trustee, but There are there are times where you start seeing this failure and in log stash It'll be like running up and then within the last 24 to 48 hours We have this like giant spike and that means that somebody merged a change most of the times It's either like a library release or somebody merged a change. That's all of a sudden we're talking about The small percentage of a failure on your one change and then once it's merged and it's running on thousands of patches a day you hit it We're just tons more then we can start looking at the get history in a specific project and see Oh the IPv6 tests are failing and somebody just merged this thing a neutron for IPv6 in the last 24 24 hours So let's see what it is. Oh, it's in the stack trace. That's here revert it like immediately and then ask questions later Because it's blocking like everybody And then maybe we do like postmortem after that, but it's up to whoever merged the thing that has to sort that out and figure it out So lessons learned from this The nCPU workers thing we need to keep saying default saying especially considering How just terrible it is to configure open stack with the billion options Yeah, that comes back to like dev stack being a reference config that's something we try to do With values we said in dev stack try to push them back to projects to you know Make sure their defaults make sense for everyone not just the gate environment, right? So grinding rechecks, this is where Again and I'm a single change if it's if it is more than a 1% time failure somebody There was the comment earlier from elastic recheck that says we hit you should we hit these bugs If you don't think there are problem just put in recheck and it'll run everything through the gate queue again in the check queue There are people or there are times where people will just think like well It's not my problem and they'll just keep rechecking and rechecking and rechecking until it eventually Gets through and then it's merged and it's in the wild and it's hitting everybody like I was saying And this has come up and is an issue in the mailing list and I or see and people will say like well If you if you're already categorizing these things and you know about them Like why don't you just do auto recheck so that when you hit a bug like as a human? I don't actually have to hit recheck and Garrett. I don't have to care You just keep running stuff And generally I mean there's lots of reasons like why we don't want to do this, but basically It you don't want to like perpetuate races for good reasons Another thing was I think that's a come up before too is like well These these one tests are kind of bad like we or we shouldn't be adding like more tests We already like test so much stuff. We shouldn't be testing so much because it's failing so much. It's like well It's kind of a bad idea to like cut down on the testing Like in kilo we just started getting with the multi-node job. We just started getting a live migration testing And we we made this stuff job with shared storage voting So now we actually have a voting job with a shared storage back in which was a thing that we weren't actually testing in Nova outside of the experiment queue for a long time and Cells we're working on getting like an actually like voting cells job Which is important for people that actually care about cells which would be lots of people Another thing is keeping stable branches stable is really hard But you know there are people that actually care about stable releases Biggest like the the thing we were talking about earlier today the thing with stable branches that I think of is especially with like these random ssh issues is In one release, they'll be really bad. We might skip the test for a while or we might do something to like Try to curb these off a bit and then we'll come back to them because we're at the end of the release and it's blocking everything and then We'll turn them on and then it's not such a problem anymore like what happened what change we should probably backport The changes to stable like well if we knew what the actual changes were that fixed the thing we could do that but We don't know and people don't really keep an eye on it So stable kind of doesn't benefit from the changes that people are working on in Trump Adequate logging is critical. So this is the example. I have for logging is the live migration thing So we got live migration voting and I think within the same week We actually merged a change that broke live migration because at the time that we merged it the live migration The multi no job was non voting so it showed up as red, but Jenkins didn't minus one it So we didn't you just as a reviewer you sort of think like you don't pay attention to things all the time or you didn't think of them So live migration starts failing and we're thinking like and this was like kilo RC This was after RC one for kilo so between RC one and RC two. We knew that we broke a live migration And we you'd go into the logs and there was basically nothing it would you know pre-live migration and then failure and What's going on and then we start looking in the code and there's this like There's this horrible pre-live migration method and livered driver that is past an addiction area of Boolean flags for are you doing shared Are you doing block storage? Are you doing volume back all this different stuff? We weren't like logging pretty much any of that and there's a 20 different conditionals in that path So it's like maybe we should log the input parameters to this method to figure out what happens between the time that we come in And the time that we explode a Simple little change like that and then it was it was obvious it was immediate We hit it and we were like oh, we were passing no to this method and it was If it's gonna work you probably got to provide the flags to make it work So it as an operator. I would imagine that this is If you're running the cloud and you hit someone who says my live migration failed and you look at the logs And it says trying live migration and then live migration failed That's not entirely helpful to figure out Why it did so, you know Debugging these failures actually you see all of the fun in the logs that makes it impossible to debug like some logs Don't log anything or others log a stack trace on every single call you make which is my favorite example So there's that's something that the logging working group I think that's what they call themselves and a set of logging guidelines, which I think Sean started at some point Are trying to address and that's something you can see if you start debugging these failures that you know Logging is inadequate for a lot of things and we need to improve it And then the other thing was in my mind is I think about this quite a bit is when we When we have to fix a problem when we fix a bug by fixing something in Dev stack and it doesn't get into the release No, it's people aren't or shouldn't be running production open stack with Dev stack But the developers were all used to Dev stack and we're used to having the CI system running on Dev stack So we want to get the thing fixed. We want to get fixed for us. And so we get a fix where it Naturally to us needs to be And maybe it can't be done in code I mean there there was some stuff with the way like pi pi libraries or gets the Python packages get set up and stuff And you got to do that a certain way You're not going to do it like in Nova because it doesn't have anything to do with Nova But in the case of like the LVM bug there was there was a change to the cinder To configure how LVM of the I'm ran with cinder and then there was a change to Dev stack to use this Config file in cinder. I was like, oh, this is great And it had like doc impact and it said, you know, if you're going to be running without LVM the metadata statement or whatever without LVM you probably want to use this config file and After kilo is released I was looking through the kilo release notes and I saw that it doesn't say anything about like using if you're using LVM That you should be using this config file if you're not running this other service so We get to get changes at least documented because people that are running ansible or chef or puppet You need to know about this stuff so they're not reinventing the wheel and fixing these same bugs over and over again So just a couple of places to get more information about this because this forum, you know There's a lot of learning you have to know how open stack works in operation So having channels to get more information about this and feedback and help with debugging So there's the open stack QA channels where most people who spend a lot of time doing gate debugging Hang out on IRC You can also go to the mailing list, but there's a lot of noise there already. So maybe not sending a general mailing list post about how to debug the gate may not be the the most productive Then there's the elastic recheck page on status.openstack.org. That's where all of those Fun graphs are in the uncategorized page are all there Then as part of the open stack bootstrapping hour, which disappeared sometime around January. I think there was one topic where Matt Dan Jay Sean all talked about how to debug the gate and that's up on YouTube For anyone to watch where they go through a little bit more detail Yeah, it's actually stepping through the logs going to log stash doing a doing a query. It's more of an actual demo Yeah, which is kind of hard to do in a presentation and then the the infra team Maintains a all of their presentations that they present around the world all the time They put them on a page that anyone can view if you want to learn more about how all of the machinery of the gate Works which having that background actually helps a lot when you're trying to debug these failures figuring out how these environments are configured and deployed because you know open stack is you know kind of malleable and how you can deploy it and use it and Knowing that background helps when you're trying to figure out where things went wrong And with that are there any questions? I don't know how much time we have left But if you have a question, there's a microphone right there And just walk up to the mic and ask a question six minutes or Unless this is gonna be a question Matt Matt When you find something that you need to revert in a project like neutron because it's breaking the gate and neither one of you Are core in that project. What is your process to be able to very quickly? Revert something like that. You have a process where you got to go get a hold of neutron cores or do you like super? I yell at Kyle Mestre. It's his fault ATL of neutron. Yeah, we hang out on IRC. We know a lot of people you always just you go to the neutron channel You go to the dev channel you ping cores and you say look, okay, this is breaking the gate It's provable revert this change. So you just go find cores for that particular project. Okay. There's actually in the newer Garrett's there's actually a revert button which makes it super easy for me to find the thing and get history Go to Garrett and then hit the revert button and then show up in the neutron channel and say I You know this all blew up. Here's the revert I think this is probably it and then they go sort it out So you have speed and getting the revert patch out and then they just have to Figure out if it's actually that yeah, okay, and then there's like the mailing list shame afterwards Oh the gates broke in we think it's this yeah everyone FYI because people will you know Everybody will show up and ask the same question like ten times like oh this it failed on my chain. Why did it fail? Well because it's failing for everybody. Yeah, I've noticed Matt. You're really shy on the mailing list shame So if you want to feel more proactive It's funny because Joe Gordon is usually like you know calling div We're calling dibs on who's gonna actually call this out as of yeah, we broke everybody. So let's let him know We can see that So for the recording the question was if you see a bug on your change And you look at log stash to the due diligence and you see that it's fixed Should you still open a bug on that change? If there isn't one already and the answer is actually yes, and the reason is for the uncategorized rate We want to make sure that that Categorization is high enough. So we're not you know missing issues that are more critical that are still open So if a bug hasn't been filed and there isn't an elastic recheck query for it already, but you've noticed the bug is fixed Going to the last going to the effort of filing the bug and Submitting a query helps us and that query will be removed after the 10-day window for our log stash Evaporates and we're not seeing it anymore, but it's good to you know do the record keeping so everyone can see it Hi, is there a plan to run different neutron configs like VLAN VXLan OVS and a Linux bridge as part of the gate. So right now. It's only OVS in the gate I'm asking There have been a lot of community discussions about what we should be running in the gate and what the default should be and What neutron and scope should be using as a default? Current plan I think is to move everything to Linux bridge The neutron team if it's an open-source driver They could push their own jobs to use whatever open source back end They want that only run on neutron changes. I haven't seen any patches to configuration to do that There are other drivers and other configurations which rely on proprietary systems go to third party Yeah, that's not an issue I think what I'm saying I see the discussion of moving to Linux bridge But my request would be to keep some configurations or so one or two jobs running VXLan I'm sorry OVS with VLAN just for to make sure it doesn't break B or vice versa Yeah And that's something that the neutron team will have to decide for their own gating jobs on what they want to do Because the configuration is per project and okay, that's something Like that is we have Seth burning it's gate Seth is gating on Nova Cinder and glance jobs because there's a set back in and all of them But Cinder also has Jobs for sheepdog and cluster FS just on Cinder changes. I think but it's not unlike Nova changes So the that's the projects do have some control over what they're testing in their own projects The unit tests and the integration tests run in parallel has there been much discussion about running the unit tests first and unit tests fail Stop immediately and not run integration tests. Oh the staged gating That has something that is something that has been discussed before I There are reasons why we haven't done it. I don't actually recall them off the top of my head I think it has something to do with our utilization rates and not actually helping throughput as much as people think I don't actually know the answer off the top of my head But there there have been mailing list discussions about that because I'm sure it'll come up again Yeah, typical situation at sea is if you submit a change and search wait unit tests fail You do have to wait before you submit Another change on to all I mean You can see all of the results in progress on the status dot open stack dot org page It shows every job that's running on every change and you can see a unit test fail when it happens all the integration tests They're still running as at least a quick work around On that same note the website only asked you to run a few tests the paper 8 and the talks 27 and all that But the gate has a lot more. I mean there's not much documentation around those. I always see them failing Yeah, so I Thought there was decent documentation on setting up a dev stack environment and running tempest on it We might need to put that in a more More global situation more global location like in the inframanual for developers or something Because it is something that's doable as long as you have a dedicated machine For running the test because dev stack does kill a node We do that already, but this like grenade the DSVM test those keep failing and I don't Mean I couldn't find anything Grenade documentation that is a weaker point because grenade is a very special subset for doing upgrade testing and That's probably a good to do because I do agree the docs on setting up a grenade environment of little week There have been some recent changes to make it easier. So That's something we'll work on in the community to improve. Okay, we're out of time. So yeah