 Yes, I wanted to give you a short talk on something about our Jenkins installation that you probably all know and love and How and why it kills your runaway test jobs and What information you then can get out of that if it kills your beautiful Garage change and maybe you're even responsible for that most of the time. It's just these flaky tests that cause the Jobs that won't stop. I need to be killed May I ask who of you is comfortable with seeing such a kill wrapper incident and Looking at the data and making any sense of the data that gets presented a bit. Okay So I hope I'll and like the rest today Maybe that works out and you're less scary about the monstrous output you see there So I'll split this in two parts first I give a short explanation of What the issue is and then in the second part I'll talk about the the information that you see there and then How to make sense of it, maybe? Um, so yes Jenkins, I guess And now most of you all of you developers using Garrett using our CI using Jenkins That's good. And yeah, sure Jenkins is just working wonderful for us with all these bills 10 tons tons of bills that does every day We use it for Garrett. I think we use it for all the tinderboxes by now. I don't know if there's any Everyone yeah on all the platforms, of course that we support Linux including Android Mac Windows So yeah, most of it it's just Find weather and it works But there's one There's a couple of issues actually with Jenkins, of course if you have too many if you submit too many Garrett changes then Jenkins gets overloaded if one of the windows or one of the Mac builders has an issue and a hiccup and goes offline then we have too few of those and then Cloth needs to hit the reset button and everything is fine again, but there's one intrinsic issue with every kind of the CI job running is that We can't solve the halting problem still so we still have the issue that we don't know if Your Garrett change will execute and find our time So we have to somehow Work around that and even if we could solve the halting problem and we'd know that this job would Be solved in like until the end of the universe that wouldn't help we want to have these results quickly So we need to have some up the limit on how long these jobs run And if they take like longer than an hour or what we actually do implement there is that we check for fresh output on Standard out or standard error. So if your job is still running and generating more output and if it doesn't do for a specifiable amount of time then we declare it dead and to be killed and That works quite reasonable and quite well as I said, there's occasional flaky runs where some of the tests Hang for whatever bad reason because there's still some some race in the code that just hits one out of a hundred or one out of a thousand bills and then Of course your experience that every once in a while But of course you could also have a have a change a Garrett change that you wrote and that introduces something that systematically hangs and and you would then experience that in your in your Garrett changes Jenkins build and then would have to look into the Into the output of the CI and see why it why it Hanks if you can't reproduce it locally. Maybe because it only hangs on Linux say you work on on Mac or whatever So if you if you look into a Jenkins job that's effectively the make call that you also do on For yourself in your local build environment and that make What I want to get at is that there's a huge amount of processes Happening While such a job is going on so the first make recursive these ones make again for some reason Then you have the recipes that are all executed. Of course in more or less in parallel So lots of these recipes being executed each of them has these shell script lines So you spawn the shell process Then there's parentheses parts in these for example So you spawn another shell and you have a J unit has say that spawns a Java process that Java process then goes on to These are these remote tests where Java process Forks a full office and then tries to tell the office by you know commands what to do So the Java process spawns a nest office process that one then execs that all splash Thing that shows the splash screen Which we have for like historic reasons nobody ever cleaned that up and then that spawns the the true as office bin executable So yeah, lots of lots of Processes hanging around there and you can imagine If we have a job that that stops working so there's hung processes Then there might still be lots of these hung processes and we make we need to make sure that all of them get killed We want to get to a clean state again. Of course, we don't want to have any leftover hung processes And I'll show you how the way Jenkins natively does that is not quite up to its task So we had issues with that in the past because what a test looked like like this Java test I just showed where you have the Java process spawning the as office as office bin thing and these two communicate with one another via a pipe and And also the as office so whenever you start in an office It looks whether there's already another office running because we can always only have one office running per user Installation and each test has its own user installation in the work here somewhere But this one gets reused so when you have two Jenkins runs and There's a leftover as office process from the from the old one and it still has its user profile named Pipe open where it listens on you connections So so if another as office starts for the same user installation then it sees oh, there's already another process But has this pipe with a special name open So I'm not gonna do anything but terminate and tell the other over the pipe that it shall do But that other is the hung job from the old From the old Jenkins job, so it won't do anything, but you don't know So your so your new Jenkins job will also hang because it thinks others So in the end that the Java test will then tell the that the hung old as office to do anything And it will still not do anything So the new test gets hung as well and your job will also get killed because there was another Completely independent Jenkins job run before you that happened to get killed But it didn't kill everything of that because Jenkins didn't know how to kill everything And it just happened that there was an old as office process hanging around from the previous Java unit test and now your Java unit test hangs as well. So bad luck What we needed to do then was to manually lock into those machines and kill off all the processes That's still hung around the zombie processes Which was not not that much fun. So Because the issue with Jenkins is that it's written in Java Which is nice it can run anywhere Everywhere, but the problem of course is that it has rather poor capabilities of how to how to terminate All these processes that I showed that are so each Jenkins job has these Like maybe thousands even of processes and and how does it kill all of them? Well, what it did is It or what it does is sets a build ID environment variable to a specific value specific to this Jenkins job And then it iterates overall the overall the processes on the system It does some your PS equivalent on the on each platform and then scans the output of that the textual output and then looks at the PIDs and then ask for these PIDs Whether that process has this specific build ID variable variable with a specific value And if that's the case then it knows I it's one of the jobs from this process And then it says that job is sick term not even a sick hill That's just a sick term which is tells the job. Please go away, but it doesn't kill it actually So there's tons of things that can go wrong. They're like New processes get spawned wallet while it processes that PS output of all the processes So it doesn't even know if there's new processes coming in and stuff like that. So When I looked into that code and and wondered why we sometimes have these hung process I learned I yeah, that's that's the reason and That's not that's not gonna work So what I did was just come up with a little C program Luckily or most or most of the time these hung jobs are on Linux because on Linux for the Garrett See I for the Garrett jobs We run the full make check on the other platforms We don't run the full make check and the full make check is the one which is notorious for spawning all these J unit and other Python unit UI unit Remote tests where you have these multiple processes So you have a Java process or a Python process that spawns an office and then tells it to do something And these are the processes or the process groups that are most Most in danger of running into this issue of hanging of having leftover Processes when when we killed it via the the old original Jenkins way So I concentrated on on Linux and created something that At the moment only is used on Linux that might work on Mac Never tried it I think and I think no wanted to do something for Windows At some point but never came around with that so we don't have anything at home for Windows at the moment and Yeah Condensed down what that skill wrapper does is So the Jenkins job the Jenkins job starts with the first process Which used to be the make that you also type in your shell when you build the office And instead of that as first job we now have a the skill wrapper program which then invokes the make and all the Processes hanging off that but it invokes that make in a So-called new process group. That's a post-ex concept where you need to do some voodoo with your process And then you have a new process group and the great thing about these process groups is that you can kill all Process in that process group at once atomically and none can survive So you can't have this issue that there's new processes being spawned while you kill the other ones So this is a sure-fire way to kill all of the processes that got spawned by that one Jenkins job and Yeah, that works Remarkably well, I think I can't remember we ever ran into any Issue that could be explained by some some back in the kill wrapper ever since I guess whatever like like two or three years I can't remember one but yeah, that's kind of problem solved there and We now know we can get rid of that process group But we still want to know Why did it hang? Was it just again this notorious flaky test? That hangs every thousandth run or was it something you introduced with your garage change? so before we kill all the processes with this one kill minus Group ID say kill which Erases all all of the stuff that was going on We'd like to to learn about the state of this hung scenario so of all the processes that are still running and Where they are hanging and why they are hanging and what's what's going on there? So the nicest thing we could get is of course Run all of this in some kind of virtual machine and and if it hangs then give that virtual machine image to a developer Who can look into that but we don't have that so the second best thing we have is at the point where we kill Before we kill everything just done as much information as we can out of the state of this process tree and that's the notorious large Blob of data that when you look into the the console output of a hung Jenkins or kill rapper kills Jenkins job then at the end of it. There's a long long long long Blob of data and and as I asked in the beginning Probably most of you never look into that and I I want to explain a bit what you see there when you look into it when you dare Open the lid and just take a peek so What's interesting information? Of course, it's the the tree of processes that are still running With all their arguments with all their process IDs, so you can get an idea of what's still hanging there Then from the process is also their back traces to know what is what they're actually executing at all right? What point they they got blocked and hanging and Later we also learned it might be interesting to see the log files that are written. So when you do a Normal make job then we don't print out all the output that just that gets generated, but would only print it When the job Fails, but when we kill it and it doesn't fail so I never prints anything It doesn't come around to the point where we would print the failed Locks Output because we just kill it hard So we need to do a trick To find the lock files and print them and that works nicely on on Linux as well, and that's what we do I introduced that a bit later So yeah, that's again What we do before we terminate everything We run this little script that is supposed to generate all the data or gather all the data I Did that not in the in the seed program itself because it's easier to do such things in a script and I took pearl for no particular reason except that it is available on all these Jenkins machines and it's not Python so That's a little plus one there and what it does is or If you can think of new things that we would show that there then it's relatively easy to do that You just need to to add to this. It's It's in the load Repository so you can extend that script Just need to make sure that it's something that's available on these centers seven Build machines So does this ps3 thing of showing all the processes Then for all the interesting processes So I don't do that for the shell scripts that execute the recipes because that's uninteresting for us But for the Java and the S office and of the S office bin processes It prints all the back traces the raw ones as well as the Python ones in case it's a Python process so gdb has this nice pi VT which Looks if this process is running some Python and then it prints the Python back traces and otherwise It just says sorry. This is not a Python process So that works rather nicely to print all the information that we can even if it's a Python process Then you get the raw order the real Python back traces in addition to the raw ones and It dumps all the lock files as I said and for that it looks into the proc tree to find the Standard error some links of all the processes that we are still running Because that's the that's these some links are the same as the lock files Or they point to the lock files that we actually write So that's a nice way of gathering all the lock files because otherwise it would be hard to to get an idea We have these process they're running, but what are their lock files? So that's a trick I use there to gather all the lock files that are being still open But as I said, it's quite tons of data and I already Mentioned this in passing so the Jenkins machines are just still sent or seven baselines So they don't have some of the features that would be nice of the ps3 Command that is there for example doesn't has a minus t so it shows all the threads per process with all their threat IDs Which would be suppressed with the minus t So there's even more Output and I tried to GDP backtrace full which also prints local variables But I think either it was too slow in the end or it had some issues with pretty printing and Python I can't remember but I had to to disable that again, and we could also try to maybe to use J stack to print Java Stacks or readable Java stacks of process that are running Java. We don't do that at the moment I can't remember whether that wasn't working on those machines or Never, we never needed it until now, but that's something one could look into for example and with that and A few minutes left we'll go over To the real meat so I have prepared some Hope that's readable at that size Have prepared some rogue Gary change that introduces Some that that locks into some test code and of course what I then get is that the That the Jenkins job says that it got killed by the killer rapper Of course, they're claim to be the usual one because that's the one that runs the make check and I Poisoned some of the make checks so if you Daringly click on this and it brings you to this Yeah tons of tons of output that you see there and especially because I Make the screen size larger it now shows of course some in these PS tree output this tree in a rather See if I can that looks that's still kind of even at that at a hundred percent It's these these lines are so long that they that you don't nicely see that it's that it's a tree, but it brings all the all the Process arguments for all of the processes So it shows the makes at the tops and then the at the top and then the shells that run the recipes And then here we get to a Java that is probably a Java Java unit test and if you look at the end of the line somewhere then it says where the Lock file is getting over the user user installation files. So this is J unit test framework complex so that's kind of a first first tip for you what what Process fails will always look at the end of the lines where you see the work dear Directories belonging to this test and then you have an idea are that's that's that test And then there's the US office as office bin and and as I said the PS tree lacking the minus T Also shows all the all the other threats per process and then we come to a second block and there's again the shells and Then we see a python somewhere. There's a python up there and Long line ends with UI test writer tests five So we get an idea that there's still a second test also Hanging and that's a UI test named writer tests five and Then the PS tree trails off and What comes next? Is that readable at that size? So I'll leave it at that or I think the rest is not that demanding and with so I can furthering so Then comes for every process that is not a shell process It calls GDP on that process to print all the Python and all the raw Backtraces and also gives the registers just in case So the first one that it picks is the Java one with this completely uninteresting because it just shows you how Java is implemented and We'll have to scroll through that. So you have to look at the pit Look at the process tree to see how that's a Java process. That's uninteresting at the moment I'll just wait through that And then comes the next one Which is a Short one with just two threads and that's because it's that oh splash thing That's between the Java and the real S office bin So that's also uninteresting. So if you see something with two short threads, then it's an I will splash you Scroll past that as well. And the third one is then the first S office bin from the Java test and Now you you can can look into that you don't see that much interesting here and I'll Give a spoiler and the issue is I think in this threat number two because from Java we do a Remote binary herb incoming request you see that here down there And that's the that's the place where I poisoned the Java test to call something in the S office And that's something in the S office is Some that must be another threat this one is doing or maybe it is the one Yes, um Hurry up. So there's so we didn't find out anything about the Java test. That's an output as well Sometimes it's too confusing to learn anything So but we still have the other test that was a Python one and Here we have a Python backtrace and Python does its backtraces the other way around I think also in this case for the most interesting line is now the topmost one That says that we are in this writer tests five X window Pi File line one for three and that's exactly where I introduced that poisoning where I call Into the S office to do something bad. So if you then see this you can go, uh-huh The Python jobs stop there Then I can look into the Python file and see why it stopped there and debug from there and the Python file So this is what the Python file does Here down here is how I poisoned the Python file to call not the toolkit You know service, but the deadlock, you know service that I introduced here and then the deadlock, you know service just tries to Lock a mutex twice, which which doesn't work which that locks. So that's how I managed to Create this garage change for our testing purposes and We're over the end now. So no questions. No answers enjoy