 Hello, I'm Lukas Doktor from RedHead, from VIRT Team, and today I'd like to talk a bit about bisection and not just for Git. So first, I'm not sure how much familiar are you with Git bisect, who here is familiar with it? Like half-half, cool, that's a great beginning. So I'll do some short introduction for those who are not that familiar with it. Then I'll give you our usage and things that we were solving and may be reused. I'm trying to be really practical here. And then we removed to places where Git bisect actually didn't work for us and why we actually, in the end, had to create something new, but pretty much inspired by Git bisect. So bisection, what it is, in short, it uses interval-halfing to quickly find the way where things changed. In Git terminology, you have a good commit or a bad commit, and you want to quickly find out which commit in between caused the issue. So you start in the middle, there is a cut, so you start in the middle, and if the result is same as here, you jump to the right into the middle of the range. If it's the same as so on the bad side, then you jump to the middle into the left. Bear with me, it's not exactly great on the picture, but plus minus. And then you keep jumping until you find two commit parts where one is good, one is bad, and return the first bad commit. By the way, bad doesn't mean it's bad, it just mean it has the same output as this one. As this one, and not as this one, okay. Very quickly, you on Git, which is a subversion system, I mean, I guess you're familiar with that, you can, on a Git repository, start a bisection. Since then you can tag a certain commit as good, bad, or skipped. Skip is very important, I'll talk about it on the next slide. And once you specify one bad and one good commit, bisect will automatically start checking out the commits that it thinks you should check next. Afterwards, you can manually set that, okay, this one is good, this one is bad, and you know, keep jumping until Git is happy and tells you, okay, this is your first bad commit. Alternatively, you can use git bisect run, which where you can just hand it over a script or a command line. It will execute that script on each revision it thinks needs to be checked out. And based on the return code, it either assumes it's a good result, it's a bad result, there is a skip resident code of 125, and everything above 128, including, means a critical failure, which means interrupts. So don't be scared, like if something interrupts immediately and you don't know why, it's probably because it returned minus one, for example, because it's above 128. And you can address the issue and resume the bisection from there. Afterwards, you can use git bisect log to see whether it's sensible, because sometimes Git can wander off and git log is useful for that. I mentioned skip is very important. This is because bisection is fast method to converge, which means take this example, when you start on this commit, and immediately this one fails, you leave out this part and you never touch it, because it was bad, right? You can skip this, never look at it again, which means you jump here and here and return this as to be the first bad commit. In reality, if this was not related failure, it was, for example, a setup issue, and not a real failure of this case. Instead, what you could do is you can skip the commit, you can say, okay, I wasn't able to test this one commit. What Git does is it jumps very close, tries that commit, if it works, it jumps to the right, to the right, to the left, and finds the first bad commit. So it's important, you can use it for, for example, setup issues or for unsettlement issues, like let's say you have a certainty level and you say, okay, I'm really not certain whether it's failed or passed, skip that. Worst case, you can just bisect it again without that. The sample workflow looks pretty much like that. We will need that in the second part of the presentation. So you start a bisection, you can give it a bad commit and good commit, and then just run it by section. Where the script can look somehow like this, it could be just a simple wrapper, where you don't need to check out the commit because Git does that for you. I mean, automatically you can assume you will always be on the commit that you are testing, but what you need to do is, for example, deploy your application. Why? Because Git doesn't know anything about how to deploy your application. So sometimes people just forget about this and you can see how easily you can just skip this commit if this fails. Unless it's expected, then you can just return, for example, one. And then you run your test suite. You don't need to read this slide, just focus on the colors. The red lines means lines that you manually enter. Yellow lines is the output of bisector, which tells you, okay, I'm now on this commit, I have those many revisions left to test, et cetera. So you can see how it's progressing, for example, if it takes too long. And then you have blue output, which is the output of the script. So you can see that I have like, it tries one commit and there is some failure, one commit failure there, I execute actually false and then I execute through. You can perhaps guess how it ends up, right? And in the end, what Git tells us, okay, there are only skipped commits left to test. What it means? I had actually skipped commits, I had a good commit, then I had a couple of skipped commits and I had bad commit. So Git won't test those skipped commits again. Why? Because they are skipped and it's on you to decide like which out of those three commits were the first failure. So pretty useful. Good thing, like note about merge commits. It works, it descends them pretty well, so you don't need to care about that part. Everything works seamlessly. Git by sec log, again, no need to read that now, but it's just there and it's available. Now, if you remember, I mentioned I'm from Red Hat, from Vert Team. And actually the project I'm working on is the performance QMU CI. And it's called CI, except each build takes 8 to 12 hours, which means I cannot really afford running a CI, which means per commit basis. But I'm faking it pretty well, nobody actually noticed it because I'm running daily jobs and sometimes weekly jobs. And in case of failure, I just rerun the same test but with limited set of tests, using Git by sec to speed up the process. There were three little issues with that and I solved that in the Czech script and I think you can inspire by it, that's why I'm here. First problem was I mentioned performance testing. So it's not a feature testing, which means I don't have good, bad, I just have one throughput and other throughput. So how to deal with that? You'll see on the next slide. Then reproducibility is an issue, even though we use usually five samples and use the middle one to improve the reliability, it's still not that reliable. So I usually use two other three modes but if the reliability is under 50%, we can for example switch to does it fail in three consecutive rounds, et cetera. Helps pretty much and can be implemented on the same place. Second thing we do is I related to the good and bad part because we are actually reusing the already assessed results to further improve the like next assessment. And last but not least, not just with perf testing, but sometimes you may want to plot some outputs and it's good to have them sorted according to Git log and not according how it was jumping with Git by sec. It's pretty simple but maybe not that obvious so I'm including it here as well. So this is a part of like slightly simplified part of this file from our project and it basically shows how we drive the execution. Again, you already seen that on my slides before except we don't specify the revisions here. We just check out the good revision. Then we run our check script telling it this will be the good baseline. That's important. You will see how it's treated on the next slide. And the result of this is if you get two directories, like directory one is called good one, directory two is called good two. And in there you would find JSON results with multiple throughputs or whatever is currently measured. Then we tell Git that this one was actually a good one. We check out the bad commit, do the same except we tell the check script that this will be the bad baseline and it creates bad one, bad two. Again, with JSON results. Tell the Git that this is bad which means we are ready to run a by-section. We run the by-section this time telling it okay, don't just run the test but also check whether they were closer to good or closer to bad. And afterwards generating a report. So this is the by-section script. Don't try to check which language is it. It's like simplified to fit into slide better. But what we do is we execute run perf which is the tool we use for testing that generates current results using this suffix. And if it's a good or bad which means we're generating the baseline we simply just run it again with the suffix of two, generating current result two and afterwards we just move the results which means we use the current result and move it to the name good or bad using the same suffix. So all files or directories that are called current result we just move them to good one two or bad one two. That's it. So that's the baselines. Next we run the check part which means we are not in this branch so we just skip this one, right? So we still execute the run perf get the current result one and then we execute a tool called diff perf that understands the directory output so it looks at all directories called good and g and bad and b and creates two groups of results. And it tries to assess whether the current result one is closer to here or closer to that. That was the first implementation now we are actually using standard deviations to assess which is more probable but works the same way. I mean you can just inspire by places where you can do that. After and yeah, the return code is zero or one, closer to good, closer to bad. So that's the first questions like how do we check whether it's good or bad? We just looked at which is probable, which is more probable. Is it like more likely to be good or more likely to be bad? Then we execute run perf for the second time. Remember two out of three mode, get the return two. If they match, which means we are done. No need to do anything else because two out of three why would you execute the third one if you already know that two of them are matching. But if they are not matching, we need to execute run perf for the third time based on our return code on this one because if you already have one good one bad, the third one cannot have a third state which means it will be closer to one of those. And afterwards we move the results but this time the name will be the index, like a global index like one, two, three, four, five, six, seven, eight, whatever, one million one. And we suffix it with B for bad results and G for good results. This is very important because we need to distinguish which are goods and which are bad and need them sorted for the report. And the suffix is also important because here, you may have noticed that I don't look at only for good results but also for anything suffix with G which means in the next round I will include all those results that were assessed as good. That includes those results that did not originally look like good ones which means in the next step, even though they previously were assessed as bad ones, can be in the group of good results and further move the good results likely to bad results. This worked well for us and significantly improved the base section unless we had an error at the beginning. If you have error at the beginning you can just rerun that again. But if you don't, as we go, we further improve the standard jitter that is acceptable for us. And afterwards, we exit based on the return code. In the end, I mentioned we are generating new results. So again, it's very simple. You can either use git log and try to match which commit belongs to which of your results. And it's kind of tedious, especially with merge commits where you can have those commits in a weird order. But you can just stop and think. And if you start here, and on good result, you jump to the right, into the middle, so you're not jumping over anything. And on bad result, you jump to the left. What you can do is you can just take all good results, leave out the bad ones, and let them sort it as they go, like first, fourth, and fifth. Then you take all bad results, leave out the good ones. And because you're jumping to the left, you need to reverse the order. So you take them in a reverse order, which means you have third bad, second bad, and then you have the bad results. It as simple as that. And the result is like, again, no need to understand everything. The main thing you can see here is we are jumping from left to right, which means we have all those commits that were good, and there are all those commits that were bad. And like four tests not important for this presentation. I'm just demonstrating that it works and it really draws the line. So that would be the introduction to git bisect, and some goodies for you if you already are using it. Anyway, where git bisect didn't work for us is mainly downstream. And it's mainly because we don't have a git to bisect over. We just have a list of nightly bills revisions. And it's solvable. I mean, imagine like after two weeks finding out that something is wrong. Sure, you can spawn a job and you know, run 14 times to find out that, okay, this nightly bill actually caused the regression. If it's enough, that's good. But we can usually see like, for example, 20, 30 packages that were updated. And sometimes you can guess which one caused it. But sometimes there are so many changes that you can't really say who is responsible for the failure. So what we would do is we would just provision the old restore, try to install a package, you know, one by one, okay, it works, doesn't work. In the end, you find out that this is our sentence, so you can rerun it again. Then you find out that you need multiple packages together to actually get this failure, like you need to QMU and a LibWord change in order to reproduce. And it's quite chaotic. For years, even before PerfCI, I was looking for a tool, and if you know of any tool that would allow something similar, like git bisect but not on git, I would love to hear about it, but I fell to find one. So I just said, okay, enough. I have to do something about it, and I did that. And by working on it, I mentioned we have usually multiple packages, and they are independent. So I don't need to just bisect whether this package is important, but I also need the combinations. So I said, it would be nice if I can bisect over multiple independent access. And, you know, since I'm reinventing the wheel, let's edit as well. And here you can see the usage is pretty much similar. I mean, why would I reinvent that, right? I just reuse what git already does. So the command line is similar, except you need to specify all the arguments yourself. I have some helpers, but for this, let's try that. And you can see that this time I used the same commit range and in the previous example, but on top I added two more access. I said, okay, but we changed multiple things, not just the git revision, but in our CI, we were so brave that we changed revision and some complexity and some test suite, for example. And we changed it all simultaneously so we don't know what of those things actually caused the regression. We need to tweak our script a bit because bisector doesn't know anything about git. I mean, it's independent project. It uses list of strings. So you need to check out the commit and skip if it doesn't work. You still need to deploy your project and then you need to run the project and maybe you want to pass those extra two access as two extra arguments, right? So let's see how it's gonna work. Again, it's no need to read things. It's just read, you need to specify all the commits that's the only difference. Then you get a summary, which I find pretty nice because you have multiple access, so it's not always obvious how things will look like in the end. So you get the summary. If you are happy with that, you just run the bisection and you can see again some yellow and blue outputs that are changing, changing. We have more access, so more variants. And in the end, it will tell you, okay, first bad commit is this one. Unfortunately, we don't list all the skipped ones to be delivered. And it tells you that we ended up in nine steps and the failure was caused only by the access zero. If you remember, we had three accesses. So the first and second accesses were useless. They didn't change anything because they don't do anything. And only the first access, so the git revision method in this example. The looks look slightly different. I can compare it here. I just shrink the shells here. So this is the git log. This is our log. I mean, you would probably guess why is because if you use arbitrary strings, things can be pretty long and you wouldn't actually guess anything from that. And second, we use multiple access, which means escaping that so we understand which revision that was is hard. So we just use serial ID of, I mean, index of that item and separate that by minuses. So what do we have support currently for? For the arguments, comma separated list. That's pretty simple. We do have a support for Python like range. More importantly, like we don't use that, but what we use is the URL thing, which is very simple HTTP parser, but it works well with Koji or brew if you're familiar with that. You can just, you have a page where you have all the brew builds and you can say, okay, I'm interested in all revisions between this build and this build and it will give you links to the brew builds. And you can easily consume it in your test suite, install this brew build and use it. So this is what we use 90% of the time and the rest of the 10% is the beaker helper, which gives you a distro revision. So we have, if you use beaker, provided you use beaker, you say, you have this nightly build, this nightly build and I want all the balls between because I'm lazy and I don't want to copy and paste the names. Now for arguments, I already showed you this part, right? We run the suite and it automatically injected all the access, like keep changing the values of those arguments, like first, second, third, like repositional arguments. Alternatively, if you don't want to use run and you want to do it manually, you can use bisector arcs to get all arguments or access an index of the access you are interested in and it will give you the value, the current value after you check out the different value, it will give you a different value again. Alternatively, if you just need a wrapper, so why would you write for a wrapper? You can use your templating mechanism and use the check run instead. Last but not least, we have the multiple access thing which is interesting, but I don't think it's the best way but it's systematical, which means it works for me. Imagine a situation where you have actually eight kernel changes, eight libver changes, eight Qmer changes, and eight BLKIO changes. What a coincidence. So you know that on index 0000, you have the good versions, right? You know that those are tested, so you don't test it, like you believe the user. Then you have a bad revision, which is the last of combination they actually gave you, right, the 77777. So what do you need extra, unlike in git bisect you have multiple accesses, so you actually need to know whether the current access is useful. I mean, whether you want to actually bisect it. So to save time, you start with all that, then check out the first access and say, okay, let's try whether these accesses actually affect something. So in here you can see that previously 777 was bad, now 0777 is good. So there is something happening on the kernel front. So you check out the third one, fifth one, fourth one, and you now know that the fourth one is the first bad, which means the 4777 is bad. We can build on that. So we switch to the access one, do the same. Again, it's a good one, which means, yes, libvert is also like a suspect. So we bisect libvert and find out that, yes, we have the first bad on the index five. So we can check out the next access, so like 4, 507, and see nothing changed. It's still bad, which means this access is really irrelevant, no matter what you do. I mean, maybe somewhere in the middle it would work, you don't know, but it's likely that if before it worked and now it doesn't, nothing changed. So we can skip this one and not investigate it at all. That's like a slight speed optimization. Then you need to check out the next access. And again, you're using the first bad, which is in this case, zero. And you find out that the seventh is the first bad here. So in the output, because you don't have any further access to investigate, the 4, 507 is the first combination that is bad. What does it mean? In gate, it's simple, like you get a single commit or multiple commits provided there are skips. What it means is that if you use Kernel, Libber, Kimu and Libby LKIO with 3, 507, it will work well. 4, 407, it will work well. 4, 506, it will work well. But 4, 507 is the first bad combination, which means first, second, and third, and fourth access, like the Kernel, Libbird and Libby LKIO changes are needed together. Note, when I actually started with bisector, I used different approach and these kind of failures were pretty hard to find by that way. I mean, they didn't look that nicely, but all kind of failure worked well. So if your workflow is likely to use all kind of failures, feel free to contact me and we can bring it back, but it was just prolonging the bisection. So I'm happy with this one because in RCI, this is the only thing we are looking for. So the key takeaways are, Git bisect is cool. If you're on Git, don't even think about anything else. It's cool, I mean, don't use anything else. Git bisect is fine, but maybe you can inspire by some of our goodies that I provided you here. If you're not on Git or if you want to, for example, bisect multiple submodules because they tend to break as well, you may check out our project. You can add the Git provisional, so you don't need to specify all the revisions in between and it can help you with bisecting over multiple arrays or just simply over things that are not Git revisions like nightly builds or your images, it could survive that probably. Just know that it's like early beta stage which is good and bad because you can join and help me and improve the tool. I know that skips are not that well implemented at this point, but in terms of good and bad results, it works pretty stable and we are using it in our pipelines. So, any questions? Yes? This is on you. This is a very simple project, it does, POSIX like, you don't do anything you are not supposed to do. What Git bisect does, oh, sorry, what bisector does is it will just give you, okay, use this string. In our case it's, for example, the brew build like link to that kernel. What you do afterwards is on you and we are storing information in hidden files so you can reboot the machine freely if you want to. So it's on your checkscript, what do you want to do? Basically. And yeah, the question was whether I can reboot the machine and, okay, I'm engine now. Okay, there was another question there or no? No, no, one slice. Okay, so what did you? You mentioned that you might not know this part. Uh-huh. How do you bring in the history? So first question was whether we store all the information what was good and bad. Yes, we do because if it was, for example, sparse access, sometimes you don't need to test that again because it's still the same. And besides we need to show the user the Git log, or I'm just keep calling it Git log, it's bisect log but it's the same log. So we are storing that information. And second question was what happens like when we are bisecting improvement and then following by a regression. If you don't catch it, which means you are here and you are here and you don't see anything, you just don't know anything because it's bisection. You don't test that, right? How would you know that it happened? But if there is like, like if you have certain level, certain level and in between there is a spike, then again it's like slightly outside of the scope of this presentation but what we do is we start the bisection and this will be actually so unprobable that it will probably be this one. I mean, we were trying to do the skip things but we haven't had time to do that yet but so it would be best to skip those obviously but we are not doing it at this moment. And one more thing, we previously and sometimes like if you have like really oscillating values that the probability doesn't work, we just use the nearest neighbor. So obviously this one will be closer to that one. But I mean worst case, what we usually do, I mean that's why you have git log or bisect log for because you can see that okay from here it looks somehow odd. So you can rerun with a shorter range which is what you should probably do the best. Okay, I think there was like to skip them in advance because you know, because you know that, okay. So if I understood the question correctly, you have some git revisions, right? Like revisions and you know that some range should be skipped because you know that they can't be tested, right? Yeah. Okay, with git bisect you can do that because before you start the bisection and even if you start the bisection, you can always use git checkout to check out a certain commit and you can say, okay, git bisect bad. Git bisect skip, sorry. And then you can check out the other one and get into that the same. Basically, you create a for loop over your skip commits and after you finish, it will again like the git bisect skip, the last git bisect skip will let the bisect to check out the commit that should be tested next. With bisecter, unfortunately, not at this moment. So git bisect, yes, we don't at this moment, few, three, two, commit it. By the way, it's Python project so you can just import the project and use it in your project as well if you're interested. You don't need to use the command line. For it, for best scripts. Any other questions? Okay, there. Yes, I thought about multiple different use cases. Like for example, to test it, I just, you know, try, for example, booting the biggest machine with different kind of sizes of RAM. It behaves nicely. So yeah, it could be used for some kind of tasks as well. But I mean, you name it. Yes, I forgot to repeat the questions. I know. So yeah, the question was whether it can be used for something else. And yes, there are certain ways. Like for example, in tuning, you can search for different arguments, although we're jumping from one axis to another so it may not be optimal for that case. But again, open source on GitHub, so you can contribute. Contribute that, like different modes are welcome. Anything else?