 Thank you very much. Hello everyone. Thank you for having me here. So we are going to talk about Red Hat and Microsoft and Linux and testing. I came from the open source world and I've been doing software testing at Red Hat for a very, very long time, as you can see. And I love to speak at conferences. I don't have any other pictures except somewhere I've been at the conference. And I am also the project lead of KVTCMS, which is an open source test case management system. And it's an interesting project with an interesting history, so please check it out. Give me feedbacks, report books. If you need something like that, we are happy to fix books for you. And with that said, I'm done with the shameless self-promotion and let's talk about pairwise testing. So anyone familiar with what pairwise is? Okay, one, two, three. Couple guys, okay, cool. For the rest who are not familiar, I have an example to demonstrate. Imagine that our job is to test this car and, you know, because we are diligent testers and we look carefully to what's given to us. We see that we've got a few options for batteries and a few options for wheels. And also we have performance mode and, you know, dual engine, single engine and stuff like that. So we have options which may, these are input parameters which may affect the performance of the car, so therefore we have to test them. And here is a list to make things easier to comprehend. And just for the sake of the example, I have added some more options so we can have more combinations in the example afterwards. And these ones, I've just made them independent of one another to make things simple, for example. If you want to do exhaustive testing of all these parameters and their possible values, we have 32 different combinations between the two, which means 32 different test environments that we need to work with. Or in reality, if we're testing the car, that means 32 different cars for test. So that's quite a lot. And Perwise is all about reducing that number. So Perwise says we should not test all possible combinations between all parameters that we have. And on our website, they claim that in complex software systems you don't need to be able to control all the input parameters at once to trigger some defects. You need to be able to control at most two of them to trigger pretty much all of the defects. Not all of them, but a good amount of them. So if you can trigger defects by controlling only two parameters, that means you can design your tests to take into account combinations of two, and not combinations between more parameters. So Perwise is also called all pairs of two testing. And in this matrix, in the example, if you look at it, it contains all pairs of so 60 kW battery with single engine, and then 60 kW battery with dual engine, and then again 60 kV with 21-inch wheels and with 19-inch wheels and so on. So it contains all possible pairs of two, but doesn't contain all possible combinations. And the maximum size of this kind of matrices is determined by the two parameters with the largest set of possible values. So in this case, we have battery with four values and engine with two values. Yes? I said it's all combinations of two. What about single engine and no performance model? Okay, single. Okay. Bad example. So this should be here. Yeah, because I've created this by hand to be beautiful. Yeah, good catch. Thank you. Thank you. So yeah, that's the eight. In this example, eight is the largest possible value of this matrix. So you take the two parameters with the most values and multiply them. That's the largest size. And obviously, eight is much smaller than 32. There is a community website called parawise.org, Microsoft is sponsoring it. And they seem to be pretty much the only one entity that is promoting this kind of testing, at least to my knowledge. On this website, there is a list of a lot of research papers. There's also a list of many different tooling for different programming languages available from the website. There is also a parawise tool from Microsoft available on their GitHub account. And they claim that parawise testing is effective at finding defects. And it's an effective test strategy. And using these tools that are available and knowing how parawise works, you can actually use it to generate your test cases with particular parameters automatically. So you have some tools that tells you what to do when you perform testing. And this is what I did. I made an experiment which involves installation testing of Enterprise Linux, version 6.9 in particular, where I applied these techniques to the entire testing campaign of the product and across all product variants to see what will happen. So we're going to talk about the experiment. Installation testing very quickly. This is what my team does and what I'm going to talk about. We have this program called Anaconda. This is the installation program for Fedora, Enterprise Linux CentOS, these types of distributions. And our job is to test that piece of software to make sure Linux can be installed on your computer. And when you click the finish button, the restart button in the installation screen, you can actually reboot into a possibly working system. At least you can be able to log in to a command line and fix it from there. And it's mostly written in Python with the GraphQL user interface. It has also text-mode user interface, command line, fully automated interface. We have integrations with different tooling from Linux. For example, here it says file system type in its drop-down. You have all possible file system types that are supported in the Linux distribution. And to be able to perform partitioning with these file system types, we actually, the installation program integrates with the command line tools for that particular file system. So we have one screen, but it does behind the scenes, it does different things. We also have, for example, networking to download from the network inside of the installation program. That's a network manager. We don't manage networking by ourselves. We just integrate with network manager. And all these integrations, they can be a source of problems. And this is usually how our test cases are designed. We try to design around these integrations and what could possibly go wrong. From the point of view of testing, we have nine different product variants. So we have stuff like client and server and workstation. And we have different CPU architectures, different hardware that these variants run on. So we have Intel 32 and 64-bit. And we also have IBM Power and IBM Mainframe that we support on the server variant. And in Fedora or new versions, we have other hardware platforms like ARM 64, for example. So this number can grow. Now it can become very large. The traditional way of how the team is handling that is for each product variant, which is some content set and on a CPU architecture. We have one person that is dedicated to only that variant. And they are responsible for testing everything on that variant. We call them variant owners or architecture owners. The interesting thing about all these variants is actually that they are mostly the same thing. The software is the same when the software is put into the installation media. Our build servers, they have no idea for what kind of variant they are building. And the software itself very rarely does any checks to see on what platform it's running. Pretty much all the times we only check what is the underlying CPU architecture if we need to do something which is specific to that architecture. So for the majority of cases, all of these variants work pretty much in the same way. And when I told you I do testing for a long time, I've been testing installation for the last 11 years. And I very rarely have seen books about something that works on one variant and the same thing doesn't work on another variant. So that's why I think it's mostly safe to consider these product variants to be independent and to be the same thing. And you'll see why later. The test suite that we primarily use for installation testing is split into three groups. So called Tier 1, 2 and 3. Tier 1 is a very, very small test suite. It's fully automated. It runs against every single build, against every single product variant. And it's not subject to my experiment. We don't want to touch it. Tier 2 and 3 is the subject of the experiment. It's a much, much larger test suite. The number of test cases in the Tier 2 and 3 group is almost 20 times as much compared to Tier 1. Traditionally we try to execute Tier 2 and 3 at least once per week. We try to complete all the testing. Our testing campaign is usually several months long. And during that time we average around 6,000 test case executions, which means 6,000 times reboot into installation media. Take your time about 20 to 30 minutes to complete the installation and then reboot into the system to see if it's working as expected. So that takes quite a lot of time. You know, obviously that doesn't scale much. And let's see what we can do about it. So I've studied my test suite and created a very, very simple experiment. The first thing is that we actually do some test cases which are platform dependent. Not many of them, mostly on IBM Power and IBM Mainframe platforms. So there isn't much we can do about them. You know, it's not like we cannot execute these cases. We have to execute them. So we just take them and transfer them to my experiment test plan and go on from there. And then we have a larger group of tests which have parameters. So stuff like we have a test case called storage slash iSCSI and that means perform installation, attach a disk over the network and place the root file system on that disk. And for that test case we have authentication type and we're interested in only these three. And we also have a networking subsystem that manages the iSCSI connections on Linux. And on RER version 6 we have two networking subsystems which need to be able to work pretty much in the same way. So for testing that means six combinations for only one test case. Now I said the pairwise is something to do with parameters and if we try to apply pairwise here we don't get any difference because we have only two parameters. And looking at the existing test suite almost all the tests have only two parameters. Very rarely we have three, four or five parameters in a test case that we want to test. But across all variants I have to execute these six nine times so it equals 54. And I can consider the variant to be a parameter to testing and this is what I did. I've applied pairwise across all parameters including the product variant S parameter. And now that we have three parameters we can do the calculation, we see that we get 50% reduction immediately which is yay, cool. And the last group of tests that we have in our test suite are such that they do not have any visible parameters that we care about. And for example we have this called partitioning slash swap on LVM which means install Linux, do the partitioning, don't care what you do as long as you place the swap partition on LVM on logical volume and make sure that it's working. And the only thing I can do about it, instead of executing it nine different times for every single variant I can randomize and say I want to execute only one time and for every different build I want to randomize on what product variant I'm executing that cases from this group. So this is the third part of my experiment. Very simple algorithm. And I was very quick actually to create my experiment but I needed some acceptance criteria and I know I should have done this beforehand but I actually defined them after I knew how to execute the experiment. First I obviously want to have less test cases, case executions in total so anything under 6000 will be fine and you can actually calculate on paper how much savings you are going to get even before starting testing. The second is more hard to measure. I don't want to be missing existing bug so I want to test less but at the same time I don't want to say to my product manager yeah everything is green but I didn't test so there were some bugs but I have no idea what these bugs are and for that reason I compare the bugs that I found during my experiment with the bugs that the rest of the team has found. And the last one is even more trickier to measure don't increase product risk and for the sake of the experiment we measure product risk as the number of bugs which are reported as critical and that I was not able to find for some reason and I want to know why I was not able to do this because I want to make pairwise the main testing strategy. So the results, before we continue if anyone isn't clear about this I made my experiment in parallel with the rest of the team so anything that I was doing had no impact on the release schedule of the product and on the work of the rest of the team so everybody else was working like they were used to before that and I was doing everything alone just as an experiment in parallel. This is pretty much my most... if you are quick we can do it now. So this is my most impressive metric I was able to achieve 65% less test case execution so I didn't execute 6,000 I executed about 2,000 which is cool. I was able to achieve 76% execution completion rate so that means from all these 2,000 I was able to complete 76% of them and the historical average for the REL6 release of the team is 85%. So I remind you I was working alone if we work as a team using pairwise I'm sure we can beat 85%, we can probably do even 100% if you want to. So that means how much work I was actually able to complete. Now about bugs, here are some findings that I didn't expect and they were actually very interesting for me. 30% of all the bugs that were reported against installation for that release and for that testing campaign came out of the tier one test suite. So they were actually discovered very early in the release cycle. For me that means tier one test suite is doing a very good job. We can probably tweak it a little bit and increase that number but maybe we are going to come to a threshold very soon and that's a completely different experiment on its own but yeah we don't want to touch the way we do tier one that's our first line of defense so it's proven to be good. The other 30% of the bugs were actually found by pairwise so what that means is it's either I found some bug and I reported it as new bug or either somebody else was able to report it and I was able to reproduce and the test cases in the pairwise plan they are a subset of the test cases of the full test plan. So this is cool, now we do 65% less work and still find one third of what we are able to achieve so that's very good. That for me looks like the claim that pairwise is indeed effective at finding defects is actually true and I was actually expecting that not to be true and this is very surprising for me, I didn't expect this 30% of the bugs were found by exploratory testing and I went through every single one of them to read what the guy did and how they found the bug and for most of the time it's not like somebody was sitting in front of the computer and trying to do anything particularly crazy they were actually following another test case with steps to reproduce but either they did not follow the steps in exactly the same way or the setup environment was a little bit different so maybe if you have a test case which says you need to have a rate setup on the system the test case itself doesn't say if it's going to be rate one, rate zero, rate five what type of file system should be there these kinds of things and we come across these edge cases and I'm not sure if it's a good thing or a bad thing one thing is sure it means we have edge cases that we have not described in enough details so we hit them and we want to have more details we want to have more automation to be sure that these things don't regress in the future on the other hand we like to give test engineers a lot of freedom and ten people on the team even though we have automation and we have tooling they use them in different ways and we have these kind of things stuff that I was not able to find four critical bugs three of which are regressions I have given the bug numbers but unfortunately we have a policy anything that is found before the product is released to the general public is reported as private so you are not going to these are in bug zero but you don't have access to them unfortunately so you have to trust me on that the first one is after installation with IBFT the default route is missing so IBFT is like a network attached disk starting directly from firmware so it's like booting on a network disk directly without having local disk on the system if the default route is missing pretty much everything is messed up you don't have networking working and that turned out to be firmware dependent which is very good for me because that doesn't affect the results of my experiments so firmware hardware dependent problems they can happen regardless of what type of testing you do so it just happened that that particular system was faulty and developers and everybody else were not able to reproduce on other hardware so yay for me, cool the next one is more serious unaccounted installation program failed to get kickstart file from NFS on mainframe so kickstart files are text files with configuration how to drive the installation automatically we have functionality in the installation program to bring up networking very early right after boot and try to download this from the network if they are specified so that we can continue unattended from that point on and this is interesting bug because it's a regression fixed by a change for another bug it's also a coroner case only on mainframe and it happens only on IPv6 it doesn't happen on IPv4 the reason that I was not able to find this type of bug with pairwise testing is that the test case test with kickstart from NFS we have this thing in the test plan but this is one of these test cases that I consider independent and I try to randomize and when I looked at the results I've always randomized this on the Intel architecture and I'm nowhere expert on mainframe so I pretty much have no idea how to work with mainframe and my randomization tool was very stupid it didn't look at previous executions to say oh you've tested this on Intel so let's try and test on some other platforms to make sure that we distribute evenly as possible next one is again networking related in the installation program when you want to attach a network disk there's a small check box bind to that particular network interface if you have more of them I tried testing this very early in the testing campaign and I messed something up with my setup environment I decided to do this manually and I messed up and I said to myself I'm going to skip this because we have several more builds to test with so I'm going to skip this because I don't have time I'm doing an experiment I'm kind of like on the clock then comes the next build and for some reason that particular test case was on the compute node variant which is a very minimized Linux version with less software available intended for high performance computing a surprise surprise the compute node variant doesn't have the iSCSI client's i2s available into the installation program so the installation program just hides the screen it doesn't allow you to access this functionality and I did not test again so I skipped this test case several times and all the while there was a problem so when I learned about the problem I went back to the first time I skipped my test and tried to reproduce and this time I was able to actually reproduce correctly so that means I need to be more diligent as a person when I do my job and the other one is again the regression some errors in upgrade work when we do upgrade we don't want to have these kinds of errors because customers call us if something is really horribly broken then we should have called this before the release and if it's not that bad we try to silence these errors as much as possible because pretty much there isn't anything we can do about them so first thing I tested this and I tested it very very early in some of the first builds and then we have this kind of policy going around if you have a lot of work and you've tested something continue with the stuff that is left idle and only then when you're done go back and retest what you've done previously and this is what happened in that case I tested it early, it was working I set these passes I moved on to other testing then there was a new build there was a regression that broke and I did not test for several releases for several builds and problem again I skipped testing so a few lessons that I learned from this experiment and hopefully they are something that you can apply to your job or inspire you to do something else yeah I have a lot of free resource now on the team if we do 65% less actual testing then that means I can have only 3 or 4% on the team working on the product day to day and I can have the rest 6 or 7 people from the team doing automation all the time working on infrastructure all the time and this is huge, if I have 6 people doing automation all the time it's like having a second team this is great the other thing that I learned for myself is we need to do test review more regularly as a team and I'm not talking about going to the bug tracker and see we have these new bugs or we have some regression test cases which are related to that area and improve it I'm talking about sitting down and reading through the entire test plan and through all the test cases as a team together maybe once a month or once every 3 months something like that so that we share knowledge, we exchange ideas and by doing this I had to do this because I needed to know what parameters are in there in these test cases so I can create my experiment and I found we have test cases with hidden parameters or variables that affect the way we do testing which is not explicitly described so that's hidden knowledge within the team I also found duplicate test cases some purely duplicate some which overlap and test pretty much similar things so all of these are sources of optimizations which we can apply to the team and I did not change this because I didn't want to affect my results anymore if I saw something which is bad I just left it as it is and simply executed it because the rest of the team they didn't know that was it like that so perform test review hopefully regularly as a team this is cool and I've observed some patterns in the way I was performing my experiment testing inside the team which is particular to our environment like how we set up systems how we trash this system afterwards you know this is also source of other optimization but my advice for you is actually either take a lot of notes or have somebody else to watch what you do and they take notes and then you know change somebody else does the work take notes and then you compare notes try to find patterns and this actually works really well you know if you have a lot of notes then you can read through all of them and see what you've done and see things that you do repeatedly maybe they are good maybe not and all of this in the context of pairwise is that all of these combinations they can be considered parameter if I execute test cases in batches what exactly test cases are in the batch pairwise again and again and again I've never done this it's just an idea it's probably going to work I don't know there is risk with this testing strategy because we do not do all the tests all the time for every single build we are not going to execute them anymore but as a team I've presented this to my team you know we have ongoing talks about this and we think it's a viable strategy for us to do testing on the whole product we think we can minimize the risk we can mitigate it somehow the most important thing is the human factor especially when you skip testing everybody needs to know what's happening underneath so that's the most important thing for me and that's it I'm done talking and now it's time for you to ask me questions okay first question there yeah you go ahead no no I cannot tell you that so the question was can I tell you how many bugs we had in the release I cannot tell you this you can look at the the Arata records that Red Hat provides on their website when we update and most of the bugs are listed there publicly but not all of them yeah go ahead okay so the question is if I have examples from history that about bugs which cannot be found by pairwise with three or more four parameters pretty much like your question you know tell me the title one four which one the first one yeah that's that maybe you can answer that for me but yeah this is not an example of what he is asking about and you know I do have access to historical data we keep track of everything that we've done in the last 15 years with respect to installation testing you know but it's not very easy to query you have to go through and read everything manually yeah I don't have an answer for you right now you know we've never looked at this data before but just based on general feeling that you know just from experience working with the same products for so much time you know I believe you know the things that we are not going to find the things that we are going to miss are not that many okay go ahead I think okay so the question is is there any proof that if we have independent test parameters and do parallelize that's going to reduce test coverage not going to reduce sorry well I I don't know if there's some research done on that topic maybe all these white papers that I mentioned on the Microsoft website you know you have to read all of them I admit I am more like a practical guy I never went to read all of them you know just look at the technique and actually I was told about this technique by a former Microsoft engineer at another conference but yeah I don't think it's going to reduce coverage because still you have a lot of mixture of test case executions and you know that adds up to also with respect to coverage with previously in different experiments we've compared data from doing manual and automated testing on different platforms and coverage for the installation program in terms of percent of code covered is pretty much the same all the time even if you do a default install without touching anything you know it's pretty much touches everything in the source code so yeah we have something like 80% coverage which you know it's really meaningful not really meaningful metric but okay yeah you go you go first this one this one yeah so these are pretty much yeah sorry the question is if that bug the first one is not covered by pairwise because it doesn't have parameters or yeah well well that test case as defined it doesn't have any parameters so we want to make sure that this thing called binding to network interface actually works because when you select this option this creates configuration files which need to be correct and need to be present on the file system after reboot and pretty much all of the bugs that I was not able to find out of the group which don't have parameters explicitly listed and that was my problem you know because they were randomized in such a way or I didn't test them very well enough you know so I missed them yes go ahead please if I have the question is if I have experience with tripwise testing no I don't have I heard about this but I don't have experience testing in triples or even larger but the the research hints that pairs are enough yeah that was the rationale the Microsoft research papers are fascinating yeah so the classical example on I'm not sure one of the Microsoft articles that they've read about is if you open the word this is where they came up with the idea first and one of the setting dialogues he has like five or six checkboxes and you can check any one of them in any combination so that's a lot of combinations and and there's like thousands of checkboxes in word for settings and stuff like that and if you want to test everything that just huge so they decided okay look look at something there isn't a lot of information how they came to the conclusion that two is enough they just say okay two is enough I said let's do this and it's proven in practice that it works how much time you've got oh ten minutes left okay okay the question was if I have any tool or if I was using a script so I mentioned there are many tools for different languages in my particular case I did not use the official Microsoft tool because that one is you need to compile it and I'm a Python person don't like to compile stuff and you feed it a text file with the parameters and it just puts you know the matrix of the combinations I used the ruby pairwise gem which can be used also in ruby scripts and it also integrates well with the ruby testing tooling and I used that to you know I fed it with lists of my test cases, parameters for every one of them just describe this in declarative order I have this and this and this and to do ruby do the work and output the list of cases and the list of parameters I need to be executing for every single build and actually I've noticed the ruby tool doesn't apply the pairwise algorithm very correctly sometimes it produces duplicate entries you know but that isn't you know that big of a problem it's probably like one or two percent of the times okay so he says there are online tools to help with this but there are tools for every popular programming languages out there okay a question in the back so the question is what about how quickly I find it but can you repeat I can kind of don't understand what you mean okay so the question is if there were bugs or you know that were not found the first time maybe but then we I tested with different combination of parameters in this case and then found them on the second time well I don't have that information because if I found something let's say in in build number 2 then I have to go back to build number 1 and try to to reproduce that thing and you know I don't know if it's been introduced in build 2 or if it was there before that it is possible to do but that's a lot more work to you know to get the data and to validate the statement did I answer or no okay okay so the question is about race conditions books that depend on timing and if I had to experience with this in the context of pair wise first of all during installation we don't have these kinds of problems pretty much none of the time at one point in time the installation I think was multi-threaded but currently it's I think it's mostly single threaded because it was very hard to debug these things in the installation environment you know you have to burn iso media reboot systems and stuff like that it takes a lot of time simply to get to a point where you can debug something and figure out what's going on and it may not be reproducible that said you know if you do desktop testing or you know something something else which can be easily tested and reproduced then I guess you know you can have problems with race conditions but again you know the race conditions they happen kind of randomly so you have you don't have any idea whether or not pair wise affected you know the happening of the race condition well I can well I tend to disagree with that because you know if I render my tests all the time but I run them on the same system with the same load and you know kind of in a predictable environment the race condition may never happen we do this kind so the question is have I evaluated associated cold changes to govern the parameter selection in testing we do this when we design test cases like for new functionality for example we have bugs and we try to evaluate this and need to create new test cases we do that kind of analysis but that's not subject to the experiment I just took whatever was existing at that point in time and I knew the theme was going to use the same set of test cases which have been already created before the release so I just used that but on the other hand yes now that we know how pair wise works we can be thinking more in terms of parameters to testing because honestly until now we haven't thought a lot in terms of parameters so if the parameters were obvious yes we put them into testing and if they weren't that obvious you have to go to the source code see what's happening or have some customer table that yes this is a particular problem in one particular environment so we can add parameters to that and also you have to realize the existing test suite that's a product of many many years of development of the installation and upgrade of the test suite it's not something we came up just before the release so it's based on a lot of historical data ok if you don't have more questions thank you for being here