 Hello, all. Good day. I hope you're all doing good wherever you are and keeping safe. I'm Braha Dambarshanivasan. I work at IBM as a support engineer for Linux on PowerBox, for internal test teams, as well as distros and customers. Thank you for this opportunity to talk about quality in Linux. This is a call for action on where we are currently with the Linux current development and testing efforts. And what some of the pain points are and the need to improve overall quality of the Linux kernel. I'm going to talk about current development and testing processes that we already have and what we can do better to ensure improvements in the quality of the kernel. So let's dive right in. Okay. The Linux kernel is fast. And at about the complexity of the development community being so distributed, quality seems to get compromised sometimes. It can, however, teach great lessons in terms of diversity and inclusion, but that would be a topic for another day. The Linux kernel project has grown over the years and become very big. It's all possible because of a similar set of standard processes all across the entire kernel developers community. The kernel has over 27 million lines of code with over 15,000 developers from across 1,400 companies contributing to the Linux kernel since 2005. That is when we got JIT and so we can keep metrics better. Every release has a short development cycle and thousands of commits and hundreds of thousands of lines of code gets into every release. As an example, we can see that in 2019, we had 75,000 lines of code coming into, sorry, 75,000 code commits coming into the kernel and that was the least since, lowest since 2013. And that gives an idea of how wise this project is. Let's look at the development process in a little more detail. It all starts with a batch or series of batches in a fast set. There are thousands of contributors including people in academia, students, professionals in various industries and definitely you and me. The process works on the concept of develop fast and release often policy and every code change is done through a batch. And batch represents the smallest unit of code change that can fix a problem or introduce an enhancement or support specifications or feature or so on. And the batch is sent by the developer to the mailing list and gets reviewed by others on the mailing list as well as maintainers. Reviewers may or may not test a batch, but the maintainers definitely have more responsibility and so to ensure that the batches that they pick up doesn't break anything, they definitely try to test the batches that they pick up for the next release together, but not individual batches. So after all the various rounds of reviews and reworks are true, the maintainer may choose to pick it up and pull it into the repo. His repo. And if it is a feature, it first gets to the winner's next repo where more bills and tests are conducted to ensure that it does not break any existing functionality. And then the subsystem maintainers will choose to pull it in to send it forward and on up to Vanessa's tree. It goes through the hierarchy of maintainers. If it is a bug fix, it could get into the main line without getting into the next repo. And the processes get seamless by many maintainers and there are hierarchies of maintainers to before the code gets into Manassas tree. There are multiple subsystems to depending on the component of your proposed batch, you could end up interacting more with the first layer of maintainers than of that subsystem than others in the open source development world. We see that there are different ways in which every maintainer picks up comments. Some might know the code so well that they get intuitive about which batch of work or not and which wouldn't and others may be strict loss to rules. So everyone would probably give you a different set of best practices that they follow and a different checklist that works for each of them. So another amazing thing about this development process is the ability to add new features or enhancements or implement things without anyone really directing the development process. But then the question arises to bugs that are breaking major functionality get warped with the same urgency as some enhancement that developer is really passionate about. I feel that the question answer to this is no. Since everyone usually picks up what they want to work on, the same sense of urgency fell for a bug would probably not be shared by everybody across the development world. All the bug fixes still need to follow the whole cycle of batch acceptance upstream before a distro will accept to pull it in. But with an open source development model, customers could batch their code themselves if they have the skills to do so. To work around critical problems or additionally, we could have the Linux distributors provide test fixes in a very short period of time to mitigate the upstream process. Of course, the fix will not be considered final, but it can get the customer unblocked and up and running again. So as you can see, it can take a while to get the patch into the upstream repo. Distros always choose to be at a lower version than upstream to minimize risk. For that same reason, unless a patch is upstream, distros will not pick it up to backboard it to the version that they are on, and each distro selects a different version and adopts and commercializes that version. And the distros take the patches through their own downstream integration, their own progression testing, system integration testing, security and performance tests. And the end users of Linux are most likely going to pick a distro version or one of the stable Linux versions to run their systems on. This is when they're going to actually hit a bug which got missed in all of this cycle till reaching them. And so this bug could be either breaking existing functionality or just failing to perform an intended function. These bugs get reported on various different boxillers and many are reported on kernel.org too. Getting these results is as important as adding new features or enhancements. Now on a different aspect on bug resolution, it would be great to have a prescribed set of logs to be collected as part of first failure data capture, FFTC process. For example, as part of the process improvements, my team was involved in defining these two internal testings at IBM. And we found a significant about approximately about 30% decrease in requests for more information on bugs. And that helped us to get to resolving the bugs faster. Could we in the community help come up with a comprehensive list of logs as part of FFTC for every subcomponent to exist like SSS report or support config or a port, which are all distro specific. Leaving upstream work with no standard tool to collect information. We need to either provide them as part of the kernel source with the owner's permission, of course, or develop similar tools to ensure that all logs are captured at the time of the first instance of failure. So then there is the other side to the whole story. And that is the wall of testing. And that's equally important or maybe more important. The kernel community claims active long-term support for about six versions of kernel, if I'm not wrong, currently right now. But those would probably be the best versions for the distros. The distros themselves support a lot of versions at any given point of time. And would also merge or backport lots of different versions or different patches from newer versions of upstream kernel to support their wide customer base. So there is an ever-increasing customer base adopting Linux, each with varied requirements and configurations with varied preferences for different architectures. And the architectures themselves would have their own supported versions. And the way the kernel or the memory or storage is configured for each of them along with different application workloads that run on top of the OS, that matrix of all that needs to be tested becomes very, very complex. Add to that more features are constantly coming into the Linux kernel. It's just adding more to that complexity. In my opinion, the testing is not happening at the same fast phase that the development side is providing new code. The tester may pick up the latest kernel version to add test cases on. And by the time the test cases are developed and pushed into a test suite, the upstream kernel level could be several commits ahead with more features and bug fixes added. Now, ideally, we would like testing to cover the following set of testing to ensure every aspect of the kernel is tested. Or so I think, and this probably is not a complete list. So we want automated tests to ensure the commits coming in don't break kernel bills. We want to add test coverage for all the thousands of new lines of code that's coming in every week. And we want to test for regressions, right? That is to ensure that existing functionality is not breaking. The faster regressions are discovered, the easier it is for the kernel community to fix and resolve the regression. And we want to ensure that every hardware on every supported Linux kernel version with maybe even add-on top of that enterprise software application workloads are all tested. That matrix is extremely complex. And we want the kernel to be tested for performance and security too. And we want to automate everything. And completing this list is like winning the lottery. Extremely unlikely, right? We already do pretty well on the build tests. Hardware companies do test Linux on their own products. And the distros also test on most hardware. We can improve on code coverage and testing for regressions, performance and security. I will talk about that more in the next slide. So it's a given that not everything in that matrix we talked about gets covered. There is currently no single test that covers it all either. The making sense of the testing aspects is not possible when we don't have all the information in one centralized place about what all testing is already happening. How can we solve a problem without knowing all the variables in it? With the distributed custom contributor base, it gets all the more difficult to ensure we are addressing this. So we see that it's an almost impossible matrix to test and not everything is going to be tested. So how can we help this situation? Going back to the patch process, it is expected or it's an unspoken rule that some unit testing is done before sending a patch in. How many developers really follow this unspoken rule? There is always going to be a small percentage of developers who are sending patches without enough testing done, which could cause problems higher up in the process. We come to this conclusion because we do hit build breaks and we do see progressions from time to time. And a lot of effort and time is put in to resolve this box. We need to put a check on it. We need to ensure that a patch does not get picked up by a maintenance when a list of tests are not done on it. And we really need the maintenance to try and post this. Tests should be like unit tests, build tests, sanity tests. Your car should not only build with your patch, but you should also boot into it. And some progression tests should be run at least on the supplement where the code is being changed, at least on the hardware that they are developing on. And I think it would be a good idea to try to add those unit tests, develop them such that you could add it as one of the sub-tests for that sub-complement. Possibly one other option would be to add a document with all the relevant test cases that could be run for testing that subsystem in a file-like, possibly a test-me, like a read-me file for that sub-system. Similar to how we maintain a list of maintenance for that sub-system, we could maintain a list of tests for that sub-system. And then testers could jump in and try and add their test suits and say, hey, I have added something for this. It would try and get that information of what all testing is there for a particular sub-system and get it all into one place. That is possible, that is possibly one solution that we could think of. Maintenance do hold developers responsible to fix any bugs that their patches might cause. So why not hold them responsible to ensure quality and do their bit of the testing and avoid many of those initial bugs and save the corresponding time and effort that goes in, you know, fixing them. We need also to find better ways to hit regressions, find those regression bugs as early as possible in the development cycle, ideally before the batch is sent out to the miniglass or at least before it makes it to the mainline. We need to simplify and automate existing test suits that can be set up and run easily so that we can ensure that more people would be open to test as they develop and ensure that when new functionality gets added to the kernel, we also have a way to test it. One suggestion would be to add the, you know, again, develop the unit test cases such that it can go into self-test or add it to a regression test suit Another possibility would be to start discussions on the mailing list to get others to get add relevant tests to the subsystem for a feature that you're working on. I'm not saying we're not doing enough testing. I just feel we can leverage more across the community to do better with the testing that we already are doing. So let's see what is currently up there. We have automated build tests run internally in many companies like Intel, Red Hat and others. A lot of effort has already gone into the creation of these complex test suits but it's mostly private to these companies and hence unavailable for the rest of the community to pick up or add more to it or make it better. So can we leverage the existing tests to ensure the code builds on all platforms? It would be good to have that. There are efforts for testing of performance and secondary aspects too but again most of it is probably internal to many companies and the results are published to the larger community. One starting point towards opening things up is if we could ensure that these test results are published and if we could get to open make those tests use open source more people can contribute towards making these performance and security testing more robust and help add more automation too. It might be possible to add automated testing capabilities to pull in only relevant tests just for the files or subsystem that the batch is touching or we can add automation to run specific kernel unit tests or self tests check for test coverage of the code and more but since a lot of this testing is private or at least unknown or not so well known in the community some of this testing starts becoming redundant and there is duplication of effort. If most of the tests are being done internally that may mean that most companies are doing similar testing but are unaware of each other's results so it would be a good thing to create a synergy around such testing efforts and then we do have the LTB, the Linux test project this is probably the one test suit that sees a lot of contributions and it can be run for comprehensive kernel testing but this doesn't test everything I have seen that this test suit doesn't have tests for all subsystems and it's also complex and usually developers aren't going to run this to test their code we see the need for other test suits like XFS tests which is specific to testing file systems or maybe even BLK tests suit that is specific for block IO related tests and other similar test reports that are all spread out over the community and to consolidate all the test cases for specific subcomponents we see similar repositories we also have fuzzing projects that are happening and fuzzing is a very powerful testing technique and the intention is to find bugs very fast with a semi-random import or something and fuzzing is especially useful in finding memory corruption bugs but fuzzing also finds a lot of false positives and fuzzing throws out a lot of bugs too so it gets difficult to determine valid bugs among all of those bugs when we find it through fuzzing there are also projects like the Linux Foundation's kernel CI project that help test building and booting the master tree the stable tree and some maintenance subtrees on various architectures they also are working towards consolidating testing initiatives and this is one place we could definitely think of collaborating on and use it as a central forum to list out all existing test tools that we already have we could also have conversations going on and what we can do better but we don't have a comprehensive list of what all test use are there and what all testing is happening not in one place we need to get it all in one place to decide how we are doing on the testing side not every test effort is known to others looking for something similar we are here discussing this whole topic there is an issue of test cases too how do we collaborate more on common test cases we do need a way to ensure there are projects that would help bring in more contributions from testers into one place and also help avoid redundant test case development efforts a more open discussion on what is being tested and what the results are would help clarify a lot of the current haze around the Linux testing so can we start those discussions on a central mailing list where testing can be discussed more openly then there is also the need to ensure that workloads for large enterprise customers run fine on Linux these customers are large and complex they run large and very complex business workloads and cannot have down times as it would impact critical applications and transactions for a large number of people so they expect stability and support and there are different workloads that spend a lot of effort to certify on their different hardware platforms and there is a bigger challenge of machine access for large machines or large configurations we need to actively have companies that produce hardware and as I am including my own company here to contribute and collaborate freely with their systems when they need the software they shouldn't expect the community to help provide software with no access to hardware there has to be a larger collaboration for everyone to contribute and to make Linux better so to summarize the problems there are gaps in current testing by developers probably unit testing and sanity testing and maybe even some more regression testing that needs to happen there are gaps in system testing integrated systems testing performance testing, security testing and there are overlaps in testing both in test efforts and in test case development efforts there is a duplication of effort all across and these are the problems what can we do to address these gaps we should address the gaps in current testing by developers by enforcing early testing in the development cycle and the maintainers need to take a call here we should address the gaps in system testing by simplifying usage of complex test suits list out tests for subsystems get testers to add to the list of what tests they have developed on add tests for new functionality closer to the development cycle as close as possible to the development cycle so as soon as your code gets into the ripple there are test cases waiting to start testing it we should address the overlaps in testing by creating synergy and collaboration between developers, testers, distros and end user companies and consolidate all testing efforts we should ensure that like the development cycle testing also should be streamlined to follow a standard procedure or template and publish testing repositories in a centralized place so everyone knows where to go find testing tools in conclusion the developers, testers distributors companies selling hardware for Linux and companies using Linux are all in this together quality of Linux is the responsibility of everyone involved in it and using it and we can ensure better quality if we collaborate better and hopefully this would be the beginning of those important conversations that we should be having that's all I had for the topic itself I've thrown in some references here that caught my eye while I was preparing for this talk