 Hello, everyone. I'm Ken Chomichi from NAC. Today, I'm going to discuss how to solve flight tests at Open Source Project. First of all, I'd like to introduce myself. I contributed to multiple Open Source projects on my career like Kubernetes, OpenStack, and Linux kernel. In Open Source projects, my role is always to solve bugs and create debugging tools to investigate issues easily. For example, I created ClashDump tool in Linux kernel community. I was a QA project lead of OpenStack QA projects, and now I'm an approver of testing area in Kubernetes community. Here is today's agenda. Let's get started. I want to introduce the overview of the trend of Open Source development. Recently, Open Source projects like Kubernetes, OpenStack, and so on tend to be developed with continuous testing at CI pipeline. OpenStack community use the tool to implement this pipeline. We can see FATs are on a queue to be merged into the mainline of OpenStack like this. Here, Kubernetes community use Cloud for doing the same thing. If proposing a single change with full request or a patch, the change needed to pass the test set. Test set is here, which test is here. The test sets consist of coding style check, build check, unit test, integration test, and E3 test in general. Unit test is for testing the cross-funding function. Integration test is for testing the combination of function or integration with some specific services like database. E3 test is verifying the behavior of the deployed system through API codes. For example, in Kubernetes CI system, we deploy Kubernetes cluster with kind Kubernetes in Docker and E3 test run against this deployed Kubernetes cluster through Kubernetes API codes. The same thing is implemented on OpenStack CI system also. We deploy small OpenStack Cloud with DevStack, which is deployment tool for development and testing. Then E3 test, which is called as 10 tests run against this deployed OpenStack Cloud. In general, E3 test consists of multiple combination like operating system is Ubuntu or CentOS, IPv4 only or IPv6 is supported. When proposing a single pre-request to Kubernetes, many test jobs are operated like this. When proposing an OpenStack project, there are a lot of test jobs are operated like this. We need to pass so many tests for each single changing before merging the changing into the main line on both OpenSource project. Here, I need to explain some small tests which is related to this session main topic. On both Kubernetes and OpenStack community, we define small test sets from all E3 tests. This small test set is called as conformance test on Kubernetes and it called as interop test on Kubernetes. What is the difference between all E3 tests and the small test sets? The system, those system has two type of features, one is core features and another extend single features. E3 tests are also implemented for corresponding features. All E3 tests consists of tests for both core feature and extensive feature. This small test set contains only core features, a test for core features. For example, create a hot API is one of core features of Kubernetes and we expect that this feature should be available at any Kubernetes cluster. E3 test of this core feature is implemented as a part of conformance test. On the other hand, we don't expect IPv6 support should be available on any Kubernetes cluster. This IPv6 feature is implemented as non-conformance test like this circle. On OpenStack community does the same thing. We have some tempest test for core feature and we define those tests as interop test. We can see fat test are conformant tests by checking actual E3 test code contain conformance level like this. This is actual E3 test code and we can see this E3 test should run through the life cycle, blah, blah, blah. E3 test is a part of the conformance test with this level. We can see fat is interop tests with interop code repository. This is a tempest test name. If the contained tempest test name is part of interop test. By using this continuous testing pipeline, we can detect changes which breaks existing features and block merging such backward incompatible changing into the main line. That is huge merit. We can keep release readiness of open source software always in the community. Both community release new versions by due date. That means tempest release. It is mandatory to keep software readiness always anytime with this pipeline. The reviewer don't need to take a look at the change which don't pass those tests. That will be this code review workload from reviewers that also big made of this testing pipeline. Who can run CI test job? That depends on who submits the previous on Kubernetes community. Kubernetes community consists of this member structure like non-member contributor, member reviewer, approver, sub-project owner. If you are member or hire like this, if you are submitting a request to the Kubernetes, the test job automatically will be operated for your own progress. However, if you are a non-member contributor, test job are not run automatically. So you need to get help from existing member to run test job for your own progress. If you want to get involved more in the Kubernetes community, it is good to be a member in Kubernetes community. On the other hand, on OpenStack community, CI test job are automatically run for any contributors. So you don't need to get help from the existing member to run test job for your own changes. There are some differences between both community like this. I'm going to explain FlakeTest here for the main topic. FlakeTest are tests which cause unstable results. That means FlakeTest pass successfully in most of the tests running, but sometimes they are failed. Even if running the same test on the same code, the test results are defined on FlakeTest. For example, if some specific test in ConfluentTest is failed at 5%, Kubernetes community run five pattern of each test which contain ConfluentTest and the failure ratio become 25% for each single progress. So 25% of progress cannot be operated successfully. Then those progress are even by reviewer and the progress are never merged into the main line. Such situation make progress also frustrated. So I'd like to explain some reason of FlakeTest. One common reason is timeout due to heavy workload at the testing pipeline. The testing pipeline are implemented on virtual machine which are deployed at public class. When proposing a lot of progress at the same time from different developers, heavy workload happens on network and disk IO or something. Such workload makes testing operating time longer than usual. Then run out the expected time in each test. It is very common root program just before code freeze endpoint. Many contoll users try merging their own code into the main line before the code freeze of some specific version. Then heavy workload happens on the testing pipeline at the time. This situation is really frustrating because many developer want to merge their own progress into the main line but the test job are failed due to unrelated reason and the waste time. Actually in Kubernetes 1.20 development, one FlakeTest happened on a full request related to a conformant test due to this kind of workload. I'm picking up one FlakeTest here. So in this case, go down here. So here, let Kami shows failed test. Actually this is a FlakeTest and at the time, workload of CI pipeline has a pipe like this. Then this FlakeTest happened due to the heavy workload and that make me difficult to decide this place can be merged into the main line at the time. Actually this breaks itself because it's merged by deviant reviewer. So another common reason of FlakeTest is lack of test isolation. Each test calls API of each system and checks the return response are expected or not in each test. Those operations takes time so those tests are operated in the parallel to reduce testing operating time. However, some tests can affect another test result in operating in the parallel. For example, some test checks resource usage of some specific node by pulling workload from the test. If the other test put another workload in the parallel, actual resource usage become different from expected in the some specific test. Then test itself is failed. To avoid such situation, we are implementing test isolation if necessary but sometime we forget to do that. This is a bug on test site. Another simple reason is due to some websites down. Testing line downloading a lot of software, a lot of library, a lot of container image from external website. If some website is down, we face test failure on some test job. That also a bad situation because we cannot solve this kind of issues directly because that is out of our control. Other than those reasons, there are reliant bugs on software side and lack of lock, mis-synchronization between component and so on. We need to investigate those issues deeply and solve them. As I explained before, if practice happens, part of test job is failed. Then code reviewer is not made for the previous and it is never matched into the main line. To avoid such situation, we can return test job manually by hand by reading recheck on get it of open stack, test job operated again. So same thing is implemented on Kubernetes side also. We can return test job by writing written slash retest on Kubernetes GitHub. For example, I am going to pick up this change here. This is a very simple change. This is a type fix in some error message. And this in this request, test job should not fail at all because this is just a type fixing. And this should not be affected test result of test job. But actually this request also needed to run test job with this slash retest. Then every test job are green now like this. On the other hand, when some test job are failed, some developer write slash retest without checking the detail of the test failure. However, sometimes those test failure are due to the pre-request itself without any reference. That means the pre-request contains a bug. For example, a pre-request contains invalid coding style, lack of imports or careless of unit test and so on. In this case, even if we run the test job many times for those pre-request, the test job are always failed. Unnecessary test job with cloud operating cost for implementing testing pipelines and also wait developer time because developer need to wait for finishing the test jobs. To avoid running unnecessary test jobs, we need to check the reason of failed test jobs. In general, test job of coding style check, build test and unit test tend to be stable. If those test job are failed, we should check failure reason carefully especially. Here I am going to explain how to solve flag test. First thing we need to do is distinguish of flag test. On Kubernetes, we can see the failure detail on this detail button here. On the failed test job. Then broke show which test was failed. In this case, API priority and fairness should endure blah, blah, blah test was failed. Then we need to check our own pre-request which is related to the failed test set or not. If the test is related, we need to fix our mistake in the pre-request. We should not relearn the test job before fixing it. If the pre-request is not related to the failed test, the test failure should be due to the failure. We can relearn test job in this case. We can search the related issue on Kubernetes GitHub to see someone already submitted the same flag issue. Kubernetes community provide dashboard test grid to show how frequently test failure happened for each test. I'm going to show this test grid dashboard here. This is a list of tests. It's a test. We can see some red column here. Red column means failed test. In this case, API priority and fairness should blah test is failed sometime here. If we are clicking this red button, we can see the detail of this failed test like this. OpenStack also provides this kind of dashboard. OpenStack helps. We can see the failure here. If we are facing some test failure, we can check the current situation of flake test on this dashboard. I said one of the common flake test due to heavy workload of testing pipeline. One of the solutions is finding necessary testing job only for each flake test. If a flake test changes unit test code only, we don't need to run integration tests and E3 tests at all. If another flake test changes document only in the code repository, we don't need to run E3 tests and E3 integration tests. We should customize what test job should be operated for what kind of changes for requests. On OpenStack, we are configuring testing job on each project. For example, this file is from OpenStack NOVA projects and it's the project repository contain.zool.yaml file for controlling testing job. This test job contains relivant-files, which special FAT. FAT changes don't need to be operated for each testing job. That means in this case, testing job name is NOVA-DSVM-MARTI node-base and relivant-file is here. This specifies RST file, which is a document file and some hacking and testing which is unit test and integration test. If you are submitting unit test changes into the OpenStack NOVA project, this NOVA-DSVM-MARTI node-base job is not operated at all. You need to test and integration test changes is merged into mainline without facing easy test frequency at all. So we can merge those changes without frequency. That is really good for developer. Another solution is to investigate deeply again and again. That means if you decide to solve Flake test by yourself, that is great for the community. You will help many developer by solving Flake tests. The first thing we need to do is investigation of logs, a lot of logs. We need to check time stamps between sending API request and receiving the API response from target system. Then we get the necessary log from multiple components on this time stamp and we should try to describe the failure scenario based on those logs. Sometimes the logs contain some specific line because easy test can be operated in parallel. So we should filter out those lines for concentrating on some specific API operation. When facing lack of log for investigation more, we need to submit another request to add more logs into easy test and components. It is really important to solve issues on production at the end of the day. Anyway, if you find some root problem and the solution, please submit request for that. As I explained before, it is so hard to investigate easy test failure that takes much effort under our time. Especially distributed computing for Kubernetes and OpenStack, we need to investigate multiple component logs for a single operation because the single operation is incremented with combination of multiple components. In addition, we need to investigate the corresponding code of each component to check where the root problem happened on actual code. We also need to investigate integration. More coverage of unit test and integration test can solve this drawback of easy test. On easy test job, we deploy actual system. Then we run easy test against the deployed system. On the other hand, unit test and integrated system. If detecting weakness on unit test, we just need to investigate small size of code that reduce our effort and time to investigate root problems. As a result, many of small bugs can be fixed on unit test and integration test. Then we can reduce the whole failure ratio of easy test. Those two blocks, one is a particular test parameter by hand. And just say no more. Just say no to more end-to-end by my workflow. In just say no to more easy block, test should be implemented with parameter structure like 10% is unit test, 20% should be integration test, and 10% should be easy test on this part. This is written five years ago, but that is really good. I recommend to read it once. I'd like to summarize today's session. Break test make developer frustrated and waste our time. That is very painful. We need to check the test failure detail before running a test job. Sometimes we make a bug without a frequency. If we are running a test job without this bug, the test job will be failed again. We need to check the test failure detail before running a test job. It is good to run necessary testing job only at the pipeline for each changing. We don't need to run each test job for document change. Right? It is better to run necessary testing job only for reducing workload at the pipeline. It is better to improve test coverage for unit test and integration test instead of its retest. That can reduce our investigation workload. Another thing is if you are solving a break test, you can be a hero. I'd like to see this situation. Please contribute if you are interested in solving break tests. That's all today. Thank you for attending this session.