 Hi, my name is Baoduo Fang. It's my honor to present in the Neighbor Office Conference. Today, I'd like to talk about my GSOC project with Neighbor Office on how to select unit tests with machine learning. Let me first introduce the motivation of this project. Originally, whenever we push a patch to get it, a Jenkins job is triggered, where builds are run across all platforms, including Linux, Mac, Windows, and Android. We have observed that typically, every patch will only change a small part of the code, so only a few number of unit tests are affected. Therefore, there is no need to run all unit tests for every patch. We can skip those irrelevant or low-risk tests to save computation. And inspired by Mozilla's work, we have found that it is possible to use machine learning to select unit tests for a patch. In their work, they can achieve an accuracy over 95%, so I try to replicate their work for our CI chain. So in this project, three models are trained. Test-label-select is responsible for selecting unit tests given a patch. Test failure predicts the overall probability of a patch to fail any unit tests. Because test failure has a poor performance, test overall is introduced as an improved version of test failure, which takes test-label-select output as its input. Before you introduce the models, there is one point that needs to be clarified. Although the unit test-select model achieves a high recall, it is still dangerous to only use the unit test-selected by the model. For example, even if the model achieves a 99% failure recall, there still exists 1 in 100 failures that may escape the CI chain and can break everything. So in practice, all patches are classified into high-risk ones and low-risk ones. High-risk patches enter the fast track, while low-risk patches run the normal track to ensure correctness. The computation is saved in terms of how many platforms a patch is built on. For high-risk patches, the fast track first runs on Linux. If the model is accurate, the patch should fail on this build so all the other platforms can be skipped. If it passes the Linux build, the rest platforms will be run to ensure correctness. For low-risk patches, the normal track is just the original garage job where all platforms are built in parallel. Now let me introduce my models. The first model I'd like to talk about is called test-label-select. It takes a patch as its input and it forms a patch test pair with every unit test. After the pair is formed, it is fed into the XGBoost model where it outputs a failure probability between 0 and 1 for every pair. And then we use a threshold to decide whether a patch should fail a specific test. And during the test phase, the model achieves a 95% failure recall and 85% pass recall. In other words, among all actual failures, 95% of them are captured by the model while we can skip 85% of the successful tests. However, we can now be 100% sure about the model output even if it achieves a 95% fail recall. There still are 5% of the fail unit tests that are considered as pass. It can be a disaster if we directly merge those patches into our master branch. Therefore, a new model is needed to divide patches into high-risk ones and low-risk ones and use the fast trend and normal check mechanism I mentioned before to save the computation. As a result, test failure is proposed to predict whether a patch should fail any unit test or in other words, this model classifies every patch into high-risk ones or low-risk ones. In this model, only the patch features are considered. The model outputs a failure probability for the overall patch and similar to the test label select model, test failure also uses a threshold to decide whether a patch will fail any test. If so, it will be put in the fast track or otherwise it will remain in the normal track. In terms of its performance, we can see that its fail recall and pass recall are pretty low. So to improve the previous test failure model performance, the third model test overall is introduced by using predictions from test label select. Since test label select model can achieve a high accuracy, why not just use its prediction in test failure? Therefore, I concatenate test label select prediction with the original features in test failure and treat them as a new input to test overall and the performance is improved. However, it is still far from satisfactory. As a result, a smart inference algorithm is proposed to capture more failed unit tests. In this algorithm, we will first run test label select for all tests if it says that the patch will fail more than 10 tests then Jenkins will run the fast track. If not, we feed test label select output into the test overall model together with original features from the patches to predict the patch's probability to fail any unit test. If it is higher than 40%, it is considered as high risk ones and the fast track will be run. Otherwise, we will consider the patch as no risk and run the normal track. The algorithm improves the failure recall to 90% while keeping the pass recall at 57%. Although the project is now integrated into Jenkins, many future works can be done to improve its performance. So first, the model can be improved such that the patches entered fast track has a higher precision. Because originally the goal is to select unit tests instead of classifying patches into two tracks, fail and pass recall are considered more important. But in the current implementation, a low failure precision means that many low risk patches may enter the fast track and spending twice the time compared to the normal track. And secondly, deep learning models may be used here to improve model performance. For example, transformers may be a good option because the attention mechanism it uses is good at relating different features together and it has achieved great success in NLP and the recommender system with structured data, which is very similar to our case. However, it may slow down the inference time and it may also require more expensive machines with GPUs. And third, when the test label select model achieves a high enough accuracy, a Jenkins job based on the real unit test selection can be implemented just like the original goal of this project. Finally, I'd like to thank these people who have been very helpful to my project and my hacking with neighbor Office. I'm looking forward to participating more in the future. And thank you for your attention. If you are interested in the project, you can check the project link on GitHub. And if you have any questions, you can reach me through my email. And again, thank you for your listening.