 Hi, everybody. Thank you so much for coming. Apologies for the delay for some technical difficulties. So my name is Fiona, and today I will be presenting about kernel qualification and cloud computing at scale. So first, let me give you a little bit of background about myself and also what I will be talking about in this session. So I'm a software engineer working in Google Cloud, specifically in Google Compute Engine. And during some of my time at Google, I was working on a team responsible for qualifying and deploying the host software to the host machines that's running in cloud, including the host kernel, which is the key software on the host. So in this presentation, I'll go over each phase of the kernel qualification, what it involves, and also I'll dive deeper on qualifying for performance on the host running in cloud and also how metrics and benchmarks play a vital role. And lastly, I will discuss about the different challenges that we have faced in qualifying the kernel at scale. So please bear with me if you already are familiar with all these concepts. But I'd like to first go over what is cloud computing, how enterprise customers use it, and what is virtualization. So essentially, cloud computing is the on-demand availability of compute resources as a service over the internet, such as infrastructure or storage. Businesses or individuals don't have to self-manage physical resources themselves. And for example, enterprise companies will run either a portion or all of their services on virtual machines. So what is virtualization? Virtualization is the technology that enables us to have the hardware resource on a single physical machine to be shared among multiple simulated software programs, virtual machines. The virtual machine thinks that it's running on top of hardware, having access to the computer like CPUs, memory, and storage when in fact it's partially simulated in software. The hypervisor is what allocates and controls the sharing of the machine's resources. Now KVM is the hypervisor in the Linux kernel module that provide the virtualization layer by exposing the user space API interface. The virtual machine monitor VMM is the process in the user space that talks to the KVM and provide the hardware emulation to the VMs. Now let's go over the qualification process of a kernel. So for the most part, Google runs on a forked version of the Linux kernel, plus tens of thousands of additional patches for various reasons. For example, Google runs on massive scale. And sometimes there are things that are not necessarily applicable to the general usage. So Google makes trade-offs to what makes sense for our usage. And a lot of the time kernel developers at Google contribute to the open source upstream repo as well. But there's still a ton of Google specific patches that are maintained within Google. So just like in any other development, kernel developer write unit tests. At the time that they're submitting a change to make sure that their portion of the code works as intended. Now, as you can imagine, the Linux kernel code base is enormous and every release have tens and thousands of patches, along also the internally managed patches. And a developer working on, say, the memory management subsystem doesn't necessarily know all the changes going into the kernel, such as in the networking subsystem. So once the changes are checked in, they're being functional tested together on different types of hardware. Then the source code is built into a binary and tested at the application level. At this point, the kernel binary is officially put into the qualification pipeline. So first, we always run a set of tests with simple smoke tests to ensure that the kernel meets the minimal functionality and performance requirements, such as splitting up a VM, executing a series of commands on a VM, migrating a VM, making sure that the networking is working on the VM, and more. It's so that we catch those trivial issues because sometimes they cause the system to crash. And if a host crashes too many times, it gets auto-checked and flagged as maybe having a hardware defect. And the host will be turned down. There will be actual technicians in data centers that go and go check on the host. And it will take a couple of days, at least for this process, for the host to come back up again. So once that the kernel passes the minimal criteria, the kernel binary is then deployed into the test environment and installed on a set of test machines for a period of time. Now, the test environment is very crucial to any qualification. We have various testing infrastructure and environment for different use cases. For kernel qualification, it will first go through a more controlled and isolated test bed, where the kernel is installed on a set of specific hosts. And we have control on what to install on the host to run different tests. As the qualification progresses, the test bed becomes less configurable, but more similar to what is in the prod environment. So everybody uses the kernel for different things, and it's really hard to test all of the scenarios. But we work together with the kernel team, the VMM team, the networking team, and other teams who run on top of the kernel to select a set of tests that validates both the kernel and the VM behavior. A lot of the time, these tests are provided directly by those teams, actually. So we're really relying on the teams to work together to qualify the kernel. In this test environment where it is more controlled, different application-level tests get cross-tested against different hardware. And in addition to the host software that I was involved in helping to qualify, other teams are also developing software and rolling out things that run on the host to support the cloud environment. So when we do qualification at this stage, we only want one variance, and others should all be controlled, which is why we partition the test machines into different sectors. Some will install the production kernel that was running in cloud already, and some will install the ones that we are qualifying. So other teams can use the kernel that's already in the production to qualify their own product and services. This is also to ensure that the compatibility between what we're calling for the kernel and what's out there in production for all of the other software starts running on the host. And in addition, we also test kernel upgrade-downgrade. So when we install a kernel on the host and we will install the older version to ensure that there's always a path to rollback if there's a bug in production. So through the qualification process, the test will become more and more larger scoped. And once enough confidence is gained, the kernel candidate is going to get deployed in a lot larger number of test machines for load tests, cross-style tests, or multi-machine tests, and then further into more complex workloads and benchmarks. So as the kernel progresses through the qualification stages, as I mentioned that the test environment will become less configurable, but more variability is introduced, which is more like to the production environment where our customers will be running their workload. So in this environment, other teams will also use these machines to be testing their application. A lot of times we don't have control to what the VMs are running on the host because one physical host can be supporting multiple VMs. But this characteristic is also what we want because it sometimes exposes issues during qualification that we want to catch before it gets out to production. In this environment, almost all of the machines will be running the kernel that we're qualifying. So essentially other teams will be qualifying their services on top of the kernel that's in qualification. Of course, there is downside because we are introducing more than one variance in this test environment, but at the same time for kernel, we're gaining the additional exposure to the different type of use cases that other teams are running their workloads. And when it comes to qualification, we also have to think about both the functional and performance. We can verify the functionalities by running the smoke test, unit test, and all those sorts of tests that I mentioned, but for performance it's a lot more dynamic and it can sometimes be affected by multiple dimensions and then which is why we put the kernel qualifications through multiple different environment characteristics. So now let's talk about how we qualify performance for workload running on the kernel. First, the performance of any workload is unique, but let's simply consider in terms of the baseline threshold. There are multiple dimensions that can affect the performance of a VM. For example, different hardware, different VM families have different capabilities, general purpose, compute optimized, memory optimized. All of these VM shapes are specifically tailored to run the workload that is suitable for its use case so that it provides better performance for our customers. There are different configurations to support these VM families and that they will interact with the kernel differently. So a kernel definitely has to get cross-tested against all of the ranges of host machine hardware at the hardware level, as well as the VM that's offered by these hosts. Sometimes different use cases have their own baseline threshold to determine if a performance has regressed or not. Depending on the type of workload the guest is running, it may be stressing a different component, whether it's the view, memory, or this, and how it interacts with the system is very different, which yields different performance. People use the VM to run various things and require that the VM provide good constant stable performance. Our kernel may pass some functional tests and a short term stress test, but then they could perform very differently when it's out in production running at a larger scale. So which is why to qualify for performance, we incorporate benchmarks and running more complex, stressful representative cloud workloads in our qualification pipeline to ensure that the components are being stressed at the right level and interacted at similar level as it would be used by the enterprise services. Sometimes when we see variability in performance, there might be a bug or it could be influenced by the environment. It could be that the host is overloaded and in which case the VM will get migrated to a different host to alleviate that pressure and bring back the VM's performance. Another issue that we commonly seen is the noisy neighbor issue where one VM running on the host is running a very dominating workload that takes up some of the other resources that's used by the rest of the VM's. So by running benchmarks and cloud workflows for qualification, we have insight into both the host level and the guest level performance metrics. We during qualification, these test workflows are being monitored 24 seven with alerting systems that and with alerting systems and then also the performance metrics are collected and displayed into dashboards for diagnosability if there's an alert that gets fired. So let's dive a little bit deeper into benchmarks and cloud workloads. We run benchmarks and cloud workloads to provide us insight on guest VM performance. PerpKit Benchmarker is an open source product. It provides a wrapper around a large range of popular benchmark tools. We're seeing some here, for example, we expect the BU is one that measures speed and rate. It measures how fast a computer completes a single task and then how many tasks a computer can complete in a duration of time. NetPurve is a common benchmark used to measure networking performance. And then Memtier Benchmark is a benchmark developed by Redis. It's for load generation and then benchmarking key value databases. Essentially, depending on the configuration, supply perpKit Benchmarker will launch a benchmark. And for example, for Redis Memtier, it could launch a number of client VMs with a certain read and write ratio that talks to the server and gives enough stress to the server at the right level so that we can collect the guest metrics during qualification. It's crucial to set up a benchmark or a workload in a way that is representative to the proud workloads by giving it enough stress and at the right frequency, which is why we work closely with performance teams who are the main experts in this area to determine how these benchmarks should be configured to replicate what's similar to production environment. Another aspect to this is how long to run the benchmark. Depending on what we're testing, the duration of the benchmark should vary. For our case, to qualify the kernel, we typically run the benchmark continuously during the qualification period. This is because we want the kernel performance to be at the expected level for a sustained period of time without showing any signs of degradation. Sometimes an issue may only surface when the kernel is under a certain level of stress or in a probability one in thousands of chance. So definitely running the kernel in qualification on workloads for a sustained period of time will give us additional confidence. So I briefly mentioned about host and guest level metrics that we collect during running workloads and benchmarks. On the host level, more without any insight into what guest is doing, we can monitor a couple of things. So first, how the host operating system is performing in general, like CPU utilization, memory disk utilization, packet drops and such. And we, of course, will also have metrics around certain features to ensure that it is performing at the expected behavior. And also things on a process level, whether if it's a crash suddenly or if it's terminated for some reason. But we don't necessarily have a ton of information on how the guest VM is actually performing or how the client that talked to the service was running on those VMs, what their experience is like. Which is why with workloads during qualification, we have insight into performance on the guest VMs created to run the benchmarks or the workloads. So this gives us the information to qualify and ensure that the VM performance, the service that's running on the VM is not degrading. Similarly to what we care about in the host operating system, we measure the guest operating system. And then these metrics are collected through CloudOps agent tool, which is also available. It uses the open source tool, open telemetry to collect metrics. Additionally, also support collecting third-party application metrics. So for example, for any database services that's running on the VM, it collects metrics on how the server is doing. And as well as, for example, Redis-Memtier, you can, it collects on how the latency that's experienced on the client side. So all of these are very important metrics that we collect and use to qualify a host kernel. We often run into obstacles during qualification with kernel qualification because the operating system is so complex and that there are so many things running on top of the kernel, there are a lot of factors that can influence performance. And we do get false positives during qualification. So alerts will fire and we will triage the issue, but it could be due to a number of factors that's unrelated to the kernel. So as mentioned, multiple variants is introduced at the later call stages. So a lot of effort and time will be taken into triaging a bug. For example, the noisy neighbor issues. As I mentioned before, we can have multiple VMs running on a single physical host. And when the performance of a VM is affected by its neighboring performance, it's commonly called a noise neighbor problem. This is caused by various factors, such that the VMs share the same CPU memory or networking resources, which can lead to increased latency if the bandwidth is taking up by one of the workloads running on the host. There are solutions to this. One, it doesn't occur on single tenant VMs, which it has its own, so it's running on one host on its own. And then secondly, when we detect that the performance of a certain VM is degrading for this reason, it gets migrated to a different machine. And then thirdly, some resources are set, but unfortunately not all resources can be divided and allocated per VM process. So we talked a lot about qualifying the kernel, but what happens at scale? All of the things that I've mentioned in the kernel qualification pipeline happens for one kernel version that we're qualifying. But you might be wondering, why can't we just qualify one kernel and then go off to all of the platforms? This in practice is very challenging as we're expanding so rapidly. So first, upgrading on a host is very expensive. Essentially all of the VMs running on the host will need to get either terminated and restarted or live migrated to a different host machine. For GPU VMs, the VMs get terminated and restarted. And for other VMs, it goes through this process called maintenance event or live migration, which during this process, there is a certain level of degradation experienced on the VM. So we try to avoid that as much as possible. But of course, we need to balance between getting the new features on the kernel, upgrading the kernel versus not disrupting the VMs very frequently. So in the happy path, we qualify one kernel for all of the platforms, roll it out and the machines are performing as expected. But then a lot of times, different VMs and hosts have different characteristics. They might be touching different features on the kernel and when there is a bug discovered, say in production, for only a sub portion of the family, then for one, we need the ability to roll back only a self set of what was rolled out. Otherwise, if we had to roll back everything, the other machines that are not affected by this bug will get disrupted unnecessarily. The customer's services running on those VMs will not have a great experience because they've experienced multiple disruptions in a short period of time. Or if we need to roll forward a fix, that VM family now deviates from the rest of the production machines that's running a different kernel version. So similarly, if the issue is discovered at the end of the qualification for a sub set of the platforms, then we have to determine whether the rest of the platform needs to wait for that fix to get re-qualified and kernel qualification takes a long time. It has to go through all of these qualification steps, as I mentioned, to make sure that a certain fix, even if it's a trivial one, that it is functionally correct and it doesn't degrade the performance in any way. So a lot of the time, this also makes that we don't, it's hard to keep one kernel version out there for all of the platforms. And lastly, we have a lot of feature-specific development that gets deployed to certain families and it's just not going to scale if that feature is waiting for everything to finish qualification at the same time. So secondly, as mentioned previously, there are a lot of factors outside of actual kernel bugs that affect the performance. These will create a noisy qualification signal. And sometimes a bug may occur in low probability, but when we catch something in crawl, even if it's just once, it could appear thousands of times in production just because we have so many machines running in production. And any VM could be running a workload that impacts huge numbers of users, which leads to the third challenge, which is having limited resource in qualification. Some host machines are very rare and expensive and a lot of demands out there in production by customers. But so which is why with the resources that we have in qualification, we need to be able to use them efficiently, determining what sets of tests should be running on what different hardware, instead of running everything for all the platforms. It has to be specific tailored to specific VM families. And lastly, as the cloud industry is expanding and we have all these new hardware and new features and new VM offerings for specific workloads, which is great. But it's very important to invest in the infrastructure to qualify and roll out these products so that they can provide stable and quality experience for the services that's running on those VMs. Also, the infrastructure has to have automation to reduce manual toil with a system so complex. And we all know that humans make mistakes. So it's very airplane for to not have sufficient automation in the infrastructure to qualify these kernels in parallel and rolling them out to huge numbers of machines. Additionally, noisy signal is something we're continuing to struggle with, but investing time and resources to solve, which is we have so much product there and we have a lot of tests and metrics and information collected, which is great. But having a system to take in all of these information and determine whether something is actually regressing is very important as well. So that's all of my presentation. Thank you so much for joining. We have some time for questions if anybody have any. Yes? Do this. So I talked about different testbeds, right? So in the early qual stages in the testbed, the machines can have labels on them or put into categories and the machines will also have machine names. So in the more configurable stages, we have the ability to note that we want this kernel running on this machine with these softwares to run these tests. And this is all provided by infrastructure developed by another team. But in the later stages, although we still have that functionality to do that, but generally we like to run a specific set of kernel on a larger pool of machines, say the same hardware platform or VM families, so that we can get a more variance of experience, especially to qualify for performance. Or like depending on the type of CPU they're running or there's a lot of different ways to select them. Yeah. So there are other teams that are testing the kernel functionality as well, the specific kernel subteams as well as the networking team. And we run firstly the specific functional test on the kernel, but also we run an extensive set of end-to-end tests of the VMs running on the kernel and a lot of different checks to ensure that it is performing as expected. Of course, there are bugs. We are constantly improving our test suites, but since the qualification process is very extensive and very long on the kernel, typically those issues is catch, like in the early stages of qualification. Yeah. Yes. So when measuring a regression as mentioned, we work closely with performance teams. Through time, we have seen when the VM is performing at this level, it is providing good performance and stable consistent performance on the guest VM. And these are monitored in 24-7. One tool out there, Cloud Monitoring, in Google Cloud is something that monitors guest VM performance. And since we already have that stable baseline threshold, whenever something changes, that will raise an alert. And then this is when we will start diagnosing and triaging if there's an issue in the bug, in the kernel that's causing this performance deviation. I think there are no more questions. Thank you everybody for coming today. I hope you all enjoy this presentation. Yeah.