 Hello, everyone. Welcome to 2020 to use the cubicle Europe. Here is the signal the intro at the deep dive talk. My name is Don Chen. I am a software engineer from Google and the current work on the GK and answers. I'm the one of the funding engineer for Kubernetes and the initiate signal the back in 2016. Derek. Great to be here with everybody at cubecon again. My name is Derek Carr. I am a engineer at Red Hat working on our open shift product and have been with Don and the six since the early days as well and excited to be here with some of our newer contributors Alana and so long. You want to introduce yourself? Yeah, everybody. I'm Alana Hashman. I currently work on the open shift engineering node team. I've been working on Kubernetes since about 2018 and you might recognize me as the co-chair of SIG instrumentation, but I also work in SIG node and currently help lead the CI sub project with Sergey. Hello, I'm Sergey Angelov. I'm working for Google and very excited to be here. So with that said, back to you Don. Next. So before we get into the today's agenda, I want to briefly mention the previous signal update we made at the cubecon last year. You can access the related slides recorded by click all those links here. Next. Here's the today's agenda. We are going to first introduce SIG nodes responsibility. Then we are going to talk about the current activity since last updates, the roadmaps from 1.23 to the 1.25 and beyond. Then some interesting projects and efforts currently driving by the signal. For example, 1.24, Docker share removal, SIG operation 2, and the CI projects continue from the last update. Then we want to share the node contributor ladder with the community. We discussed this within the SIG node community for a while. Now we have finalized yet. Finally, we are going to talk about how to get involved and how to get help from the community. Next. What is the SIG node and its responsibility? Let's briefly talk about the node responsibility in Kubernetes. Kubernetes is a classed orchestration solution for containerized applications and services. Those containers included of the Kubernetes controllers are running on the nodes. On each node, there is an agent called Kubernetes. Kubernetes registered the node to Kubernetes master. Kubernetes together with container runtime manages pod and containers lifecycle on the node. Set up one tear down and clean up. Kubernetes also does the node level resource management such as ensure the application get the request the resources, detect the node level resource starvation issues, and take the takes action to prevent out of resource situation. Kubernetes also sends the status back to the control plan to make the follow decision correction. Next, please. In summary, SIG node owns all control controllers running on the nodes, which ensure the node itself and the application running happily. SIG node is very large and owns many projects. You can click the links here to find them all. This time, I want to specially call out the new subproject we started lately, special resource operator, which helps the user manage the deployment of the kernel modules and the drivers and come to join our SIG node weekly meeting to know all those projects in details. Now, I want to hand over to Anayla to talk about our roadmap. Thanks, Dawn. I am going to talk a little bit about all the sorts of things that we've been up to in the past year, starting with graduations and deprecations. So, I grouped the section together because there's sort of this overall goal between graduations and deprecations of cleaning up tech debt, reducing maintenance service. So, when a feature graduates in Kubernetes, it was previously behind a feature flag, typically defaulted to on, and when we graduate it, we get rid of that feature flag. It's no longer conditionally on. It's on for everyone by default. Similarly, deprecations are when we will disable a feature if it was behind a feature gate and ultimately remove that feature from the code base, again, making it a little bit more simpler to maintain. So, over the past year, probably our most major user-facing removal was that of Docker shim, which was finally removed in the 124 release. On the graduation side, we are also graduating C group V2 or the version two of kernel control group groups to feature parity in the 122 release. And then a couple more minor deprecations and graduations. We've removed support for dynamic Kubelet config that was previously deprecated and ultimately removed in 124. And we've also graduated pod overhead, which helps keep track of additional resources at the pod level. So, no more feature gate for that. For beta graduations, this is one feature maybe have been added at an alpha level, and now we default that feature gate to on. So, in the past couple of releases, 123 and 124, we've graduated a number of features to beta, including ephemeral containers, which had quite a long time between alpha and beta, so we're very excited about that. But we also graduated the Kubelet CRI support. We now support a V1 API, which is very exciting. We have graduated Kubelet credential provider to beta, and we've added support for and then graduated GRPC probes to beta, and so on. So, if you want to look up any of these particular features, you can do so by enhancement number, which I've included on the slide, and in the copy of the slides that we're sharing will also include some links. Finally, I'll talk about the alpha features that we've added to the SIG over the past couple of releases. These are net new features, so they're introduced behind a feature gate, but those feature gates are disabled by default, and you can turn them on and test them out. We have a couple of these alpha features that we have added. The first is C-advisor-less and CRI full container and pod stats, allowing the CRI, the container runtime, to provide statistics, typically Prometheus metrics, on behalf of a pod, rather than having C-advisor have to do some introspection of the C groups and provide them instead, which can be a little expensive. Sort of ultimately along the path of reducing how much we have to rely on C-advisor in Publix. We also added some new CPU manager policies, which is very exciting. Now, in our future road map, 125 and beyond, we have a bunch of really cool stuff in the pipeline. We're going to continue enabling C-group v2, which we're going to talk about a little bit more further in the presentation, which will unblock a lot of new features in Kubernetes, very exciting. We're adding forensic checkpointing to containers. We're looking to add secret pulled images and in-place pod updates for adjusting the requested resources at runtime. We're also working on graduating memory swap support, which is currently an alpha feature, but not yet graduated to beta. There are, of course, many more features, and if you want to hear more about those, you can join us at a SIG node meeting. But for now, let's go and do a little deeper dive into one of our major areas. I believe Don will be talking about Docker shim removal. Thanks, Anela. Before getting into Docker shim removal, I want to briefly talk about Kubernetes container runtime interface, CI. It is the GRPC interface, which defines how Kubernetes, the agent running on the every node, interacts with a wide variety of the container runtime. We published the first version of the CI back in May 2016, and the first implementation of the CI was introduced to Kubernetes 1.5 in 2017. Next, please. As the first implementation to support the CI, Docker shim is the built-in model in Kubernetes. Why did we choose to do this then? So then we have to deprecate and remove it today. In 2016, Docker was the only production ready container runtime. We needed to validate the interface we created quickly to ensure the incremental deployment. It took us a little bit more than one year to have the first version of the production ready interface and the implementation. Then why we bundled it with Kubernetes as the single binary? Back then, Kubernetes iterated very dramatically. CI was changed all the time. Bundling it with Kubernetes so that we can tolerate breaking API changes for the faster iterations. This also allows us, we can switch to using CI as the default for the in-process Docker integration while the API is still in alpha. Hence, we created the Docker shim, and that was the only choice for Kubernetes user until two years later. We introduced GA, the second container runtime, container deep, then cryo follow. Next, please. Since we published the first version of the CI, there were many efforts to build the different container runtimes to serve different purpose, different work nodes, different business reasons, and so on. Signal supported all of them, but worked very closely with container deep, cryo, and fratty in addition to Docker shim to make sure the interface we defined covers the majority use cases. We also introduced CI test suite and cryo-cuttle tool sets to help the development and usability. Next, since we published the first version of the cryo, oh, sorry, since we published the first version of the CI in 2016, two years later, Kubernetes container deep went to GA and graduated from the same CF incubation. Later, cryo also production ready and graduate. Both container runtimes support the CI and OCI compatibility since day one. Both container runtimes are future parity with the Docker shim before GA, and over the time, we introduced more features to them respectively. Both container runtimes met the test courage requirements defined by Signal before the production ready. Next, why do we want to remove Docker shim? Why not maintain an inbox solution for the users? From the previous slides, one can see that since the beginning of its life, we treated it as the temporary solution. We fixed many integration issues using it to support existing users, but decided not to introduce new features, especially after we had two alternatives three years ago. This is a growing feature in parity's issue with the Docker shim. By the way, the first feature we developed after Docker shim in Google, it is a SQL version 2. Now I want to hand it over to Derek on SQL with version 2. Thanks, Tom. So as discussed earlier, we've made a lot of progress in the community around Secrets V2 enablement. As we started this project, we wanted to make clear on what our initial goals were with respect to Secrets V2, and that was largely to get parity with existing features support in Secrets V1. So all of the resource controllers that Kubernetes leverages to restrict the amount of CPU or memory or PIDs or huge pages that a given pod can consume, they are restricted today primarily by Secrets V1 controllers. Over the life of Kubernetes, Secrets V2 has continued to evolve and has reached parity features in many cases with V1. So today, if you boot a Qubelet on a V2-enabled host and your runtime is set up appropriately, our intention is to ensure that they at least have feature parity today. We are not deprecating Secrets V1 support by adding V2, but we are making a statement that new resource controllers are intended to be only added for V2 only. We'll talk about some of those. So one of the major activities to enable V2 support besides the code is just to make sure that you have viable test coverage. A number of distributions over the last few years have started to change their default boot time configuration to be V2-enabled. So that includes Fedora, Ubuntu, COS, and others. And now that that's happened, we get more pressure obviously in the Kubernetes community to support those hosts. One example of new features that we are excited that you can leverage in a V2 environment that is in an alpha state right now is memory quality of service. So hopefully, as we continue to evolve that feature, we can point to that as our first V2-specific Secrets-enabled support. So if we go to the next slide. What does a Secrets V2 environment mean for you as a workload deployer? Typically, you shouldn't have to worry about it, right? You should leverage Kubernetes as your container orchestration runtime to say that's going to handle it for me. I just set my fields on my pods and everything's great. As a Kubernetes infrastructure provider, it does have some meaning, right? You have to know if your operating system is configured for V2. You have to make sure that your container runtime supports a V2 environment. And typically that does mean now you'd have to change how you deploy a secret manager in both Keele and your runtime to be system-de-aware. In practice though, while it's true that most workloads shouldn't be impacted, it's hard for us to immediately know these things without testing and feedback. So whether you get Kubernetes from the community, you get it from a vendor, you get it from your particular provider of choice, please make sure you give that feedback to that source so that we can get any bugs and issues known to the broader community and fixed up. A good example of where a Secrets V2 and stuff that's not provided in the core of the Kubernetes project often intersect is around things like security monitoring resource agents. So while projects like CAdvisor have added V2 support, your own unique environment might have some unique agents running on your environment that may not yet be V2 compatible. So just be aware that the intersection of your total set of solutions deployed to a node might need to account for V2 in the future. So if we go to the next slide, just wanted to raise some domain-specific or industry-specific challenges that we are aware of as Kubernetes maintainers that as we look to evolve towards V2, we hear feedback about a lot of users in telco or using Kubernetes as a platform to automate new networks. One area that we know is not available in Secrets V2 was allowing you to disable CPU load balancing, for example. So a lot of people in those industries are pinning their workload to particular CPUs and are very performance sensitive. We are hoping in the broader Linux community that we get these things fixed and so that V2 will just work fine in those industries. But that's an example of a domain or industry-specific challenge that is hard for us to know without feedback. Other types of challenges are more nuanced. So if you're using a particular language runtime, in GoLang there's native V1 support to determine how to properly configure GoMax Prox. But right now, my understanding is I think on GoLang, you actually have to manually set that on a V2 host. I'm sure over time, as GoLang evolves, you won't have to care about it anymore, but sometimes these quirks exist. Similarly on Java, if you're using newer JDKs, greater than 15, Java should be V2 aware and not need any additional work. But if your company or your environment is on an older JDK, you might need to update in order to take advantage. Finally, we're doing a lot of work in Signode to try to be more flexible with respect to auto scaling. So we talked about in-place resource resizing in our roadmap. Right now, we are working towards landing that in an Alpha feature state, but we would love help to be able to ensure that, as that feature lands, that it is V2 tolerant. Right now, there's gaps that we need to close. In particular around auto scaling, a lot of things between V1 and V2, the metrics do change. And so if you have exotic auto scalars, we really just need more testing against those metric providers that are feeding your auto scaling and that feedback brought back to the community. So like a general call to action, if you're watching this, is if your domain, your industry, your language runtime, or your vendor has any known gaps that they want to share with us, the only way we can get them mitigated is to share that feedback. So please do look to join us. Going forward, though, with Secret 3.2, there are interesting features that we look to explore that are of your interest to you as a potential user or contributor. We'd love to learn more about. So things like pressure stall information metrics to drive more efficient eviction or node reliability is an exciting feature for us, as well as better leveraging user space and killers like MD. So hopefully we see that as we evolve to V2, we can get better, more reliable nodes for the broader community. Next slide. So talking through reliability, sometimes regressions happen. So to speak to this next section, I think Serga, are you next? Yeah, I'm next. I'm going to talk about new regressions that we discovered in Google in 122. It's a continuation of the story that Elana told us last time. We did a similar talk on KubeCon. So if you're interested for the beginning of the story, go back and listen it first. Or you can just continue here and then go back and see what other issues we caught when we've been doing that. So it's not the only regression we found, but it's a pretty interesting one. So to start talking about this regression, I want to refresh a little bit about architecture of Kublet. Kublet receives signals from API server and it also reports status back to API server. So there is this connection and then Kublet also talks to container runtimes for workers and it's like a schedule spot and kills them when needed. Plus, it monitors pod status periodically querying container runtime information. And this information about pods being generated as events in the plug and these events being fed back into Kublet. So it processes all that from all the signals from all the different components to build up a final picture of what is currently happening with pods and containers in the pods. In 1.22, we fixed a bug. That bug was related to reporting the terminated status of certain containers. There were some race conditions and to eliminate these conditions, we put more logic together as a single source of truth of pod termination status. As I said last time, there were many race conditions and the other bugs we discovered as a result of this refactoring. It was a good refactoring, it fixed a lot of bugs, but and then regression was caught and we shipped quite a stable product. Unfortunately, we found another problem in Kublet in 1.22. So this was represented as out of CPU message when you try to schedule many pods simultaneously on quite busy nodes. Examples may be jobs that you execute. Each job is quite small, but you need to execute a lot of them and when new job trying to be scheduled, it suddenly cannot be scheduled and receive this message out of CPU. We discovered this bug relatively early. Unfortunately, due to the nature of this bug, we saw similar situations with out of CPU before. It was caused, for instance, by static pods. When static pod being scheduled while API server doesn't know about it and then API server will push some nodes, some pods into the node and node will not have time, will not have CPU because static pod just got scheduled and API server just wasn't aware of that. Other cases caused out of CPU before. So we investigated, we thought that it may be some scheduler problem, it may be some rare issue that nobody caught. And most biggest reason that we didn't have this caught and we didn't know about this regression is that we don't have many tests that test many pods are scheduling and rescheduling often. So what we did. The fix was quite simple for this one. For this situation, oh, I didn't explain what the root cost. So root cost was that when we fix the pods termination detection and we put it in single place, we made Kublet know about pod being terminated earlier than before. And since Kublet knows about it, it will sync the status with API server and API server will know that pod is terminated. So it's time to schedule a new pod in place of this one. What the fix was to delay reporting the status to API server, I highlighted this like key codes that participate in that fix, but PR was much bigger than that. So now we will delay reporting terminated status of a pod to API server before all the resources are cleaned up. And all the resources, I mean not all all resources, it's only CPU and memory in this case. We made some observations while we've been fixing these bugs beyond just the distributed systems are hard and you need to find balance between speed and consistency. We also highlighted again that terminated pod is not equal to deleted pod. Terminated pod can report its termination, but it still will clean up some resources. And one resource in particular that will be clean will be turned down during the termination period is volumes. And volumes may take a lot of time to be turned down and Kublet account for that. So new pod can be scheduled that want to use the same volume and it will wait till volume will be completely cleaned up and then we'll pick up a new volume. So this behavior is different from what we made for CPU and memory and in the future we may consider improving this for CPU and memory as well. So scheduling will happen faster and Kublet will just wait a little bit for CPU and memory to finally freed up, but it may be future improvement. It's all highlighted again that we need to have good CI for Kubernetes and make sure that we cover all the bases when we test it. So we added end-to-end test for this regression that will schedule many pods and will make sure that this signal wouldn't be, terminated signal wouldn't be synced up with the API server too early. In general CI subproject meet every week and what we do is we look at all the test grid, we analyze what's happening, what's going badly, what is going good. We fix issues and adjust the test and we also create a new test coverage. So if you're interested to join SIGNOT and you don't know where to start, the CI subproject may be a very good place to start. You will learn Kubernetes through testing Kubernetes and it will give you a lot of insight knowledge like how to test and what kind of situation Kublet may get into. We recently published Achievements blog post. I linked it here. You can go and check it out. We have 18 active contributors that we highlighted. We really appreciate all of your work. We want to highlight it more and more. We don't have time in this talk to name names, but yeah, hi, everybody. Thank you for your contribution. I also posted some statistics here. It's from some time ago and it's kind of many symmetric. But what it shows, it shows numbers. We do a lot of work. We do reviews, we do bug fixes. So if you're interested, join us and help us in this hard task. This is how we track work in CI subproject. We have a special project for the test and we do receive early reliability signal by doing bug triage. With that, I want to go to Derek to talk about how people contributing to SIGNOT can get more exposed into what we're doing. Thanks, okay. So one of the challenges in an open source project at the scale of Kubernetes and SIGNOT itself is just the breadth of things that we need to keep in mind as maintainers, that we don't break or regress the past and we can continue to build responsibly for the future. So SIGNOT is the third largest SIG by absolute workload. Lots of PRs get opened. Lots of PRs need to get merged, closed, and triage. And contributions can come from a variety of member companies who work in the community, sometimes enduring and sometimes just in particular feature or function areas. What this means is in general for us as a project to be healthy, we need to have a constant pool of help to help drive that work forward. So if we go to the next slide, one of the things that we're trying to do in the SIGNOT community is to ensure that we have a viable set of healthy new contributors, reviewers, and a pathway for approvers going forward. Hopefully by the time you are watching this in KubeCon we will have a PR that's merged in the SIGNOT, I think the community area is where we're looking to put it, where you can look through new guidelines on how we want to help encourage a clear pathway to evolution of responsibility within the SIG to set guidelines for reviewers and approvers and then expectations for those who become approvers with respect to maybe making clear their individual domain knowledge. One of the concerns we have as a Kubernetes project, and particularly in the KubeLit, is it's kind of an intersection point between many other aspects of the project. So that would be networking and storage. And our code isn't always as decoupled as we'd like from those two intersection areas. And so particularly for a pathway for approvers, things like security and knowing what to look at are strong paramount importance. So if you want to help evolve your contributions in SIGNOT, the CSF project is a great place to get started, but just help join us at SIGNOT and hopefully we can get you moving forward in that path. Next slide. So Alana, do you want to talk about how folks can get involved? Yeah, let's do it. So we often talk about our contribution priorities and this sort of draws back to a few of the things I said about roadmap. We try to prioritize stability first. We want to ensure that texts are fixed. We want to ensure that bugs are fixed and open triage issues are fixed. We want to do that before we start working on new features because otherwise we might get overwhelmed in the number of bugs and test breakage that we're working on. We really, it's important to us to ensure that our test infrastructure, monitoring and health is good and our tests are green. We have good signal to work with. So this is sort of how we prioritize contributions. And if you really want your new contributions to get looked at, you are more likely to get a positive response if you are fixing something that's broken initially versus like I have this great new feature because we're sure that your feature is great, but we have so many great features that it might not be the best place to start because there might be something broken blocking that. And so optimizations are also great to include that improve the performance of existing components. Other areas that you might be interested in contributing and I often try to call this out especially to folks who might be using the kubelet have opinions on SIG node but aren't necessarily going to write code. You can still contribute. You can still help us out either by doing performance testing or writing documentation or giving us feedback on logging and metrics and how we can improve that or just helping us triage and keep on top of the issues. You don't have to write code in order to contribute to the SIG. So we welcome all forms of contribution. So how do you contribute? Well, first it helps to join our SIG meetings so you get an idea of what we're currently working on. At the main meetings of the SIG we talk about major features that are being worked on, enhancements and what's going into a given release of milestone, high priority bugs and more. If that seems kind of intimidating to join when there are 40-plus people in a meeting, you could try coming to RCI and triage sessions where we have a smaller, more hands-on group and we work directly on specific bugs, specific issues, that kind of thing. As I said previously, you can also participate in code reviews, issues, documentation. This is all awesome. A big shout-out to all of the folks who've been helping out with the Docker ShIM removal documentation, this release that was an enormous effort and a great help to SIG Node. And it's also really great if you adopt features, turn on alpha feature gates and let us know how they work. So where can you find us? This slide links to our regular meeting which is currently scheduled on Tuesdays at 10 a.m. Pacific time. Our CITriage meeting is every week on Wednesday at the same time, 10 a.m. Pacific. We have a Slack channel pound SIG Node on the Kubernetes Slack. We also have a mailing list, the Kubernetes SIG Node mailing list on Google groups. And our current shares are Don and Derek. Thanks so much for attending this talk.