 Hi, welcome to KubeCon, CloudNativeCon Europe 2021 virtual. Today's talk is going to be focused on Kubernetes labels everywhere, decluttering with node profile discovery. My name is Dave Kremins, I'm a cloud software architect in the network platform group at Intel. Today I'm going to be joined by Connor Nolan, who's a senior engineer in the orchestration team at Intel. So for today's agenda, we're going to cover node feature discovery and overview or background on NFD and the problem statement associated with it. We also have an example of a complex node specification. We also intend to provide a conceptual overview of node profiles. And we have orchestrated a very simple demo that highlights the problem and presents possible resolution to the issue. And finally, we will have some key takeaway summaries for you guys too. So before discussing the problem statement at hand, let's focus on what NFD is today. So NFD is a node feature discovery component available in Kubernetes that detects hardware features and advertises those features using node labels. And these particular features can be categorized under numerous specific domains, such as CPU, IOMMU, kernel, memory, network storage, PCI system, USB, et cetera. It's an important component in the Kubernetes ecosystem, given that numerous different types of workloads need to or have, need special attention, let's say, to specific features of a platform. So for instance, if a platform had a particular feature that was required as part of your workload, then we'd want to ensure that the workload lands on a compute node that has the intended feature available. But again, even though this helped in the placement of workloads, or at least complemented the placement of workloads, it still promoted this tight coupling to individual platform capabilities. And if I drill into that, what I'm really saying is that your workload was very individual feature aware, so each feature that was required or intended to be leveraged by your workload needs to be specified upfront. And that's what I mean by that tight coupling. And when we look at this across numerous different types of workloads, and especially from my background where it's really based in telco workloads, there are a lot of features required from the platform in order for a workload to run deterministically and wish the right level of performance and throughput. And when we look at what's expected of the platform, it can become a configuration nightmare very quickly. It really adds to the complexity and not to mention that it becomes that bit more difficult to schedule when you have so many different requirements specified as part of your workload or your pod specification. Also, when we look at the discovery mechanisms, there are actually multiple ways to do detection and labeling. NFD is just one. There's also NodeLabler and there are other various components out there as part of the ecosystem that are also capable of labeling their nodes or labeling their features as well and making them available. But with each offering and each component that is capable of doing that, we essentially extend the amount of features that are, let's say, claimable or schedulable, let's say, in Kubernetes. And we end up creating this laundry list of all these different features. And as I said, this can become unmanageable and adds extra level of complexity and presents new levels of scheduling addicts. And the number of features continue to grow and this is the existing pattern. This is what's out there today. And as every couple of months, there'll be new feature detections added to NFD, for instance, and there'll be new node labels available. And if a workload does need a specific feature capability out of a platform, it's going to be added somewhere so that it can be claimed. But again, this particular pattern focuses too much on the individual features. And when we do that, when we focus on individual features for a workload, we tend to misalign on the abstraction layers provided by Kubernetes. But Kubernetes is a stable API with the right level of abstraction. And when we look at it and compare it to that, we kind of break the abstraction by tying our workloads to specific individual platform capabilities. So the point of my slide here today really is to try and pivot towards a platform-centric perspective. So no longer do I need to be concerned with all of the individual feature capabilities of a particular node. Instead, I want to be able to avail of a platform offering. Holistically speaking, what can my nodes offer me so I can land my workload over there? When we look at it under that lens, we start to perceive it as kind of like an almost an optimized accelerator for your specific workload. An example in this case would be something like, you know, a power efficient packet processing node. So when I have a packet processing workload, I don't need to be concerned with all of the individual feature items that I need for that workload. Instead, I can target a specific node because it's been configured and what's available to me from a consumer perspective is power efficient packet processing. There's a big difference, you know, and it simplifies things and so on. In my next slide, I wanted to showcase an example of a very complicated specification. This is taken from a real world application. It is telco oriented, but it gives you guys an idea of how fast things can, you know, become unmanageable and out of control when the list of feature requirements, be it for your node or your workload, continues to grow. So as I say, this is a realistic specification. We put it in video format because we couldn't fit it on the slides and I think we're almost done. So as, well, as evident or at least as showcased by that, then I think it should be clear that, you know, there are a lot of opportunities for a really large level of complexity within, you know, feature capability detection, not just with NFD, but the feature detection mechanism itself. So that was an example of how complicated a specification can become and how hard it can be to manage. Now I'd like to hand you over to my colleague Connor who will take you through more of the presentation. So over to you Connor, thanks. Thanks Dave, okay. I'm gonna do a quick high level overview of how no profile discovery works. First by showing the flow that currently exists and then how MPD can make life a little bit easier for the user and for the scheduler. So let's take a really simple example. So here we've got a cluster of eight nodes and each node has a different set of features. And I want to keep in mind the example that Dave has just shown of a node with a couple of hundred individual features instead of a handful like we have here. And then picture a cluster with 5,000 of those nodes instead of eight. And you quickly get an idea of how unmanageable the situation can become. So a pod is created and it's looking for a list of individual feature labels and it's using the nodes lecture construct to ensure the scheduler gives it a node with all of these features. So if I'm the scheduler and I'm going through my filtering process, I'm gonna look at this pod and the labels specified and say, okay, so this pod wants AS and I crypto instruction. So node five doesn't have that. So that gets filtered out. This pod wants AVX 512 instruction set. Node one has AVX, but doesn't have AVX 512. So that's no good. This pod wants SSTBF, so speed select technology base frequency. Node seven doesn't have that. And this part also wants layer three cache allocation technology and node eight doesn't have that. So those are all filtered out. I see this pod is also looking for a couple of custom feature labels. So let's say the cluster admin has configured some nodes in the cluster to run the single NUMA node topology manager policy. And then they've applied a label to those nodes to signify that. Likewise, they've configured some nodes with ISO CPUs and applied a label to those nodes to represent that as well. So this pod wants a node where it will be guaranteed NUMA alignment. So that rules out nodes two and four. And finally, the pod wants to run on a node configured with ISO CPUs. So node three is no good there. So after all that, node six is the only feasible node and the scheduler moves on to the next step in its filter process. So now instead of polluting our pod spec with a bunch of individual features, let's create a node profile CR containing all the features we want to make up a profile that best suits our workload. So once the CR is created, no profile discovery reacts and applies a new profile label to any nodes that fit that profile. Then when the pod is created, instead of looking for a plethora of individual features, it's now looking for one label, the label of the profile that has been tailored to optimize this particular workload. So overall a very simple concept, but one that lends itself to a top-down approach to workload scheduling and infrastructure management. And this approach can be utilized to automate and scale the process of cluster slicing. So this is designating nodes optimized for a particular performance critical workload to run those workloads only. And this is a concept we'll elaborate more on in the demo, which I'm going to go through next. Okay, so as a proof of concept, we've built out no profile discovery as a Kubernetes operator and it's deployed into cluster alongside NFT and then essentially leverages the feature discovery performed by NFT to enable these higher level profiles. So as you can look at the cluster setup. So I have a four node cluster, three worker nodes in the control plane node. I look at the pods that are run. So you can see here, the NFT master is running. I've got an NFT worker on each of my worker nodes. And down here we can see that the no profile discovery operator is running. So this is a single pod deployment. So there's no need for individual node demons or for host level discovery or anything like that. It's just a single point of contact with the Kubernetes API. Let's take a look at the labels on our worker nodes. So we can see up here on cube worker one as a bunch of feature labels applied by NFT. Cube worker two, also a host of NFT labels. We can also see a couple of custom labels, the kind of labels we touched on earlier, like a specific topology manager policy or a node configured with high-salt CPUs. And cube worker three, much the same mix of NFT labels and a couple of custom labels. That's all pretty standard. We can take a look at a node profile spec now. This is a simple node profile CR. It's been given the name high performance pack processing and in the spec, our number of desired feature labels that make up this profile. So again, like in the example we have before, we have AS&I, AVX512, STBF, Layer Three Cache, ICDPUs and topology manager policy. Create that profile and we can see that it exists. And I'm just going to leave the node profile spec open on the left so we have it as a reference. So now over my other screen, I'm going to check again for all my worker node labels now that the profile has been created. So what we're looking for here is a new label with the profile.node prefix. If I scroll up to cube worker one, you can see that there's no change here. But on cube worker two, now we can see this new label with the profile.node prefix and the name of our high performance profile. And on cube worker three. So that's also unchanged. And at a glance, we can see that this node doesn't have the desired topology manager label so it doesn't fit the profile. So in summary, what this means is of our three worker nodes, just one of those nodes fits the criteria for our high performance profile and has those matching labels and it's given the additional profile label. So what would happen if a node no longer fits a profile due to some change in its configuration? For example, let's say the sysadmin decides to disable the topology manager on cube worker two and then as a result, they then remove the topology manager label like so. What we would expect is for no profile discovery to react and update the node so that it no longer matches the profile due to the change in circumstances. And if we check our node labels again, we can see that not only has the topology manager label been removed, but also our profile label has been removed because this node no longer fits the criteria for this profile. So now none of our three worker nodes fit the profile and as a result, none of them now possess the profile label. And likewise, if we were to make a change to the no profile object itself, so for example, if I was to remove the topology manager feature label from the spec and then reapply that spec, again, I'll just leave it open for reference. And now we go back and check our labels once more. We can see that now nodes two, cube worker two and cube worker three both fit the profile and are given the label, the profile high performance pack processing label because both of these nodes now match the criteria listed over here. Now that the topology manager policy has been removed. Also as an optional add-on to the core functionality of basic feature aggregation, we've also explored the possibility of introducing a node tainting mechanism via the CRD. So I'm gonna add in these additional fields into my spec. So what we're aiming to achieve here with these additional taint behavior parameters is that 50% of nodes which fit the high performance profile should be tainted with a no scheduled taint. So in our small example, we have two nodes now labeled with the high performance profile. And so we're looking for one of those nodes to be tainted. But if you can picture this on a larger scale, the purpose of this to designate specific nodes which have been optimized and configured for particular performance sensitive workloads. So essentially treating nodes themselves as accelerators and slicing your cluster accordingly based on these profiles. Then workloads with requirements which match a given profile are scheduled exclusively to these designated nodes and those resources are not wasted on less critical workloads. So I'll update CR once again. And this time I'm gonna check the taints of my worker nodes. So we can see here there's no values for cube worker one and cube worker two. But for cube worker three, we can see that it has been tainted with the high performance profile and the no scheduled taint. So this is what we were expecting to see. We had two nodes labeled with the correct high performance profile. And we wanted to taint 50% of those which is one and that's what happened. So also I wanna check the labels of the worker nodes. And now we can see so cube worker two is still the same. So it has this profile label which is true. Cube worker three still has the profile label but you'll notice that the value has changed to tainted. So this allows workloads with tolerations to target this node specifically as opposed to non-designated nodes which also match the profile. And this information is also reflected in our CR. So if we describe our high performance profile we can see down here. So this is the spec with the feature labels that we specified and the taint behavior that we set. And here in the status you can see. So two nodes were labeled, i.e. two nodes match the profile and we specified we wanted 50% of those tainted and you can see here one of those nodes is tainted. And for completeness sake, let's say you wanted to scale up the number of designated nodes. So let's bump this up now to 100%. So what that means is that we want to taint basically all of the nodes which match the profile. So I'll reapply that CR and again check the taints. And now we can see that worker nodes two and three are both tainted and likewise with our labels we should see that again value for our profile label has changed to tainted. And if I check my CR again we should see that reflected in the status. So again, this is updated here. Like I said, we specified 100% and that's what we've been given. All nodes which match the profile are now tainted i.e. designated for a specific workload. Okay, we've seen what it is and how it works. So now let's cover why you would use no profile discovery and the advantages it can provide. So a number of advantages present themselves when we look at the compound effect of multiple features on a platform versus a single concentrated profile. So firstly, the reduced complexity in the workload scheduling process as we've shown in the example. And then I'm probably the most obvious advantage the simplification of the pod spec itself. So now you can remove this sprawling laundry list of feature requirements baked into your pod spec and instead reference a single profile which has already been curated for your application's performance needs. Move towards a top-down perspective. So what do we mean by this? We want to get away from the bottom-up mindset of my app needs this feature, this feature and this feature in order to fulfill a certain quality of service. Instead, let's move towards a my app needs to fulfill a certain quality of service. So what make-up or profile of features is going to achieve that for my application? And then we can align these profiles or these make-ups with the abstractions that are most prevalent in today's deployments. So cluster slicing, no profile discovery naturally lends itself to a model of cluster slicing or partitioning of a cluster into groups of nodes with common use cases. And the demo showed us some functionality for designating nodes for specific performance-critical workloads and then providing this mechanism in a lightweight and scalable way that's Kubernetes-native and could potentially remove a lot of overhead for a cluster admin. So with that, I will hand back today for some closing comments. Thanks for the demo, Connor. So shown there, we've essentially demonstrated the problem at hand whereby a laundry list of feature asks can get out of control, whereby we can generate a profile-level label that has the, where the individual capabilities roll into it. And when we've done something like this to try and simplify the placement models for workloads, what else can we do? So if we do have a proliferation of this type of mechanism and new patterns emerge based on utilizing something like this, there's always the question of what can we do next? What could we build on that? Is there another avenue of work we could look at? And there is. So we're also looking to, based on your profiles, so it's easy to have this get out of control and to protect the integrity of profiles. We're looking at a JSON schema to validate and promote consistency of them. And if we do start work in that particular domain, then it's easy then to tie this into existing automation pipelines and validate the creation of new profiles. So I'd say this is something that would be beneficial going forward, assuming that NPD was successful or that the profile awareness that this brings to the community would be accepted and utilized. And we've a couple of options here, right? So what's been demonstrated today is kind of a simple Kubernetes operator that has its own custom resource and its own controller. But we could take a different approach and extend NPD by providing the same capabilities directly into it, whereby NPD actually incorporates the profile management in generation. That's another approach that we could look at. We've also looked at a potential integration point with NPD whereby we build two separations. One to focus on profile concerns that can then act as a complementary component to profile management. So NPD or profile concerns in the management of them could be done in one component and NPD could manage the actual, the labels, the profile labels themselves. And we've also seen some use cases or potential possibilities with policy-based control systems, whereby a policy system could create the profile based on its own policies. And this is something that could be leveraged in Kubernetes given that it's policy or it can be policy heavy, depending on your infrastructure and platform. So you could tie it directly in there and generate a profile based on your existing policy controls. And again, then with this, then you have the option then of enforcing the profile so that if a particular policy is not met or honored, then the profile is no longer valid or is invalidated based on smarter action from within your policy control. So like there are numerous things that we can do next with this particular approach. So I hope you've enjoyed today's talk. It's, I wouldn't say that it's too much of a push beyond what NPD does or what features of feature labels have brought to the community or to the Kubernetes ecosystem, but it's definitely bringing an awareness to your kind of your platform, your holistic platform capabilities versus the individual items that are prevalent today. So with that, I'd like to say thank you to the audience for attending the talk today. And we hope you enjoy it and we hope to see you again in the next QCon. So thank you very much. And we'll see you again maybe in QCon North America. Thanks.