 Thanks for coming to our pitch. We're going to talk about node resources What's a node resource? Well, you've probably heard of Kubernetes and cooblet and container run times, maybe Anybody needs to know a little bit more about container run times raise your hand if you're good If you're if you understand what those are that's great Hey, about I'd say about six years ago. I'm just gonna do a little intro Little history before hanging off to Christian about six years ago. We were sitting around our round table doing the open containers specification for runtime specifications and Got a renown patelle. I don't know if he's here or not but renown patelle came up with an idea to do hooks in Inside of the run seed runtime using some text that would go into the OCI specification and Those hooks would be executable programs that will be run at particular points at which a containers executed for example pre start post start, you know that kind of thing But how would people define where they where would they put them, okay? There would just be a path in the OCI spec to where these hook these application programs will be that would actually Receive the contents of the OCI spec Maybe some state and then reply back, you know synchronously with maybe some modifications to the OCI spec or they would do an attach over resource and That's where no resource comes in the reason that people need to extend the these container runtime engines if you will underneath the covers is because resource managers Need to be able to modify how the containers get access to the resources how much resource they get And we tried to do some configuration if you will in the container run times But we always seem to get it wrong and resource managers need to not just manage the resource for pods and containers, but also they need to manage the resources for Docker containers for you know other kinds of container run times that are running in parallel on your note parent parallel with Kubelet and worker node pods that have been assigned and Intel is one of the the groups that always need to be able to have their own resource managers And so they're they're heavily involved We we did do beyond this whole thing. We didn't notice the OCI spec we Container runtime groups put together a runtime schema for for our hook schema Sorry, that would sit on a directory and then a container runtime could bring in that schema Look at look at the JSON for anybody who wants to do a resource talk inside of the run times and Some of the people didn't really want to just use the runtime schemas They wanted to be able to hook into container run times themselves so that the Modifications of the container runtime could be adjusted as well Before getting all the way down into the bottom end of the run the other you know the runtime engines so One of our plugins that we're looking at is actually the ability to use the this this hook schema to load up Plug-ins that you know executable programs that will run at boot time based on your system D Being being able to you know modify those configs Anyways and set dependencies. I don't probably into too much detail But Christian has put together a great little plug-in that will run in container run times So that you can actually have a process that's loaded by system D For example, and then call in to the container runtime over a TTR PC interface We might even do a GRPC one as well But we're going to start with TTR PC, which is how we do shems in Instead of container run times, but this this TTR PC interface will host the ability for a runtime a service Sorry a resource service manager to be able to actually tell the container runtime It's interested in a certain type of sandbox like a pod sandbox for the ability to make modifications to it Okay, so I'm going to hand over to Christian now either a great developer we're about to merge In the Thank You Mike and thanks everyone for joining this session. So my name is Christian litkay. I work for Intel as a cloud software engineer Mostly in the orchestration and the runtime resource management area, so So we are going to talk about NRI the node resource node resource interface So what is NRI? So NRI is a common framework for plugging extensions into OCI compatible runtimes These extensions or as we call them NRI plugins implement custom logic for controlling container configuration So what do we mean by a common framework? Well, it's common because ideally NRI support should be present in all common runtimes and Also, NRI plugins should work identically across all NRI enabled runtimes It's a framework because it comes in multiple pieces which together achieve the goal that we set for NRI and that goal is to allow altering how the runtime configure is configures containers without actually modifying the runtime itself So the benefit some of the benefits of NRI So for runtimes It's I think obvious. It is another way to share and reuse code. So in addition to just Using the same packages in the runtime implementations. There is now a common interface to plug a new functionality Which can be implemented once and then plugged into any of the NRI enabled runtimes For cluster administrators, NRI provides the necessary plumbing to customize or enforce how containers are configured So this includes control over the initial container configuration during creation and Any subsequent container configuration updates So additionally NRI also and those configuration changes in response in response to external events which are not actually from inside the container runtime So essentially if you plug in your Plug-in so plug in your NRI plug-in your extension Into the runtime in your cluster Then you are free to implement your own custom logic for container configuration But only within the boundaries Imposed by NRI. So there are certain things which even with NRI you are not not able to or not a lot to do So how does NRI work? To explain how NRI works we need to First take a look at how the container creation signaling flow looks like from a high level In a Kubernetes cluster. So this is going to be a vast oversimplification of what is going on But it should be sufficient to understand in this context the necessary details So first a pod spec is created this typically happens as a End result of creating some higher level object for instance a service Demon set replica set or something similar So eventually this pod spec ends up in the Kubernetes node agent the kubelet On a worker node the kubelet uses the container runtime interface to Request the creation of a pod and one or more containers within that pod. So the container runtime Constructs from this CRI request And OCI specification this OCI specification describes the container's parameters for the low level runtime This OCI spec is then passed on to the low level runtime for instance run C or Cata containers Which then creates the container so With NRI in the picture So things work a little bit different So once the initial OCI spec has been created in the runtime it is passed to NRI and In NRI the NRI plugins Can basically perform customizations according to their own logic to the this initial OCI spec And then this updated OCI spec this final updated OCI spec is passed on to the low level runtime Which then creates the container So there are other container lifecycle events Where NRI can hook in and basically the details of the signaling flow are slightly different, but the Important thing from the NRI point of view is that usually a request or always a request comes in via the CRI interface As long as we are talking about Kubernetes clusters and and Kubernetes containers and Then NRI is acting basically Before that request can be finally processed and making modifications Inside the runtime So if we take a closer look at the NRI bits in this previous picture so if we zoom in and Take a look at how NRI is Decomposed so so we notice that there are several components which work in tandem and this is why we earlier called NRI a framework so there is an adaptation client Which is basically a runtime agnostic library to help run times integrate to NRI itself and interact with the NRI plugins Then there is an NRI plugins tab Which takes care of the low-level boring details of writing a plug-in So this should improve code reuse Because instead of each plug-in alter having to implement all the low-level necessary functionality Which are common to all plugins they can by using the stuff They can directly dive in and start focusing on the custom logic that they need to implement So this top takes care of things like connection establishment to the runtime Communication Plug-in registration and so forth So finally there is the NRI protocol itself Which this Is above our previous dimension two components So the adaptation client and the plugins tab use to communicate and interact with each other So the NRI protocol is defined as a protobooth based API with TTRPC bindings The protocol defines an execution model And the data model for NRI the execution model is basically a set of potent container life cycle events which NRI knows about and Plug-ins can be interested in and the data model defines which subset of the OCI spec NRI plugins are exposed to and How this subset can be modified? by NRI plugins So and another way of describing this would be that the execution model basically defines the events and and The data model defines the inputs and outputs that NRI plugins receive and produce While processing those events so If we look into the events in more detail so the NRI API defines an event subscription mechanism and Plug-ins only need to subscribe to those potent container life cycle events Which they are interested in and they only get notified about those events which they have subscribed to So currently POTS cannot be modified by NRI plugins plugins can still track the life cycle state of POTS by subscribing to the available for life cycle events and They can also act on those events so because There are no Possibilities for plugins to change anything in a pod the sea the NRI pod life cycle events are exactly the same as for CRI so these are pod creation or run POTS sandbox POTS stopping stop on sandbox and pod removal or remove POTS sandbox for contain for container life cycle events Situation is a bit different because container configuration can be customized by plugins So the NRI defined life cycle events for containers are similar to the CRI ones But there are a few additional POTS variants as we call them so this full event set is Create POTS create Start POTS start Update post update Stop which is semantically a POTS stop and then remove plugins can customize content as during creation update and stopping Therefore these events are actually semantically requests and not events So plugins can respond to them with a set of customized customization changes that they wish to perform In response to these events So if next we take a look at the data model so plug in inputs and outputs So for POTS again because POTS cannot be modified The amount of data that we Take from from CRI and Passover to NRI is slightly smaller than for containers So for POTS the available data includes the Kubernetes namespace labels annotations C group parent in the runtime and For containers The data available to NRI plugins include the containers labels annotations commands So the command line arguments environment variables mounts devices UCI hooks C group parameters Which are related to native resources So these are CPU scheduling parameters CPU and memory pinning Memory and huge page limits, and I think that's it and additionally there are two classes or classes So one is for last-level cache allocation, which is called the RBT close and the other is for block IO Scheduling and throttling and that is called the block device close So the container event determines what kind of changes a plugin can request in response So for container creation And actually this support for container creation is the one where the Largest amount of modifications can be performed by a plugin and this directly comes from The con underlying container model defined by the OCI specification So once a container has been created then most of its parameters cannot be changed But there is a subset which can and this is reflected in The possible responses to these various events so for contact from container creation Plugins can request modifications to annotations environment variables mounts devices OCI hooks The resource related C group parameters that I mentioned previously The last level cache so our DT and the block IO close or classes so and Additionally For a container creation request Plugins can also request updates to other existing containers in the runtime not only the one which is being created So for container updates the requested update Can modify the resource related C group parameters both for the container being updated and other existing containers and This is the same for stopping. So what else an error plugin can do So there are a number of other things plugins can do during registration to the runtime the plugin is offered a chance to Synchronize its state with the runtime So during the synchronization it receives a full set or a full dump of all existing existing ports and all existing containers in the runtime and in response to this It has the chance to Do customizations to the configuration of any of these containers And these customizations are exactly the same as can be done for a stop or an update event So not the same as for create Basically resources can be updated but not much else Hmm So plugins can of course also subscribe and react to port lifecycle events, but they cannot modify ports and They can control things which are outside the scope and control of the OCS specification Nothing prevents a plugin from doing that. Mm-hmm. There are also a number of things that NRI plugins can't or shouldn't be doing So multiple plugins cannot make Simultaneous conflicting changes to the same container. So this is Checked called by the NRI infrastructure and it's like the scenario. So basically if this happens during Container creation or container update. It is sort of transaction only Rejected with an error. So another thing that plugins can do is That they cannot control Those parts of the OCS spec which have been intentionally left out of the NRI specification so We try to be careful To include everything that we think is needed of course there it is possible that something we overlooked but the rest of the Parameters which cannot be controlled by NRI plugins are intentional and Although I said that my plugins cannot do Such control in reality. This is only true if NRI plugins are not trying to bypass The runtime and for instance by direct sake of manipulation make change changes on their own to these parameters, which they should not do so How to write an NRI plugin? So if you are interested in writing an NRI plugin of your own then the easiest way to start is to copy The template plugin that we have and fill in the missing details. So the template plugin basically is a very few lines of code Wrapping the stub just to be able to create an executable Start it up without any modifications the plug-in so this template plug-in simply subscribe to all events and then it Just prints out the events as they as it Is received so and this plug-in does not make any customizations to any any So basically it always responds with the empty empty response so and The easiest way to get up and running is that you clone this one and then you start modifying it And how you need to modify it? Well, first you need to subscribe to the events that you need because you probably don't want Maybe you don't want to receive all the events and then you need to implant actual customization logic for containers So if you remember there were Three events to which a plug-in can Respond with customizations. So those are the control points that you have so to those usually want to subscribe so these are the Create container update container and stop content. So and what what you want to do is that you Make your initial customization in response to the Container create creation event or request If your custom logic is such that the changes you are making to the container being created Have side effects or effects on the existing containers then in response to this event you typically want to also customize Configuration of existing containers so that this effect is somehow mitigated or reflected So a good example would be that if you want to allocate for instance a exclusive CPU To a container then it's only exclusive if you basically exclude the same CPU from the allot set of the other containers So if you do this if you are modifying modifying other containers so existing content for a for a Create container request then you typically want to subscribe also to a stop container event because then you need to undo somehow those modifications that you did for Creation events so in a way the other direction This is true also for for update. So in addition to this So these are all reactive changes or reactive Customizations so something NRI is telling you something that something is happening in the runtime a container bring created updated or stopped and Then you react with with customizations in the response, but then I also provides a mechanism for Doing unsolicited Customizations so For instance one example would be that you are collecting some kind of runtime metrics And then based on that those metrics you would like to let's say redo how CPU allocation is Spread across the containers So that those could perform better So this you can do That is with the update update containers call Which is actually control flow wise in the other direction So NRI is not calling you that you are calling into NRI Tell your customizations these customizations are exact same as for stopping or update. So they are only to the resource related C group parameters block IO and rdt classes and Then NRI will tell you whether this succeeded or not and if not then you need to somehow react to that so those are the Short guidelines how to write the plug-in and of course one good idea is to take a look at our sample plug-ins that we have in our repository we have a few not too many but those can be useful Inspiration and guidance and of course we have some Some set of documentation a little bit limited, but we are working on that so examples and use cases. So we have a Few plugins already available so some of those are real-world plugins because Those have been written To mimic existing similar functionality in some of the runtimes It's sometimes questionable whether those use cases are such that you that they really make sense nowadays But nevertheless they exist and we've wrote those in the hope that once NRI is merged then Some of these could be removed from the core of the runtime and then made available as a somehow keep the core a little bit so prevent this kind of I Don't know whether I should call them costs to mere specific additions But that's a little bit how those look like to me. So keep those out And the rather implement them as NRI plugins So these real-world ones that we have so one is an annotation based device injection So if I recall correctly, I saw this code in cryo, then I But okay, I will implement this NRI plug-in. So basically it reads a Annotation with a well-known key and then interprets the value as a description of a device Device parameters which should be injected into into the container and then it injects it Then we have another one to see the eye device injection Which is by now obsolete, but it's still available. So this is basically So this is how we initially Implemented see the eye device injection So if you know if you don't know what CDI is so CDI is basically the low level device description format used by the upcoming DRA or dynamic resource allocation featuring Kubernetes to inject devices into containers and Manage them a little bit better than you can do with or actually a lot better than you can do with with traditional device plugins so This was easy to implement as as a As an NRI plug-in. So it was one of the first Sort of test cases for ourselves that do we have in a functional coverage that some useful things can be done but because we did not want to We did not want to add the dependency between so from the area to NRI So we wanted to completely decouple those then Eventually we implemented natively this in both cryo and container. That's why this is now obsolete Then the third one, I think it's Mike's favorite It's the OCI hook injection and that is Basically, so in cryo you have OCI hook injection support, but you don't have it in container and this has been a long Standing discussion several times have been attempted in one way or another, but the PR did not Really ever go in. Yeah, so and then we decided that maybe NRI plug-in would be a good idea because then It's not a configuration option, but you can completely leave out the code So if you need to lock down some in some environments container this order, this is absolutely not possible Then to assess this is a little bit easier if you simply don't have the plug-in present on the system Then just take a look at the configuration So this code has almost very much been taken from the cryo implementation To end into NRI plug-in. It's really just a few lines of code if you have a NRI enabled container they and then load this plug-in then it works Exactly identically how cryo would handle this schema based OCI hook injection for a container There is one more real-world plug-in when I have to do list and this also comes from the cryo code base So cryo has a built-in They call it high-performance hooks So this is a really good example actually of some functionality where you react to some Container lifecycle event, but then you perform actions which are outside the realm of the OCI specification So this piece of code is not there for fun but it's there because the same thing cannot be done by by the OCI specification and the reason for that is that Nowadays it does also other things, but but initially when I took a look at it So one of the things it was doing is that that if you so when you create a Guaranteed QoS container and that container gets an exclusive CPU allocation then It was basically migrating But it is migrating off all the interrupt handlers from those CPUs Which have been assigned to this container so that Actually your user space processing is not getting preempted by kernel level higher co-handling So this is providing even more isolation from from the rest of the other containers and system and the process is running on the system then just by getting a normal exclusive CPU and when That container is stopped then this migration is done in the other way around so or the So the higher co-handling state is Restore to the original So we haven't we don't have a plugin for that But this is one of them because this is something that we promised for the cryo voice that we will do Then we have a couple of Debugging and development plugins so we have event logger. So that's an easy way to Take a look at what is going on in the NRI sense in a runtime. So it Registers, it's so the the logger registers itself to the Runtime it's up to all events and then it does a full dump of every single event it receives and it never Request any customizations of course to any containers. So in that sense, it's a no plugin Then we have a container differ. So that is a mod that is basically a modification of this ID so instead of that that Basically works by the same principle it registers itself subscribes to all events never does any Customizations, but instead of dumping the events it only dumps diffs between the chain of Customizations that several plugins are requesting to the container So if you have more than one then you can see what is going on and then we have the plug-in template Which is the Basically, it's a it's a simplified version of the logger. It's not doing verbose logging But otherwise, it's almost the same and then we have one more experimental plug-in and So that is a modified version of a I would call it an Experimentation vehicle that we have been now for a couple of years using in our team to Experiment with various resources assignment policy improvements for the orchestration and and runtime space so Originally, it was written so that it So if you remember the cubelet and the runtime communicates with something called the CRI so they contain a runtime interface and the original implementation of this is Resource so I see right proxy. That's why it's called a CRI resource manager how it works Towards the cubelet it presents to be the runtime towards the runtime. It presents to be the cubelet and As the prequests are passing by it is just consulting its own policy that hey What should happen and then does the customization by modifying the CRI requests and because this is a some kind of a hack so we decided that since we have Possibility to do to do much Of these same things with NRI so we modified it that we added the NRI plug-in support to it So if you give it a special common line option, then it registers Itself to the runtime where the NRI interface and then does all the same things That it would do over the CRI proxy method, but using NRI customizations Yes, I think that that was all so Do you want to say Mike something about the current status? Where we are we're close to merge in the PR. Okay, so so in continental a We are close to getting it much, but it's not not there yet. Yeah scheduled scheduled for 1.7. Yeah But minor getting the go dependency. Yes. Yes. We'll get that result Okay, so If you want to take a look at NRI so currently you can do it so that you are take you for instance close clone from those Depending PR so you can wait until we get those much and then and you can Get your hands Any questions? I Think we're probably close on time. Do we have any extra buffer between now on the next? Cool Thanks question questions Hi Just a simple question the NRI plug-in. Do you have any kind of mechanism for dependencies between NRI plug-in or they have to be? No, so so we want it's a good question. We want it to start very simple So we recognize that it might be possible that you want to Split up your full processing chain into smaller components because it might be that you want to just in some configurations Do part of it, but not all we have done we have our own test for that But we wanted to start very simple. So basically we went with the traditional in its style Index and this is something that we later might Change if it turns out that that's a sort of Dominant usage pattern, but at the moment for us it has not been There's no reason you couldn't just set up a system the dependencies run your plug-in each of those plug-ins will be have their own Requirements right and then call in to the teacher API if you start by system be the pattern we'll talk about the pattern if you started by system They don't system. They can do this for you exactly Exactly exactly control the other plug-in everybody wants to be first and everybody wants to be last, right? Yes, so basically, okay, sorry I forgot to say that this that's how it goes that the plug-in when the plug-in registers itself It gives an index and the name and the index is used to order them That's it The name is used to basically pass it some configuration if if you want to manage your configuration What are the security implications for running something like this? so So for the security so we try to leave out Everything which is directly related in security in the OCI spec. So second filters System calls none of it you can touch via NRI but currently it is so that the NRI has no any kind of Access control with various privilege levels. So once if you are able to connect to the NRI socket if you are able to register yourself Then you can do whatever is possible currently then right This is something that we discussed with Mike that you know Do we need something like that? And I think this is something that we need to so we need to experiment with this whole thing a little bit more And then we will understand whether such a thing is needed or not so currently we think that it's enough that you either lock it fully down or you let somebody in and then because our so our We try to mitigate this security implication so that we just left everything out that we think is Problematic potentially yes So we yes, so this we had this one so the Sierra resource manager That's basically works so that it has several built-in policies So what we were thinking is that we we would like to So once this is merged then then we would like to rework that so that we rip out all the Sierra based code Clean it up and then we would like to split it up smaller pieces and then make it available a little bit like the stub So the idea would bet hey if you want because you can do other things than just risk So you can do other things than actual resource management with this like the OCI Who can injection but if you want to do with resource management if you want to experiment with that if you would like to you know disable the Corresponding components the CPU manager memory manager topology manager in the cubelet and you would like to try doing it here Then the idea is that that we will turn that into something that you could start from and just you know It has a little bit higher level abstraction than this one, so it has a much closer to So it has things like allocate and release resources and that's it So then you implement and program against that interface which is resource assignment specific And then you could you could use that but we don't have at the moment that cleaned up and and Modified in this. It's a great question, right? It's going to be default off. However The whole Yeah, and the plan the point of this right is some of those hoax we talked about some of the ways that you do resource management already I mean it already exists there for root, right? So we're trying to put together a higher level process where we can start managing this with policies For example, this pause immutable that could be a policy that we can hand through right and then say sorry No hooks because it's Help Thank you first off my apologies. I missed your whole talk. I just came here. So here's my question Here's my the problem that I have and I'm just curious if this would help I'm with the company of the storage solution and we need to install a kernel driver right now There's there was s sro. Which was the special resource operator open-ship specific. It's a genealogical dead-end KMMO is the replacement if you're familiar KMMO Does this help? Yes, could you talk about it? So in your particular case you would want to have a plug-in that would actually do the install before the first pod ran Or before the first container run It would just be a plug-in that registers interest in making sure that that dependency has been met That would be one way to do it, right? Or you could just run your own Well, you said yourself you needed your resource to be installed your application. Yeah Exactly, so it's a dependency you have Yes So did I in that did I understand it correctly that basically the problem is that that so in connection in some of the some of the Life-cycle events you want to do something on the host side so install something. Yes, okay. Yeah, that should be possible And it could be also dynamic So Mike I think we are out of time One more question and then come forward and ask questions. Thank you very much for delivering a great talk I really appreciate that. I have a question related to container and cryo configuration itself If you look at the how to integrate that there's a lot of manuals that go do that system D and blah blah blah Kenji is there in your map that to simplify that stuff? Maybe some hoax just like you mentioned do some Gus Manually changes wherever you're making. There are greater chances that a failure or something goes for sure It For administrators was going to be more work to do different work today, right? You can currently configure your system in in cryo to do this by putting your your Hook schemas in a particular location Just the administrators has to set it up and you know with the defined json's for what's going to run at the particular point in time And this just for modifies that so you no longer have to do that administratively But again each resource manager may have its own complexity. We haven't we haven't made this simple for users yet This is more of a for developers and the owners of the resource managers at this point We're we're gonna have to as he mentioned earlier. We're gonna have to look work on policies and that's gonna It's gonna mean we need more declarative Right Specifications they're gonna be setting your pod spec make it simple Bring it down pass it to the container runtime through CRI and then and we can just execute it with the plugins Receiving the information for the policy requirements. It's not gonna be easy, but we'll get to it