 Hello everybody. So, let's start. I'm Marcin, I'm a senior software engineer working at Huawei and today we're supposed to present together with Joe, however, he couldn't come to Paris, so I'm going to do it alone. Today on the presentation I would like to bring some problem that exists in the container ecosystem and the world that we don't fully describe compatibility of the images. So, I would like to bring more attention to the problem itself and gather your honest feedback. In October 2023, we established a new working group under OCI. We try to define image compatibility, basically what it means for users, how this could be used, and so on. And we want to initially support Linux, Illumos, Windows, and in the future, FreeBSD. I know people there were interested in that, so, but they don't have capacity right now to work on that. And basically, the final list of supported fields is to be determined by the working group. So, we don't know yet what exactly you want to support there. I mean, there are ongoing discussions with so many different opinions, but we will get there. And finally, we try to figure out appropriate solution for that. One more note from my side that all the opinions contained in these presentations are my own. This is not an official statement from the working group itself. So, yeah, basically, the presentation has been made based on the discussions on the working group and my observations in there. So, let's talk quickly about containers compatibility and portability. Today, containers are often thought to be completely portable across time and space, and a lot of time they do work. That's true until they don't. And containers are just regular Linux processes with many of the same advantages and disadvantages when it comes to portability and compatibility. Those two sentences are very accurate about containers, because for most of the users, they work fine. I mean, 90% of users, it's okay. And people expect that there is this mindset in the community that if I run container, it will work out without a problem. However, in telco, we discovered that, I mean, it's quite obvious because we require some specific things regarding hardware and OS configuration and so on. So, it cannot always work like out of blue, right? It has to be configured, not especially for some heavy workload like, I don't know, some CNF like routers and so on. And once we started this working group, I learned that people think about compatibilities over like, for different organizations, it means something different. So, I find this definition of compatibility from Cambridge Business English Dictionary very accurate, because it's about computer programs to work successfully with other machines or programs. And work successfully is a keyword in here, because it means something else for different organizations and interest groups. Because initially, when we started this working group to determine image compatibility, I thought about that like something like binary approach. It's zero or one. It can work or not. The container can fail or be successful on the node. But I was a little bit wrong because people came up with different approaches in there and I came up with compatibility context. I will present here three different bullet points. So, the first one is host compatibility, the container's hard requirements. This is what I said a few seconds ago, that it's about if container can work on the host or cannot. An example of that can be if a container can, if container requires specific hardware like CPU, NIC GPU because of CPU features, NIC because of performance or features as well and GPU for some computation. And additionally, we have OS site where you have kernel features, modules, generally speaking, the whole kernel configuration and different aspects of OS as well. An example of that could be a virtual machine that runs in a container. So, let's imagine we have a container X and inside you want to run KVM virtual machine. So, if KVM is available on node A but it's not available on node B and C, then you have a problem and your container will not run in there. So, this is a hard requirement. Another example can be NVIDIA CUDA toolkit versus NVIDIA kernel driver available on the host. I don't want to go too deep into how NVIDIA is doing backward and forward compatibility. It's not the point in here. It's just if you require some specific features in CUDA 12.1, then you have to have appropriate NVIDIA driver presented on the node, right? So, the first category is container's hard requirement and then we have performance metrics. Basically, it's about if container can run on the host but if it doesn't meet performance criteria, it can be considered incompatible. So, this is very valid for real time or near real time applications like some telco services that have very high SLA requirements. An example of that from Huawei Cloud could be that we have some very critical telco service with high SLA requirements of throughput and latency period. So, we have container X and in theory it could work with all of the nodes. However, benchmarks show that container X doesn't perform very well with NICC and then we consider basically the container incompatible. So, we consider here as well performance metrics to be some kind of point if container is compatible with the host or not. And the last one is the most optimal image. So, this is kind of, I was surprised by this approach a little bit but it's about selecting the most appropriate nodes for the image that results in maximally optimized workload on the cluster. If it's not possible, then fall back to nodes that still can run the container. And it's also kind of sort of compatibility, right, in that case. An example of that could be application that if CPU supports advanced vector extensions, you want to get this container schedule on the node with AVX supported. So, but if it's not and still container can work without AVX, you want to fall back to node BNC. So, this is the third compatibility context. So, how we do it this today? Like, today we have expectations that users or sysadmins provide appropriate infrastructure for containers. However, we could somehow inform them about the workload that is running in there. So, the way how we could do that, we could bring some additional metadata for compatibility from the container side. So, it's not only about, sysadmin trying to figure out how configure the cluster and so on. It's especially valid, for example, if you are running some cluster and you are pulling their vendor software and at some point you're responsible for that and you're responsible for debugging and maintaining. So, you want to grab some information about compatibility of the container itself that will help you to debug that. Another thing is group applications that you could break specific type of applications into one workload type and that would allow you to optimize your cluster itself and also configuring the cluster. Like, you would have better information about that. You could also validate new hardware and OS before running containers in there. So, you have an additional check and the last one very important is fail fast. Basically, you don't want to pull images and then learn that, for example, something doesn't work there and you want to reschedule your container or reconfigure the node. So, it gives you one quick check if container is in theory compatible with the node or not. So, how people are doing this today? People or vendors provide documentation how to run your containers, but it's done in so many different ways that it's impossible to automate the checks. And the other thing is people try to launch containers on the cluster and then in case of failure they read logs and see what's going on in there and if they, for example, should reconfigure the node or go somewhere else. I see this the second one very often. Just people try to launch and see what's going on in there. If there is missing kernel configuration, some features, some modules, maybe missing hardware and so on. But with hardware is a little bit different story in Kubernetes because you have ways to expose that and you can express this a little bit differently. However, it's still valid, I think. And the third thing is annotations. People provide information on the container site already, but it's not standardized. And I learned that also in the compatibility working group. An example of that is HPC community. They bring their own hardware annotations. So, you can see here CPU version, ISA level, what kind of CPU model is there and with available features. Also, you can see that driver versions in here of NVIDIA driver versions and also CUDA versions and so on. Additionally, you have libc implementations, glibc version, kernel version, stuff like that. So, it basically shows that we have specialized organization that require this information. Maybe it's not going to be 90% of users that using that. However, the requirement is there. And those organizations are not small. Telco and HPC are quite big and other could benefit from that too. So, how this looks like today because Open Container Initiative governs a few specs today. We have image spec, container runtime spec and distribution spec. But we miss something for compatibility. I mean, image spec itself allows you to describe to some point, some specific features like Windows is using OS features for their build. And also platform object allows you to describe a little bit of compatibility. But in my opinion, it's not enough. And that's why the working group has been established to change this a little bit. Because once we have this, because of portability, you have container runtime spec, image spec and stuff like that. So, other implementations can have some kind of guide how to do stuff. But if we provide something for compatibility, even if it's a new spec or something improve in the image spec, then we will make better supportability. So, that's the goal basically for that. And so, as I said, that's why the OCI Image Compatibility Working Group has been started. We are able to get some good amount of stakeholders in there. So, it's Huawei, NVIDIA, Illumos, which is another operating system, OKD, Intel, SUSE, HPC and Docker. Also, we have like I would say a lot of contributors in there so far. And yeah, we are trying to figure out what kind of solution we should provide to the community. As a group, we came up with three high-level use cases. Basically, more granular image selection. Today, image selection is very simple. I mean, it's totally fine. It's great that it works that way because it works for most of the users and it's simple, right? So, today it works in the way that the first matching image is returned and you can pull it. But sometimes you need something more complex. So, we're trying to figure out how to do more granular image selection. And then we want to improve container scheduling as well because once you provide this container metadata, you can be like better in scheduling. And the cluster provisioning, I already mentioned, all the reconfiguration stuff that that would be helpful for SUSE admins. I mean, the information you provide to them. Like, if we have some additional stuff on the container side, some kind of metadata, it will be much easier for everybody to support their specialized cases. And as a group, we are able to come up with four proposals. Three proposals are already merged and one is under review. So, let's go quickly through them and proposal A is about to allow to add custom annotations on the image index and runtime sites. So, today in the image spec, you are allowed to annotate your containers as HPC is doing. However, we are missing some stuff on the runtime side. So, imagine this kind of case. You have a container A that requires some GPU. So, you annotate that with example.com, GPU and Vidya. And on the container runtime side, you could provide a similar annotation. Like, if you run container D, you could bring annotation the same as on the image example.com, GPU and Vidya. And then you could improve the image selection that way. It's a very simple solution because it doesn't require a lot of like changes on the container D or different container runtimes. So, it's kind of nice. However, in my opinion, it's too customizable because we try to come up with some kind of standard here. And then we have a proposal that suggests to use reserved platform features field, which has been reserved for a very long time. And we would have to come up with standard list of supported features in there. It covers some aspects of compatibility, but not all of them. So, we are still discussing solutions in there. And then we have a proposal B that is a little bit different approach than others because it introduces a new artifact type for compatibility. You don't keep compatibility metadata in the image itself. You basically introduce a new artifact type, keep the spec there, and you have relation one-to-one between artifact and the image. And also, it allows to define graphs on the spec, basically for compatibility relations. So, imagine a situation like that. Because if you want to enable some kernel features or something like that, it can be done in different ways based on the hardware. So, if you have AMD CPU or inter-CPU, you have different ways of enabling IOMMU, for instance. So, that's why there were proposals to let people define graphs on the spec level. However, the community doesn't like that too much because of complexity. If you try to represent graph in JSON, it's kind of bad looking rather than hard to understand sometimes, especially if graph is growing. So, this is understandable, but still it's in the proposal B. And the last thing, because we were thinking how to enable all compatibility across different organizations and so on. And there was suggestion that maybe instead of trying to define everything ourselves, we allow people to define their own compatibility fields based on the requirements. And this approach is about having one repository that different organizations could contribute to. So, for example, if I have Organization Telco, I could contribute to this repository, adding my own organization itself there, adding my own attributes that I'm interested in, like if SRIOV is enabled on the host, and implement some kind of plugins that would check that. And everything is in centralized repository because of security and control flow. Sometimes it might be required that the checks requires root permissions or so. So, the idea here is that people, once they bring their own plugins, they also come up with their own up armor or silinux profiles that we control was going on in there. And that's proposal B. Proposal D is a little bit different. Still introduces a new artifact type for compatibility, similar approach, but allows to define graph on the schema level for organizations. So, instead of users that come up with their own relations based on the compatibility attributes, we have organizations that do that for them. It's a different approach because as an organization, I can come up with attributes relation. So, for example, if I have HPC use case, I could maybe not allow but guide people how to use specific attributes, like when it comes to storage performance and so on. And the final thing allows users to add their content in a very flexible and open way. It's a very different approach from B because we don't have one centralized repository where people can contribute. They can come up with their own stuff, their own solutions. So, maybe as OCI Image Compatibility Working Group, we come up with a library that could be implemented by others and allow people to do stuff themselves. Much more open and it brings some risk, in my opinion, but it's a very interesting idea. Also, Proposal B brings some risk as well because if you want to keep centralized everything, it's not so good as well, but there is a trade. We have like openness versus security and stuff like that. So, basically, I like both Proposals B and D in here as well. And integration with Kubernetes. How do we think that could work with Kubernetes today? We could validate nodes before pulling images. I think there is some work that we could do together with NFD folks, at least take some part of NFD project to maybe that could improve our tool or basically we could take scanners from NFD that we could use in our own tool to discover stuff and maybe do some matching in there. And also improve scheduler by considering containers compatibility metadata. This is discussable all of that because we are still in a phase where we are trying to figure out what's the best for users and we are trying to get feedback and attention in here. However, you are not so much in here. So, yeah, maybe afterwards I can get some feedback later. But, yeah, and then there is the last point as a data source for dynamic nodes reconfiguration because I saw in Kubernetes world that there are attempts to dynamically reconfigure nodes on the fly. And that could be potential metadata for that tool, however, for very specialized cases because if you have very generic cluster that accepts any like workload, then you don't want to have dynamic nodes reconfiguration there for sure, because if you enable one node for specific type of application, you can disable for others and so on. So, if you have specific workload type, like, for example, routers in CNF routers or something like that, and you have small tweaks on the node, maybe that would be acceptable. But, yeah, that's how it looks like so far. And we are looking for feedback. So, if you are willing to share your honest opinion with us, then reach us on Slack, Google Group and GitHub. And additionally, we meet every Monday and discuss stuff in there. So, generally speaking, we are still in a very early development in there and we are trying to figure out what's the best for users. We don't want to change mindset of people that exists today, like if that containers can run anywhere, because this is the mindset people have today, like most of them. It doesn't work everywhere, but I think it's correct mindset. So, yeah, basically, that's it. I wonder if you have any thoughts about that. Hey, thank you so much for that. I've got a question about... So, solution A, I think you proposed introducing some sort of field similar to an annotation. There we go. Custom annotations are a field similar to annotations. So, my understanding is that this would be the most flexible solution, right? I've got annotations on nodes and on your containers or your pods, whatever it is, and that matching happens. What are the downsides to that other than the security? You talked about that with B&D. Is there anything else that you'd be concerned about with this one? Yeah, the biggest downside of that is because today HPC came up with their own annotations, how to do the stuff. And if we come up again with different organizations coming with their own annotations to express some kind of compatibility like, I don't know, kernel modules or something like that, then this is the biggest downside. We want to standardize that, that it's understandable for everybody. But if every organizations come up with their own custom annotations, then we have a problem again. So, this proposal A, it's about using all the custom annotations you can bring yourself. It's not about defining standardized. Maybe we can, with this proposal A, somehow mix this with proposal E and come up with something there. But yeah, we still are discussing stuff. It's not final. Are there fields or are there features that overlap in that way such that there is duplication? My understanding is that, well, if a particular provider, a particular hardware manufacturer has some feature, that's unique to them. Do you know of any examples of things, say, across two providers that have different fields, but they're actually the same thing? That's a very good question. I mean, probably, yes. If you think about standard stuff, like if you want to express hardware, I don't know, PCI, class ID, device ID, vendor ID, those are pretty standard stuff. And then operating system configuration, kernel modules, configuration, some tweaks in there, those kind of standard stuff. But we don't know if there are organizations that have something very, very specific to their needs. For now, we are discussing that with HPC and telco use cases, and Intel is providing some input there as well. But that's the point of allowing people to come up with their own fields and attributes. And Proposal B is basically addressing this as a centralized repository. So in that case, you wouldn't have overlap. In Proposal D, you could have an overlap, but it's just more open and flexible way. Gotcha. That makes sense. Thank you. Hey, thanks for your presentation. So I have a couple of thoughts that are not really well organized at this point. But the first one is I like the idea of Proposal E, I believe it was because it ties the image along. Like if we're using the image as the source of defining these affinities that we want, then having it packaged within the image itself in a structured way makes sense to me. The limitation of that is I don't currently believe, I could be wrong about this, I'm not really an expert in the area, but I don't believe that the Kubernetes scheduler is actually really aware anything about the image itself. So like from a scheduling decision making perspective, there's really not a mechanism to have a Kubernetes entity be able to gait a pod from being admitted before it makes it down to the node. And then the runtime could say like, actually, this image isn't really ideal for this node, but it's far too late in the process. This also makes me think about a really, a very common conversation that we're having at this KubeCon. And I would urge you to participate in the unconference conversation about DRA because that is also describing a separate mechanism of having a user define special resources or like special hardware configuration that they want their workload to run on, but it's coming from the pods, as being defined from the pod perspective rather than the image, which is a little bit more idiomatic for Kubernetes. So that is all to say, this is interesting, I think there would need to be some pretty significant extensions to the scheduling mechanisms for us to really take advantage of it in a meaningful way, but having the, you know, I would urge you to join in the conversations about DRA because there may be able to be some overlap in a way that we sort of conceive of this. Yeah, so thanks for your input. I agree with you about the Kubernetes stuff, the schedule itself, that it's not aware of some compatibility or any other metadata of the images itself. And this is just a rough idea that scheduling could be improved the way, how it's done, it's not determined yet. So we don't know how to do it the best way, maybe it doesn't make sense to even extend the schedule itself or bring any piece of software that would allow you to consider metadata of compatibility in different ways. So everything is basically baking in there and we are trying to understand what are the best use cases and scenarios for the users. So yeah, basically it's in progress. Totally. No, no, and I can appreciate the use case of having this sort of information baked into the image. I think, like just for Kubernetes alone to take advantage, like, you know, this is a conversation that's happening in the OCI, which is a, you know, by it being at the layer of the container runtime, it's, you know, we're not really going to, it's the conversations don't always overlap between, you know, the orchestration layer and the runtime layer. But yeah, there would need to be pretty significant overhaul, like the scheduler would need to be taught how to speak to a registry, basically, to be able to do this, which, you know, then comes into runtime specific sort of configuration on which registers are actually being talked to, like, you know, both container D and cryo have mechanisms to define mirror registries, which means that it's actually lying to the Cuba saying, oh, sure, I'm pulling this image from here, and it's actually pulling it from somewhere totally different, which is kind of hacky. And, you know, but that said, like, that adds another layer of complexity because, like, the runtime is not even always being honest to the whole Kubernetes ecosystem about where it's actually getting the image. And theoretically, that the Shaw should be exactly the same. So it should be the same image. But if we're like, going this artifact around, and it's like being stored in a separate place, like that may be a missed link there. So, yeah, those are some of my thoughts. I'm interested, I might pop into the OCI meetings to voice the thoughts there and the connections to Kubernetes, but just wanted to let you know about those. Sure, sure. Thank you for your input. If you'd like to join every Monday, 10 a.m. PST, every input is welcome. And thanks for your thoughts. Cool. Thank you.