 Thanks for having us. And yeah, today we'll be covering like specific details and kind of weird technical corners that we found ourselves into when trying to port our application to Kubernetes. Generic context, we are, in this case, not a SAS provider. So we are basically just providing our HPC solution with all its components as deployable artifact to the customer, which is basically rolling it out within their own environment. And yeah, so we don't have generally like super low level access to the cluster to tweak with the feature gates and the level of optimization, but still it's a very high requirement for us to kind of be as lightweight as possible, maintainable, observable as possible, so that we can minimize the most important metric of all, which is the number of calls that we get when they roll out in production. So I am Luca Montekiesi. I am senior software engineer for Siemens EDA. I'm working on the integration of our product between, yeah, which is called Caliber, which Min will be talking about, and container and orchestration technologies, and I'm also interested in a bunch of other different things. OK, thank you, Luca. My name is Min Zao. I'm going to take it from here. I'm an engineer and director at the Siemens EDA. So we work on this project together. I got it started. He did all the great work, so I did nothing afterwards. So to get it started off, so we are going to talk about three parts today. So first, I'm going to introduce what is EDA and what type of HPC workload that we're talking about. Not all HPC are built equal. Then I'm going to turn back over to Luca to tell you how we solve the problem, how we actually enabling our software to run on Kubernetes cluster. So what is EDA? EDA, for us, stands for electrical design automation. So it's actually a whole category of software. It covers the entire flow from IC design all the way to IC manufacturing. So in general, a lot of people say, OK, EDA is a high-performance computing software, which is correct. So there is a lot of software that involved in here requires a lot of computing resources. It's actually built, predated, and Kubernetes exists. And so a lot of these high-intensive workload already actually solve a lot of problems that you may already familiar with. But not all the EDA software are actually equal. So we usually call it front-end of line and back-end of line. Front-end of the line is more design creation. When you had an idea, you create a circuit and a map it to layout. That's what you do front-end and design. So there, you see a lot of interactive editing. You see data analysis, or you run on-demand simulation, for example. You want to run your Monte Carlo simulation on your circuit. So sometimes you can think about, OK, those things mostly can cover with the jobs on Kubernetes. You can just get away from that. But once you move to the back-end, which is basically what we're dealing with, the caliber product line that we have in a Siemens EDA, you are talking about a very highly intensive computing, cluster computing, basically, a batch mode job. So you want to launch the job into a cluster and then leave it running. And then you have many different components actually tightly coupled together. So that's the workload we are dealing with. Give you some simple numbers, just an idea. So we're dealing with a single batch job. This is a single batch job. 20,000 cores is not unheard of. And we even hear about like 50,000 cores. And that's the scale that we're dealing with. One single node, you can take memory up to one terabyte. And so from the component to component, burst communication can be 10 gigabit per second. So that is the scale for one single batch job that we're dealing with. So these are batch jobs. So now let's look under the hood what the category looks like. So ignore the workload manager for now for a moment. Everything else is one single batch job. So the user want to submit this thing into a cluster and then you establish all this interconnection between all the components and then leave it running. Usually the job runs from hours to even days. So it's a very long running job. And you have a lot of moving parts. Of course, then you will think this is sort of almost like orchestration by itself. You have the head nodes, which we call primary nodes. I have a bunch of worker nodes, which we call remote nodes. And then you place all the processes on the remote nodes. There are compute processes. That's the number crunching. There are scheduling processes. That's the dynamic load balancing. Try to move the load across all these compute processes and then you have all the data processes which facilitated the information exchange between all the compute processes and then serve as a temporary storage. Some of them are fail safe. So if you lose a compute process, you can still crawl along. You get run a little slower, but your jobs can still run to finish. But some of them if you fail, the whole job is done. And not only that, usually our user is very critical on the turnaround time. So you want this to finish as fast as possible. So then there is also dynamic adjustments for the resource because computing resource is very scarce. Usually the cluster running our software is saturated. So if you have a high priority job comes in, you probably want to say, okay, let that job spit out some cores so I can squeeze in this job and let it run finished and then return the core back to that job. You're still getting that job on right. Because that job may be already running for a day. I don't want to kill it. I want to keep it crawling for a few hours so my high priority job can go through and then let it pick up speed. So all of these requirements I listed down here, these are sort of a software problem predated in Kubernetes. Now the problem is how to run these on Kubernetes. How to sort of marry this scheduler with Kubernetes scheduler, right? So the Kubernetes jobs is not gonna really do the work. So the solution is we build operator. We create our own CRD which create our own controller and we implement all this logic is in there. And of course there is a set of challenges that we have to resolve. Okay, so at this point I'll turn it back to Luca and talk about how we resolve these. Thanks. So yeah, so the idea is we have this very lightweight piece of software, joking. And we need to move it like on Kubernetes. So for that we have been facing like many types of problems and we are gonna provide like some solutions that we gave to this problem. So maybe you will be able to reuse some of those just till those. So yeah, the idea here, the first problem to solve here is this workload is composed of many different moving parts and this has to be connected at specific internal states. So just to backtrack to the high level overview, I have a master process and I have worker processes. We call them primary and remotes. The worker is basically composed of distributed computing framework that is being used and leveraged to schedule operations by the master, by the primary. And I have also different types of remotes. So, and these different types of remotes are basically accepted by the primary only at this specific internal states. So this means that the application demands a stateful orchestration. So this means that the operator itself that we are going to develop depends in this kind of reconciliation logic from the application state itself. This is not a common kind of scenario that you generally see for operators. Operators generally are designed to be stateless and edampotent. They basically read from the Kubernetes API the status of the resources. They build their internal status and they take action based on the build and cache the status, right? So we wanna fall back to that use case, to that use model. How did we do? We thought about just mapping, finding a way to map the application state to Kubernetes API. So in that way we fall back to the standard use model, right? How can we do that? How could we do this? So basically the idea is we can steal the concepts of what's being done, for example, by the cubelets. Cubelets solve some sort of similar problem. So among other dozens of functionalities implemented within them. But the idea is that the cubelet basically controls the container and creation and execution by interacting through the container runtime interface. So it knows the state of the containers thereself, right? And this is not information that comes directly from Kubernetes. It comes from the container engine itself. So whenever he's basically aware, becomes aware of this state, it takes care of mapping the state to Kubernetes so that this state becomes available for all our controllers so that they can trigger reconciles based on this information. This is possible because the cubelet is also a Kubernetes client itself, okay? So it interacts with the API by itself. So can we do something similar? Sure, we can. We can just create an application which kind of behaves in this concern for like in a similar way with respect to the cubelet. So this application we call it like a state server. It's just an application that runs as a sidecar of our primary process just scrapes the state and serves the state to the outside world so that another application, we call it like a state mapper, is able to scrape the state, map it to the Kubernetes API so that all our reconciliation lodging can happen in a much more natural way. So how does it map to the final architecture? So basically, we can see here, we have like our stateful primary process. We have our sidecar container which is kind of just getting and parsing the state from the primary. It's exposing the state so that the job controller, so it's basically the entity obstructing our orchestration for the job, can take action based on that. And it reads the state from the Kubernetes API, right? So the only thing that the job controller has to do is just basically to create batches of pods depending on the state of, which is read from the Kubernetes API. I'm saying only, but it's kind of, yeah, it's kind of, I'm talking small about that because like actually the batch that we end up creating are very big scale lists of objects. So we are talking about like thousands of thousands of pods for every batch that we create. So we need to kind of have a little bit of an attention also on this inner control loop because yeah, it's very much a problem, can become very much a problem in cluster which are not managed by us. So how can we make this part as light as possible? So, and as efficient as possible, still satisfying the functional requirements. So the first recommendation I could do is, I could give is just if your use model kind of fits what you need to do, you can just use the batch API, just the native job controller which comes for free basically in the Kubernetes API. That has been done like an amazing job by the batch work group lately. It supports submitting and scaling up to 10K, 100,000 pods. So it's very efficient, very flexible and it's trying, I see the worker is trying to capture as much as possible all the possible HPC AI type of and batch type of workloads. So if it fits, just use this one. In our case, we wanted to hurt ourselves. So we wanted to play with, we wanted to kind of implement at a lower level batch controllers because we wanted to be able to tweak with specific features of the scale up and down, for example, of the pods. One thing that we wanted to implement, for example, we wanted to be able to, when during the scale down of a job, we are talking about like highly dynamic jobs. So these jobs are able to kind of just scale up and down depending on the demand of the primary which knows the parallelization that it can get to optimize the workload. So basically one thing that we wanted to do, for example, is just to be able to scale down and remove the pods from the same nodes or for specific nodes. So have control on which nodes we were deleting the pods from so that we could basically empty out some nodes before and they were being able to be reclaimed by the autoscaler before. Well, if we just scale them randomly, like yeah, we cannot have control on that. So we have fragmentation of the pods across the different nodes. Another thing, so we, since we are basically rolling these out at a customer side, we wanted to offload this part of the job which is kind of a heavy part, kind of moving all these parts from the customer side and just have it within our own domain. So our own package, we can kind of provide observability on that and yeah, it becomes so much easier to maintain for us. Also implements custom status information and other requirements which are dictated by the manager which is interacting with this component. So where did we start, what did we go? So basically, we didn't do anything fancy. Like we basically started from the standard cube builder scaffolding. We made use of controller runtime library. Two words on that for people that don't know what it is yet, it's just a library which is basically, I would say like it's butter included that it contains a lot of very useful functions for creating very efficiently and very flexibly your own controllers. You can see the full architecture at the left. The components that we are gonna talk about that once that we are going to optimize within the next slides are the ones that you can see on the right. So basically quick picture on the architecture and how this basically this library interacts with the API server. So I have my API server right in the sky. I have like my clients which are sitting within the controller itself. So the first component that we're gonna watch is the reflector. Just as the name says, kind of mirrors the resources that the controller is watching and that are contained within the ETCD and kind of exposed through the API server, right? So to do so, for example, and we are going to focus especially on list and watch request because those are the heavier ones, especially when you go like 10,000 pods. Those are very heavy when you are dealing with, yeah, on the API server side. So a reflector under the hood is based on a HTTP client. Simple standard HTTP client which keeps open an HTTP2 or WebSocket connection with API server so that it can basically keep this channel open and be notified whenever there is a change to any type of watch resource. So once this is reflected, where is it reflected? It's reflected within the local cache. The cache is just something, anything that you can implement your own cache if you want. It's something that implements the informer interface and yeah, it's just caching as the name says, the changes and the resource itself. And downstream to the cache I can have predicates. So predicates are just filtering functions that you can set up so that you can just basically trigger your reconciologic based on the events that you are actually interested in. So you may just discard the ones that you don't care about. And then you have your reconciler which is where the magic happens. You basically watch the state and you build a new state and you based on the desired state you decide the actions that you may take on the cluster. So just for example, creating the pods if they are not there. So yeah, simple functionality but quite complicated architecture. So you wanna be able to observe what is going on, right? It's a kind of a sophisticated interaction, this one. And there is not so much information around that we can find about how to troubleshoot it better. API server side, there is information. There is auditing, you can set it up very flexibly. Yeah, basically you can just discriminate like whatever, specifically what exactly what you wanna watch. The pods, like specific verbs. So it's very flexible and very powerful. Server side metrics, there are a ton of server metrics that expose that through the metric server. So the ones that you generally wanna have a look at are the one related to API priority and fairness. So concurrency limiting use and how much latency your requests are generally taking to get back. So these are kind of the main ones that we have a look at. Client side, it's kind of a blurry area. Like we couldn't find so much. So we thought about putting together a slide to talk about that. So first of all, you wanna watch exactly which requests you are doing to the API server, right? Stupid thing, but what is that you do? You turn up logging. So all these libraries, I think also Client Go supports, is based on global singleton loggers. So you can just set up the verbose to have a higher level. Just dump any request, and this will basically dump any request or response that your client gets. One note though, this won't dump the actual events, change events that are flowing through the network. Why? Because the connection is never closed. So the rip closer of the client is never finalizing the body. So you won't be seeing this stuff. How do I get to see this stuff if I want to? The first way, you can just hook up after the cache as we showed. You can implement your own just predicate dump like predicate function which just logs all the events. So you can see what is going on after the cache. But if you don't trust anybody and you want to have a look at pre-cache what is going on, this is a kind of a hacky approach, I know, but it works. You can just top basically on top of the TCP level on the HTTP round tripper. You can implement your own round tripper. And just hook up and have a look at everything that is flowing through the network. Clearly, you need to have JSON encoding for that because if you have other stuff or if you have encryption, you won't be able to see that. No, I'm going to correct myself. I think this hooks up after the decryption, so you can still be able to see it. All this stuff was kind of development material, production-wise, metrics. So the default QBuilder scaffolding comes with a lot of Prometheus endpoint, which is serving a lot of interesting metrics for all the levels of the stack that I showed before. So work here, REST client and reflector. So you may be able to get really interesting information about what is going on to your controller, even in production. So now I have the tools. What do I do with these tools? Definitely not what the lady is doing here. So we got a watch in the right way. How do I do it? First thing, we need to consider we have, so we have a ton of resources, ton of events flowing back and forth around the network. So I want to protect, basically, first thing like the controller from the things that it's not interested about, right? So first thing I can do, I can, for example, set up like the controller itself so that it's only watching the events, sorry, it's only reconciling on the events that are concerning to himself, right? I'm kind of just watching specific type of resources and I'm kind of disregarding anything else. On top of that, I may even just reconciling on specific types of resources. I may not be interested in all the types of events, so I may set up predicates that just reconcile on, for example, deletion events or creation events depending on what my specific application demand is. This is useful, again, for protecting controller from the kind of the wilderness, but how do I protect like the API server from the controller, because that's another thing, because the controller itself, when I'm watching and listing 10,000 resources for like a thousand jobs, it may very easily overload the API server and not all the managed environments are set up to tolerate that amount of requests, right? So the first thing I need to do is just watch at the lower level, so before the cache at the HTTP client level, exactly the resources which are concerning, which my controller is concerned about. So I need to basically set up a labeling system that allows me to just watch the label resources that controller needs to be aware of. In this example, I have a pod inside of the same namespace which doesn't have anything to do with the controller itself, with my resources as well is kind of a pod of somebody else, and since it's not labeled, I'm not watching it. So easy, but powerful. And this thing is also powerful if we want to go like in advanced functionality, and for example, I want to implement operator sharding. Just imagine that my controller sharding better. So just imagine my controller itself is only watching a subset of resources or one instance of the controller, while another controller is watching another subset, and these sets are basically labeled with the indexes. So I can implement pretty advanced functionality, and we are thinking about putting some work on that as well. Even more extreme, I'm not interested, I may not be interested in the actual status of my child pods. So I may just be interested in the fact that they are there. Or like, yeah, for example, I can get the status of the job from the primary process, right? So in that case, I can set up, I can go even a longer way, and I can just watch the metadata of these pods. And this ends up saving quite a lot of bandwidth on the network as well. So my point was just like, it's extremely flexible and configurable, and yeah, just dig like the source code because there are a ton of hidden functionality which are not readily available when you read the doc sometime. So that said, assuming you kind of implement all these optimizations, you still may have a problem because you still have like a ton of pods within the cluster. And the thing that I wanna say is like you may not be the only one watching, so you may have like CNI controllers, you may have mutating web hoops that every time that you roll out, you create a pod where a new resource basically are triggered and are basically managed by somebody else. So there is always this underlying assumption that somebody else is doing a good job at kind of managing all this stuff, right? So what could we do to basically make this thing lighter, right? And more efficient. So we thought about like, our pod for us is just an abstraction of a computing unit within our whole job. So can we just make this computing unit bigger? So can we just pack multiple workers, for example, within the same pod and still get the same functionality? So functionally, clearly this works. And also this thing ends up saving quite a lot of bandwidth. So at the right, you can see the watch bandwidth which was consumed through a whole kind of creation of I think 10 key workers here. So we end up saving like as much as four times more bandwidth by packing like four remotes inside of the same container and the same pod. Clearly you may say, okay, this is not without any consequence. So clearly the pod itself becomes bulkier. The failure modes for the failure conditions for these pods become harder to basically set up because like what if just one worker fails? What do I need to do? Do I need to get the whole pod to fail? So these kind of things become a little bit more complicated but still manageable if you want to. The thing that doesn't become so much easier is the scheduling itself because we are talking about like pretty bulky processes. So if you basically packed many, many of these processes within a single pod and you clearly have to rise the requests, you make the scheduling less efficient and a little bit more complicated. So for example, we have seen that after three, four workers inside of the same container, the same pod and the same pod, we started to basically see the gradations in the scheduling efficiency. This is clearly dependent on a number of factors but this is what we've seen. So yeah, with these things, we basically have kind of optimized a little bit anything we could to make this bulky workload a little bit lighter to part and to move across the clusters. What is the user experience for the user itself? How does it submit the job? So as expected, they have a job response. One thing is like since it's kind of composed of different type of remotes, this may become a little bit too verbose. Like imagine like you have a bulky YAML spec. Okay, it's nice that it's just plug and play. You just submit it and it creates everything but it's really a big, big thing. So we thought about like kind of using, and this is something that came up with actually like a very interesting model where you basically can have a specific section of the YAML which is kind of a common base. And then like you have other subsections specific to all the types of pod that basically inherit and override this section. So we kind of played with the YAML a little bit and the result is something a little bit more is kind of still super expressive but kind of a little bit smaller to manage and easier to read as well. And at the same time, we kept the possibility to just inject Sidecar containers freely so that we and the customer itself is able to just extend the functionality and make it much more extensible in general. Performance-wise, we have been doing tests across all our products. Unsurprisingly, we had like quality of performance. One thing that you may wanna be aware of is that when we started to play with kind of security features, so for example, there is this Secom default feature of the Cubelets that basically just enables Secom functionality. Secom is a syscall filtering functionality that happens right at the border of the kernel. When we enabled that, we got substantial performance degradation on specific workloads. The reason is, and this wasn't definitely easy to find out, on specific operating systems and specific kernel versions, basically Secom is tied to another security mitigation functionality which is called SDIBP, which is speculative star bypass. So whenever that one is enabled, basically the branch prediction capabilities of the processor between logical cores in SMT functionality is basically disabled. So this means that very CPU-intensive workloads end up having like quite substantial performance degradation. Even worse, this functionality, I take me for granted, but verify this, but it should be enabled by default after Kubernetes 1.27. So if you are seeing running kind of computing intensive applications and seeing performance degradations, this may be the reason, just have a look at that. So last thing we wanna cover because we cover much of the lower level of our stack. This is a pretty much a work in progress, the direction that we really wanna follow. So how do we, basically, okay, I managed to have my workload to run on Kubernetes in a lighter way and so on and so forth, but how do I enable multi-tenancy with respect to the resources which sits under my workload? So we are basically working to integrate with a project called Q, and Q has two very nice features. The first one is that it doesn't touch anything concerning the scheduling. So it's just something that you can install on top and you can basically manage your quotas and your resources on top of what is already existing. The other one is that it is exposing a very generic workload custom resource that you can basically use as an interface so that you can announce your workload type to Q, and Q will be able to just, you will just be able to use the full functionality. So your controllers just have to announce to create like this workload resource whenever you are reconciling a job, and Q will take care of just taking the admission decision if it will be basically admitting the workload to your Q or not. Yeah, functionally you can just label like your resources as your nodes as different resource flavors and you can map them out to a cluster Q and you can basically map your kind of organization, your teams to specific local Qs which are basically research managed against the cluster Q and the quota values that you decide to set. So yeah, this basically covers pretty much the work that we have done. There is really a lot of other smaller details that we don't have the time to talk about, but yeah, so I think this basically covers pretty much what we did. So thanks for the attention span and yeah, hopefully the last session will be. I think we have time to take questions. So there is a mic in the center. Does this integrate with traditional scheduling systems like SLRM or PBS in any way? So basically let's say Q is not exactly a replacement, is something you can use with some existing systems and schedulers, okay? But there is definitely some overlap in the functionality. We actually came to know about other projects that make use of SLRM and party to Kubernetes as well, but those ones are basically something that was born outside and brought inside of Kubernetes. What Q, the nice thing is conceived from the start very close from a work group which is very close to the Kubernetes API itself. And it doesn't touch the scheduling layer at all, so it just installs on top. So we are specifically working on that, but we definitely have plans to investigate also like SLRM on Kubernetes use model because some of our customers may definitely demand it. Because the software exists a long time ago, so a lot of people use the old resource management software, so like LSF, Grid Engine, SLRM is relatively new, but it's more of the same use model. So we are also keeping that in mind and does it make sense to just drop the SLRM on top of Kubernetes and you live in SLRM? Well, that was a question that actually came up during the development. Any other questions? All right. So if there are no other questions, you can find us around for the big party after anyways, so just feel free to reach out, yeah. Thanks. Thank you. Thank you.