 Okay, good morning. My name is Garas Singh. I am product manager in OpenShift, and I have done it with me. If you would introduce yourself. Oh yes, my name's Darren Dunn. I'm with IBM Research. I'm the computational patterning team lead for Albany Nanotech, as well as the Thomas J. Watson Research Center. So I want to give you a view of what I'm hearing from my customers regarding Batch and this type of activity. So I've talked to a lot of customers. These are pre-government customers who come to me around Batch type of needs, right? So we talk to life-sense customers. They say we want to run genome simulations. We talk to manufacturing customers. They want to do computational food dynamics. We have very big presence in FSI or financial sector. They want to run their characteristics analysis. And then we have this roads is red hat data science, and they have this use case of running training jobs on top of a batch system. So when I hear all this customer, everything boils on into three characteristics. They come and repeat, you know, the correctness of the batch workload they're looking for. One is asynchronous run. What they say is we have this jobs. The jobs needs to run asynchronously and complete to give me the result. These jobs are computational heavy. You know, require GPUs, fast network fast storage, and these are massive scale. And since they're running on cloud, they need some type of elasticity. So listening to those customers, we are trying to address few problems either from red hat or open shift side or from community side. One is job Q is, I think the font is very small, but I'll talk through it. So what I've seen is what customers asking is, hey, these are my jobs, put it in a Q so before it goes and schedule. And that Q needs to be dynamic in the sense that the first job does not mean that it needs to execute first. I should able to prioritize things in the queue so that I'm gonna make sure my higher priority job is running all the time. I think community is working on something called Q. We have, we internally open shift is working on something called MCAT which is basically a app wrapper and dispatcher for queuing. Then a dynamic infrastructure, right? So we talked about Q. When a job comes into the queue, what customer wants is a cluster is a spinner for them, right? These are very expensive infrastructure. They do not want it to spin beforehand like think about GPU nodes and all. These are expensive. So they want whenever a job comes in the queue, they want infrastructure to be spin. And when there's no job, just deep provisioning the whole infrastructure so that there's no cost incurred. Second thing is they need to have specialized, specialized hardware into the infrastructure. For example, GPUs or you have a fast infrastructures or fast network that you want to use in. And there's a way to enable that within the platform. So things like operators. We use operator in open shift how to enable those operators within the infrastructure, looking at the job characteristics, saying that, hey, this job needs GPUs. So the infrastructure needs a GPU. Go install the GPUs and get the ready infrastructure before the job can be executed. Third thing is the gang scheduling. So what what customers says is, let's say I have 10 jobs and infrastructure has capacity for five to run five jobs. Make sure the jobs raise a weight until all the resources are available within the infrastructure. So all kind of all or nothing, run all jobs or nothing in the gang fashion. These are the trends that I'm seeing by talking to all these customers. So I mean, all these customers use cloud in one way or form for their application. Now they see benefit of it. They want to use the same benefit for their AI ML and HPC application. Multi hardware architecture. So what I've seen is what customers saying is, hey, to save the cost, I would be able to run my run my management on the arm architecture and compute on X86, right? So you need to have multi multi architecture in there. As raters, we talk about, they want to use GPU smartnakes fast network, everything that clouds present today. They want to leverage that into their into their workload containers. These guys now I've, I'm seeing a shift where customers of, you know, old HPC type of customer running on HPC workload on prem on a bare matter. They are transitioning, mostly a rehosting and looking forward to re-platforming their HPC type of workload into the containers. We talk about HPC a lot, but I've seen a very good use case coming on from AI ML site, where AI ML customers asking that, hey, I want this batch thing to run my AML jobs. So rather than HPC site, I'm seeing a lot of requests coming from the AI ML side of the business. Not only AML and HPC, but I've seen some another demand coming out from gaming, networking and telco site that they want to run on these, on a batch system. Next is Darren. He's going to talk about how we are running, which were EDA worker on OpenShift or Kubernetes. So I'll welcome Darren. Okay, so for my part of the talk, I'm going to give you a little bit of background about the two worlds that I bridge. I won't expect that anybody here knows the workload that I'm going to talk about. So I'm going to go into a little bit of detail so everybody understands. I will also go into a little bit of detail about why we want to do cloud burst. It's a very similar story to what everybody else wants to do and some of the benefits for us, particularly in semiconductor research and design for doing so. And then I'll go over some scaling results from our optical proximity correction workload using Red Hat OpenShift Container Service in the IBM cloud. So semiconductor research and development, my role is to span two worlds. I span process development and research. I also span hybrid cloud. And my sole role is to bring these two worlds together so that we can be more efficient and leverage more opportunities. So when you think about doing semiconductor research and design, you could boil it down to three main things. You need to be able to design new circuits and chips. You need to be able to transfer what you've designed to photolithography masks. There's no way you can do anything without being able to transfer a design to wafer with lithography and other processing processes. Once you've done that, you're not done. You need to develop processes that are capable of actually processing the lithography pattern that you have, transferring it to a workable circuit on the wafer. Today, what I'm gonna do is I'm gonna focus mostly on the transfer process. And I'm gonna talk about a project or process which involves what's called retargeting, which is taking a design shape and actually changing it slightly so that we can actually print it. And then using massively parallel, embarrassingly parallel HBC workload called Optical Proximity Correction to actually transfer the shapes to a photolithography mask, assemble it, and then write it to a photolithography mask and let the process development teams take over. So the tool set that I'm gonna talk about today, we use multiple tools. Today specifically I'm going to talk about a tool called Caliber, which is made by Siemens. These are commercial tools. Typically in engineering design automation workflows, most of the primary tool sets are commercial. So one of the challenges that you have is taking a commercial tool that has been written for a more typical HBC Linux cluster environment, containerizing it, and then finding efficient ways to run it with orchestrated containers on OpenShift or bare more vanilla Kubernetes services. So when we talk about this, what we're really gonna be talking about is taking this application. This application is meant to manage the transfer through the photolithography mask from EUV Lite or 193.2, an actual wafer. When you break this process down, actually, it consists of many different tool flows. When we finish with design, what ends up happening is we get a database file format, either GDS2 or OASIS. That file format contains a representation of design shapes that may contain upwards of 10 billion design shapes in a particular layer that we're going to transfer. As part of the process of figuring out ahead of time how we can take a design and print it on a wafer, we have to go through several steps. One of them is called design for manufacturing. That's actually taking the ground rules that were established for the technology, figuring out what actually has to happen, what's printable, what's transferable, and if we need to change design rules, go ahead and do that. We then need to use other shapes to fill the gaps in a design without impacting the performance of the design, and that's a tool and a technology all to itself. In many cases, what we're doing is we're decomposing a design into multiple colors and printing a color at the same time, and the reason for that is, is that as we get beyond the three-nanometer node, the five-nanometer node, even starting at 10-nanometer, we can't print the entire design with a single color or a single exposure. We need to find ways to break it apart so that we can actually pattern what we need to do. And then we talked about retargeting, where we take those shapes, we change them slightly so that we can actually print them. We will put insert features through code called sub-resolution assist features. Those are required to help the lithography actually print particularly intermediate to isolated shapes, and then we finally do optical proximity correction, and we also verify that the shape that we've created will actually print on wafer using modeling to within a certain tolerance or expected variance. So in a nutshell, what is optical proximity correction? You can think about this as an optimization problem. Prior to about the 90-nanometer technology node, you could actually take a drawn shape, multiply it by a scaling factor and transfer it to a lithography mask, and you'd have a pretty good probability of hitting the target for the design. And people would come up with some rules-based changes, but typically it was just a straight shot. Starting at 90-nanometer, that wasn't true. So if you look at this diagram, without OPC, that's the drawn shape, you could think about the lithography process as a series of transforms. One transform is to transform it to the mask, the other one is to expose it, and the other one is to etch it on wafer. What typically happened at 90-nanometer and beyond is the shape that we brought in as a design shape no longer was representative on the wafer. In fact, in some cases, some shapes wouldn't print at all. So there was a technique called optical proximity correction that was developed in which you bring in the design shape, you partition it into vertices and edges, you then expose that partition shape or calculate what the aerial image would be, which is a very physical process, and you apply an empirical resist model to it to predict what the shape will print on wafer, and then you iterate until you either hit a fixed number of iterations or a cost function has been matched, and you then output that onto the photolithography mask. So you end up with a lot of very complex serif shapes when you do this. So how is this represented from a compute process? So this is actually a two-part compute. You have a primary POP process that reads in the design layout, chops it up into tiles and creates a queue or a heap. Once it's done with that, it spins up a bunch of remote workers on a traditional Linux cluster, in our case on an OpusShift or Kubernetes cluster, and then it starts parsing out work tile by tile to all of the workers, and it does this in a round robin fashion until all of the tiles and the queue or the heap have been exhausted. There's no need for the remotes or the worker processes to communicate with one another. So it's a very traditional hub and spoke process. So when we talk about this in a traditional sense and bursting to the cloud, what problem are we trying to solve? What we're really trying to do is expand our opportunity horizons to work on more and more projects. You can think about this as a triangle. We have three main resources that we're trying to optimize. One of them is compute. One of them are commercial licenses for the tools, and the other one is people, okay? The most sticky component that you have are people. It's not easy to expand a design team or an OPC team or a patterning team in short periods of time. The two things that you can expand are compute and licenses. So if you think about the semiconductor industry as a whole, there are more opportunities than capacity to take care of. And time to market is huge. So the faster that you can realize an opportunity, the more opportunities you can work on. So people spend a lot of time trying to figure out how do I maximize my opportunities? There's two ways that you could do this. One is to increase your compute. The other one is to increase the number of licenses, and these two have to go concurrently, okay? So if we have a smaller opportunity horizon that's referenced in the yellow, but we've got a lot of projects that we really want to explore, one of the best ways for us to do this is to burst to the cloud. But in order to do this, as was mentioned earlier, designers and OPC engineers are not cloud engineers, and cloud engineers definitely are not designers or OPC engineers. There's a communication gap, and there's an expectation gap. So if you could come up with a way that people are submitting jobs to a compute resource and you can expand the compute license and people triangle to maximize your opportunities and deliver more in the same period of time. So our cloud burst strategy from the beginning is to use OPC or Optical Proximity Correction as a proxy workload. We want to create the same on-prem infrastructure as we would use in the cloud, and we want to target managed Kubernetes services. What that does is that lets us minimize the amount of configuration and tweaking that we do when we burst to the cloud and helps us realize this vision of same compute on-prem as we have in the cloud. We also want to centralize licensed servers in the cloud to avoid splitting pools or having gaps in time from when we can leverage the cloud or not, and also develop controllers and operators to make public cloud burst with OpenShift transparent to the engineering teams. So this is a pictorial diagram of the type of infrastructure that we're talking about. On the upper left-hand side, we have our on-prem compute. We have interactive compute to set up jobs to review results to also do job management. And we also have a number of OpenShift clusters that we maintain internally with shared distributed file space. When we burst into the cloud, we target a service called ROX, which is Red Hat OpenShift Kubernetes service in the IBM cloud. If you're used to AWS, you might think of this as ROSA. And in the cloud, what we do is we build some compute nodes, the job management nodes, and we want to run across as many data centers in a cloud burst geography as we can so we can maximize our capacity, leverage autoscalers where we can so we can scale up and scale down to save costs, and centralize the licensed managers in a cloud geography that's accessible to all. One of the key things that you could do in almost any cloud is connect different regions via networking that exists solely in the cloud itself. So what we've done is, within IBM cloud, we use transit gateways to connect different geographies and maximize our ability to burst anywhere in the IBM cloud from anywhere within IBM. So how does our workflow work? Based on the description I gave you earlier, how do we set this up in Kubernetes? And I'm sure people here could come up with many different ways to do this. We chose because of the nature of these tools to try to mimic as much as we possibly could the way that these jobs run on a native bare metal Linux cluster. And the reason why we wanted to do this was twofold. One, it solves a problem of OPC engineers or design engineers understanding how their jobs are running. And two, with very small tweaks to a controller or an operator in Kubernetes, we can build in resilience to this workload by making sure that we leverage inherent capabilities in Kubernetes to keep worker nodes, for instance, up if they go down. Make sure that we can checkpoint and restart jobs, scale them down so that we can prioritize jobs. So what happens is we use Kubernetes job types. We have a primary job type, which it starts initially. What it does is exactly what I talked about before. It reads in the layout, chops it up into tiles and creates a queue once it's done. It then talks to a distributed controller that then spins up a number of workers. And when I talk about workers in this case, we're talking about running at fairly large scale. Table stakes for an OPC run is typically between 8,000 pods and 16,000 pods. When you start talking to foundries, that's like a 20,000 pod run to make time to make most efficient use of time. So what we wanted to do was go out and say, for us to use this internally as research, we need to be able to get to 10,000 pods reliably. Let's go out, figure out how to assemble job flows like this and demonstrate scaling from 1,000 to 10,000 pods. If we can do that in a fairly linear fashion, then we have confidence that we can go and do this at more scale. One of the other reasons for breaking this apart, or breaking our job flows apart is it also lets us address the differences between running, let's say 4, 2,500 pod runs and one 10,000 pod run on the same cluster, and also lets us address some of the shared file system issues that we have. So one of the things to keep in mind is for many of these tools and many HPC workloads, a shared high performance file system is really key. All of the worker pods need to see the same file system, read and write to it and stat it that the primary pod sees. And they're gonna do this with irregular patterns. So one of the ways that we've made our work happen here is to explore both open source, high performance file systems like Ceph or internal file systems like GPFS or Spectrum Scale. And what we tend to do is offer Ceph Gluster or GPFS as persistent volume claims that are mounted by all of the pods in the run. And we keep those static, and there's also another advantage to that is when we have to go back and debug logs, everybody can see the same thing. They can pick up the logs in the same place in whatever cloud environment we're talking about and with asynchronous file management, we can bring it anywhere. So these jobs are heavily compute dependent and they're heavily IO dependent both at the network level and at the file system level. So one of the things that I get asked, I get some skeptical questions up front with new OPC engineers and new designers is, yeah, this is all great, it looks convenient, but is it gonna perform? How do I know that if I just give you a Kubernetes job that I'm gonna get it in the time that I need to do it? So over the last two years or so, we've been looking at scaling. And OPC is embarrassingly parallel, it should scale fairly linearly, but there are parts of the job that need to do IO, need to do some checking in the hierarchy management, the primary pod. So we don't expect it to be pure linear. So what I've done here in these results is plot a couple of things. I plotted the speedup that you would expect if everything was linear, which is really just this line, the black line that you see here. The next thing that we've added is a speedup, assuming you have a serial fraction of about 15%. That's something more attainable than pure linear given the amount of file and networking IO that we need to do. And then what I did was plot our speedup in the blue curve for OPC runs that span a thousand pods to 10,000 pods for a nano sheet node, which you could think of as three nanometers to two nanometers depending on who you talk about or talk to for a back end wiring level. Back end wiring levels, and I apologize for using jargon, are the first wiring level that's most important for signals. And they're called the thin wire levels and there's a couple of them, but the first one is the most important. So what we're showing here is that with in the blue curve, scaling from 1,000 to 10,000 pods, we're achieving a pretty close correlation with about a 15% IO or serial fraction, which is actually quite good. We're by no means done with this. We believe that these points that you see here around six to 7,000 pods, we understand how to fix, and we think that we can definitely, with some very simple tweaks, start to increase the slope of this curve so that we're somewhere between the 15% serial fraction and the actual hypothetical linear scaling result. So to summarize, we've successfully demonstrated running optical proximity correction at scale using a managed open shift service. We are targeting managed open shift services because it minimizes setup and tear down in public cloud infrastructure. And we're applying what we've learned using optical proximity correction to many other tools in the design change, this design tool chain. So we can apply this to the tools that we're actually using to do register transistor logic, synthesis, place and route, timing and enclosure and performance. We've got a lot of opportunities to build logic and custom controllers into this flow. We view this as the first step. We do use, for all of our mass delivery into Albany Anotec today, open shift and a managed Kubernetes service. So we're, in a sense, and this is something that we always have to do within IBM, our management team always insists that we eat our own cooking. So we can't go out and say, hey, everybody should use Kubernetes or everybody should use open shift. We actually have to do that and make sure that it works for us and that we can stand behind it and say, here's the data, I can scale from 1,000 to 10,000 pods, I get good results, I can do this reliably. And I think what this is also gonna do is enable us to make smarter use of public cloud services, the IBM cloud in particular, but also not to limit ourselves to that. We're also a hybrid cloud company. We wanna be able to enable this in other people's clouds so that we can make the greatest use of cloud resources and also deliver new technology on time. So thank you very much. Thank you. Nice talk. I was curious, it sounds like you have a lot of workflow management going on with this workload. Do you use any of the CNCF projects for like Argo workflows or anything like that? Not yet, we would like to. We started very simply because we had to solve some problems or some, I won't say problems. We had to optimize our containerization flow and in order to keep things simple on the compute side, we didn't try to do anything more sophisticated. That's one area where we'd like to go is use, particularly on the operator side and the scheduling side, better or leverage open source projects to actually do that better within OpenShift and Kubernetes. Just want to add one more thing. So also things like CRD changes, custom configuration on the node. So for that, we are looking at Argo CD, things like that, whenever you have a job coming and before the infrastructure is spent up, you have Argo CD comes in and you know, grab the infrastructure. So things like that, we're looking at the Argo CD in that way. Thank you for the talk, very inspiring question. You've been at this for two years, which means you probably went to a number of OpenShift versions. I'm sorry, I'm hard hearing. You've been doing this for two years. Yes. You've probably been using several different OpenShift versions. Yes. What are your conclusions when upgrading? Did it really improve your performance or not? Can you repeat it one more time? I'm sorry. So you probably went from OpenShift 3 or maybe OpenShift 4, 1, 2, whatever, 10 or 11 or whatever you are today. Right. I assume there will be differences when you repeat the same test with newer versions. Can you share something about that? Oh, yes, yes, yes. We do see differences. We've seen an improvement. It definitely in the four series of releases, in particular with our network performance and our ability to opt to tune file systems. I think we need to be a little more rigorous about version control and how we benchmark different versions. We're not doing that today, but that's one of the things that's on our list. But I have seen some improvements definitely from three to four with networking and our file system performance and our ability to tune them. The compute is also better, but it's hard for me to quantify it because these workloads don't make very good use of hyperthreading. So when we do position workloads, we're looking more for physical cores and trying to drive the workload to physical cores. I think we have some work to do to optimize it and make better use of what OpenShift offers in other Kubernetes. So just one more thing. So OpenShift is 100% Kubernetes. We are also making improvement and upstream as well. So things like CRAN, CU Groups V2, these are coming, going forward, which will enhance the other performance. I mean, we have seen better results in the lab and we are going forward to implement this within OpenShift as well. So going forward, you will see good, I mean, I will working with Darren to see how the graph looks like when we implement those. Your slide listed GPFS and CIF running with OpenShift. Was that running in cluster or was that external? They are, so this is a managed service. They are running external. We're building them within the same VPC, but the control plane for the cluster is separate and it's different. So when we do the mounts, what we do is we're provisioning virtual machine nodes in the VPC and the control plane is managing all the connections. So if I understand your question as being external, the file systems are in one of the data centers that the VPC encapsulates. The block storage that we're building it on is usually local. So we'll build it over local NVME nodes or virtualized bare metal instances with local storage. So the operators aren't running in cluster and are like Rooksef or using CNSA for the GPFS layer? No, these are static persistent volume claims that mount this or the file system that we built within the customer within the VPC. All right, with the differences between Ceph and GPFS, did you see any performance differences in your workload? We did, so it's not really a fair comparison. Ceph performs much better than almost any of the other open source file systems that we've tried thus far. And truth in advertising, we haven't tried Luster yet, but between Gluster, bare NFS built over block storage, Ceph outperforms them all. GPFS is really optimized for HPC. And it has some features that Ceph doesn't have which allows us to tune metadata and just performance much more finely than we can with Ceph. So GPFS will outperform Ceph. But that being said, from an open source perspective, we're getting very good performance from Ceph. We still have a lot of work to do, but we're getting very good performance from Ceph. Thank you, great talk. You mentioned that for the Ceph, you have the something missing for the HPC. So I'm curious, what is your views that Ceph reserve is HPC? Isn't that necessarily something missing from Ceph? We're using Ceph file system. GPFS is really kind of like a Ferrari. I mean, there are things that it does very, very well and you can tune it to do certain things very well. So in some sense, it's not a true apples to apples comparison. It's more, if I'm willing to go out and get spectrum scale, take the time to configure it. And there are managed services for GPFS today or spectrum scale. But like anything else, you've got to tune it. And so it just has more capability under the hood than Ceph does out of the box. And I think that's the main way to explain it. It's, and it's often used in universities, national data centers. So there's this entire body of work where people have figured out for this workload, here's how GPFS gets configured. So it's, I expect it to outperform Ceph out of the box. Now maybe there's somebody who's better at Ceph than I am who could get very close or beat GPFS. But GPFS is really starting with a huge arsenal ahead of time, a huge priority advantage. All right, thank you.