 Thank you guys for joining We're here to talk about mastering LLM delivery in private clouds a Journey to seamless deployments mostly With kubernetes and OCI and for those of you who are familiar with cohere We do have a relationship with another OCI. We're talking about open container initiative not oracle cloud infrastructure But hey our oracle cloud friends. We we like you guys do so Uh Before we get started let me give you an intro. So my name is autumn Mulder I'm the director of infrastructure and security at cohere. We build large language models Yes, we put LLM in the title of the talk mostly to just bring you all here I won't spoil what the talk actually is about but we will talk about alums a little bit We build foundational models. We help help companies. They're looking to deploy and use this tech in their enterprise And then to talk us through some of the challenges we have Hi everyone. My name is Marwan. I am a member of technical staff cohere Previously I worked on the Azure Kubernetes service team at Microsoft and more recently at the company formerly known as Twitter I've contributed to a lot of several CNCF projects in the past So it's super exciting for me to see the foundational role CNCF is doing with the success of the new kid on the block large language Models and we can't wait to share with you our journey. So without further ado, I'll hand it back to autumn All right, thanks Marwan. So I'll just tee this up a little bit Before we get started. I like to know who I'm talking to so raise of hands who's a site reliability engineer or Identifies as such. Okay, we've got a few of you. Thank you one on the stage. Thank goodness Data scientist Emily's out there Good number. All right, I Have no idea what the rest of you are here for but I hope you find something valuable out of our conversations Awesome. So Really briefly, most of you probably know this because you saw the word LLM in the talk of the title But what is a large language model? In a nutshell, I asked our large language model. This is what it is You know, it's a way you can talk to computers with natural language My favorite definition to be honest this one It's a pile of linear algebra on disk. So for those of you who are infrastructure engineers That's kind of relevant because we're just dealing with Files right files on disk that store the the probabilities. So We'll talk about the LLM serving stack briefly just to anchor this conversation and then And before we get to the technical challenges. So this is this is pretty high level When we're talking about the system components for a lot of stock It's it's very similar to any other system that you're running. So you have a kind of surrounding pieces the Observability layer the persistence layer Artifact management. I mean, this is this is what you get when you have to serve any kind of system In the middle here. We have what our CPU based services. So we have a rate limiting our authentication and points in routing Batching and queuing which actually turns out to be a really important problem when you're dealing with with large language models But these are all CPU based workloads that we run on on GKE and then we have over here on the left right Sorry, you're right We have the the models and these are our GPU based workloads, which is where a lot of interesting challenges come into play And so yeah, this is this is it. This is our this is our SAS system. This is kind of what we started out as As a company because we we started out small, you know, we're start up a Lot of the systems that we that we built had some pretty key dependencies on on GCP because that's how you how you start going but What we found was in the market. There's a strong need to run private LLMs. So Just real quick. Why do companies need a private LLM? I think most of you can probably guess but this is similar to when As a as an industry we started moving into the cloud like all the same reasons that people Gave for I can't move to out of my co-located facility into the cloud It's the it's the exact same reasons We're hearing for why I need to run a private LLM and don't want to use a SAS system So it's compliance and security controls, you know, you have those built up You want to use them and you want those to to apply to any LLMs that you're building for for your system You've got data volumes and latency are pretty pretty key problems So if you're if you're shoving a lot of data to the LLM you need that co-located with the rest of your Infrastructure to just kind of drop latency or you don't want to page large data egress costs and then For those who are infrastructure engineers, you know that it's really nice to have a single plane of availability To to monitor availability to monitor reliability and just bring that all into into one place So these are a lot of the challenges we were hearing from companies who were saying we don't want to use a SAS provided system But we also don't have the expertise to run a fully open source Stack and this is just kind of an evolving space. So we want we want help So as a as a leadership team, we brought the the challenge to our external infrastructure team We've got the rest of the crew crew here today and then then Marwan It was a really great group effort to figure out. How do you take? The system running on our SAS environment and make it really easy for us to deploy into private environments So to walk through those challenges and solutions Marwan Thanks, autumn So one of the first challenges we've hit was our dependency on managed services So we were startup. We're trying to move fast So we took a lot of dependencies on GCP specific components in areas such as observability and persistence and we found out that when we started looking at our code that Effectively it did a lot of assumption that is running in GCP. It was super tightly coupled to the GCP SDK as well. So What we need to do is first of all make our stack cloud-agnostic, right? So that means is we need to figure out what you want to do with the managed services And then we need to make sure that this stack can run on other multiple targets as well So the first lessons we've learned is to invest in abstractions and invest in those abstractions early So for services such as databases and pub sub you want a uniform API that is independent of the underlying Implementation and in some cases you might also want to have those interactions behind a separate micro service When we started looking at our code a lot of it a lot of the functions and the objects They had a lot of configuration options that were specific to our use case So we had things like analytics code billing code and feature flagging libraries as well Now all of those things are going to be Relevant for every single customer. So we needed to simplify our service configuration API So that's where the functional options pattern came in handy So it's a creation design pattern that lets you build complex structure using constructor that can take zero or more functions And those functions end up modifying the return state that your turn to the consumers So with that function we're actually easily able to define different options for our private cloud deployments as well as our SAS deployments, so we're easily able to turn off feature flagging for those private cloud configurations and also made it super easy for us to Define you options without modifying the existing constructor code because all you need to do is to define the new options interface The second lesson we've learned is not to reinvent the wheel It can be tempting sometimes to define your own abstractions over common operations, but there's no need to do so When solutions ready exist So for object storage we relied on Thanos object store library, which abstracts a lot of the operations over common storage back ends and Instead of having to worry about the specifics of persistent storage and the implementation of storage devices We required a user provided read write many PVC as part of the application configuration So that PVC we mostly use for storing the large model weights as an optimization to avoid having to re-download them during auto-scaling For metrics most of our code was instrumented from day one with Prometheus metrics and it's the de facto tool in the industry So we didn't really need to do much there and for logging there's a lot of exporters out there that convert between proprietary and open source formats and Finally we tried to align workload identity whenever whenever possible because it simplifies the integration between community service accounts as well as IM service accounts So that's what we started with you can see all the GCP specific blocks there And it turns out all you need to do is just very little few abstractions and then you're halfway there But we haven't really talked about the artifact management layer and for us Our most valuable artifact is our model weights So we needed a solution to deliver our model weights securely into private cloud om scenarios We can roll we can roll our own storage service The issue with that is you need to worry about authentication and security You also need to worry about scalability and finally need to worry about the cost So I'm gonna take a quick break from the lessons and deep dive into how we design our model weights delivery solution So the first thing we considered is creating signed URLs So it's a good solution But signed URLs are quite limited in nature because they have short expiration times and they only need to be for exact object URIs and our model weights were quite large So we had model weights that were more than 100 giga plates So in order to deliver a solution using that approach you need to create a unique URL per customer per file per model Which isn't really quite scalable The second option we considered was to deliver those artifacts directly into the customer's object storage So we can use workload at an empty federation to give them access to our me on storage backend and then They can pull the weights directly from our end The issue with that is that it requires a lot of manual setup and it's complex setup to do It's also not supported in all cloud providers. And finally, it's obviously not supported in on-prem So that solution wasn't really about a viable thing for us The third option was to bundle our model weights with the container images So our application consists mostly of serving images as well as the container waste and that effectively what makes the LLM stack So we thought okay We can just bundle the model weights in the same container image as well And if the users can pull the container images, they can pull the model weights So for container images specifically, we relied on a reverse proxy that is sitting on top of our Google Artifactory tree so reverse proxy is Connected the Artifactory through a service account that gives it authorized access to pull those image those images and Then the customers can talk to this reverse proxy using the license key to pull those images through and we relied on Reapplicated as a commercial product, but part of them is you know the licensing API plus the reverse proxy side of things The issue with that approach is it's less flexible because of the tight coupling introduces between the application layer as well as the model weights our model weights have generally a quicker release cadence and The fact that we have to create the same image every time you want to release a model weight wasn't really flexible The other issue is that it negates the optimizations We did with the NFS cache for quicker auto scaling of models So you end up like you know being not being able to scale as fast and finally because of the size of those images Bundled with you know the serving images, which are already quite large plus the model weights patching critical vulnerabilities takes a while So we took a look at that and okay It seems that we already have a mechanism to deliver container images is there a way where we can use the same mechanism to deliver the model weights and achieve the desired decoupling and That's where OSA artifacts come into place. So lesson three is OSA artifacts are a very powerful standard like really powerful There for those are not familiar with OSA artifacts they are a way to utilize OCI compliant traduces to store arbitrary files and Because you're utilizing the same infrastructure as containers you end up consolidating the security and the management efforts into a single solution The community standard the community usage of OSA artifacts is still evolving But really what we believe they have strong potential as a generic artifact store and registries can be really a very good generic artifact store So I'm going to deep dive into how we used OSA artifacts to build our model weight solution So we use the open source project or us or us is the de facto tool in handling and managing OSA artifacts I like or us a lot. I almost in my cat or us The focus and or is on generic artifacts So it never assumes that you're dealing with a container image. It always assumed you're dealing with a generic artifact type It has a very rich API and library supports for building custom registry clients to handle different media types And I'll dive a bit into like some of what those terminologies mean So when we start working with or us on the first thing we thought about is We can just you know give the give or as the model weight directory And then you know dealt or us push and then it's going to push things into the OSA artifact registry What that ends up doing is it's going to create a single atomic layer as you know Like one entity and then it's going to create a tar ball and mark that for extraction on download And this is what a manifest looks like both of that approach So as you can see this looks like a typical container image manifest But the media type is different and you can define any media type you like there You can think of a media type as similar to when you know when you do an HTTP request And you specify the content type to be application JSON the client itself knows how to handle this particular media type When you create an artifact of or us Passing it a folder you'll see that it has two annotations there So one of them is the day is unpacked true and that marks the file the folder for extraction after download And the second one is effectively just a directory name where you know this particular blob will be extracted The issue with that is it's because it's a single layer. You aren't really gaining any parallelization efforts Like the entire blob is going to be serialized, which was way much slower than using object storage So we needed to do better So the second thing we did was what if we've been pack our Entire model ways into separate smaller folders and then pass it to or us to upload So first of all we tried to you know separate the folders into the entire model ways into smaller different folders So we create subfolders that are like, you know suffix with a certain number And then we for example if you have a model weight of size hundred gigs You end up creating 20 folders of you know size five gigs The issue with that though is on extraction your entire directory tree or entire folder structure is different So you need a way to essentially rebuild the structure post download And that's where or us rich library comes into play. So Or us has this post copy callback that you can hook into Post a script or download so after a certain blob is downloaded. You can do a particular action there So it seems like the process is easy, right? We just hook into this post copy callback and then say okay everything you downloaded here Just shift it up one directory and that would restore the model weight structure as it was before But we still need a way to tell or is that you know you need to actually do this process So remember the annotations I showed you before it turns out you can define any custom annotation you'd like on this Artifact layer and then you can define any other operations like based on that So this is what our manifests look like with that approach. So we still have the same media type and then you notice at the bottom There's this annotation that says move up and this annotation is very specific to our implementation of the or us client That says okay once I see this annotation. I need to shift everything up one directory and Then the rest of the manifest is just those layers duplicate or every layer is just one portion of the model weights So that's a typical interaction with the or CPI so you first create a bunch of file descriptors and then you add annotations to those and then you pass it to Orson say okay pack these layers with these particular annotations and then the download process is simple on Download we just check if this annotation exists and then we move the folders up the directory to rebuild the the structure So some of you might be asking we've had binary distribution formats and blob storage APIs for years and they're just out there and supported Why do we need another you know support for a container API? Well, it turns out there's a lot of benefits you gain from using OCI and or us for us The main benefit is we were able to ship a solution really quickly and you know less that's the two months actually Because it turns out that all you really need to stand up a computer environment is a container registry That's really the only real dependency you have out there more or less I'd also unlock the bunch of other benefits for us and Most of those benefits relate to that container registries or registries in general are content addressable So the first thing that unlocked for us it provided a way to ensure the authenticity and integrity of the image contents So because every layer gets its own digital fingerprint Any simple mode any modification to this particular layer would result in different hash So developers are actually able to easily validate if an image has been tampered with or not The second thing is thinking of model ways as you know image layers Unlocked a lot of encryption scenarios for us instead of having to encrypt the entire model ways and then you know Take the hit on the download time We can just pick a random layer from this manifest and decide to encrypt that and that essentially avoids the problem Of you having to encrypt the full thing The third thing is because of the content addressable nature of container registries and the fact that if a layer has been stored Before it's not going to be uploaded again If you build your model ways in a certain way to ensure that you know the unique layers are just not duplicated there You end up having a bunch of like storage cost reductions So if you're smart about it because if a layer already exists in a registry similar to like if a container layer ready It's in registry. It's not going to be pulled again and that's done through like digital fingerprinting and the show stuff that is known in like container registries Or is itself has built-in retries for failed layer downloads So if a particular layer fails to download it's not going to undo all the work is done before It's only going to be trying this particular layer and all of that comes for free was just using or is he don't really need to Have any custom retry logic or anything built in there Another thing is because you can it pre-inspect the red the manifest of an artifact You're able to do a bunch of smart things so for us for example We use triton inference servers heavily and the triton config You know as a file that tells try to server like how to perform certain things We try and store the triton config as a separate layer on this artifact So you're actually able to like pull that before and do some changes to support running on different hosts different batch configs and so on It also unlocked an easier path to air gap for us because I mentioned before a registry is really the only dependency you need Air gap customers can just replicate both our container images and model ways directly into the registries And that's all they need to spin up our application as well Obviously like a home chart as well, which again home charts can also be stored in OCI artifacts So it turns out like all the problem Effectively just simplified into using a vendor neutral OCI registry So we got rid of the GCS dependency there and that's all we needed to simplify our artifact delivery solution and I'll hand it back to This is literally me giving him a chance to take a drink of water But I think I warned you we put LLM in the title But really this is our love letter to OCI and Aura's it's a great great set of standards It was a really fun engineering challenge to talk through so Yeah, so once we once we dealt with kind of this first two set of challenges that Mara walked through You know we dealt with the observability side we dealt with persistence who dealt with artifact storage. So now we had to get to the actual kind of core set of challenges that are specific to models in general LLMs specifically but Dealing with GPUs. So what we found is there's a lot of challenges When you're talking about going across clouds across providers for GPUs because we we don't just talk about like the four main clouds We're also talking about on-prem scenarios. We're talking about specialized providers that the you know, we're kind of hyperscaling and building out GPU specific Data centers that kind of thing. So what we found was there was two main areas of challenge one is just provisioning the GPUs in a way that you don't have to like build a You know a giant tree decision tree that says am I running in this GP this provider then you know use these configurations So that provisioning is is not a significant challenge that we'll talk through And then the other is just dealing with kind of small scale versus large scale So as a SAS provider, obviously, we're dealing with GPUs at at large scale both between our kind of internal infrastructure team that's dealing with Like super clusters, but then our external infrastructure team that has the SAS system and we're you know We're dealing with a lot more GPUs and just like a few In our private deployment scenarios, we have customers who need You know need to spin up a lot of replicas And so they kind of start to approach some of the challenges that we hit dealing at large scale, but the many Many of our clients just you know, they're talking at tens of GPUs They're not they're not dealing with a lot of GPUs And so the the challenges you hit are kind of different and so so learning some of those last ones was was interesting So more on hand it back Hello again Yeah, so First lesson that relates to GPUs is that provisioning GPUs reliably is tricky So first you need some host-level components such as the NVIDIA drivers and Then you need some Kubernetes specific components like the device plug-in and maybe DC gem exporter if you care at all about GPU metrics The issue with that is there is an implicit Ordering dependency between both if you try and run the NVIDIA device plug-in before your drives are ready You're it's likely gonna crash and then if you have alerts configured that's gonna cause a lot of alerting noise for you And then what we found that is different providers do this process differently So for the longest time they're supporting it now But for longest time DCP let you manages the driver themself yourself, and then they manage the device plug-in for you and Acasit up a different approach where they have a pre-built phd as an option with all those best installed Or you can also just run your own device plug-in as well There's a solution for that But it's not quite there yet. It's the NVIDIA GPU operator It's not quite there yet because it's not supported across certain operating systems specifically the cause operating system if you're on GCP The next lesson is inconsistent identifiers and with that I particularly mean the device name So the device name returned by the NVML can be different in certain scenarios And this is important to us because we use striton for inference and we have different Batching configs we use dynamic batching feature and triton to optimize for throughput And we have different batching configs depending on the GPU instance You're running in and the way those are are defined is it's a combination of the model name as well as the GPU type you're running on Well, you found that if your cloud provider is using PCIe Versus sxnp so PCI Express versus sxnp on the physical interconnect you can get a different device name So for the same a 100 ADG node you might add I'm getting The first one if you're running an Azure or the second one if you're running in GCP and at this point our parsing logic failed So the batching configs didn't really work in different environments So we had to work around that and account for the fact that you know device frames return can be different The next one is regarding node labels Kubernetes node labels So no labels are important because you use no labels to be impact your workflows and try and control the scheduling properties there But it turns out there's really no Uniform way of knowing if you're running on a 100 versus D4 for example without Hardcoding or without knowing what the you know the VM name or the host name is going to be in advance Some providers have an accelerated label like GCGK does it Azure does it but the value is not super consistent So there really isn't a uniform way to the you know to define that today the GPU feature discovery project Tries and solve this problem where it essentially adds a bunch of labels that it treats from the host on the node But you're bound to the previous problem I mentioned because the device name return can be inconsistent So we had to sort of like ensure that our application is flexible enough with templating that the customer is able to provide their own You know unique set of labels and values as well So note upgrades everyone loves note upgrades. I think They usually work most of the time For deep your clothes, there's kind of like a unique set of challenges there You can do search upgrades, but striking the right balance between availability and speed is tricky and Ideally if you want to know downtime you do a blue-green strategy where you know You create the new note you create a new note pool on the new version and then you move the workloads there Let it soak for a bit and then delete the old poll after but GPUs are expensive, right? And you're not guaranteed GPU availability So a blue-green upgrade scenario while it may work sometimes most of the time is it's going to fail Well, we found out that it's best the best approach to upgrades for GPU We're close specifically that have large count is to create a new note pool on the new version and then for the old Pool to attain that pool to ensure nothing else is going to get on it And if you're using auto scaler you set the maximum nodes to one And that's what that's going to do is it ensures that the Oscar isn't going to try and scale up this note Pool again, and then you let HPA do its work So like over time which can be like a week or more as your model as you get like pot turns all the new workloads are gonna Be scheduled on the new pool and at that point in time you can safely delete the old pool It's not ideal, but we found that that's probably like the safest approach we've come across and Speaking of auto scaling so auto scaling LLMs is challenging. I worked on a cluster asking in the past There's a bunch of hacky assumptions about how GPUs work So for example, there used to be a hard-coded delay before the auto scaler issues a request for scaling a particular GPU requesting workload. So that delay was hard-coded. It was 30 seconds Thankfully, there's fixed upstream now that essentially allows to configure this value But the main motivation back then was to make sure you're not over scaling or like overspending for GPUs But in some cases if you have capacity reservations, or if you're guaranteed GPU availability Then you don't really need this artificial delay at it The second thing is backup delays can lead to long scale-up times So when auto scaler fails to scale up a node pool, it goes into backf mode and the backoff is exponential backoff It starts with five minutes It can go up to 30 minutes But it only resets this exponential backup duration after three hours But during that point in time you may end up getting capacity due to other workloads downscaling So you kind of want to be able to configure this backoff parameters So we found what's best is to try and like tweak those auto-scaled parameters regarding backoff as well as obviously generally the Oscar parameters to fit your workload and to fit the nature of your workloads and So on the auto-scaling point, we've had quite a journey to decide what metric we use to horizontally scale our LLM workloads so First instinct would be to look at delays like latency like request latency. We tried this for a while The metric is not ideal because with LLMs There's there's really is no way to determine the output or like how long the LLM output is going to take and LLM can just keep Generating things forever. It depends on like the context lens It depends on like what the user specifies like max tokens and so on so it's not really a uniform way to scale on And obviously different models have different response times The second metric was the GPU utilization And duty cycle well again sorry back to the first point the issue with the first metric as well It doesn't really take GPU utilization to account. So we're like, okay Let's try and look at GPU utilization The issue is that different models behave differently. So embed models are flops bound and GPT generative models are memory bound So looking at the duty cycle a higher low value doesn't really give you a generic indication in the generic case So we moved past this and then we looked at the inference server queue time So try to expose a bunch of metrics on its queue that we tried to scale on The issue with that is it's a local view and it only tells you You know a particular incident of an inference server has a long queue It doesn't really give you a global view of the system So what you found best for us is to Essentially because we had a batching component in play there We could have a more distributed global view of the system looking at the number of batches that are currently running as well Number of batches being queued and using that we built a heuristic that we scale on Using that metric actually allowed us to improve GPU utilization and we were actually able to serve the same Using almost half same traffic using almost half the GPUs as well as tolerates spikes and like traffic latency Latencies a lot more gracefully. Oh, we're basically at the end here But like in summary, right? We've kind of got this this broad system and one but by we got by the time We got to the end of figuring out how to take everything into a private deployment, you know We've come across a few few challenges and lessons, right? So these are just common things right anybody anybody's gonna have to deal with those challenges when they're going cross-cloud Obviously we had some unique challenges around Around how to deliver like large artifacts Regularly and and that was a really fun fun engineering problem and then just last GPUs are Not as well-tested or you know battle one as as CPUs But I think that'll that'll continue to get better as we go on so thank you. We appreciate you coming and I think we have five minutes We're happy to take any questions if we've got them. Yeah. Yeah, feel free Yeah, nice talk and so regarding the GPU the challenge of provisional GPU node I was wondering and if you folks have looked at the most recent introduced like the Dynamic resource allocation API a generalization of the yeah the position volume of thing with the dynamic resource allocation Yeah, yeah. Yeah, do you think that should be helpful some of my engineering team because they're laughing over there? Do you do you want to take that? Yeah, so you part of the potential solution because we we've started looking at it But there's other things kind of higher priority for us to look at. Okay. Thank you. Yes For the OCI solution, do you have an open source solution available that I can use with something like hugging face accelerator That that is not something that's like open generally it's more something optimization We've made as part of the delivering our platform, but we kind of wanted to share the the thought process and Marwan's got something in general like or us makes it super easy for you to build your own clients on top The reason it's not open source It's because it's very like specific on like how we're you know packaging things and there's a dependency How you're exporting and then downloading but the library APIs are like the docs are super well Like I'm happy to connect after if you don't give you any guidance. Thank you Thank you First of all, thank you for the talk that was very very well presented very informative So obviously I guess you guys are deploying into Kind of customer clouds as well as on-prem. So for the customer cloud scenario Auto-scaling performance seems to be pretty important right because you don't want people Paying for GPUs that are just sitting there downloading weights So do you guys have like numbers around that that you'd be able to share or has like using the OCI registry to download the weights? Did that improve, you know base like versus using something like blob storage? Yeah, I think we got the numbers to be pretty close to our object storage in production Obviously, there's the whole thing it depends where our artifact registry is it's in US region But if your customers downloading from like AWS, there's probably like extra delays But the numbers were like pretty close to what we have just pulling from object storage In our SaaS platform, maybe if you're fine like a two to three minute extra delay I Don't think I have the official numbers off the top of my head But the numbers are pretty close and I think the gains made it worth it for us to you know, incur that cost And is that like after the node is already provisioned in the cluster is that including node auto scaling as well? Yes, so The it depends on which model so for certain models like the generative large ones We have an abstraction like the NFS cash So you only get the penalty on first download For the smaller embed models you download per you know model service start So Sorry, I lost your question say it again. Yeah, so I guess I was wondering if that those numbers like, you know Three to four minutes. We're including kind of node acquisition time in model weights. Okay Just model weights. Yeah, and then we download the model weights because it's you know Try to knee-stab them first before loading to memory. Awesome. Thank you so much Thanks Hi Thanks for the great talk I really liked the idea of putting the model weights in OCI But it's not very clear to me how you actually download those It's just something you do you have to do you you have to code in your inference server I mean use a client library to download those so the question being how we how we actually do the download Yeah in the downloader. Yeah, so we haven't we haven't in its container Then it containers effectively just an aura's client and then you know You pass it inputs outputs and it just calls or us But you know the or is API pull which does the pull from the you know, OCI registry So we haven't it container as part of the you know inference server it runs first and then the inference server Starts up when the model weights are downloaded locally. Okay. Thanks Hey, good talk. I was wondering when you break the model into layers I was wondering. Do you actually do like the layers literally like the model layers each is one layer Of the data or is there better ways of doing that particularly with fine tuning? It depends on the the model itself like obviously some of them are Multiple layers of a single layer that kind of thing but yeah with the larger ones we kind of break it up not like every layer but Yeah, roughly roughly we utilize layers. I don't know if you is there any like research into optimizations I would say we haven't done done a lot on that again because There's there's kind of other larger larger order problems to solve right now, but yeah, no numbers we can specifically share. Thanks Hi, I'm wondering I'm not sure if this is a part of the challenge your face, but do you can you comment on How do you strike the balance between throughput and latency if you employ batching of the other inferences You mean at runtime? Yeah I'm actually we can chat with you offline. Yeah, like that's that yeah That's kind of a little outside of the scope of what what's the challenges are for delivery privately, but yeah, feel free to come up And we've got like I have a question about the performance The OCI and whereas it's a great idea and how do you compare it between like ours and other stories is I think Yeah, we answered that earlier it was very close performance and even the little Hit we got from you know switching from object storage to an OCI artifact in those scenarios It was worth it because it's just it's vendor neutral. You can run it anywhere. So so the main benefit is Sorry, the main benefit or advantages from phone OCI is what is it? I think encryption I think being vendor neutral were like big things for us because we don't really need to worry about any specific things Object storage as well as also enabling our gap. The performance is very similar and the little hits we get I think Given the benefits we've gained it was worth it. I got you. Thanks. Yeah look My question is specifically about the model caching for downloads, right? I'm curious about the the NFS setup assuming and any consideration slash optimizations You had to work with to make it super fast I'm actually going to take that and like recommend the same thing because we've got the rest of our team over here And so we would love to answer that but it's a little bit outside of like the private deployments themselves. Yeah, thanks Hey, thanks so much for sharing everything was Was was real awesome what you guys shared Curious on the GPU layer part of it for the on-prem Are you using like like Dell or HP servers within video GPUs or using something like DGX where something like base command might help You with some of that are you speaking about like our SAS system and like what we use for for our specific servers for the stuff on Prem, yeah I mean you you showed some of the examples where you were running things in either Amazon or Google Yeah, part of it was going to be to run it on prem I'm just trying to press around some of the Nvidia some of the Nvidia marketing stuff about on DGX Claims that base command is going to solve things like what you guys solved for And I'm just I just trying to ping around what your experience with that was or if you were just using like Dell or HP servers on Prem and in like the base command wasn't relevant. Yeah, I would say a lot of the on-prem examples Like I can't share too much just because sure. That's for price area. I get it. Yeah, it's customer specific, right? And so when we're doing that it's kind of whatever that particular Particular customer is bringing to us and we're just engaging with them to say, okay How do we make sure that the GPUs are like registered appropriately got it with the cluster that we're running our system on? Okay, so thank you a lot. Okay. Thank you. Awesome. Thank you all appreciate it