 Thank you. Thank you. Good afternoon. Hopefully it's coming loud and clear. So this is a very exciting topic for me, and so hopefully you'll find that too. But in terms of a show of hands, how many people are using AI on OpenShift and Kubernetes? Or intend to use in the next few months? OK, cool. There's a good set of people. How many people are getting to, like, influencing in those advanced topics on OpenShift? A little bit. So, you know, Tripti is going to talk about that, and she's going to explain to us what influencing is. In the way of introduction, I'm Tushar Katarkhi. I am a product manager on OpenShift, kind of involved in a bunch of AI initiatives. And we'll talk about that a little later towards the end. And Tripti, single, is a product manager for NVIDIA, and she focuses on deep learning and influencing software there. And she has a very exciting lineup for us. So without further ado, I will hand over the mic to Tripti. All right? Do you have that? Check if this works. All right, thanks so much. OK, so this is just a quick agenda. I'll start off with a brief overview of what deep learning is and focus a little bit more on the inference side. And then I'll jump right into the NVIDIA TensorRT inference server, which was announced in September. So fairly new. I'll go into the features, the internal architecture, where it fits into the larger inference ecosystem. I have one quick performance slide, and then I'll jump into a demo. And then I'll pass it back to Tushar to talk about OpenShift and Kubernetes. So deep learning at a high level is the idea of using large amounts of data to train neural networks, teach these neural networks how to make human-like decisions. And so it's typically broken down into training and inference. And inference is what I'll be focusing on mostly today. Training is using large amounts of data, teaching these neural networks how to make these human-like decisions. And then inference is taking that trained model that's been iterated over with that data several times and then deploying it into the real world and giving it new data to make new decisions and new predictions. So that's deep learning at a high level. Like I said, I'll be focusing more on the inference side and the TensorRT inference server. So focusing more on inference and why GPUs are necessary, there's this idea of plaster, which stands for programmability, low latency, accuracy, size of the network, throughput, efficiency, and the rate of learning. So as you can see, latency and speed is not the only factor here. All these factors contribute to a successful inference deployment. And using GPUs delivers on all these factors. So the main problem that most people run into when deploying an inference workflow is the problem of inefficiency. And this includes not being able to run several models at the same time on one GPU. So on this example on the far left, users may want to run several different types of deep learning models, whether it's a speech recognition model, language processing model, or a deep recommender model. And if each GPU is dedicated to that model and one spikes, like in this example, the speech model spiked up, the other GPUs are left underutilized. And that's pretty inefficient. Also, solutions today typically only offer support for one framework. And that really restricts your internal teams working on developing these AI models. It restricts them to that one framework. And if they feel like some teams work in PyTorch, some teams work in TensorFlow, there's not one that uses it all. And so that's the issue today. And then when it comes to custom development, like I said, there are several teams developing different solutions and solving for different tasks, whether it's visual search, recommendations, and so on. If each team builds out their own pipeline, their own custom pipeline for those, that's not very efficient because they're all really doing the same underlying task, which is inference. And really having one solution that handles it all makes managing those pipelines much easier. So the NVIDIA TensorRT inference server, like I said, was announced in September and is also now open sourced as of about three weeks ago. And this is just a high-level overview of where it fits into the larger ecosystem. On the left, you'll see the clients sending in requests to some sort of cloud application running in the data center. And from there, those requests are sent to a load balancer, which directs traffic to the appropriate instance of the TensorRT inference server. So in this case, there's three instances running. And all the underlying hardware is visible to the inference server, including heterogeneous GPUs, which is why I've listed T4, V100, and P4 there. Like I said, it integrates with all orchestration systems, such as Kubernetes. And like I said, it's open source and available on GitHub. So here are some of the current features that the TensorRT inference server has to offer. And just to single out a few, we can kind of separate them between two performance features and usability features. So for performance, there's a feature called concurrent model execution, which is what allows you to run multiple models or multiple instances of the same model on one GPU at one time. So this is how you're really going to maximize the utilization of your GPU and get the most capacity there. And with dynamic batching, you're able to batch up your inference requests inside the inference server based on a user-defined SLA, latency SLA, rather than having to build that logic outside of the inference server. And so now for more of usability features, the TensorRT inference server exposes metrics, such as for utilization, count, and latency to enable auto-scaling. And we support multiple model frameworks, such as Facebook's TensorFlow, and NVIDIA's TensorRT, and Cafe2 through the Onyx import path. So this is just a deeper dive into the inference server internal architecture. So the green box here on the left represents the inference server itself. You have our client requests coming in from the top through the HTTP or GRPC endpoints. And there's also a Python or C++ client library that enables that interaction between client and server. From there, requests are sent through to request handling. And in this case, it's a simple image classification example, identifying between dog and cat. And then they go through to the per model scheduling queues. And this would be based on what kind of model it is. So in this case, it's an image classification model. So based on that, in the request itself, it will go to the scheduler queue for that particular model that's needed to execute on that inference. And from there, it's sent to the framework backends that actually does the inference compute. And if that image classification model was written in TensorFlow, it will go to the TensorFlow backend. And from there, the result is sent back up through the response handling and back up to the client. One thing to note here is at the bottom, the metrics are exposed through a separate HTTP endpoint. And the models that are used inside the inference server are visible through a model repository. That's a persistent volume. OK, so that was kind of a zoom. What we just saw was kind of a zoomed in view of what is inside the inference server. And now we're taking a step back, zooming out, and looking at where it fits into the larger inference ecosystem. So you'll notice that at the far right is the inference server. I guess I can move my mouse here. So that's what I just showed you, the zoomed in portion. And now, starting on the left-hand side, the user in this, as an example workflow, you can say a user uploads an image. And it's sent to some sort of application through to a client API. And some pre- and post-processing is done on it. And this could be image cropping or scaling, or maybe the model requires a mask over it. So that's sort of the pre-processing that happens. After that, the pre-processed data goes through to a load balancer that can direct traffic to the proper instance of the inference server. And like I mentioned, the model repository is mounted into that inference server as well. And so now the inference server is running that model, whether it's image classification, whatever model it may be. And it's sending responses back. And it's doing this as requests come in. And at the same time, metrics are exposed through and to enable autoscaling and to be able to spin up new instances. And so metrics such as queue time and GPU utilization, when those kind of go up, it's a good indicator that it's time to spin up a new instance. And then you'll also notice a dotted line around that portion. And that shows our collaboration with Kubeflow to support the TensorRT inference server. And so there's a detailed blog describing that collaboration and all the code is available on GitHub. Okay, so this, the chart on the right, shows the performance gains you get when using the TensorRT inference server across three separate deployments of ResNet 50, which is an image classification model. It's the typical one used for performance benchmarking. And so TensorFlow FP32 on CPU and GPU. And then you'll obviously see the most performance gain when using the Nvidia TensorRT version of ResNet 50. The main takeaways, the key takeaways here are that the inference server supports CPU and GPU, as well as TensorFlow native and the Nvidia TensorRT plans that I mentioned. And this is all under a 50 millisecond latency SLA across all deployments. Okay, so what I'll do is I'll jump into a demo. So this is a video. So this is actually a video of the, of our TensorRT inference server flowers demo. And there's a live version of this that will be running at our booth. So feel free to check it out. So let me just describe what you're seeing here. At the top is our flowers client. And so all those little images are images of flowers being classifieds, whether it's a daisy or a rose or so on. And this flashing bar going down indicates that classification. So all these images are being classified. And it's moving down like that. And so at the bottom left corner over here, we've put images per second, the demand and what's being delivered. So at this point, it's 800 and we're meeting that demand at 800. And so at the bottom, excuse me, we have two dashboards, which give us some insight on what's happening in the data center. On the left is a cluster that is a manual deployment of these models without the TensorRT inference server. And on the right is a cluster with TensorRT inference server enabled. And at the bottom, you'll notice that it's broken down into a per GPU utilization. And the percentage of models loaded onto the GPU is indicated by color. So focusing on the left side, the left dashboard, each GPU is dedicated to a single model. So there's a blue model, a green model, an orange model, and a yellow model. And so the blue model in this case will be the flowers, the flowers model that you see running right now. This dial over here shows average GPU utilization. So it's 20% right now. And soon what you'll see happen is, I can skip forward a little bit. So what just happened was the demand increased, so we increased the demand to 5,000 images per second. And keep in mind that we're still directing traffic to the manual no TensorRT inference server cluster. And so you'll notice that the two GPUs that are running this flowers model are completely maxed out. While the other models, whether it's a deep recommender or anything like that, those GPUs remain underutilized. You see the spike on the chart happen there. And the average GPU utilization is around 38. And so a typical solution here would be to increase the hardware and just add more GPUs to support this flowers model. But you're still left with underutilized hardware in your data center, which is really inefficient. And you'll also notice in the images per second on the bottom left corner that we're not meeting the 5,000 images per second demand. We're only getting around 4,800. So that's not ideal in a production workflow. So soon what will happen is we'll stop the traffic going to the non-TensorRT inference server cluster and we'll move all that traffic to the enabled one with TensorRT inference server. And you'll see it drop as soon enough, you'll see it drop in on the left and peak on the right. There it goes. And at the bottom, all eight GPUs have all four models loaded onto it. So when this peak happens, you'll notice that beforehand the GPU utilization was around maybe 17%, I think. And now it's back, it's up to 39 or 40%. Similar to the manual deployment. But this one, you get the same average GPU utilization, but in this case, all your hardware is being utilized. And you can also see that we're easily meeting the demand of 5,000 images per second with plenty of capacity to even spin up a new workload or increase the demand. And so another thing to make it more realistic is that we show that it can also, with the TensorRT inference server enabled and having your models evenly distributed across your hardware, you can increase and decrease your demand like a typical spiky, bursty workload would actually be like because any model can spike at any time and your hardware really has to accommodate for that. And so maybe I can skip a little bit. So here we go. We've increased the demand to 15,000 images per second and we'll go back down to five and go back up and kind of imitate or simulate a spiky workload here. And you'll see that the GPU utilization goes up and down and all the GPUs, all eight GPUs are being utilized. So you can see that happen. And none of them are really being maxed out yet. The fact that they're distributed across all eight GPUs allows for more capacity to be able to handle that spiky workload. And so in a little bit, you'll see that with this configuration, we're actually able to get to 18,000 images per second. If you notice the gray box there, 18,000 images per second and we've completely maxed out our GPU utilization and this is using the TensorFlow inference server. So there you go. See, that's 95%. All GPUs are being used. And we're pretty much meeting the 18,000 images per second. There it is. Okay. So I'll go ahead and hand it back to Deshar who will talk about OpenShift and Kubernetes and the road ahead. Wow. That's cool. So with that, let me now jump back and see what we can do with OpenShift here. Just a moment. You do, escape that. Let's do it this way. There you go. So as I said, so what can we do with this beautiful demo and the TensorFlow inference server and what's the road ahead for this on OpenShift with deep learning on OpenShift is what I'm gonna describe next. We'll start with that basic TensorFlow inference server that we saw the demo for, that Tripti described a little earlier. And then we'll say, oh, that's actually a bunch of containers and so the containers need a container platform and so we talked all day about it today. So we got OpenShift container platform which is basically Kubernetes. It can run across data center and cloud and by the way, it supports GPUs and media GPUs and therefore you can run it either on the data center or on the cloud or a combination of those two. Okay, but now you have the TensorFlow inference server but now you actually need to use it. Maybe as Tripti described, maybe it's part of a recommendation system or you have a chat bot which is using natural language processing. So you have an app that you want to create and you want to deploy it in production so you have your cloud native intelligent app and that's running on TensorFlow inference server and by the way, it needs things such as load balancing, it needs things such as routing, it needs encryption. Then we have the OpenShift service mesh that was talked about earlier. So this is kind of your production setup. Now you have a cloud native intelligent app which is actually using the TensorFlow inference server and the underlying infrastructure that is provided by OpenShift for this to be deployed and shown in production. So now, okay, but where did these models come from? Somebody has to actually create these models so you can use OpenShift as a platform to train your models. Your data scientists can do that using Bring Your Own or actually NVIDIA has a bunch of pre frameworks such as for TensorFlow, et cetera that they call NVIDIA NGC, which is NVIDIA GPU cloud container so you can run them on top of OpenShift and so there you've got your data scientists have access to the best of the best frameworks to create the models to feed into this TensorFlow inference server. Okay, so what's next? But as you saw earlier, you probably need to do some pre-processing of the data, some post-processing of the data. You need to set up the data pipelines. Your data scientists might want to visualize the data that is coming in using Jupyter, something that they might be used to. They want to use Kafka as a message bus and maybe use Spark for real-time processing. So all those patterns are available on OpenShift and I won't go into the details of that but I'll describe some references in the end and all these different quote unquote frameworks can run on top of OpenShift. So now what do you have? You have your inference server, you have your models, you have set up your data pipelines on OpenShift. Now what you want to do is you also want to actually write your cloud native application, you want your developers to do that. So that is something that obviously OpenShift is not just a container platform but also a DevOps platform and that's where you obviously can use existing patterns, Jenkins that we already shipped or in the future, Knative with things such as build event and serverless on top of OpenShift. So that gives you an idea of how OpenShift can be used in this entire ecosystem from end to end and some of the things that have happened so far really are things such as device manager support which enable the support of GPUs. There have been available, I'd say for about a couple of releases already. We have introduced other features with the community such as priority and preemption which can be used which will become essential if you're trying to do things such as model training and using jobs for example. So recently back in October, we announced jointly with NVIDIA support for Red Hat Enterprise Linux which is certified now on DGX1 and Tesla GPUs which is basically DGX1 is the GPU appliance from NVIDIA and that support is available now on OpenShift and REL and then here we're showing you a preview of PenserRT inference server on OpenShift. We're planning to write a reference architecture and obviously the road ahead is really to make this a much easier deployment, deploy experience, install experience with operators. We have other exciting stuff that we're going to talk at KubeCon over the next couple of days with the community and doing things such as GPU sharing, heterogeneous clusters for example, how do we have, I mean this picture on the left here, it looks nice but some of these things, the heterogeneous cluster still is work that needs to be done up in the upstream community so those are some things, GPU topologies, another thing that we are working on. So if you step back now, how does Red Hat see AI? I think one of the four pillars here as you can see, running AI as a workload on top of OpenShift and our ecosystem is obviously the first pillar and you saw a little bit of that today. On the right most is how do you build intelligent applications? I thought I touched a little bit about that also so if you're going to build an intelligent application which uses AI, how do you build that? And then there are the two in the middle which is basically the first one really is how do we continue to enhance our core business using open source tools and AI? And the second one really is how do we enhance our products itself using AI? This is where Chris Wright was pointing out earlier about automation and self-driving products and self-diving components. So that's kind of how Red Hat sees with, oh I missed the most important piece with data as a foundation and so you'll hear some of this in our, throughout the discussion over the next few days. So here are some references and KubeCon highlights to round this up. Nvidia is going to present basically scaling AI entrance workloads, what you saw today at a much larger breakout session. First that's tomorrow, Tuesday, December 11th, I believe it's at 1.45 p.m. Then we have on Thursday, Red Hat, some of the Red Hat engineers are going to talk about that last bullet that I was showing earlier of the last box about how to build our deep learning through K-Native serverless framework. Then we have several other blogs here. The first one talks about how to use the Tensorati entrance server. The second one is using KubeFlow, that you mentioned how to use KubeFlow to deploy the Tensorati entrance server on Kubernetes. And then the last one is actually on how to basically enable GPUs on OpenShift. So these are some of the resources that you have. There's a webinar which is coming, and Joey's I'll introduce him in just a moment, about which talks about how to maximize GPU utilization. There we have boot setups, both at the Nvidia and Red Hat booths, so come and check it out. And as Tim mentioned, there's a live demo going on there. Then I want to also mention that the Tensorati entrance server is open source. It's available on GitHub. You'll find a link there. And then the OpenShift Commons machine learning SIG, that's something where we as a community from OpenShift and Nvidia and others, we get together and talk about machine learning. And then we have Open Data Hub, which is some of the four pillars that I talked about. This is a Red Hat CTO office initiative, and they are trying to build a data hub that I described earlier, and the four pillars in an open way. So there are plenty of resources, lots of excitement. So I hope you guys can participate in that. And with that, I think thank you. Thank you, Tripti. Thank you, Nvidia. And then back to Dan. One of these will work. All right, maybe both of them.