 For those of you who don't know me, I'm Tashad Katarki, a product manager for OpenShift. I work closely with a number of partners and customers on AIML, I'll just say AIML and OpenShift. But one of the exciting things about this particular topic is, so far, we've been talking about how you can use OpenShift as a platform to build your AI on. But this is kind of slightly different in the sense that here we are using AI within Red Hat and working with partners, which I'll get to in a moment, to actually improve products and improve services using AIOps. And we'll define what that is in just a moment. But that's the exciting thing. I've been at Red Hat for several years now, OpenShift product management for three or four years. And this might be, I've done several of these comments. So it's always great to see you all here. Thanks particularly to our customers and partners for coming and talking. We always appreciate that. So let me introduce you, Sunny. Sunny, do you want to quickly introduce yourself? I guess I introduced myself earlier, but I will repeat that. Yeah, yeah, yeah. I'm a co-founder president of. Those of them who are sleeping. Yes, I'm a co-founder president of the company Profistore. We develop AIOps solutions and focusing on the OpenShift workload. And thanks to Red Hat, we get introduced to a lot of different customers that are trying to understand the workload and try to optimize the cost and resources on the cloud in supporting the OpenShift workload. So I'm going to give you a little bit detail after Tushar, give an introduction. Yeah, but just to tee it up in terms of a teaser, earlier when Discover was talking about, they were talking about how you set CPU and memory, how much they need and the limits for that. So with some of the things, the feedback that we hear from customers is, although that's a good thing, it's actually hard for customers to actually a guess what that is. So not everybody doesn't know what value to set for their particular app or pod or whatever the CPU requests and limits. And the other thing is, even if they know, it changes over time. So I think these Profistore has a fantastic solution which you'll hear about. So that's kind of AIOps. So without further ado, let's get right into it. And I do have this clicker switch I can use. So what is AIOps? I mean, I'll let you all read that for a moment. But the idea really here is, how do you improve IT? How do you take it to the next level, IT operations, so that you can replace a broad range of tasks which are manual, error-prone, time-consuming and kind of gives IT a bad rep today. How do you kind of change that using AI and machine learning? So that's AIOps as platforms and software systems that combine data and AI functionality to enhance and replace a broad range of IT operations, processes and tasks, things such as availability and performance monitoring, event correlation and analysis and IT service management and automation. So why should we care about this? I mean, at least we intrinsically understand why this is important, right? We all, for example, we want to know what's happening to our computer and if there are alerts, if there are things that are going to fail, we want to know about that. And you can imagine that at a cluster level or multi-cluster level when there are hundreds of nodes and thousands of projects and whatnot, you want to, there's no, this is the classical pet versus cattle analogy, right? Like you cannot teach them as, look at them as pets, but rather look at them as cattle. So that means that you can't look at individually, each one of these things in aggregate, you would want to know what's happening. And so anyway, so by 2022, according to this analyst, I think this is IDC, 50% of IT assets will have the ability to run autonomously using embedded AI. And so this is why, to leverage smart IT and facilities system, so this is why, so what is the evolution of the AI ops capabilities that, what is the steps, what are the phases, right? So this actually gives you a nice picture here on the left, where in, it starts with monitoring. How do you observe that system, that complex system, what are, after that, it is, what can, insights can you get from that, using the data that you have collected, can you use machine learning to get that insights, and then finally, can you act upon it, using some automation? So if you think about the evolution on the right side, you'll see data collection visualization is the first step, it's no different from any other machine learning pipeline and workflow, and then there is, you want to understand what are the patterns, you want to discover them in that AI ops workflow, and then you want to do some kind of prediction. In this example, we talked about how, you might maybe running out of CPUs at some point, so how do you auto scale, could be one example of it, or you could be making predictions about network attacks, if you are worried about security, so that's another application. So those are some of the predictions that you can do, and then you do some kind of a root cause analysis, you determine that this is the problem, and then you do a remediation of that. So real quick, what have we done, it's been an evolution for us, from an OpenShift perspective and Red Hat perspective on how we are approaching this. I'll start with the top left here, and say we first started with introducing the observability and strengthening the observability pace. If you go back to OpenShift 310, as far back as 3.7, we introduced Prometheus and Tech Preview, and then in 3.11 we GA-ed it, we had an operator for it, and then so the Prometheus and the corresponding alert stack is kind of the start of this observability story for us, then we have now with the Istio service mesh, we have OpenTracing, we have Metering is going to GA, has GA-ed in the most recent release, which allows you to generate reports based on CPU, memory usage, et cetera. Then we are doing this other thing, and I have a visual on this called telemetry, in which we are collecting this Prometheus data and not only confining it to a cluster, but we are aggregating that into our, as a service so that we can do some data mining with it, and I'll talk a little bit about that. And by the way, the technology or the platform that we used to collect that data is Datahub, that Sharad was talking about in the morning, and that's the Datahub which we have, which collects all that data, and we do all the analytics on that so that we can improve our production services, and that is basically, when we open source the reference architecture, that has become Open Datahub. And then finally we have Insights, which some of you might already know, Insights has been there for some time from Red Hat, but Insights basically gives you some insights about the system that is deployed from a Red Hat point of view, so I won't go into the details, but the Insights, so once we have the data, we collectively analyze it, we have developed some Insights, then we bring that to the customer. So that's the first piece of it, the observability piece of it. The second piece of it is really, what are we doing in terms of, once we have observed, what do you do next? If you get a Prometheus alert, for example, what do you next do with it? So easy one to say is that, okay, it can send you an email or it can integrate with, for example, PagerDuty, you can get an alert. We've got an alert, now you can do automation with it, and you can do some automation task with it using AnsibleTower, and so that's the integration that we have. We have other things in our portfolio, that is the Red Hat Dision Manager, which is a rules-based system, that we have business process automation. Some customers I know are using some of these advanced techniques to create rules on how to react to some of these conditions, right? So, and then how to automate it using business process automation. And then this is the connected customer really is our overall umbrella program that in customers with OpenShift 4 are sending us this telemetry data, we're analyzing it, we have some historical data, we're analyzing it in real time, and we are proactively actually fixing bugs for them, right? Like in fact, I might have a slide on that, I'll get to it, but we have 20 to 25% and OpenShift 4 has been out in the market for about three to four months now, and I'd say about 20 to 25% of the bugs we have fixed are because of this data, and this is not something actively that the customer has said needs to be fixed, it's just based on the telemetry data, we're able to determine that there are bugs in the system. So that's the connected customer, we want to do more than that obviously, that's where some of the OpenData Hub work is really helping. So that's why this is something exciting for us. Then finally, we have the automation piece, which is really of course, now you want to automate all this, we heard about operators on the Operator SDK, we talked about Immutable Host, which is our operating system, we have improved the install experience with operators with OpenShift 4, and we have a whole bunch of other things that we are doing in this space. So this is the example of what I was talking about, the OpenShift, this is the dashboard that we see depending on the number of clusters that we have at Red Hat that I was talking about. So you can see that we get some information, there's a Grafana dashboard which shows the number of connected customer clusters, are there any errors in the system. So some of the things, AI ops examples that we are doing, double clicking on this is things such as log anomaly detection, outliers, we are analyzing logs that we collect and determining if there are any outliers and use that to improve the system, the product itself. The second one really is the cluster rollout monitor. One of the important features of OpenShift 4 is the over-the-air updates, so we are able to push over-the-air updates to the connected clusters. Now, what we determined is that if we monitor that upgrade process and if there are any anomalies, if there are any problems with that, then we are able to detect that and we are able to take corrective action based on that. Similarly, we are doing anomaly detection with metrics, which is the Prometheus anomaly detection, and then we are doing the work load prediction resource optimization, which is something that Sunny is going to talk about with that technology. So this is our kind of big picture vision. I won't go into a whole lot of details into this, but you can kind of guess what we are doing here. We're collecting the data both at the cluster level and we are aggregating that and then we are taking action based with AI and ML and the other thing to important is the word ISV here and that's why I'm here with Sunny is because obviously a lot of these things we are not able to do on our own, but on the other hand with Operator Hub, we are encouraging our partners to take advantage of the open ecosystem to bring all these tools and technologies to you guys. So that's kind of the introduction of what we are doing. I'll hand it over to Sunny to explain a double click on exactly what's happening with Professor Stotten for day to day AI. Thank you. Thank you. Do you want to be? Okay. All right, so in the remaining 10, 15 minutes, I'm going to give a bit more details on how our solution works on OpenShift and the value proposition. So I just want to give this slide because to show you that this is a survey by right scale. Now a part of FlexZera, they did a survey earlier this year over about 800 enterprises and half of the enterprises, about 400 of them have more than 1,000 employees and this is the conclusion. They said that the cloud cost optimization is the number one priority for all these, for majority of these enterprises, okay? And it doesn't matter the other conclusion, it doesn't matter how long you have used public cloud or private cloud, cost is continuous to be the number one priority, okay? And our solution tried to address this very important issue. And what we learned from ReHead customers, OpenShift customers and other partners is that usually the CIO will get a shock about the bills when the developers are just using the public cloud freely. So our solution is trying to address that and let me tell you a bit how it works, okay? So specifically the pain points we address is that if you deploy your applications on the cloud, most users or developers will not know exactly what resources on the cloud needed to support the applications. Right now it's all guesswork. And adding to this is all your application workload is quite dynamic, right? Containers are very dynamic and sometimes very short-lived, okay? But you get to deploy many of them on a weekly or daily basis, okay? So these cloud resources include CPU, memory, and of course if you're running on AI machine ML workload, it will be GPU resources, very expensive. You get charged on an hourly basis, okay? And if you are in a major enterprise, you might have many different divisions, projects, and each one of them might ask the CIO office for some cloud resources. And again, they don't know what kind of resource they will need to support the application, so it's all best guess. And usually it takes a long time, okay? Last time in the Red Hat Summit, one of the major open shift customers from Europe in the automotive industry, say that in certain cases, some of their divisions only use 10% of what they get allocated, okay? So it's a lot of wastage, okay? This one just show very high-level overview of how a solution works, federator.ai is our solution. It works both on-premise as well as on the public cloud as long as it's running open shift. And then all these metrics, for example, CPU, memory, or GPU utilization get collected in Prometheus and we use that to analyze, right? We do our machine learning on those metrics, the historical workload, ongoing workload, and it will also allow the user to get their inputs on the SLA spec or what kind of margin they want to put on the resources to prepare for unexpected workload. And the output of our solution is a list of recommendations on the resources on a per application basis or per namespace basis, all right? And if the users allow us, we can also automate and execute those recommendations into auto-scaling solutions for them. And the basis of it is using machine learning to do prediction on the workload. So on the historical workload, we get insights and our outputs give them four sides on what the workload is going to be on different timescale, all right? And using this prediction, understanding of the workload, then we orchestrate and optimize the resources. And for some of the benchmarks that we have done for our customers, it can show that compared with the native Kubernetes of not using any mechanism at all, some of them can go up to a 70% cost savings. So the ROI is quite significant. And as I said earlier, if the customers allow us, we can execute and dynamically auto-scale the cluster for them. And just to summarize, and the reason why we can do this is we do it dynamically, continuously on all the workload and resource on a different timescale. This timescale could be the next hour, next 24 hours, next seven days, or next month, right? And we also do auto-scaling on AI. And as you know, native Kubernetes already have their own horizontal post auto-scaling, HPA, or vertical port auto-scaling. But they are done in very primitive way. And we have our own mechanism using AI machine learning of the workload and can do a much better job, which I'll show you a little bit later on. All right, and then after this, all the understanding of the workload on a per application or per cluster basis, then we could determine the best cost solution from let's say Amazon, Microsoft, and Google, right? And then determine what's the best cost solution for it. Okay, and this, I will just show you a screenshot of the actual solution. On the upper part, it is the CPU prediction. The blue curve represent the actual customer workload, and our dotted line represent our prediction. As you can see, it doesn't really overlap because we're just doing prediction. But it follows quite well the general pattern. And the one on the red box inside is actually our prediction. The green line represent our resource recommendation. So customer actually don't really, not really interested in a particular curve, the actual workload, but they're interested in what kind of resource. In this case on the upper graph is the CPU. What resources need is to support the application workload, all right? And the below represent the memory workload. And we give a margin because we want to make sure that it never runs off memory. So we give a margin of 15, 20%, and this can be configured by the user. And we are a right now levels five certified operator. And thanks so much for the redhead support. We have done this certification process a very short time, okay? In terms of, I believe my team told us that it's like less than a week, a couple of days. So. Just a certification. Yeah. I think they all heard that, sir. Yeah, but level five, all right? Yeah, it's fully certified. And this helps a lot when we sell to our customers. And so, and once we tell our customers is operated and they will understand this very simple to install, very simple to upgrade. And just to summarize, we gave a policy-based, optimized resource recommendation for any workload. And the resource could include CPU, memory, and now GPU, very important. And then we get, we tell the users, what's the best cost from all these three providers? Because the good thing about the major, the three big providers, Microsoft, Amazon, and Google, is that they published all the costs on APIs, okay? For very large customers, of course, they negotiate a certain rate, but for the rest of the 95% customers, they will use the standardized cost, okay? And as I said earlier, we provide much better workload management than the native Kubernetes mechanism. So this is another screenshot. Once we learn about the workload, then, of course, it varies on different workload. This is a very small cluster we deploy, all right? And it's actual customer workload. So the customer may just pick on the very top Amazon, which costs them $790 for this particular small workload. And then we go through the historical workload and give a recommendation, right? And then we find out that you can support it with much cheaper instances using Amazon. And highest is the actual Microsoft, and then Google is in the middle. And this is not a general conclusion for among these three multiple cloud providers. It's just example, in this particular case, it turns out that Amazon is the cheapest, all right? So, and just want to show you some use case. This is a leading market research company, a market research firm based in Boston. They are migrating a number of on-premise VMware-based workload to OpenShift on AWS. And they have no idea how the sized AWS cluster to support this containerized workload. So running our tools, right, they can optimize the cost on every one of the applications. And it turns out that on certain application like Nginx or SQL, right, we can do much better job than using the native Kubernetes HPA auto scaling mechanism and doing it. And we can show that in certain of the benchmark, we can improve up to 70% of the latency, right? Which is very significant. This is the one, a different use case, but the GPU cluster is a pretty sizable cloud provider. It's a government-funded high-performance computing center, allowing all the GPU resources to be used by the enterprise in Taiwan, university and research labs. They have over 2,000 GPUs. And I have to say, they spend about close to $20 million buying those GPU systems from NVIDIA. Over 9,000 CPUs, you know, over 10 petabyte storage. And what they do is allocate statically in the past the GPU resources to the users, right? Doesn't matter what they, because they have no idea how much the GPU, the GPU users do not know how much they need for the application, right? So it turns out that they already allocate over 90% of the GPU resources to these users, right? Even though most of them only using 3% of it, okay? So very inefficient, but now using our tools to analyze all the workload, they can raise from 3% utilization to 80%. So that's very significant increase in utilization and the return on the investment ROI in using our tools is almost 10 times, okay? And in addition to providing this workload, GPU workload visibility and the resource prediction, we also give them some performance anomaly detection, okay? And that's basically about what I have to say. And if you want to learn more about solution, you know, go to our website or just send me an email, okay? And again, I want to thank Rahat, you know, Diane in particular, you're inviting me to give a talk and thanks to Shah. Thank you. And just, he's from Hong Kong, so he's also traveled a great distance. Yes. So we've had people come from Australia, from Hong Kong, there's just an amazing group of people who've traveled very far. So Sunny, thank you very much for coming and thank you for being here. I'm going to bring our next group of panelists up and stick around for the...