 I'm going to talk about the topic beyond AIOPS. So I've been looking into AIOPS as a manager now, starting off as a senior software engineer, principal software engineer in the AI Center of Excellence in the Office of the CDO, and working on all things AI and operations and how that evolved over time. So it'll be a little bit of a journey because my perspective on it changed over the time. So let's start with the ground definition from Gardner, which would say AIOPS platforms are software systems that combine big data and AI or machine learning functionality to enhance and partially replace a broad range of IT operation processes and tasks, including availability and performance monitoring, event correlation analysis, IT service management and automation. That's a whole lot of words. Some people might interpret it as AI. It's going to replace IT operations. I mean, maybe that's a spin that some people out there are taking. But if you take a closer look to those AIOPS features, it boils down to pretty simple in air quotes stuff, like finding a baseline of your metrics, stimulating the future, finding some correlation with your incidents, doing anomaly detection, which basically means I'm predicting the future and if it deviates from it, I'm seeing an anomaly. And hopefully it helps you finding the root cause of your problems. But it's not like that you have an AI running your system or doing the job for you. It's not even so that AIOPS would be a product in a box. I think it's more like a marketing term nowadays, like when clouds hits the market and everybody was doing cloud and you talk to different people and everybody has a different opinion what cloud actually means. So these days you see a lot of old, not old, but also new products that are in the monitoring space, that are in the tracing space, slapping AIOPS on their products and people think, oh yeah, I'm gonna buy this product and then boom, I'm cloud native, I'm AIOPS and I'm gonna send my ops people to more interesting stuff because I now have a product doing all the tedious work. But in reality it's more like having smaller experiments across your ops people and then develop those capabilities that leverage AI to bring you to the next level. I'm seeing it more as a cultural shift. So similar like what we saw with dev and ops doing a dev ops movement, so developer people using the tools from the ops side of the house. With cloud native developer can deploy a whole complex scenario where previously that was only available to the ops people because you need to spin up so many servers and stuff. Nowadays you can do that with just one command cut and paste from the internet and boom, you have a multi-nodes deployment. And similarly ops people using the tooling from the development folks, being YAML engineers and codifying their operational landscape. So using the tools from those two domains brought dev ops and SRE folks are an implementation of dev ops. And I think it's similar with AI ops. See the dev ops people using tooling from the AI side, from the data science side. And that might start with just using EDA tools and data analysis, Jupiter notebooks and the like to detect those signals in the data. So if you want to increase the efficiency with AI and you want to go on that path to the self-driving cluster, at the very right end, you wanna have that scaling, that exponential scaling. So you start out with assisted AI which is basically the AI takes, helps you discovering things, it tells you, look, here's an anomaly, here's a correlation of some things. But in the end, you have to make the decisions. I think that's where we are currently. Now wrapping that in a box gets to augmented AI. So we have smaller parts of my operational domain and domain taken over by autonomous AI agents doing some rollbacks or merging patches or something like this. And then if you take that even further and you have an autonomous AI running your deployments, I think then we're sometimes, then we're there where we have really a massive scale where we have smaller amount of people managing larger amounts of the data center. And that's where we wanna get to, right? So analysts are saying 100 times cost reduction for operating infrastructure if we reduce the operational costs. And the idea here is building out the competence and encapsulating the competence in something that you create inside your ops team, gather the observations and then feed that back into building the competence and building out that tooling. So every ops team needs to go through this circle, through this, it's not a visual circle, visual circle, but it's a continuous improvement circle, right? So that's kind of sad, right? Because in the end, it's with every environment, with every deployment, with every customer, you are taking your data and then you are training your model in your AI ops product. Nice, good. So you don't get a pre-trained thing from a vendor because the only thing that you get there is the tooling to train the models. So you have to do it again in the second environment or if you are a different customer, you also have to start over from scratch. Now, how can we change that? And for this, we have to go a little bit back in time. So before open source, there was code, right? Code was the secret, we compiled it into binaries and we were making money out of these binaries. On the left side, the code more valuable than the operations of the code. Operating it at scale was left to the folks in the basement. So that was more or less an afterthought. Then open source happened. So almost 20 years ago, this red-headed company invented the rel operating system, took code, open source code and turned it into the product. So the value moved from the code to the product but still ops was afterthought, didn't put much value into it. Shipped the code to the product, shipped the product to the customer and let the customer figure out how to operate it. Then 2006, everything's growing crazy and scale and yada, yada, yada, things like the cloud come along. And when scale is everything, then the folks in the basement are suddenly valuable because if you wanna scale your business, that means in the age of digital transformation, you have to scale your operations. Gaining IT, that's when SRE folks are becoming more and more valuable. And I would even go so far and say, maybe Ops is even more valuable these days because open source has become the de facto standard. And you can basically, if you have the manpower and the skills, you can run a whole production workload just on an open source code basis. And what you're differentiating on is essentially your ops capabilities. So if the value in IT is an ops and ops are proprietary, then open source has a problem. And this is something that Matt essay being a cloud and open source executive at AWS is saying, what happens if you open source everything? That's exactly what Yagabyte did when they dumped the open core instead releasing all of its code as open source. So they are seeing that open core doesn't work, which is good, but running or releasing the complete stack as open source is the better example. Now, because essentially the value is in operating the software and they are also giving people the Yagabyte platform to run this database for their customers. So are they really open sourcing everything? No, everything is the code, but not the ops platform. So that's still left to the customer or you're buying this from the service provider. And don't get me wrong, that's nothing bad with this, right? You can move these capabilities out to people doing this for you. But democratizing software to open source software brought so much innovation. And I think the same should be true for operations. So if, so we're trying to do this with this operate first initiative. Operate first is an initiative to operate software in a production grade environment, bringing users, developers and operators closer together. So ideally operate first becomes a partner to upstream first as a basic tenant of our workflow being Red Hat's workflow being the open source workflow. Upstream first meaning if we productize something every line of code in the product should end up in the upstream project because that reduces the maintenance burden and that shares the maintenance to the community and everybody can participate from it and benefit from it. So what we're currently actually doing with the Massachusetts Open Cloud and Open Info Labs at Red Hat is launching this initiative where we want to operate upstream projects at scale and I think we're starting small. So we're not there at the scale yet but we want to embrace upstream communities to give them a chance to operate their projects in a cloud-native environment. We want to operate Red Hat products in this environment so that before we ship those products, customers, hopefully we also identify the bugs and the shortcomings or the edge cases that a customer would run into. And we're spicing this all up with open telemetry, open tracing, open ops. So sharing all those best practices and tools and deployments with the community so that we can replicate it into other deployments and that we can learn from each other. That's a whole lot of words. In the end, what you have to imagine is think cloud provider with full visibility into the operations. And it makes complete sense. So if you saw the previous presentation on the Nvidia operator, an operator these days is mostly being seen as a piece of code and not the actual person operating something. So in essence, it's codified operational knowledge. So wouldn't it make sense to have ops and developers working closely together in a transparent cloud working on the ops piece and then pick and choose some of the obstacles that they run into the tedious tasks, the chore that they have to do and codify that in an operator instead of the developer thinking on the product manager thinking, oh, this is something that the customer is usually doing. So we codify that in the operator, but then it's not really a problem. So that's where the power of open source comes. What we want to do is turn users essentially into contributors, so that's this contributor funnel. And that can only work if you have read only access to all the data. Think of it, if you're running something, if I'm trying something out on my laptop and I run into an error message, I'm put that into Google and I usually end up on Stack Overflow or on the project itself, see in GitHub issue and I'm seeing that this person had the same problem before and what I can do is read through the issue and maybe this fixes my problem. If it doesn't fix it, I'm gonna report. So I'm taking one step in the contributor funnel and then he said, eventually, if I'm super, super involved, I can resolve the problem because I'm contributing back to the core because everything is open in software development and open source. This is not the case in operations. Operations, every ops deployment is de facto a snowflake and it's behind the walls, which is completely fair because you have to deal with privacy and stuff, but if you think about AI ops, using AI to train your operational models and your operators that are running on auto mode, training them always from scratch, does that make sense? In this example here, enter transfer learning. So that's an AI technique where you train your model on a certain set of data and take that knowledge inside the model so that you can train the second model on less data. So if we could ship those AI ops tools, if we could ship those operators with a pre-trained model that has been trained on all the common use cases so that it only needs to adapt to the special use case that you have in your environment, I think that would move us beyond AI ops. So, seeing AI ops in a community starts with discussing, collaborating and logging in standards like Prometheus becoming the de facto standard of metrics these days. For logs, this hasn't happened yet, but I think we're on a good way there. For model exchange, same, right? So, developing these standards is super, super crucial because only with standards, you have standardized on something and then you can build on the same foundation. So let's grow collectively and codify that operational experience. In the end, operations being democratized, democratized operations. Everybody should be able to operate stuff at scale. Think of it as through install, ops center, git clone operations. Don't start from scratch. So then you can use your data as a competitive differentiator or what your product is actually selling and not so much your operational excellence, but you should be building on top of the data that you collect from your customers or that the customers give to you or that you know about your landscape, not so much about operating your cloud. And we're also prototyping this with the Open Data app which you heard here today and you're gonna hear here later on. So that's only natural because it has some AI in it. AIOPS has a certain root in this Operate First idea and the team also shares a common history with that Operate First team and the Open Data app team. So that was the same incubated and the same office of the CDO group. It's also a young project which makes it good because you can still influence them and they are building out the operational ideas and capability sets so we have potential to influence them and you need, obviously you need users, workloads, users, users, users, right? So without anything happening on your platform, you won't produce any data and you won't produce any issues and then you have just an idling cluster there which is kind of boring. So doing these things also with certain other workloads from cloud native virtualization, mesh for data, Open Shift itself, ACM is Advanced Cluster Manager and other emerging tech projects being onboarded there. But as it's a community, everybody out there can onboard and can run their experiments there. We also do a lot of things with research. We have a telemetry working group. There's a lot of threats going on. It's slowly, slowly starting. It's a perfect time to chime in. So my call to action here is get your access to all areas card into an Ops Center because it's so easy to be onboarded there. Essentially, just need a, right now it's a Google mail address. Maybe we change that to something different but it should be really read only by default for everybody out there so that you can deploy your workloads in, of course, in collaboration with the community and then we solve issues there, right? So you onboard via our onboarding opportunities, you get compute and in return, you're giving away the data that your compute produces, your metrics, your logs and stuff. Click on operate first. You're gonna land on this page here where we have bucketed into data science, users, operators and blueprints. On the data science side, you can follow along what we're doing there on the AI research bits. Most of it, oh, I think all of it has a Ops touch. So you won't see image detection stuff there yet. It has to do with how do we do, how do we inspect the ICD data? How are we working with time series in a Prometheus format and other things? But we also want to document the user experience, what it means a data scientist on a cloud native platform. Moving on to the operators bits, we have documentation for onboarding your workloads. We are AgroCity, so that we're following a GitOps approach here. Get all the best practices in a cloud native deployment model, so to say. And another interesting aspect of this or a perspective of this is that if you want to do a pull request to a service or to a running system, you need to replicate the setup somehow. So we also have tools to replicate the setup on CRC, that's a Kubernetes cluster on your laptop or onto other environments. So hopefully we will have guides to deploy the same setup into, I don't know, AWS, Azure, or on bare metal deployment. So hopefully we'll grow this environment into other data centers over time. And I think that's a super, super crucial part because as I said earlier, SRE is usually a process that everybody has, every customer and every project has to set up on its own and they have to write their best practices on their own. We are really documenting them out in the open so that you can actually do a Git clone of the decisions that we're taking and the processes that we're documenting. And then you just adjust them to your means and to your demands. And you don't have to start from scratch, but you can build upon our best practices here. And as I said, it's a community, every page has this contributed to this page, which then takes you to this sleuth of GitHub repos, because every, well, we are a little bit spread all over the place from that Operate First, Open Infra Labs and our AICOE pages, but that Operate First website is where we aggregate all the content. So everything is documented right beside the YAML code beside the pipeline code and all the goodness so that you can really follow that best practice GitOps workflow. That's it. Thank you. Here's the URL in both letters. If you want to connect with me on LinkedIn or on Twitter, I'm there. Thank you. Awesome, Marcel. Thank you very much. And I'm really glad that you brought up the Operate First cloud to our attention because I hadn't seen it before today. So I really appreciate that. And the data science projects and workflow stuff is just awesome on that site. So I encourage everybody who's listening in now to take a look at that. I have heard of it before today. So I have learned something new today. I'm thrilled to see it. That's great. Awesome. So Audrey, you've joined us. Thank you, Audrey. Our data scientists who did the open data hub effort earlier today. So thanks for joining us. Did you have a question for Marcel? Marcel? I do. So, hey, Marcel, how's it going? I'm going to take on the persona of somebody that's like kind of brand new to AI ops. So the question that the person may have is just to be clear AI ops, the data that's coming from like log files, metrics, monitoring tools, maybe help desk, ticketing systems, sources like that. And the other question would be, there'd probably be some sort of big data technologies that kind of aggregate and organize all of the systems output into a useful form. Is that correct? So it's correct that all the data means that this is essentially all the data. It starts out with metrics, goes to logs, then traces, the observability folks would argue, you can regenerate traces and logs and metrics out of the observability data. It's also tickets and issues, bucks and the like. That's why we're feeding into our alerts into GitHub issues. So yes, we want to make everything open. So completely transparent. You might see some passwords encoded. You might see emails encoded or phone numbers if somebody is on call. So we don't, obviously we don't want to expose any privacy identifying information, right? But if you're going to Facebook, you're also giving away your privacy identifying information. So maybe that's just the deal. If you are in there, use your internet pseudonym and avatar and not your real name, right? So if you're worried about this, I would really treat it as open as it can be because that's usually the roadblocker if you want to get access to data. Even within our company, we have troubles getting access to the internal data, although it's all internal, but you have to go through InfoSec and all that stuff because it might contain some data. So let's do it open from the beginning. We're not there yet, that's really the aim. In terms of big data processing types, yes, open data has Kafka in it, it has Spark in it. So we have the possibility to crunch big data. I hope that we have big data at some point, but now I think we have two issues. So it's not that big, but we're starting to collect stuff. And from the setup perspective, we are connected to the North Eastern, I think they call it NERC or NACI, something. That's another research domain. So I think we have unlimited storage. At least that's the way I'm assuming one. Okay, so I guess I would ask one more question. Do I have time for one more question, Diane? A little one. A little one, okay. This one could probably be kind of a yes or no one. So we're going ahead and we're gathering all this data. Do we have something in place that will reduce what I would call noise? So there might be some kind of spurious data that comes up, maybe there's some data that we could spot trends for somebody's trying to do something with the system. Is that in the works for AIOFs? Because otherwise we would just have huge amounts of data that I don't think we'd really want to use. That's an absolutely great question because that's the challenges to the AI community, I would say. So I'm not really an AI researcher, but I know that we have the same problem in reducing noise and identifying cats and images. So you don't want to identify the cat that's in the background as a cat, but only the cat that is actually moving, for example, or like these adversarial attacks where you only change your pixel, right? So I also don't want to have my AIOF agents evacuate a cluster only because somebody deployed something which has an emoji in a log message. And that's an adversarial attack on the ops agent. So exactly collecting all that data and then having the scenarios where you have a very, outage or something and then retrain your model so that it better works. So yes, I don't have it yet. We have some POCs doing something in that domain and we're seeing interesting projects there from research and also from IBM on looking or extracting log templates from log files and predicting time series and correlating time series. So, but the problem that most of them had is, oh, we need to have your data. So no, we don't have any data yet. So, oh, then we can strain our model. Yeah, no, that's exactly the problem that we're trying to solve here. Provide a common set of data. So, DDS, I'd like to see an MNIST for ops at some point where we have a standard data set on cluster outages or the ImageNet, which is also a de facto data set on training cat images. So I want to have something on training Kubernetes nodes. So that is a great aspirational goal and I think it's doable if we do it as a community effort. So thanks, Marcel, for doing that. I'm going to queue up our final speaker for the day. Sherard Griffin, and then I'll come back on with some resource links as well. But I really want to thank everybody who's persevered with us to the day. It is fluid, so we've run over a little bit, but we should be able to wrap this up in the next 15 minutes. Thank you all for your time and Marcel, thanks for doing that live.