 Let's get into the big topic of open source, something that we actually say is right. This is so awesome. We are an open culture. It's exactly what it looks like. It's that process that a developer, or let's say, has the Kubernetes ecosystem really boomed. Welcome to In the Clouds. I am your host, Stu Miniman, and we're going to be talking today one of my favorite topics. We're talking about data. And of course, we know there's been huge explosions of data over the last few decades. It's one of those exponential things. The only constants in our industry are change in that all of the information will continue to pile up. We create the Library of Congress probably in milliseconds at this point, if I know what's going on. And what has the cloud era taught us? We've really gone beyond what humans can manage alone. So the topic we're going to talk about, that we've together, a number of these things, is AI ops. And no one better to talk to us about this topic than the AI ops lead at Red Hat. So I'd like to welcome Marcel Hild, who is the AI ops lead, part of the Red Hat Office of the CTO, or OCTO as they're often known. Marcel, thank you so much for joining us. Hello, Stu. It's a pleasure to be here. To be in the clouds, yes. Yeah, in the clouds. And we do these events. You're coming to us from your home office in Germany. If you could just give us a little bit about your background, what led you to this focus that you've had for a number of years on the AI ops trend that we're going to talk about. Sure. So my background is in open source for like 25 years I'm doing open source since university. And I think I played most of the roles in this game from a solution architect to a freelance developer to somebody operating machines. And I came back to Red Hat like six years ago and worked on a cloud management software called Manage IQ or CloudForms as the product. And then this AI center of excellence at Red Hat started. And I was listening to an internal talk by Ulrich Trepper, the guy who worked on the Lib C for some time. And I figured this is a super interesting topic. I want to be part of this. And so I came to AI ops because cloud management and this overlap with AI is AI operations. And that's basically my journey into this. And since the last three years, I came as a noob into this environment. And I think now I have a pretty good opinion on where things stand and what the open source angle on this is. Well, Marcel, of course, being at Red Hat, we all have some angle on open source. It's crazy for me to think. I was a product manager for Linux over two decades ago just to date myself a little bit. So I had quite a bit of history with open source before I joined Red Hat a little over a year ago. I also have a little bit of history with octos. So I was really surprised that Chris Short, who used to host this, had not yet interviewed Chris Wright. And if I have it right, you're the first octo member we've had on the program. So every company does things a little bit. At Red Hat, of course, everyone's involved in open source. So can you help us understand a little bit the structure of the octo inside Red Hat? Sure. Apart from a fancy name octo, which always reminds me of an octopus, but it's also very describing. And I think it's a good picture because octopuses are super, super intelligent if their parents wouldn't die when they were born. I think they would even, maybe they would exceed human intelligence. So they are really learning during their short lifespan so many exciting things. And that's, I think, how the octo, the office of the CTO is different at Red Hat from other companies because we only have one CTO and we only have one office of the CTO. And that has grown over the past. So when I joined it was a smaller team, mostly pictured as the office of the CTO stands at a conveyor belt and identifies all the open source projects and communities out there and the good ones keep on, stay on the conveyor belt and then products and technologies turns that into products. I think that evolved over time and now we're also looking more into how our internal culture looks like, our relationships with communities. I mean, that's been there before. But we've been growing our family of the office of the CTO. Also now ecosystems engineering, partner systems engineering is also part of the office of the CTO. All right, let's turn and discuss AI ops. So hopefully most of our audience are familiar with many of the, you know, things like dev ops, the movements that pull together, some technologies, DevSecOps says something we as a company overall are doing. AI is a little bit of a fuzzy term sometimes. There's artificial intelligence, there's machine learning and the like. I remember I attended a conference back in 2019 which was all about AI ops. So there had been a real spike in this discussion. If you could just level set our audience for AI ops. What's the high level message? Where has it been and where are we today when it comes to the industry as a whole for AI ops? Yes, sure. So AI ops initially started as a term coined by Gardner. I don't know when, but it was before the big AI hype. It actually started out as not artificial intelligence for operations, but algorithmic algorithmic, now I'm blanking on the term, but it was basically algorithms. So not really AI. And now then they change it to AI because if you say AI you get more funding for it rather than doing analytics and just algorithmic stuff. But I think it still boils down to this idea to use data science, to use algorithms and machine learning data engineering applied to operations, to monitor, to automate, and to maybe self heal your environment. Basically everything that's happening in an operations environment or in a DevOps environment applies some machine learning applies some data analytics to it. And you have this AI ops thing happening at your company. Okay, so Marcel when we looked at DevOps one of my favorite lines is DevOps isn't something you can buy even though there were lots of companies that tried to sell you on their service as something DevOps are we talking the same time the type of things with AI ops is much of this organization and cultural versus a product that you buy that will magically you know make your data better I guess it depends on who you're talking to right so if somebody's selling you an AI ops product which has AI ops in their name maybe they want to sell you this product which enables AI ops in your team or which brings AI ops to your setup to your company. I don't know and we see a lot of rebranding of old tools with some AI ops stamps slapped on it and basically it's the same wine in new bottles I don't know if that's a common idiom but just rebranding stuff the DevOps movements I think it's a really good analogy of what's happening here because I think being even more true to the name it should have should be called AI DevOps because if you look take a step back and see what DevOps was all about it was about using the tooling and the way of thinking of operations in the development team so that I could I as a developer could just spin up an environment which is similar enough to the actual production environment on my laptop or on a smaller cluster and then similarly the operations tool applying some of the best practices from development such as version in your source code or automating a lot of stuff and revision control etc in operations so really using the best of both worlds to create this new persona and I think now we came to that time in in our journey to the clouds and to scale where it's just too complicated for humans alone to understand it where our automations might not suffice anymore where our best practices may not be enough where we maybe meet some unknown unknowns or something is super unpredictable where we just use the tools that are in the tool belt of the data scientists of the machine learner and apply that to operations yeah so you brought up the data scientists and this is one of the challenges we have you know most companies usually don't have an abundance of data science resources they're difficult to find you know you want to maintain them but getting the value out of data at least experience I remember talking to data scientists five or ten years ago we have all of this data but we don't necessarily know the right question to ask so there was the gap between the subject matter experts and the data scientists it sounds like if I'm hearing right maybe we're flipping is we're rather than having the data scientists be the center of all this activity are we allowing the subject matter experts to be able to leverage some of the things that normally I would have had to be a data scientist to do are we democratizing the tooling and skill set that we had from data science absolutely it's about democratizing it and the commoditizing it also that we can I can do stuff on my laptop these days where like just some years ago I needed to have multiple really expensive GPUs and hardware to do the same stuff and now it's really just I mean you saw maybe you saw these videos on YouTube where you just where it just takes like five lines of Python codes to build a neural net that can actually identify images of cats right but it's really just five lines of Python codes that I copy and paste into my browser maybe I need to install some stuff but then I can actually do this without understanding the real math that is being applied there that doesn't mean that I shouldn't understand the intuition behind it so I would encourage people to watch some stuff on YouTube to get the intuition on what data science is all about what's there's those two different types of machine learning like unsupervised and supervised learning there's deep learning there's just regression advanced statistics so know the tooling that you have to reach out to and then use some of the democratized libraries out there so go to scikit-learn where you have yeah there's even a flow chart where you can just look at the problem domain that you're having and then it guides you into the best algorithms and the best practices to apply to that problem and then just try it out and then see if it helps you doing your job better and I think that's the common misunderstanding that sometimes sometimes misarticulated in advertisements and in other YouTube videos that we need something net new to solve this stuff therefore we need to learn AI ops therefore we need new tooling and stuff no it's actually we need the same tooling and if the tooling doesn't work anymore if you hit that barrier of operations where your normal operations doesn't work anymore where your thresholds don't do the job anymore and if the linear regression that is maybe built into the tooling that you already have also doesn't work anymore maybe then reach to AI to do the job because only then you understand your problem domain so good enough that you can ask questions or you can pose problems that are well suited for AI because if you just invite a data scientist to your shop and you tell them make my life better make it more efficient then he will also start asking questions so it's then you need to smarten up the data scientist on your domain so I think it's way easier to smarten up your folks that already know your domain really well name the operations and DevOps and then invite the data scientists to help them solve problems that you pose so it's not the other way around yeah yeah no Marcel that that's great it is one of the you know huge value propositions of cloud in general is I no longer need to be a nation state or have some you know ridiculous budget you know with that democratization that you know the availability of the tooling and a lower bar for the skill set that we need to learn we actually got a question in from the audience Andrew Sullivan who's the host of ask an open shift admin asked is AI ops different than the preemptive proactive actions taken by monitoring systems or other things like more advanced open shift operators so maybe if you could drill down a little bit and help us differentiate what you're talking about it's a yeah it's a good question I think it's it's essentially it's the same and I think it can be applied to the same ideas so being preemptive or proactive means that you anticipate something happening in the future and you can do that with some hard coded rules or maybe some some some more advanced logic that's codified into these operators which then take action if they see a pattern emerge or if they see something going wrong maybe they evacuate a part or maybe they restart your application because they saw that pattern over and over again now using AI ops in that term would just mean exchange that piece of handwritten logic which you wrote or which you created based on the experience that your AI your brain experience previously and replace that with some heuristics that might be deep learning really at the far edge but might be just linear operation or some advanced statistics in the first step where you just use some tooling in that language that uses statistics to do the same job so it's really the same patterns the same tools but within data science angle applied to it and but that to enable that you need to have the data the first place so first the first step is you need to have operations right you get your you get your house cleaned your environment really well in shape for having all the having all access to all the data that you need to do proper root cause analysis to do proper monitoring if that's all in place and you have some automation attached to it because you can script stuff and then you reach a barrier where you can even take that next step maybe you don't need to take that next step maybe all is really super easy and you don't need to apply it but if you fail with your automation or you could do better apply some machine learning to it apply some statistics to it maybe it's then learning the data and the observations had been in the past so you might store your data over the past and then you build up that AI knowledge in your team and then you apply the IOS to it Marcel I wonder if you can help us connect the dots we talk mostly about the cloud on this show so most of you were talking about things that oh it doesn't require too much you pull out your laptop you do it when you talk about data though a lot of the data lakes now are living in the cloud moving data isn't necessarily an easy thing physics still rules so how much of what you're talking about is it a public cloud thing how does edge play into it any guidance you can give as to where do people start looking when it comes to there's a question up there reference architectures the whole flow of activity we don't have time to whiteboard and do everything else but where do we start and where do we go for more resources on that yes very good question so first and foremost I don't think that you need to start from scratch it's pretty much like it should be like this image detection or cat image detection problem that I said no nobody will train a complete image net based on all the images they just use the pre-trained models that are out there to identify cats and I think the same is true for identifying outliers for anomaly detection so most of the time it's time series data that we're dealing with so you don't have to reinvent the wheel to do something like resource prediction like predict the future when will my storage run out of space what's the seasonality of my of my access locks right so in winter I have more access because I'm selling tires and I have the spike where the climate changes maybe I use that to extract seasonality out of my time series because I have an influx of requests to my web shop so I don't write the software from scratch I use something that's already out there which is working with time series data the good and that's where cloud comes in if we are talking about cloud I'm talking about cloud native a cloud native approach that means containers that means Kubernetes or OpenShift orchestration that means CNCF projects that that contribute to this so for time series data it's Prometheus that's all over the place we are doing these days monitoring with Prometheus if you are not doing it with Prometheus you can also apply these tools but it will be harder and there's a lot of examples that my team has created on the internet but you also find a lot of other examples working with time series data with Prometheus in the cloud and that doesn't necessarily mean that you are operating the platform or parts of the platform that can also mean that you are operating your application which is again instrumented with some metrics that are emitted via Prometheus and I think that's the nice thing about the cloud these days that we have so many defined standards across the whole stack so you will see the same metrics not the same metrics but the same schema of metrics being emitted throughout the whole stack so you don't have to reinvent your tooling if you travel up or down the stack but you can just reapply it and I think that's the beauty of it, it's a little bit like LEGO blocks and they all mix and match but I think even Duplo and the big ones yeah but I think they all too much, they are just a different scale Marcel maybe let's if we can can we make this a little bit real so my understanding one of the ways that we internally do this is Red Hat we have what we call the remote health monitoring all of the clusters out there from Kubernetes standpoint data comes back and of course it's all anonymized but being able to understand trends and learn from what the community is doing is something that open source has a long history of and is one of those great drivers of data driven insight so maybe walk us through that as an example of how AIOps fits into learning things and maybe changing what we do with something like the remote health monitoring I think that's a real question so one is this remote health monitoring which I think is an excellent example where we apply machine learning to something that on the first look doesn't look like operations but I think it's again I think it's AIOps in its truest sense and then the second part is how do we treat or what's the open source angle there so remote health monitoring basically means all the open shift clusters out there that have opted into remote health monitoring so it's not mandatory like you said but you can send a piece of your operational metrics back to Red Hat that's like like what's the overall error rate of operators failing and what's the state of operators are in consumption so really something that doesn't expose any of the workloads that you're running there but it exposes the like it says the health of the cluster and the idea is that we treat the whole fleet of clusters out there as basically as connected devices and can identify problems and at certain clusters actively debug them with the customer if needed or maybe even identify the root cause or the problem without contacting the customer and then proactively fixing that before it affects a larger set of customers or before it becomes a real problem so to say and then also influence the product so something might start as I have this flipping state of an operator it doesn't have any effect but there's probably something wrong so let's stick into that and fix the bug before it actually becomes a real severe issue so I think that's the the use experience usually get from a cloud environment where the cloud scheduler can look into all the workloads that are being run in the cloud and then make the overall cloud experience better and that's something that we also needs to enable for hybrid deployments for basically all the products that are deployed out there to send back some telemetry data so that we can make the products better and then so to escape that traditional here's a release customer please install the release and we will fix the other bugs in the next release which is half a year away no but half this over the air upgrades constant rollout of updates something that we are used to with these devices here where we get an update all the time which are also sending back developer feedback to developers and applying machine learning to that internally is something what I would also call AIOPS so what are we doing with this telemetry stream which is again just Prometheus data a time series from many clusters so we're looking at these streams and we cluster the data to identify interesting well larger large clumps of of outliers not of outliers but clumps of of pools of clusters where some problems might occur which are happening at a larger scale or at a slow smaller scale where we can then make data-driven decisions where to look at next because we also only have so many developers but we can identify issues that affect a larger fleet or have a higher severity so it helps us to basically guide our attention and focus where we put our efforts next to so that's also AIOPS although it's not really us operating a cluster but we apply AI to the operational problem you said you wanted to to address kind of just the open source angle of what's happening so that's probably not really open sourcing because we're not open sourcing that data it's customer data maybe it's like crowdsourcing but I think it's going into that important direction where let's think about it this these image nets that identify cat images or these agents that we the smart home assistance from Amazon Alexa or Google or Siri they work because we threw a lot of data at those companies so Google has all the images so they could label all the cat images we're constantly talking to our phones so we had enough voice data to train those models how is that with operational data show me one customer that exposes his operational data meaning those metrics that we just talked about the log files the incidents basically everything that you need to operate your environment and share that with the world nobody does that for obvious reasons right there might be some privacy identifying information in there or there you might expose that you actually suck at operating your environment so you don't expose it to the world so nobody is really sharing that data and at the first step using our trust as Red Hat with customers to share that data with us as a trusted partner I think that's a step into the right direction so that customer A gets the benefit from customer B sharing his data with us because we solved customers A problem before we actually ran into that problem because customer B already ran into that problem so that takes it to the next level but now think about applying open source to it where we would also operate similar workloads in a community context where not products but projects like a MySQL project or a PostgreSQL project or a Kubernetes project would operate parts of a cloud environment completely in the open creating true open source data sets for metrics for failures of systems for the log files and all the incidents that have been created and then have been resolved that creates a large set of data where then data scientists come into play and say look we had this incident happening two years ago and now we have a data set of five years worth of data across many many hybrid cloud environments let's try to predict this incident that happened two years ago based on the knowledge that we have and that we have seen and then we can do something that is really preemptive and really proactive in resolving issues because we trained models that are based on a really common knowledge applying democratization also to the data itself I think that's the next step and then shift that to the larger industry and I mean if we're talking about community it's not like back in the days where we did something as students and we haven't seen a huge amount of data yet and we're not going to check for it no it's actually people being paid for this and they are working in these communities so that's my call out to all the hyperscalers and customers out there if we operate stuff in the open we can also create data sets that are truly open and then make AIOps a real thing. Marcel I think it's a great step along the trajectory we've seen for a long time that IT kept everything that they did very close to the best oh this is special we're going to build a temple to every application we're not going to tell anybody how we do this now most customers understand if I'm going to take advantage of all the innovation happened in the industry I need to be talking to my peers running certain pieces of IT isn't what differentiates me it's the relationships I have it's the people I have it's the business that I run that matters and IT needs to be close to that. And not only sharing, you know, best practices, but as you're saying, if we can leverage community data that would take things even further. So Marcel, you know, we mentioned at the beginning of the conversation that there was a lot of hype for a few years about AI ops. Is it the natural diffusion of technology and the wave of adoption? Feels like we're a little bit past the peak of the hype. Oftentimes, there are stumbling blocks or companies have failures where they move away some things. So what are you and your team? What do you see? What are some of those stumbling blocks? What are some of the common things that people will go down a path and be like, oh, forget it, I give up, you know, how do we help move things forward as an industry as a whole? As a whole, that's a tough question. I think this operates first idea is the lever to move things really forward. Yeah, and I guess maybe you talked to, you know, from an individual customer standpoint, what are the things that, you know, what are the top things say, hey, if you don't do A, B or C, you're gonna fail or look at some of the anti-patterns that you need to avoid? Yes, on that smaller scale, it's usually managing expectations. So I had so many calls that I have been pulled into, where a customer says, I wanna do AIOps. And a lot of the times they are not really operating. So really managing that expectation that AIOps doesn't solve the mess that you currently have. So it doesn't help you get better at operations. It helps you get better if you're already doing operations good. So that's the first thing. So really understand that to do AIOps, you have to do ops before. If you are at that level, try to come up with a homogeneous environment. So if your landscape is super, super diverse and you have a lot of different APIs and you have a lot of different teams working on stuff, start out with a small subset of your tooling, of your environment. And best would be to start out with a cloud native environment because that's where, I think that's where most of the traction these days happens. And start out with a small team. Because again, AIOps, it's not a technology, it's a mindset. Like you can't buy DevOps, you can't buy AIOps, but you need to enable your staff, your engineers to step up from being DevOps folks to become AIOps folks. So start with a small team and enable them, give them a lot of freedom to practice stuff and let them work on something that is not really mission critical, but where they can apply something and then learn and adopt that culture, be more data-driven. Even, I mean, that sounds democratic to some extent, even we, but also Red Hat is not super, super data-driven. So we also sometimes just do gut decisions and I'm trying to apply also the data-driven philosophy to my team, to the way that we are doing things. So make really data-driven decisions and that means you can also apply this data science tooling to your team structure or to basically to any decision that you are doing. So at the end of the day, a Jupiter notebook, this is the table stakes for doing machine learning or data science in the browser environment, a Jupiter notebook becomes something similar to our team, became something similar and equally important to our team as the terminal command line and Git. And I think that's actually a good example. So we had a data science team and a dev team and the data scientists, they needed to smarten up on using Git, using GitHub, using pull requests, all the tooling that the other folks were already, that the other folks already knew and the other folks are now using Jupiter notebooks to convey some prototypes or to basically use them as a first primitive in their day-to-day work. Yeah, Marcel, the term you've used a lot is operate first. If people wanna learn more about the principles of operate first and what your team's doing, where are some of the places that they can learn more? Sure, that's easy, operate-first.cloud is the website. There's a mailing list and there's a lot of GitHub organizations. And that's really this, it's one of the, I think it's the first completely truly open cloud environment where you can peek into the back of the data center, the back of the operations office was just a GitHub handle. So you have a single sign on with GitHub and you can click on the OpenShift console link and you will be in the OpenShift console and you can see what's running there. You can't deploy stuff, obviously, but you see what's running there. And then you go into the GitOps repositories where we have all the applications or the workloads that are being deployed there manage in a really in a forward-looking style how we would do GitOps in a green field project. And you can see that. So you would see an application which is in the OpenShift console and then you can go to the GitOps repository and see how that application is being deployed and you can even go to the Prometheus instance and see what metrics are being produced there. So it's really a living hello world example of something that you might want to run in your environment. Awesome. So we've actually, we've shared in the chat the website there, looks like there's also some community meetups happening and lots more resources. You of course did a session with Chris Wright, our CTO in technically speaking, definitely one people should check that one out. I'm glad we get to dig a little deeper. So Marcel, final thing for you, I guess, is the last two years, most of us, we're not commuting. We usually find some hobbies or things to do. I have a feeling that data plays some role in activities that you're doing. So maybe tell us, what do you play within your spare time and any data connections they have? Yeah. Usually I had a Doom poster at the back of my office. Now I replaced that with a car, but it's not a real car, it's a SIM car. So I totally got into SIM racing over the past two years, which is simulated computer racing with a VR headset. So I'm really sitting in that chair and doing some racing because I like this repetitiveness of stuff and where you do something and then you squeeze out the last bit of perfection out of it by just churning over the stuff that you're doing. And just recently I figured that there's so, I mean, I'm doing so many rounds and rounds and rounds, but I'm still five seconds slower than the real pros. And I feel that there's also some opportunity for AI guided, some AI guided code driver that tells me where I'm actually losing those seconds because it's too hard for me. I understand it from this perspective, but using some data-driven analytics to improve my way of driving and you get from those SIM cars, you get a lot of telemetry sent back and I actually managed to take that telemetry and feed it into an influx DB instance and use Grafana to display those graphs. Actually, I just took the example from another redhead engineer. So it was not based on my work, but I deployed that stuff in this operate first cloud. So I managed to send that data over there and just apply the tooling that we are using in AI ops to the SIM racing stuff, which I find super, super funny and interesting. So even in your hobbies, you can do this cross-domain fun stuff and smarten yourself up on stuff that you actually need for work in your spare time. Yeah, absolutely. In the fitness world, the data-driven results so you can measure everything against yourself in either the community or the team can help out. What VR headsets are you running with these days? I started out as a backer for the Oculus kit. So I'm really in there for since the first time and now I ordered this IO varro thing. So I'm always trying to get the best stuff. That's, I'm not allowed to travel anymore. So the family doesn't get the go to the holidays. Yeah, I need to wrestle the Oculus away from my son because definitely there's opportunity to travel through the VR headset and check out some of the things. Marcel, really appreciate the conversation. Appreciate, you know, sharing with the community and glad there's lots of ways that people can connect and follow up. So thank you again. Yes, thanks for having me. All right, we'll be back. You know, we run bi-weekly. Please reach out to me on Twitter at Stu. You can always hit us up on the YouTube channel. Lots of good programming coming to you throughout the year. So appreciate everyone for watching the comments in here and joining us for your journey in the clouds.