 Good morning. Good afternoon. Good evening wherever you're hailing from welcome to another edition of the level up hour I guess I should say episode these days, but another episode of the level up hour I am Chris short executive producer of open shift TV CNCF ambassador to the stars etc. Etc. I'm joined by the illustrious Langdon whites and two of my fellow Red Hatters today Langdon What's going on? Why are there so many people on the call? Exactly us You can also you could you can start using a chapter and just Yeah, um this chapter of this episode is right, right exactly Yeah, so this level of our where we talk about containers and why You know, like they ain't just hype, you know, there's there's a lot of usefulness to them And to try to talk about why they might be useful to you as like a rel administrator or developer or whatever Even if you're not using them necessarily in production Even if you're not using them in an orchestrated fashion just kind of to show you why containers are interesting and what they can let you do You know, I I run a series of points on the show a lot of the time if you ever seen the show before We'll talk about it later, but You know, even for that I use a tanner to run that so I don't have to run like Google's SDK in my laptop So as just a quick example, we've done a ton of different episodes. I think this is episode like 38 maybe 39 Or chapter 39 as they will start calling But today Well, let's let's do the slides and introduce the the folks for the show And that way we can make things less confusing Maybe All right, so everybody's see my Julio's slides as always All all discredit goes to me for the quality of the slides that are produced But these nice little art things over here are pretty cool All right, so you can find us on Twitter I'm Langdon with a one and Chris short is Chris short with a C and an S and another s H's Some ours, etc. You get it and You can also find us on our discord check it out You can you know come and hassle us about what you see on the show what you might like to see on the show What other shows you might like to see Or any other kind of random questions about containers and orchestration You know, you can also ask this about dating, but I highly recommend no not do that But you know not at all. We are we are both married. So there's there's something Yeah, I don't spam is real. Yeah, exactly. So But like I said, we do try to help out particularly on our discord There's a community of people there who might also be able to help with kind of what our questions are You know theories you might have Moving on today's episode. I got that number completely wrong clearly. Wow Is episode 39 not 37 and so this is where we'll talk about this in a few minutes, but the you know Every time we have a guest on the show who's from redhat we talk about redhat renames teams and titles and stuff like that all the time it gets very confusing Internally we generally speaking try not to expose that to the external world But sometimes it slips out. So today we have a couple of guests and I'm gonna let them introduce themselves And what exactly their team is called? Because even if I knew it last week, I may not know it this week Show notes are not live as I was mentioning to Chris earlier. I've been on vacation for a few days So I didn't quite get to them. So I will tweet them out as soon as they're ready And link to the main repo and chat for everybody. Oh, thank you Yeah, so they will show up there as soon as I get them done And actually I will also fix the bug that was reported to be on Twitter Which is fixing the table of contents which I haven't quite figured out how to automate so that means it relies on my memory to do It so good luck Yeah, so without further ado Why don't we start with young? Can you kind of introduce yourself? Tell us what your role is and You know and the team you're working on Sure thing, but before I do that, can you send me a link to those cool little graphics that you have? They're mine, they're all mine All right, yeah, so my name is Jan Zeleny I'm the manager of the connected customer experience as the internal name of the team Surprisingly enough that team kind of stuck for the last couple of years So we might be renaming the team soon right right right any day now. Yeah, yeah, exactly and Well, originally we started the team with like a broad vision of like, hey, you know, we have we have a bunch of connected clusters here Let's do something with it something useful. Let's present some value back to the customers And you know, let's present some value to our engineer to our engineers internally. It wasn't exactly clear than that Since then we actually have a mission now and let's see if I memorized it People asked me so many times. I think our mission Maybe next time yeah, yeah, our mission statement is to Improve the operational experience of connected customers and connected clusters Whatever that means we do that on so many fronts We work with our support engineers. We work with our product engineers to improve the quality of the product And obviously, you know, we present some value back to the customers So they are actually motivated so that they can actually get some value for sending us their telemetry data So that's that's in a nutshell, I guess cool All right. So now this was hard because I've never heard your name out loud. I can do it I can do it, but it depends on it depends on your nationality, which I'm not sure of so it could be Ivan or Yvonne or There's it others and go ahead. It's Ivan. Nice. Nice. They've done. Yeah. Yeah The surname was was right. The first name Langdon had had it better like with Yvonne. Sorry, but you know, I'm already used to all of that So don't worry about it. So yeah, I'm Software engineer working on be clear, you know, like them tell me about it, right? Like, you know, I don't know how people look at it and figure out all different ways to pronounce it, but it's very impressive Yeah So yeah, I'm a software engineer working on the same team that the an also connected customer experience I Officially or non-efficially I have the architect name In the role even though I don't like this particular name Because it just has a lot of assumptions people have about this particular role and basically I'm trying just to figure out like what would be the most important or interesting thing to do About our data and bring the impact back to our customers eventually So that's in a nutshell what I'm trying to do and then trying to get that involved everyone that Makes sense Cool All right, so obvious question What is connected customer mean or or let's put it slightly differently You know, what is the external facing like what is what are people who are our customers or users or whatever? What do they see that your team produces? So let me answer you answered that with the question. Have you ever heard of redhead insights? Yes, I have in fact we actually had Oh, no, I don't remember we had somebody on the show talking about it. We had it was episode 36 or 37 now Yeah, so we had we had Dorito on talking about subscriptions Which was is kind of getting morphed into that and then we also we did something with the insights as well Yeah, we've actually done a few different things. So yeah, so in a nutshell this the way I understand Redhead insights. It's like I am gonna run You know, it's gonna keep track for me of all my rel Instances and kind of their current status and whether they're in good shape or not Is kind of what I wasted in in a nutshell. Yeah, that's definitely that's definitely true on the on the rail side But you know lately, let's let's imagine insights as a big overarching brand. I'm not just real specific, but you know, like Open ship and in the future Perhaps Ansible as well But you know CCX and what we do on a customer facing side, you know Is the first step for Redhead to you know into the insights world on the open shift side So what we what we brought last year was a well, you can call it like a tiny application It's not that tiny actually, but it's the insights advisor for open shift Which you know, it doesn't give you all the information you will see in rel because in rel you have advisor you Vulnerabilities compliance all that stuff On the open shift side, we have just advisor which gives you some recommendations about You know how to keep your cluster healthy, you know to put it in one sentence We look for different types of recommendations related to stability availability performance Securities some we have some security guidance as well. So, you know, that's what we have on the on the customer facing side If you if you want to see that I can actually show how it how it looks like sure. Yeah All right. Okay, so let me pictures make it easier to understand things. Well, it's our yeah, I think so, too All right, so I hope you can see my screen now So this is what you will see as a as a cluster admin either either this web console right here Or you can see you can have You can have obviously command line access You see that yeah, I'm logged into my server So this is this this is the regular stuff, right? And here you can already see the first glimpse of the insights advisor for open sheet You can see this is I installed this cluster just yesterday. So you can see that it is perfectly helping If you if you wanted to take a look at it in the open ship cluster manager Which is another nice app that we have on cloud.redhead.com It will show you some some details about a cluster Unlocked in already So this is the overview of my cluster. You can see multiple tabs insights advisor is one of them again You can see that, you know, everything Everything is in a good shape But what would what would you see if things weren't in a good shape? I actually prepared a short little demo Where I'm gonna where I'm gonna show Some, you know, how does it look like, you know, what insights does for you as as open ship admin? So you're introducing problems is what you're saying. Yes I'm intentionally gonna introduce introduce a problem here And then we'll see what recommendations we get and how can we how can we follow the recommendations so that the problems are fixed? So let's start with creating a couple new projects I'm gonna create project and say project copy So a couple projects here and You know, I'm gonna skip I'm gonna skip a few steps now because obviously I would now Create applications within those projects, you know, put some code in and whatnot But just to just for the sake of the demo I'm gonna jump to almost the end and send some egress IPs for the project Actually have this somewhere in my history I Hope the number of demos I've seen that are purely about the usage of bash history is very impressive, right? Yeah, no surprise there Okay, so I'm gonna Touch net net net net space for the project and set the egress IP. Let's keep 100 here So Until now everything everything is okay. It is not gonna do any well actually it is gonna introduce one one little thing already But let me introduce another Another problem. I'm gonna keep it keep the 100 here and let's say I want to create an egress IP for my copy project And I kind of forget to you know, change the IP address here, right? So now I have two projects with the same egress IP that can't be good, but well, it is possible that I kind of miss that For whatever reason. So now when I take a look at namespace Just to check that everything. Yep. So two projects with the same egress IP Now insights has a component collecting the data on on OpenShift and sending that data back every couple hours that component is called insights operator and Since it collects the data only couple hours, you're gonna see me, you know Forcing the data upload couple times here and I'm gonna do that by going to Project inside insights Okay And I'm gonna I'm gonna delete this spot Which is gonna the insights operator is gonna restart the pot and as part of the restart it's gonna It's gonna upload the data. So in a little while We'll probably see here that insights is not available Which means we're on a good track, but the refresh test takes a little while So I'm gonna switch to the OpenShift cluster manager Refresh the page to see if data was already uploaded in process So we need to say OpenShift cluster manager. Yep So that's a single cluster you're looking at. All right. Yeah. Yeah. Okay Yeah, for some reason I hear that and I often think of ACM like multi cluster manager. I don't know why But they get conflated in my head very easily. It's not like we have a lot of acronyms floating around No, no, right But at least we have it was funny. It's something made a joke like that yesterday And and they rattled off the a very long-winded title or whatever and I was like, oh, you forgot red hat has to be in the middle of it somewhere Of course Yeah, I mean you can see that you have you have a bunch of clusters here in OCM But you know the functionality that we have with with insights at the moment is just a single cluster view So that's that's why I use this we use this view But you can already see that the data I I you know, I upload it Yeah, you know was uploaded and processed you can see that the last check Happened just now or one minute ago Let's see insights on the cluster. This is the web console on the cluster insights here. It's not available here meaning You know, it's gonna refresh and it and a minute now and I'm gonna see pretty much the same information here So let's let's go back to this one in a little while and let's take a look at the recommendations This is the this is the first piece of value that you get from from insights advisor You can see that by doing what I did I actually broke two things and both related to cluster dropping some traffic The first one because two net namespaces contain the same egress IP, you know, we got we kind of knew that what's gonna happen But there is another one None of the egress IP side I said to the projects is assigned to any node So you can go you can go here take a look at the details You'll actually give you specific information about what's going on You can see the names of the projects that are or the net namespaces are problematic You can see the IPs that were set and it gives you some some basic guidance And there is usually a link for more information like this one Which will take you to a open ship documentation about setting setting egress IPs So let's let's try fixing fixing this documentation alone, right? No, this one so that so the duplicate duplicate egress IPs so now You know if I saw that recommendations what I would probably do is to take a look at the namespaces Like I did the first time and now I see that you know exactly what the recommendation told me and Okay, so I I need to change my IP for this project So little the batch history magic again And let's use One Quickly quickly check what does it look like? Right this should be now good and again I would need to wait for a couple hours for another data reupload, but I can I can force it again How often how often does it kind of run by itself? Every couple hours like yeah, yeah Yeah, so that's I mean that's an important thing to note right is that you know if you're doing a whole bunch of operational stuff, right? It's probably a good idea to kind of force a run, you know, it might be worth considering You know, is there a way to kind of trigger it without having to delete the operator? You know that's a good tip for the future Yeah, just because you know like you tend to you tend to go in and do a bunch of those bulk things at once, right? And then you know, and it's like you you need immediate feedback on that not you know not Two hours later when you've already gone off to get a sandwich. Yeah. Yeah. That's a good point We should definitely look into that one And so yeah, now you can see that we're one recommendation down one remaining This data on the cluster hasn't hasn't updated yet. So you can see still the old one are not available So it's refreshing. That's good So in a minute, I will I would see the same information as in the open ship cluster manager Just what just one recommendation. So let's take a look at this recommendation Okay simple explanation. Let's see what are the remediation steps For the namespaces I get linked to documentation Yeah, you dress IP for the project and it includes some information about the Yeah, you see some information about the host subnet So that's that's pretty much what I'm gonna do now So I'm gonna fix this fix this issue by following the recommendation and setting some setting the egress IPs for Signing them to some host subnets So quickly checking host subnets You can see here that no egress IPs are set for any of the host subnets So let's let's quickly check. What are the Okay, I'm gonna pick pick this worker node That's this whole subnet Subnet Magick, all right, then I'm gonna set basic of all right different addresses here You know, I can I can either do it this way set some specific fixed egress IP to the host subnet Or I could do like a address range, whatever whatever I prefer this case. I'm gonna do Two specific to specific addresses And check Yeah, egress IPs both of the egress IPs are here. So, you know, the drill Restarting it How fast we are Refresh, so you can see we're back to all cluster past your cluster costs all recommendations meaning We don't know about any known issues that would be that would be affecting your cluster And again insights is not available here. We'll be reflected in a in a little while So that's that's pretty much it. That's that's in a nutshell, you know, how you how you consume this stuff and How you how you get the value out of out of the insights advisor So I'm gonna ask the hard question What data do we collect? Yeah, that's a good question Obviously, I mean I have some docs that can help but yeah, yeah Yeah, obviously we have we have a documentation on that because you know the data or the data set changes and it changes with every version and sometimes we even backport some of the some of the data collection code and so Yeah, I would point you to if you if you search open she documentation and you search the if you look for section remote health It gives you an overview of what data do we do we collect? I mean and we also have if you if you want to look at the code of insights operator, obviously It's open source. So if you go to our github repo in the open chip or repo is insights operator We actually have a sample archive which which shows you like how does the data structure look and that will give you a good idea What we collect? So before we move on So someone who we refer to as netherland tackham made a joke about rarely ever getting to zero Issues on their cluster and so while we're kind of talking and you know, because the lag Sorry, if he could mention a couple of things in the in the chat that he's seen before You know that he doesn't feel like they can get fixed or whatever that might be a useful thing to talk about And that's that's a good point one one thing about our recommendations You know, we are there is a difference between insights recommendations and alerts Like you will see alerts in your cluster, you know, whenever something is out of ordinary But sometimes or you know, many times these these alerts are not exactly actionable But they will you know, there's some indication that something's going on whereas insights recommendations We publish recommendation only when we are when we know specifically what's going on We can detect the conditions and we can also give you a recommendation And you know, what to do about the situation which alerts typically Typically don't have so you will see that there is What there is many more alerts that in that insights recommendation So it is not uncommon for a cluster admin to see a long list of alerts while inside says like, you know Everything is quote-unquote. Okay Is there one somewhere where it just says light it on fire and go get a coffee Um, because because I think the temptation would be very very high for me to put that in there somewhere I think at that at that point insights just you know hides and hides and runs away and you will see You know insights is not available So, um, you know, so we go back to the connected part, right? So so what's the point of that? Like what are you like why is the is the cluster connected? Uh, and what does that mean? Yeah, so I might Yeah, I might try uh, so the the basic idea was with the open shift four where you know in like open shift three and before It was always a bit of a guesswork of like what problems users actually run into you know, we We have the ci system and testing and q8 department and all you know ticketing Customer cases and so on but it's still kind of doesn't give you the Good picture of like what are the real problems that the customers customers have and at the same time, you know Like especially when comparing to merit services and cloud services There is this deficiency of not being able to know like what what problems the customers have Was quite a high and you know people decided we need to do something about that And that's where the idea of actually making the customers and clusters connected, which by this means mean The clusters sending the health data about about them or their health data to cloud and to collaborate at com to be more specific Where we have pipelines and you know processes to actually handle those data and to be able to Do better decisions in terms of prioritization of the fixes and things like that So basically the main part here of what the connected cluster is is the cluster that is sharing the health data with with red hats for the further analysis, so There are two components providing this capability. We already talked about the insets operator as being one of those It would be mistake to not mention the second one, which is the Prometheus slash telemeter project that is sending a selected set of metrics to to Red Hat as well and we have a way how to Then do the aggregations on top of this data to be able to actually help The customers either at the specific like single single Custom level and there are use cases for the support itself to be able to leverage this data without having to every time ask for Must gather to to understand what the issue might be But it's also for the more fleet-wide views Where we we can know what what troubles are our customers hitting In order so that you can do their life better by doing better releases next time. So that's basically another use case besides just doing the direct recommendations to the customers where the customer still Benefit from from this data even though they they might not see this as you know direct recommendation for them. So for example Like it's quite connected also to how we do the upgrade allows for open shift for Where no when we first do the release It doesn't go or it's not available for every customers we have this fast and candidate channels and Compound to this like when we release a new version We are still monitoring even monitoring closer the clusters that Go through this upgrade and see you know what issues they are hitting because even though we try our best We do our best in terms of Making sure that the things that we could think of don't break. There's still a chance that something Uh That is specific to customers environment might make some Issues more obvious than what we could ever think. So this is basically one of the cases where before we roll out upgrade to the wider wider range of customers We are still Watching like what's going on with the small set of things like that By the way, this is how this is how I explain, you know, what's the difference between connected customer experience and kiwi Like is the job is the job of kiwi to you know, make sure that like no issues go into production You know in the ideal world obviously While you know, it's it's this is not our job Our job is to monitor like hey this happened once there is a chance that it's going to happen again And you know, if it happens, uh, you know frequently enough We we need to work fast to make sure that it doesn't in fact, you know, like a big portion of the fleet Well, it also I mean one of the things that you're capturing right that kiwi You know is never going to capture unless you actually change the development model Which is that operator introduced issues as well, right? So so it's not just that the cluster Is in bad shape because of the upgrade, right? It's because it may be that You know in open shift 4.6 You had these two things, you know, and they were okay, you know or weren't causing any problems But then when you upgrade it to 4.8 You know now, you know that that back ported patch that you were relying on or sorry That's the wrong way, but like it that that bug you were relying on it didn't necessarily realize it You've now introduced it right like my classic example of this right is when uh, I think it was windows 3 1 No, I haven't been later might have been windows 98, but when it came out Civilization the video game wouldn't run civilization, right? I think And so and the reason is because they were relying on a bug in the os To do something for performance reasons, right? Which is pretty typical for video game companies So microsoft actually reintroduced the bug into windows to make civilization work So, you know it goes both ways, but the the point is that um, actually I I almost fell the test actually like a certification test because the documentation said that this thing worked one way And I know from using it that a couple of those things did not work Because there were bugs and if you try to use them things would go badly and I had never read the documentation, right? I had just done the actual thing Um So, you know, there's a lot of those right where you kind of as a developer an operator whatever There's some issue or something that you have a work around for or you know that you even Don't know is a bug And these are the kinds of things right that's something like insights is going to catch but QE never will right? They just Yeah, you don't have the data, you know, I just showed you what up those life Yeah, yeah, exactly. Um But that I mean so but that one's obvious, right? Like as in, you know, that's just a straight up error, right? But there's also these other weirdness things that'll happen because you know, like I said because a bug was patched or Because you know the way something internal is working move to a slightly different model, you know So it could be pretty subtle, you know more than just you know And I thought actually an example of that I prepared one one of the demos were like Showcases of how this data actually being treated. So let me try to share My screen. So as I mentioned the monitoring of the upgrades and and looking at how It works is one of the things that We are looking into so for example, like when we are like at some Point in time, like this is not the fresh data but something that I had a few months back When we are looking at the like how long the upgrades take between different Why stream versions we we've seen that for for for 4.7 There is more kind of variety of of the of the length of the upgrade So, you know, you could you could see this as the The length of upgrades in hours, I think and we've seen that, you know, there's something going on in there So the next thing you are actually interested in like what are the common problems, you know common reasons Why, you know out of the salary in 4.7 upgrades that Longer than than before. So what we did as one of the views is basically looking at the some Common causes for for clusters or upgrades to taking longer, which especially in vSphere. We got to This symptom, which is the connected to vSphere problem detector. I don't have the more drill down Example in here, but basically the gist of that was that many clusters in 4.6 had a problem In vSphere configuration configuration where even though the cluster thought that, you know, the vSphere is managed and it has the right credentials to do, you know Machine config operators and all all these kind of underlying infrastructure operations Like on behalf of the administrator in reality the credentials actually were not set correctly because the users actually were not using this functionality But you know at that point the the cluster itself actually was not doing any checks that know the credentials whether they are correct or not And when the 4.7 went out this check was introduced and what it caused is basically once upgrading to 4.7 The upgrade would know basically stop at some point Luckily it was at the end, but it's still mark as unhealthy state where it was, you know, complying Justifiably that the credentials are not there. So, you know, there's something wrong with configuration From and no, I can imagine winning this conversation over qe that you know This is expected state. We want to let people know that Something is wrong The problem is that from the user perspective or customer perspective the cluster was fine in 4.6, you know And they were actually not using this this this functionality And the only thing that they noticed that the upgrade might get stuck eventually when when they don't have this configured So it took some time to, you know, have this discussion with our engineering to actually Find the better ways how to let people know that something is not right But still not marking the the operator is degraded because that was just too Intensive to the to the customer and their workflow. So that's, you know, some of the example that shows A like how we are actually monitoring the upgrades and looking at the common causes for upgrade failures But it's also, you know, the difference between how one can look at this from the perspective of the engineering and quality assurance And what's the perspective when Looking at like what the customer is actually seeing and experiencing. So That's I think, you know covers one one of those similar examples to what you mentioned With the back in windows where, you know, we need to work with what people were used to and Actually not being too aggressive on fixing things that that might be working for them Let's put it in this way. Yeah. Yeah, it's a it's a weird It's a weird fine line. Um, and to be clear, right the windows story I did not hear it from a windows engineer who introduced the bug. So for all I know it's apocryphal, but it's a great story. Um So, uh, let's see. What was I gonna say? Oh, so, uh, and y'all knows where this is going. Um What how do you figure out what the rules are? Um, so in other words, you know, are you You know you your team, you know other people whatever, uh, sitting down with the data and kind of looking for Issues and then looking for recommendations or as like I said, y'all knows was coming. Um, you know, are you introducing some AI here? Some machine learning type models to try to predict What kinds of errors you might be getting So I'm gonna start with the with the simple ones. I would say it's it's actually all of those combined Uh Like, you know, the way the way we treat rules is that, you know, it's it's encoded knowledge of our support and product engineers You know, that's you know, that's the value of of open shift compared to kubernetes same as it is for You know rel compared to centOS for example because we literally encode our knowledge into those into those rules So we have nominations coming from support engineers, you know, when they see a frequently appearing appearing support cases, you know, they nominate them for automation Uh, I mean, you know, they themselves use internal deployment of of insights advisor Which you could see of internal deployment where they have many more rules than what the customers see You know, some of the rules are not as descriptive, but they still help supporting, right? That's our one source And you know, there are multiple other people who engage Engaged with customers and then bring back the knowledge and give us tips like hey We should we should get this recommendation or that recommendation So that's that's like the trivial source of, uh, you know, uh denominations, but then I wouldn't characterize it as trivial. That's that's actually very hard data to get But it is it is simple Perhaps yeah, yeah It is simple in like, you know, non-technical for you know, for us engineers Uh, you don't need to code to get that information. Sometimes that's harder than coding, but you know, whatever Right, uh, but yeah, then then we have other sources and and I'm gonna let Ivan talk about those those other sources Like you said, like some some analytics and whatnot Yeah, so like and you know, we might get more into the ai and ml stuff and one thing that like Me personally and many people especially I'm like like having more Practical experience with the underlying Technologies they like we are usually a bit more skeptical on seeing ai and ml Like solving all the all the problems, you know, just like that And that's actually confidence also our experience so far where the the ml part is actually You know the cherry on the top of all the work that needs to be done before it can be applied. So what we And we still want to leverage the data that are that are coming, but there is a long path before No, like from the raw data to something that can be plugged into the machine learning models and benefit That from so the first thing that we need to do is actually cleaning the data and having the right views Oh, uh, it's like even though some of these data might be queryable With promql, you know, people start realizing that like running promql on Larger data sets and you know, if you have too many time series coming in It it's not that the trivial as if if one had a single cluster deployment and promise use on top of single cluster so One thing that needs to be overcome is basically finding a different way to store this Data and allow cleaning those so that you have the right views. So Once you have that you can also start in Getting more people involved outside of just the data analytics itself So we try like in our team We have a dedicated team that is more focused on Looking at this data and trends and and trying to do no recommendations for recommendations, for example No finding like what would be the best advice is to choose and actually Refine so that more people would be a help with that But even even better is to get these views to the people that actually understand the open shift better than we are Which are either engineers in open shift core or Those are the support engineers working on the on the on the cases or our asari engineers working on the open machine dedicated and if we are able to get their eyes Looking at the data that we have in some Understandable way that's my experience more beneficial than trying to apply some machine learning models right away and finding some, you know, nice information that without any interpretation still this doesn't give you any value so We at some point. I think are even acting like Translators between the engineers working on open shift And data scientists that we have within the red ahead that are working and experts on working with the data so one thing that For example, they have different language or even worse. They use the same words for completely different things so you can think about like if you tell cluster or I will see the other examples feature if you're talking about clusters and features with open shift engineer They they you know see some new buttons to be added to to the right now The deployments when you talk about clusters and features with the data scientists They see, you know the clustering algorithms and selecting parameters that you fit in and you know, this is a simple example But there are many many of those where like there needs to be Effort put into an understanding both worlds. So actually the that benefits can Be derived at the end. So that's something that we try to kind of help and physically Help these different teams working together and getting the benefit out of the data. So I have some example as well on you know, specific things where we are applying the More advanced stuff from you could even call it machine learning. So I might share the screen again I just wanted to comment the One of the things I am most proud of that red hat is introducing the phrase words are hard at red hat And by extension, there's also words are hard in english Uh, which is this problem, right where we have a lot of synonyms You know, there's a lot of jargon and when you mix fields together, you know, those jargon You know conflicts really start to show up Um, but yeah, that's that's interesting I hadn't really thought about the feature the word feature because I I've also played in both sides of that, you know AI ML Data science, you know a feature means one thing In software engineering a feature means something completely different. Uh, which is kind of entertaining. I'm sure Yeah, so what we are looking at is some views at like we call it symptoms because Uh, you can have either, you know, alerts that the the the clusters are sharing or you can have degraded operators Uh, and different reasons for that. You can have the Insights rules, you know, things like we've seen with egress that also the clusters are reporting So we unify these these different data points and we call it symptoms and assign them them some some id That we can work Further with so in this particular case, it was the monitoring operator Mark is degraded and what you can see at this graph is some spikes. So this on the x axis we see the different, uh z stream versions for the open shift and on the y axis It's some hit rate in terms of, you know, how many clusters, uh, are seeing this particular problem in in particular release so what like So there is a lot of work to be done in order to get this view, you know, you know, if you have the raw promql It's always uh on x axis. There is a time so, you know Switching it in a way that you are not looking at the time series from the time's perspective, but on the time series in terms of different versions Is some work to be done and somebody needs to do this do this work And then there is a second thing is either looking at this and giving this in front of the eyes of the subject matter experts or if the If it the data allow that we can try to do some auto detection of this thing So, you know, it seems at least in this example that it should not be that hard to to find this kind of uh Spikes or signals within the noise that we should actually notice early and act before It starts causing problems for for many many more users. So Here is some visualization of like how it looks In in real world where the red line is basically defining some thresholds Based on the previous history of that particular trend and if it crosses some some boundaries Then we mark it as you know anomaly and something that People should be notified and start looking into more details about the root cause of that problem We still no it might be an alert or degraded operator There might be more kind of root causes for that. So it still takes some time after the algorithms notice that but It's the first step being able to to notice this quickly. So Even this is not that that obvious, you know, it's obvious in this example But when we actually start looking at it or applying these algorithms against every single symptom that we observe in in open shift You start noticing some things that are triggering the algorithms to you know, uh, to send a notification But in reality, it's it's not not problem. Uh, and One needs to have some understanding of how open shift works and How was the behavioral of for example alerts when the installation or upgrades is happening Before uh, one can actually apply it So there's still more work on the cleanup side than on the machine learning algorithms themselves So that that's something that I want to I don't know how much stress I would put on This that the data cleaning and making sure that makes sense It's much more important than tweaking the algorithms and applying the latest machine learning neural networks On top of the data because that's that's usually even simpler regressions might work quite well. So, uh, for example in this case of having the spike detection What was causing troubles is We tried to notice like whenever there's a new release Trying to notice symptoms that have higher heat rate that what we were used to in the past and one one of the things that happens is that Usually like there are two ways to get to the specific version of of the open shift It's either why installation or why upgrade and both of these processes Include some intermittent alerts to be triggered Because you know knows are getting restarted or knows It's just part of the installation of upgrade process that things are not great At all at once, you know, uh, atomically it's still some some Time where things might not be as expected and that usually causes some alerts to be triggered So when looking at the data from this perspective, we always got quite high heat rates on the on the fresh released when when you got rolled out and in order to clean this up before applying the machine learning You need to exclude the times when the upgrades were happening or you know, you can focus on symptoms that are happening more than One day for one cluster and they're having the trends from that Which eliminates some of the noise and then you get much reliable source of data to be able to to do this kind of analysis on top of so It's really interesting work and you know a lot of Detective work as well. So it's not just about looking at the data But we many times need to dig deeper into like what the particular problem means know what are the dimensions of of this within the bugzilla or customer cases and Like once we learn how to do it manually We try to apply some automated ways and that's where it's the time to introduce some AI and ml stuff and makes the the better use of that before that I would never recommend trying or jumping directly to ml without understanding how the data Work and how the people work with the within the particle problem domain so that that's pretty much my experience with with the ml stuff so, um, I I think it's funny right when I don't think it's a catchphrase yet, but it really should be which is like You know 50% of any machine learning project is data cleanup Maybe it's 80% You know is it's like a classic challenge One thing I want to do is take a break here for a second and we'll talk about our sweet sweet internet points Just a reminder about that. Yeah, but then I had one more question for you all when we get back Hopefully a lot of time So let me share the screen again um And I will explain what we're talking about as well Oh, that is not what it makes you at all um, all right, so these are our sweet sweet internet points. Um, So, uh, we like to give out points for oh shoot. That's the wrong code. Um So I will uh, also get the correct code Uh, but I use the one on the screen But actually, hey chris, why don't you explain the sweet sweet internet points? Well, I will happily do that while you get the codes. So our sweet sweet internet points are essentially awesomeness and a point thing um, so, you know, there's ways to get points from Anywhere from watching the episode and just submitting the code for points to Maybe issuing a pull request or issue against the uh repository that was linked Earlier in the show today, which I will drop another link to as soon as I've been talking um, so, you know participation basically Equals points and using these codes inserting them into the form will get you your sweet sweet internet points, which we Hope I have corrected or I'm about to throw in the chat. Yeah hit refresh. Um or something But the the points right now are bragging rights But they will have an intrinsic value of some sort in the very near future extrinsic. Sorry. Jesus. I keep getting this extrinsic value Even in a hack and caught that it was last week's code. Wow. Good. Yeah. Yeah. So the idea is build up your points Eventually, they'll be worth real stuff Or you can exchange your points for real stuff. I should say Um So Langdon, thank you for dropping the codes in there. I will grab the repo if you want to Brush up anything I said feel free. No, I thought that was pretty good. Uh, yeah So there's a like an activities markdown page on the uh episodes repo which gives you some ideas for how you Can participate to earn your points. Um, you know as you may have, you know, as you probably Can't tell but uh, so, you know, some of the leaders on this board, um have not just submitted episode Points, so there's other ways to do it Please check those out and that way you can kind of catch up to the leaders Obviously if it was just watching all the episodes, uh, that wouldn't necessarily be enough So because you'd only have 3,900 points, right? Uh, well, there's escalators. Oh, there's escalators. So, um, so At periodic points, which I can't remember off the top of my head But uh, when you have a certain number of episodes, uh, then you actually get a bonus set of points For having collected that many episodes And so, uh, yeah, so there was just a way to kind of make the Doing the show a little more fun participating in the show a little bit more fun And we really like normally red hat is very good at producing Uh, kind of prizes for things Uh, for some reason this particular show is has been first. Yeah And so so we've gone through like multiple vendors. We've gone through various uh, teas and teas and challenges like that and You know, but we think they're We believe there aren't out. We're not quite sure why we haven't gotten over that last little hope Um, but you know, we're we're so close. Um, as you can see we haven't even gotten this swag Right, so that's to tell you something that should tell you a lot because normally like they're Johnny on the spot with the Folks that are doing the thing Yes So, uh, yeah, so the last question I want to ask you was, um, kind of the You know Your your mo your examples so far have mostly been about Problems, um Are you currently doing or planning to do things where it's more like improvements? So in other words like Because I'm an average open shift admin. I know to set up things the way the docs tell me to right But because I have this particular confluence of events I'm running, you know, 32 web servers and six databases and you know, I'm running php 37 I don't know. Um Because of that combination of things if I tweak this variable over here, I will get overall better performance. Let's say Are you looking at stuff like that or are you trying to cover the bread and butter first in a sense? Yeah, all all at once, uh, so I mentioned in the beginning Performance for example is is one of the four four categories that that we pursue You know in some cases that can be uh, that can be performance problems in in other cases that can be performance optimizations We're no different than insights for all that way I remember one and that's that's a real specific, but I remember one specifically where You know, there was there was a recommendation still is And when you have a when you have a certain real setup and you run oracle database on top of that well setup You know these these licenses are expensive as hell, right? And you know in many cases, you know The the admins run them in a non Optimal way and thus they have to pay even even more for licenses because they need to have more hardware And whatnot. So, uh, you know, we give recommendations about optimizing their setup Increase their performance You know, so that's what that's what we do across the board both on both on well and an open shift I'll say that you know coming up with these improvements is much more difficult So we will not see as many of them as you see As you see with the problems because like I said, you know one thing about insights recommendation They need to be specific. They need to they need to tell you like this is how you this is how this is what you do now You know like pointing out like hey, you have a performance problem here Like, you know, that doesn't help much, right? And that's that's that's part of the reason why it's why it's more complicated, but uh, You know, it's definitely on the table Yeah, and another thing I would like to like mention is about like the the collection of the data and you know We don't want our users to feel as we are spying on them Which you know the more you get to the workload specific things the the the more risk like this might be felt, you know So that we need to be Even more cautious and making sure that we collect just enough data that can bring value back to the users and Especially going to the workload specific stuff, you know, there might be more and more specific things that would need to be collected Before we would be able to do this kind of recommendations. So like one thing that probably is the next on the on the table when looking beyond the open shift core operators would be looking at the Combination or like getting the health data from the olm operator space Which you know gives you some some more and basically giving more recommendations That would be specific to to some operator and especially when combining with the health of the cluster itself This can lead to some of the recommendations that would be more workflow specific But like right now, you know the the first thing that People should do is keep their clusters updated. You know, that's the the best way or one of the best ways how to Get the most out of your open shift clusters I would say is having the latest version because you have all the fixes and improvements that got there over time So that's why, you know, we want to make sure that the upgrade path and half of the clusters Is safe so that people, you know, it's even no-brainer for them to be on the latest release Which eventually also have the impact on the world close running on the on those on the first one of the first things people get recommended when Talking about an issue that is open shift is heavy. Try to upgrade to the latest version many times You know, it can be the the way that fixes the problem. So, you know, it's part Our our job is basically make sure that the upgrades are safe as safe as possible and another one is also letting people know that When they upgrade there are multiple improvements that might be actually specific to that particular deployment So we have a couple of recommendations that tell you specifically like we see you have this problem We know it's fixed in in x rel version. We recommend upgrading, you know So that that's the the main point right now as I mentioned, you know, yes, the workloads Are something that we would like to get to at some point, but uh, it's it's much more complicated than we would just, you know Deciding that we will do this as the next thing. Uh, there are more things to be considered with that And that's not my problem that it's more complicated. It's like, you know, just just get on that. That's that's what I need. Um Yeah, so it's it's on the table. Yeah. Yeah So related to that too is another thing that I'd be interested in right is that and we're starting to run out of time, but it's kind of You know, how how interesting is it if I as a single, you know end user customer, whatever you want to call it um Have multiple clusters all reporting into insights Um, is there interesting stuff that happens there, right? Um, where you have You know, maybe you still need to get into workloads, whatever, but it's there Um, you know, can you start to make recommendations like oh, you're running, you know 80 of your utilization on one cluster you have four of them that are bored You might want to go fix that. That's like a simple example Yep Yeah, um so Well, the the first thing that needs to that needs to be there is like What you saw in my demo was a single cluster view but uh, what Advisor for rel has and we don't at the moment is a recommendation view like hey here is a specific recommendation and here is a The list of clusters that are affected. That's currently working progress um And yeah, that's that's that's where you start. Um I don't know that I don't know like anything can happen like we have tons of ideas We have a very creative product manager. I'll I'll say that. Yeah And Matthew also mentioned that we are not the only team like working with this and and making sure that there is you know, the most out of The the remote health data Leverage so even though it might not be in in realm of of us like solving this particular problem as you know resource optimizations If you like There are other things we never had that we probably you know better to to speak to in terms of the cluster wide views and recommendations on that I guess even people from the advanced cluster manager Would have you know things to say and and features on there or not On that particular Particular problem. So we as I mentioned we try to be you know the catalyst of of leveraging the data for the benefit of the users But it doesn't mean that Our team would be you know providing all the features that the customers would be By way of explanation the product manager is radical, right? Yeah, so he their product manager used to be my boss. So that's why Mine too. Yeah, right exactly um, so Yeah, uh Anything else we should cover quickly chris. Do we have any? Impending questions aside from is pearl still in active use which it's still very much Yeah, um, no other questions really You know except for that a iml one which we kind of handled, um, right By the way, what you what you saw even presenting on on the spike detection just to give you an idea How fresh that is that's being worked on right now Like a lot of team members who work on that very algorithm today. So that's fresh out of the oven Yeah, as our as we often disclaim on the show, uh, this You know, we we often are showing things that may or may not be from the future. Um So, uh, you know because we're we're trying to kind of show you exactly what we're, you know Working on what we're thinking about you know all that kind of stuff So please, you know take anything you see with a grain of salt. These are the ideas we're going down There's mostly mean we'll actually have to ship it for whatever reason I would be remiss if I didn't say please like subscribe and share wherever you're following Oh, yes, we're watching us from I should say, um, so We need our catchphrase one of these days. I know we need to come up with something even corporate coffee somehow I'm not sure. Yeah, uh, I need an infusion of coffee. Yes as do I hot as empty um So, yeah, thanks so much y'all. Uh, we really appreciate people coming by. Um, Great talking about what they're doing. Uh, yeah, and you know, I think this is this is kind of next level after what we've been talking about Right, it's like you get you get using containers Then you realize you need them orchestrated and then you start to realize, you know What can I do to optimize that environment that they're kind of running in? And that's where insights takes place and I think Personally, I think that whole like cloud.redhat.com kind of aggregated experience or whatever is getting really cool There's lots and lots of stuff there that is really useful. So if you haven't checked it out For any reason you should go and look at that in general Thank you for inviting us. LinkedIn. It was fun. Cool. Awesome. Been a blast. Thank you All right, so coming up later today on the channel We have the one and only Andrew Sullivan with ask an open shift admin coming up here at 11 Eastern 1500 UTC we'll be talking about authentication and authorization So tune in for that. I'm excited for it. Believe it or not. Just block everyone No one has any reasons. All of you needs access to this network As a former developer, just give me admin and nothing else All right. Yeah. Yeah, so as bacon fork mentioned, stay cool out there folks I know and here in the northern hemisphere. It's been just absolutely brutal in some locations. So yeah, stay safe out there folks And we'll catch you next time. See you