 Good morning everybody. I appreciate so much for being here for The first talk in our last day here. So thank you so much for coming I'm Evelyn Gomez. I'm from Red Hat team From support engineer for OpenShift dedicated to OpenShift and I've been doing this for the last six five six years and As a support engineer on OpenShift, I basically eat issues for breakfast every day So over the time it actually you start to see a few trends or common issues for particular situations So why Sherlock or Susan Demi on the stock? This is because this is the place that I need to be when troubleshooting issues And also is what I need to help our customers To also develop in their selves when you're when there are the since that the mean of the of the OpenShift cluster What for a measure of fact any cluster any Kubernetes cluster this talk is actually quite agnostic So whatever I'll be talking here There are similar situation that happens in Kubernetes world in our community I see a lot in telegram chats reddit whenever there is a Kubernetes cluster running Right. So the specific situation that we're gonna cover here is Troubleshooting issues and also actually performance issues This kind of issues happens in a particular situation So this is a Kubernetes cluster when they're in their first Moment of life. It's quite simple. You can see everything pods are simple enough Microsoft is still not quite Let's say messy. It's quite simple to visualize everything, right? problem starts to begin When it becomes this so this is when you have like a lot of a workload You have network policies on there You have automation which is great and it is actually like the natural evolution of a Kubernetes cluster You want to get into that point But again problem happens usually in the middle of this transition so what I hope to bring here on this talk is how to Troubleshoot or even how to prevent issues when we're coming from a A little cluster and become like a very busy cluster a production cluster and you're actually doing like nice things with github and Tecton and pipelines whatever you want to Really work with in the long term so I was saying There's a little a few trendy issues when we're going through this transition and Those are usually how to troubleshoot issues or outages on the fly Because when you have like a big cluster on this transition When you have an issue it can start to be a little bit more difficult or a little bit more complex On how on where to pinpoint or how to trace it these issues Also what to keep an eye on So there are a few things that I would say that are crucial to monitor There are a lot of great Monitoring tools, but there are a few specific things that I would like to highlight That if you're administering a Kubernetes cluster you want to make sure that you Understand it how it is The behavior around the along the time Also, I want we're gonna talk a little bit about tools to facilitate problem-solving in the long term Because when we're using a Kubernetes cluster or OpenShift We want to have that we don't want to something to recreate we want to have This long-term cluster up and running and from time to time of course doing many maintenance But off for that we also need to use a few specific tools So basically your agenda Well, we'll talk a little bit about the background. We'll talk a little bit about the detective mindset, which is specific for my field You need When issues are rising you need to handle that on the fly and This is not only for me, but also for again for anybody that needs to that's working at the CZME You actually have to have that that detective mindset To help to resolve the issue as as soon as possible Then we'll talk a little bit about our crime scene Which is of course like a use case scenario is a real story We'll talk a little bit about a few tools that could be helpful and the relevant information and This is from the application side and they will go over through a cluster perspective During my talk there are there will be a lot of links that I put for later reference. I plan to make this available In the Comf a website for everybody. So just I Will not cover every little aspect, but I I made sure that there were links to so you can refer later So the detective mindset, what is that? So it's nature nothing but human nature when we have an outage of an application or basically the cluster outage application outage is to React and go and troubleshoot and find out why what is the corporate but I'm here actually to say to don't panic and this is certainly helpful because When you're on the issue you actually need to take a step back to take notes and there are three Specific of things that I would like to highlight That is important to take notes First big time stamp and time zone and this is especially because Kubernetes closer or open shift open shifting is special. It runs always in UTC, right? This is the full time when we were handling with different with different teams or cross collaborating The teams they are in different regions and it's very easy to get confused with that So this is the first thing whenever you have like a Report of an issue make sure that you get like the time zone and in the time stamp This will really help on the long term when you need to get a postmortem or an RCA and this is very easily overseeing But it helps so much when when you need again to Appropriate RCA or to cross reference with a lot of data There was this time there was trouble shooting initiative out of scaling and I was actually able to Find a bug in the code just by Crossing reference with time steps and in the locks. The problem was is that at the first minute the time stamp That was given to me It was a different time zone. It was not UTC So I spent a lot of time looking at the right at the wrong set of data and in metrics So having this important always like in the back of our mind where you Collaborate with your team. It's especially important and it avoids a lot of confusion node name and poll name. This is also great Information to have if you can on the fly when you when you have to resolve an issue again I'm focusing in the postmortem Approach if I need it so if I have those information at hand I can quickly check Okay, so this happened and I can do correlations about where possibly the issue started and This is the third and last part for on the tag mindset that I See that it's a lot important and this is not only for me But also you can see like same approach and the Google SRE book. It's stopped the bleeding first It's easy to get in that mindset. Oh, we have the issue. We need to find RCA. We need to resolve it But what is the corporate the thing is that if we don't resolve the issue first like our customers are really Losing money because we need to restart the service That doesn't make RCA or postmortem less important, of course not that's Ideal to have But this approach actually it's important in the long term If you have actually it's relevant and easier to be Achieve it if you do have the proper tools In the first place and we'll talk a little bit more about that with our examples Sometimes also this talk I talk a lot of our customers. I talk a lot Sometimes even managers, you know, so sometimes it can be like a tricky Conversation Because we need to answer for a lot of stakeholders for our customers the customer of a customer for everybody But as long as we have the tools in the first plate for observability This should be just fine and it's so much easier to achieve So that said with all these three things in mind. Let's take a look on two of wheels tonight as that happened that One of them in this case the application in the application layer. There was not They stop The bleeding stop the issue first actually it had but didn't have them the They are seeing first. So let's take a look at our own Dave dances So when we talk what we were talking or tackling this issue With we saw three specific things one that the application was not serving requests and this was unpredictable We didn't really could know like how and why Second we saw that from time to time the pod would crash loop with all So it was being killed. No, so the pod also had three replicas. So if well, there was some It was widespread, right? So those were the only evidence that we had but again How how we solve it? We solve it by increasing the the pods from three to six and by implement out to scale But let me just get back to this slide Because this was like a big outage because the application was not serving their request that our customer was expecting to But we didn't know how or why we simply resolve it being fast By increasing the the pods because we saw well, there's some memory issue because it's being all on But that does not actually answer why This is actually happening the bad part on this is that Because there were no Metrics in the application. We could not see what was really causing So this is a good example about when we can resolve the issue But when we don't have the proper tools It it gets hard to see their CA. So Let me talk a little bit about the long-term solution when there is a big focus on our CA if you're being Especially if your business have that critical policy So one of them is aggregator logging system and this is the tree of those actually You can achieve a open shift. It comes pretty much like out of the box, but they're also totally open-sourced Plugins that you can attach to a Cuban X cluster The nice part of those is that they you can actually Extend you can actually collect all the pods logs. You can collect the infra Logs they audit if you need it as well and you want that we want that because If a pod fails just like happened before We can actually trace back. Okay, so let me see what happened for this pod in the past and And try to correlate the event and again, we can try trace back timestamp and things like that Another thing that's super important It's application metrics and this super nice with open telemetry There will be a talk I think later today on this I super recommend to also keep an eye on that open telemetry is a big project open source project And it helps so much have application metrics it to it can It can actually prevent you once you have the data of your application you can actually prevent or predict certain certain issues, especially if slow if it is load related a Third a long-term solution. It's loads testing. This is actually commonly overseeing and this is Especially I actually feel more than never because we are in a so fast-spaced world Or we needed to deliver and and update all the time or applications We need to have load testing This will actually Make better performance in a kubernetes world because you can predict and set a request and limits appropriately so then again this prevents Outage and unavailability issues with their application The last one that I would like to highlight here is kube-linter This is a super cool tool also appraisers. It's like a staticky tool that you install and you can really Run against your deployment Yemo as a developer right so With that you can have actually have cloud native recommendations Things like do you have work as a limit set? Do you have any? Toleration or tense that you need to take care about it How many have because we have if you have one the kube-linter will say okay, so you probably should have three and The most interspersed part of this particular tool is because it's also highly customizable, so you can apply your own Rules to it so you can have like all developers following this asymptate plate We can say so it's a very powerful tool. It it helps to prevent a lot of issues things like If you if you had maybe a node Outage and you had pods running that node Then the pod will be evicted, right? But if you only have one replica, well, that's our common scenario that way that you should have like at least two or three ideally five Right and those are all Issues that can be prevented with those set of tools So this is actually a demonstration from Kibana and if you ever deal with Kibana, I personally like a lot It's easy enough to navigate It shows like the whole step one name of course like every login system it will show the same But I particularly like the Kibana interface this is a picture from a Quarkus application because Quarkus does have by nature Metrics exposure So what I did here, this is this is actually from OpinShift I create a service monitor to communicate with Prometheus and expose it the matrix from Quarkus directly in the OpinShift Monitoring system. So this this is a good example to That you can that you would be able to see if there were actually like a request Increasing what was really happening in the application level Right because sometimes all we have is a resource usage like memory CPU And that doesn't really count the whole tells us the whole story. So having things like this It really helps in the in the long run Things to keep an eye that I would like to highlight here on this on this part of deployments first is pod capabilities this I would say like a hidden issue usually because a pod it can have all these capabilities You can have K.O And K.O. actually is the most dangerous one here, but it can't have And these are configurable. So one thing that I would say that you need to make sure that your pod if your pod has That you know about it Because if you have like a demo set or any pod that has this hardware capability, it will likely cause an issue some Day in the cluster, you know, and just make sure that you know, which parts have those capabilities it's a often issue that happens and because especially because those Assume the this capabilities they communicate in the kernel level and it is known Kubernetes to cause a few issues depending on how this part behaves, okay, I Said don't think they would like to say to that is important to keep an eye is the quality of service. It's which is cause What is cause? Oh Sorry, I missed it put a picture on this but quite observe is also a few that you put in the pod or just matter of fact a container and it tells you if it's guaranteed or if Um best effort it In summary, it really will tell you when or what is the priority of the pods? If the node gets into a over committed situation This is also very often overseeing, but that's something that you want to know About your pods, especially the most critical applications that you have What is his quality of service? If I hit an issue of the node would my My application be the one to be guaranteed to stick in the node and avoid Another outage or avoid at least the loss of wonder replica for a time So those are things that that I would recommend to keep an eye if you have in your cluster the second crime scene is a A scenario that happened with a cluster and we had a few evidences on this issue First one and the most I think that was the most perceptible or the most noticeable Was that the cluster was getting this low like either when you run cube control comments You get as low response or when you open like the console. It just it's not working quite well So that was one evidence second no garbage collection actually when trying Spinning up a pod with deployment if the pod was being deployed the problem was that when I deleted the deployment The pod was not going away. So actually remember. I actually discovered there were you having? Carbage collection issues, right? This is been handled by the cube managed controller and Well, that's a second evidence on this cluster third one the at CD pod Was flipping up and down I could see the availability of that CD and that's never a good That's never a good hinting your cluster. You don't want to have a criminal's clothes are having like unavailability on that CD and For when check at CD message. I was actually seeing requests took too long to as a code If we are having issues alone with at CD Usually this message means that at CD it's performing badly But not only that Like you can go like in many ways on this But the main The main hint is that some things now wrong and behaving for poorly This could be because of this slow disk. That's a very common situation In kebernets depending on which type of storage disk you're putting on it can behave poorly that that's a situation But again Going back to our evidence. Those are the ones that we had for this issue Another set evidence is that the cuba place ever was Going up to 40 Giga's of resource usage memory that's a lot and when talking to customer and really seeing the context of the of the cluster It was not Expected it was just too much like there was this was taking a lot of Research from the memory from from the control plate note. So what is happening here? So when running like a simple grab looking for a horse We actually found like over 37,000 Messages of a horse about a specific operator that were there and Many of those message not all of them But was about this issue with With certificate that was unable to see two parts bytes as pain block So because the kebernets is so great and we have a great community. There was actually a very similar issue reported For a different operator, but it was reported and that's why I love the cleverness, you know, it's a very vocal Humanity and what happened in that specific situation is that In the CRD of the operator installed there There was this dummy value That when the cuba API server was trying to hit it The cuba was not really a sect in it and was spammy and overheading the cuba API server Making it to that high Memory usage, okay so the way that we resolve it it was basically remove that dummy value and that Instantaneously made everything work again. I keep a person went down There was an evidence that I Missed on putting it on this slide, but there were over 400 terminating projects Project that were just hanging there and also we resolved the issue with the cuba API server It all went away. The closer it started to behave better There was nothing hanging like at CDs Normalized again. It was not going up and down from on over to another so This is something that I like to highlight like the cuba API server like the core of a Kubernetes is something that you really want to make sure That is behaving properly But not only that that you really understand the routine of the cuba API server So long-term solution for this kind of obvious Performance issues is specific. It's cluster metrics. You want to make sure that you have those That you understand What is the normality on the cluster? cuba API performance metrics. Those are great metrics to have it is Available with graph front and primitives which for any Kubernetes administrator. It's which the best best friend right at CD metrics, this is super cool because At CD actually exposed the metrics in OpenShift. You have that by default You actually have like API perform metrics dashboard You have that CD metrics by default and I want to add here very quickly a little bit above a little How many of you know the little All right, so the little it's a tool for backing up Kubernetes resource especially persistent volumes. So that's one thing that it's not really related to Performance issue, but I wanted to present here just because it's so powerful And it helps so much just in case the worst happen You know in the disaster recovery scenario You want to make sure that you have your applications properly back up and Valero is super easy It's super simple and very powerful You appreciate we actually have that Valero with OATP, which is another operator which runs like Valero behind scenes So how it looks like so this a dashboard from QB API server and also from at CD. Those are good things to To really monitor and understand how it goes up or how it goes out if there are any spikes or if there is not lot of Closer performance issues when they are growing as I was saying before It mostly related to those components QB API at CD Maybe the control plane is too little Maybe the the nodes are too little and it's some resizing But in order to predict that you can actually like refer to such such metrics a very cool comment that I that I use In case you don't have like at CD Metrics is this one you actually can see the amount of objects in that CD That are stored in that CD database and this is quite cool, you know, especially if you If you're trying to call to build a baseline You actually can if you and you notice again closer performance issues Sometimes you can easily see their response like on this and you will see like I don't know maybe 30,000 events 2000 config maps You know any odd behavior like that It's probably a word to investigate Thank you. That's it that I have to share Any questions maybe Okay, so the question was if one of the Possible solutions was to report a PR To the to the operator to the better operator a patch. Yeah, or submit a patch. Yeah, possibly I would say that we need like at least a little bit more or investigation or at least try to reproduce But yeah, like a long-term solution, of course, we want to report that issue about the operator Can you say that again? All right. Okay Yeah, so the question was if I know It was Mechanism in at CD and what it calls it that right so in that scenario what happened is because The issue actually lives in QBA API server and because QBA API service is the only one that talks about CD It was at QBA service. It was just too overhead And it was not able to communicate with that CD properly So it's like an overhead of an overhead and that's what makes like at CD to start to be a little bit Not stable, you know, but it was like the main corporate is QBA API server because QBA It talks with every component in the cluster, but he's the only one that talks with at CD So like it's like the bottleneck Did I answer your question? All right, I think we can wrap. All right. Thank you so much for coming