 Okay, I'm going to start so hello. Hello everybody and welcome to our show. Oh, sorry to talk My name is Roman and I'm a technical solutions engineer at Google today I'm gonna be talking about Kubernetes support at Google Cloud I just said that I'm technical solutions engineer, but that's just a fancy name for a support engineer So why we are all here? We all know that Kubernetes is a powerful container orchestrator We are all here. That's why we are here and Kubernetes is turning ten years this year We all know that the first coupon was almost nine years ago after all this time We are all long past the demos or long past the POC's many of you already adopted Kubernetes and using it in production and But with all the all of the ben that's all in due to benefits Kubernetes brings But with all of the benefits there are also come the challenges Let's face it. Kubernetes is not a magic wand. It also needs some love and attention But fear not that's where people like me and my colleagues come in to save the day Let's take a look at some trends Out of all the cases Kubernetes related cases received in 2023 Here is a list of future cases by future in descending on order This is not an exhaustive list, but these are just the top seven futures If we take a look at this list at the first place is networking and it makes sense One of the main use cases of Kubernetes are microservices and all those microservices need to chat as a result First place is networking Node related issues are usually related to node stability resource allocation or node failures Even if Kubernetes has like auto healing mechanisms and in the schedules the workload it replaces the failed node What I noticed is that people really care about what happens to that node and they really want to know why they failed For control plane issues. We are dealing with QB API server controller manager at CD These issues usually do not affect the workload directly, but they do affect the scheduling part the orchestration part So they do get your attention really quickly Even if your control plane is managed by a service provider You can still break it depending on what you're running in your cluster All those control plane components, they don't have Infinite resources. So for example, if you have a controller that runs and creates only resources and doesn't delete them You can easily fill out the it CD database causing issues Another way to look at this list is which feature is the most popular and the most complicated at the same time Like if as the feature set grows so does the complexity and where complexity comes in this is where support comes in So let me share with you a real-life case that had me scratching my head so when Customers create cases. There are multiple automated jobs that are running analyzing the configuration the logs and try to help Customers even before they are opening a case But still if they are open a key opening a case all the output from all those jobs come to assess support engineers and help us troubleshoot the issue further Even with all those automated jobs, they just can't catch them all So let's just take a look at the case. So this is a Real-life case all the names were changed, but the issue is still the same So this is how customers create cases. This is the case number This is who created the case John Doe. What a popular name So John created the case and he says not able to create G key clusters It is failing with the below error and the error is like a CVS receipt Apparently he wanted to put the whole line in the subject and it didn't feel fit in So John can also set the priority P1 is the highest priority, which means we need to move fast to help John Category so here he can select technical Kubernetes engine and then after Kubernetes engine He can also select the future one of the features that We've seen in the previous slide, but he didn't do that. So we'll have to figure that out project ID Then cluster name namespace node name workload name. The thing is like all of these details They do help us but customers when they are in a hurry They really don't feel don't put either they don't put the details or they put they them incorrectly So we just need to read the case description to actually understand what's happening So this is what we're doing here. We create G key clusters for our application and we do cube CTL patching It was working fine two days ago Suddenly we see this error for Qt CTL patch Blah blah blah, and we have this certificate signed by unknown authority. Okay So I have an error. I have where it's coming from, but I don't know where the Qt CTL is running I don't know on which cluster does this apply. So I need to ask these details to try to help John So this is what I'm doing here Can you please tell me does it occur on all of the cluster? Is it happening on newly created clusters or existence clusters as well? And John is apparently in a hurry and he just says any update on this Okay, but oh now just a little while later He apparently seen my question and says this is occurring on new G key clusters The patching command is working fine from my Mac book But when it's running from a Google VM the command is not working So this is a very important piece of information Because Qt CTL is working fine on his Mac book, but it's failing on the Google VM So we kind of can limit this to a Google VM And next this is a blocker for our production rollout. Please update on this. Okay. We need to really hurry this here So I have a little bit of more information but he has that error and That error can come from so he's running Qt CTL All the information Qt CTL has is from Qt config So any manual tampering of that Qt config can generate this error. So what I'm trying to Say here is like to Tell John to regenerate Qt config Hopefully he will get a fresh new Qt config everything will be fine problem will be solved This is what I'm asking John to do. Please regenerate Qt config John comes back. We're still facing the same issue and an AP address like I Have this AP address. I can't find the VM by the IP address because we are support engineers We have read only access. We can see the configuration. We can see the VMs We can see the logs, but I cannot log in into VM because there could be private information for the customer So this doesn't really help me Then I'm asking John Did you try to recreate Qt config because I didn't really understand from the previous reply and I'm also asking him Can you please run this command Qt CTL with insecure skew TLS verify just to make sure? Qube API server is reachable from that VM So we can kind of limit further down the problem And John comes back with We just have the same error and again an AP address At this point in time. I'm just let's just jump on a call and troubleshoot this issue live as real engineers do So when we jump on the call, I can actually ask John to run some commands see the output and try to figure this out Okay, so this is what we're doing here John is connected to this question host Just to see the issue. Let's just try Qt CTL. Sorry Qt CTL get pot Qt CTL Qt CTL get pot Okay, and sure enough, we see the same error certificate signed by a non-authority Okay, just to make sure that John did what I Ask him to do to regenerate Qt config. Let's just do that once again, but first we need to back it up So in case John needs it later remove the one that is there Generate a new one. Okay. We're generating for cluster 35. That's the problem where Actually, it's for class 60 that's where they He reported the problem and let's just we have a shining you come Qt config Hopefully it should work, but no We still have the same issue Now at this point, I Don't understand what's happening like we have just generated the new Qt config everything should be working I have the proper certificate. I had like it should be fine. What's happening? At this point, I need to read some documentation Like I know I told you we are experts, but we're just basically on a never-ending date with documentation So reading here, I can see client certificate Authentification is enabled by bypassing this flag to the API server So what this means is that we have a Certificate authority that is used by QB API server and all the client Certificates and this certificate used by the API server should be signed by the same certificate authority If this is true, everything should be working fine. We can actually check that so let's do that The certificate used by Cube CTL can be found in cube config. So let's just Take that config view With the raw output so it contains also the certificate we provide the path to the certificate authority data We decode it we put it into a file Good We have the Cube CTL certificate now. We can also get the certificate from the QB API server Let's just find the cluster API address first We also can get that from Qube config we get the cluster AP and With the mighty open SSL command we can actually get the certificate again get the certificate put it into a file Again with open SSL we can actually check the issuers of both of the certificate The one used by Qube config and the one used by QB API server Let's do that This is for Qube API server the issuer and for Cube config for Cube CTL and we can see the issuers are different So this is the problem. I Still don't understand how this can happen because we have just generated the Qube config But I'm focused on solving the problem. So John can use this bastion host to make visit him his deployments We can actually do that. So the problem here is that issuers are different and that's why he gets this error So if we use the certificate from Qube API server and we put it in Qube config Everything should be working fine. Let's do that We're setting the context For the current context with the certificate the certificate authority for the current context And we are taking the certificate from Qube API, which we just got Encoded in base 64 And let's do that. Let's do Qube CTL get pod It's working. The error is gone This is kind of good because we mitigated the issue But they still don't know what happened like why this issue was very in the first place Because like as I told you multiple times already, we have just generated Qube config like why is this is happening? Just to make sure everything is fine. I'm asking John to run some run some more commands like Qube CTL get node And he says like oh look at this. This is from cluster 35 This is a different cluster Here I'm cool. I'm like, I know what's happening. I know was the problem John thinks he's connecting to cluster 60, but he's actually connecting to cluster 45 Maybe the IP address of the Qube API server is the same. We can also check that Let's check in Qube config The server IP address and see here we have two IP addresses of the Qube API server to same IP addresses So just to reiterate John has this bastion host that he uses to connect to different clusters There is some kind of routing here that should route requests to a correct subnet Where the correct Qube API server is in this case John generated the Qube config for cluster 60 But he's actually connecting to cluster 35 Getting this whole error when Qube config is generated. It has the certificate for cluster 60 So if we generate Qube config now for cluster 60 Qube CTL get pod Everything is working fine. Oh, sorry. Yeah No, so cluster 60 is the problem if we generate for cluster 35 Everything should be working fine again See no error Problem solved now I can go and get a cup of coffee or something But there is another case coming in Support is a 24-7 Operation so I can just hand this case with this next case off to my colleague Sian All set up Okay, I'm sorry for that. Yeah, I guess talks can go however they go Okay, so I'm taking over from Roman. I am another JK firefighter and my name is Sian And we're going to be looking through another case and this one started off as a P2 case Okay, so let's take a look at the case So the final priority was P1, but started off as a P2 We have priorities all the way between P1 to P4 and P2 is just and P1 and basically it's that something is broken But the customer is not down in production and let's go through just what has been reported to us So we have the title we have a message here saying that auto scaling is not working We have project ID We have a cluster name. That's good. We have the namespace name We have the workload name and then we have a brief description here Which says that they have scaled up their workload But most of the pods are in the pending state again not much information for me to go on they haven't for example told me When they first noticed the issue when they attempted scale up What's the business impact for them if there any errors or relevant log messages? So from there, I just request for the information because as I say many times We're not really magicians We don't know everything and we have to ask many times because it's P2 I decide to wait for the customer to come back to me and Provide the details before I do any more Okay, after a while they come back and They say we're seeing messages like this. So zero of three nodes are available Which is basically telling me that the pod the pods couldn't be scheduled on a node They couldn't find a suitable node and then they also share some messages here about Cube cattle Not returning couldn't get resource list for metrics and it's not really clear to me how the two are related Again, they haven't told me anything about the business impact They haven't told me what time the issue started but at this point I'm like it looks like it's still ongoing So I might as well check as Roman mentioned we have read-only access to Customer clusters so we can run Cube cattle commands. We can view their logs and we can also check just their configurations in the UI. So That's what we're going to do now Okay, sorry, I need to log in again So we're just going to check and see how many pods are still in the pending state Okay So they have a number of pods that are in the pending state Which means the issue is still ongoing and I'm just going to pick one of the pods at this point and describe it and see if there's anything Interesting at all in the events Okay, so if we look at the events here from the top we can see a failed scale up so we can see that there was an attempt to scale up looking at the trigger for scale up and At the bottom we can see failed scheduling messages just like the one that was shared on the case So part of the final support is when you get a case you really have no idea where it's going to go You have no idea What's relevant to look at what you need to ignore and you're just trying to follow and think what's logical What should I check next? So at this point I can see scale up is being triggered But I'm not I'm not really sure whether I really want to dig into auto scaling log So I think okay, let me check if there's even any attempts that are being made to create new nodes so All creation requests for objects they go to API server logs and on GPU you can enable Some control plane logging so you can view API server logs But as Roman said we don't know everything So I want to know what does a create request for a node look like so here I can see the verb and I can see it's Post and it's for a slash API slash V1 slash nodes. So I say, okay Let me create a query and look at just I can use these two strings to search the logs and see what I find there And I really don't know at this point what it is that I'm looking for I'm just trying to build a picture and see what happens next so Because the issue is still ongoing and the customer didn't provide me any idea of the time I'll just search over the first one hour if needed. I can always look back a little further So let's run the query Okay, so there's lots of logs I'll just expand one of them and have a look and I can see here already that there's a stack trace and Yeah, so I'll just go through the stack trace all the way to the end and see if there's anything useful Okay, so I can see it ends in an era and it says something about a failed calling a web book and The web book is check ignore label gatekeeper does this age. Okay. Now. Maybe I'm finding something So I'm sure a lot of you know about gatekeeper either you're using it or you've had about it before and then I read a bit more In the error message and I see it says it failed to call the webhook I see a post request to the webhook. I see a timeout of three seconds. Okay, that will be important later And then I see context deadline exceeded message So context deadline exceeded context console message. I'm sure you've seen the many times in Kubernetes They don't really tell you much so just from experience and some googling I figured out that the Era is just a generic error in go and basically what it means is that Request was made and it timed out. It didn't complete in the timeout that was set. Okay. Let's see what else we can find in the API server logs There's also this trace log Which I'm not so familiar with so As I said, we really don't know what we'll find so I'll just expand it and have a look Okay, so it seems to be sort of a breakdown for the latency of the request to create a node And over here, I can see all the requests some requests to Webhooks and again at the bottom I see the internal error and I see again the webhook message The webhook mentioned check ignore label get keeper does SH Okay, so it looks like the webhook could be involved somehow so It makes sense for me to check for just what's happening with the webhook in the API server logs And that's what I do as I said I'm just following the breadcrumbs that I'm finding and trying to find What makes the most sense to look at next? So again here we'll just search for the string webhook and then search for the past one are Okay, so we can see lots of yellow and red and yellow is basically a warning message and red is a is an error message, so We're looking for just to see if there's anything else we can see that's useful. We can see this still more trace Logs returned here And if we look at the check ignore label that get keep our sage specifically We see again the helpful context the line exceeded message. So nothing so exciting there Okay, so at this point back to the case at this point the customer Increases the priority of the case. They need this running in production right now So it's now a P1 for them and I just offer to hope on a call with them So on the call I explained what it is I found so far. So at this point, I don't have a complete picture of why it is that the Webhook is affecting the creation of the of the nodes, but I have a pretty good idea that that's where the problem is So in support we often are prioritizing Mitigation of understanding the root cause and I just advise the customer Why don't you back up this webhook you can always restore it from the backup and then Just go ahead and delete it for now. So let's pretend I'm the customer for now and We will delete the webhook and then we'll check to see If more pods will come into the running state Now that the webhook is deleted Okay, it's it was going to take a while, but yeah So webhooks are they they are classic example of that party tools that you can install in your Cluster that can have very many different Effects they can really affect the stability of your cluster and they can manifest in many different ways Okay, now we can see that once we deleted the webhook the pods are now running so with Webhooks the thing we really encourage is just to have a good idea of what the failure looks like what Name spaces you're covering what objects are affected by the webhook and in general for sad party tools There's very many operators that you can install nine Kubernetes very many demon sets that you can run like system security demon sets and we encourage you just to understand what is the demon set doing and What does it look like when it fails and will it have significant impact on my cloud on your cluster and on the stability So I still haven't found the root cause of the issue So we'll just look into that and we'll take a look at the webhook that I also made a backup off before the customer deleted it and There's two sections but we're interested in this section with the name check ignore label that get keeper dot s h and We can also confirm here it has a timeout of three seconds which we saw in the logs and we see here a failure policy of fail and Get back to documentation request to understand. What does a failure policy of fail mean? So looking up the documentation again a failure policy of fail means that there was an error calling the webhook and When this happens it causes the request to be rejected So if you have ignore the request is allowed to continue But if you have a failure policy set to rejected then the request will fail so in the case that you don't need to strictly enforce a Policy or use a webhook. It's better to set it to ignore and then together with this the customer also came back and they explained that the Backend pods for the gatekeeper webhook were overloaded at the time of the issue and That's why the requests were like the response was not being returned within the timeout So it's sort of an example of our collaboration between us and the customer So now we'll switch gears a bit and we'll go back to the presentation and we just have a few Just tips on troubleshooting that we've collected just working on many different kind of kinds of cases So although the cases we showed were not the best example of this We really encourage you to have a well-defined problem get all the details. This is the Whatever workload name is involved What time the issue started any relevant log messages if you have architecture diagrams that show different Pieces and how they are connected. That's even better get an idea of if anything has been changed Well, there any new deployments? When did they happen just have a record of this as it really helps and then the next thing is Using the whatever error messages or log messages you got or just the behavior You're seeing trying to understand. Is this a real problem or a misunderstood feature so for example with things like Cluster or to scale up we often see people Complaining or the cluster is not scaling down But if you actually read the documentation some more many times it is working as intended So just make the documentation your friend The next thing is to narrow down the scope of troubleshooting. You can't be looking at everything So you need to try and think especially in Kubernetes It helps either in terms of what my colleague mentioned Features or components so in this case I could either have looked at the scheduling or to scaling But you're basically trying to eliminate what it is that you're focusing on and then Once you have a list of components that you want to focus on create a list of hypotheses And you and you then you can test them and you're trying to go from the most simple Hypothesis to the more complex ones. You don't start with anything. That's too hard and then lastly if you are not able if you can see that there's some relationship between the issue that you're seeing and Just something that you've observed in for example the logs Try as much as possible to mitigate the problem and this goes back to the first step For example, if you knew this happened just after a new rollout Then you could try and roll back whatever change was made and then just collect all before you make the The rollback change for example, you can collect all the data you need so that later on you can perform your analysis and of course if all else fails and your SDK customer Please feel free to open a case and if you have all these details It's also really helpful to us and it takes us a lot of time because as I said, we're not magicians Okay Thank you everyone for your time Please scan the QR code to give feedback and if you have any questions, we'll be happy to take them. Thank you