 Hello everyone, I'm Shael Morgh. Today we'll talk about debugging at scale in production, specifically about Kubernetes debugging. I'm assuming everyone here knows the basics of Kubernetes, so I will dive right into the basic problem description and then into the three tools I will show today. But first a few things about me. I was a consultant for over a decade, I worked at Sun, found a couple of companies, wrote a couple of books, I wrote a lot of open source code, and currently work as a developer advocate for Lightrun. My email and Twitter accounts are listed here, so feel free to write to me. I have a blog that talks about debugging and production issues that talk to the Duck.dev. It would be great if you check it out and let me know what you think. I also have a series of videos on my Twitter account called 140 Seconds Ducklings, where I teach complex things in 140 seconds. The current series is based on debugging and it covers a lot of the things most developers don't know and would find helpful. I really appreciate it if you check it out and let people know about it. I was talking at Observability Fest 2022 and after I was done, they generated this absolutely spectacular mind map of my presentation. This is essentially what I'll be talking about and I'll show this again when we're done. Containers and orchestration, revolutionized development and production, no doubt. But in a way, Kubernetes made debugging production issues harder than it previously was. In the past, we had physical servers we could just work with or even VPSs. Now we face much greater difficulty in this age due to three big challenges. The massive scale enabled by Kubernetes is a huge boon, but it also makes debugging remarkably difficult. We need new tools to deal with the scale. We now have multiple layers of abstraction to the deployment where failures can happen in the orchestration container or code layers. Each failure requires a different set of skills and solutions. Tracking the cause to the right layer isn't necessarily trivial. Finally, there's the bare bones or lean deployment problem where when debugging, this is the first problem I want to focus on. We'll get to the other two soon enough. It's a problem of the bare naked container. We can connect to a bare bone container, but we have nothing to do inside it. Nothing is installed. We can inspect logs, but that relies on luck. Furthermore, if your logs are already ingested by a solution like Elastic, you probably don't have anything valuable to do within a bare bone container. CubeCityLDebug solves these problems and can work even with a crashed container or a bare bone image. The CubeCityLDebug command adds ephemeral containers to a running pod. An ephemeral container is a temporary element that will vanish once the pod is destroyed. With it, we can inspect everything that we need in the pod. Most changes we do in it don't matter. They won't impact the pod after we're gone. It works with bare bone container. The way it does this is with a separate image. So we can have an image that includes everything in it. The container spun from the image is ephemeral and can include a proper distro and set of tools we need. CubeCityLDebug was introduced in version 1.23. So if you're still in an older version, you will need to wait for that. If you use hosted Kubernetes, you need to check the version they use. Let's start with a simple demo. As you can see, we have here a few pods. We're experiencing an issue and would like to increment the logging level. So we can better see what's going on. I can use exec and login directly to the live pod. I login with a proper bash shell. I'm sure most of you did that in the past as it's pretty easy. Here I can just use standard commands like cat and grep to check the logging level. This is all good. We can see the current level is at info. Unfortunately, I don't even have vim if I want to edit this file. Now, when the container has apt or apk and when the pod isn't crashed, I could in theory just apt get install vim. That's got its own problems and is a painful process. We don't want people in production just installing packages left and right. Even if it's cleaned up, the risks are sometimes too high. Pods shouldn't be touched in production once deployed. All the information in state of the pod should be described in its manifest, unless it's a strictly state for pod like a database, etc. So installing like this is problematic. So let's exit for a moment and try connecting again with kubectl debug. I'll use the busybox image in this case and we'll see how that works. Notice this is the image referenced in the kubectl debug docs. So I'm connected to the pod again. Or so it seems. Technically, I'm connected to a new ephemeral container and not to the pod directly. This is an important distinction. But as you can see, again, I don't have vim or really any of the tools I would expect like visual VM, trace route, etc. I can fix that. I can create an image that represents everything I need and packages, all of the tools I need. Then pass that image to kubectl debug and just use those tools. But who's the thing? I'm not unique. We're all pretty much the same. We all expect the same things in our debugging sessions. And that image I can use it is probably the same image you would use too. So why not have one generic image? This is where Quarkits comes in. Quarkits was created by Tom Gronaut of Flytran. Quarkits is an open source project that includes a set of opinionated, curated, platform-specific tools for kubectl debug. So you can have everything you might need at your fingertips while debugging. So what does this mean? When you use kubectl debug to spin up an ephemeral container, it's built using a Quarkits image. Currently, there are four standard images. A Go image that includes tools such as Delve, Peeprof, GoCalvis, and many others. The JVM toolkit, Quarkit, which includes tools such as SDKman, jmxterm, honestprofiler, visualvm, and much more. The node version includes NVM, ndb, 0x, vtop, and again, much more. And finally, the Python version includes pyenv, ipdb, ipython, and much more, as predicted, of course. But this is just the tip of the iceberg. As all versions include the many tools you would expect in any proper debugging session, such as VIM, htop, and also have lots of networking tools like Traceroute, inMap, database clients for Postgres, MySQL, Redis, and again, so much more. So let's continue from where we left off in the demo. We can disconnect from the current session and then spin up a new session with the Quarkits image. Notice we can also use the shorthand kk command for many of the operations which I don't use here, but you can see the syntax in the Quarkits docs. Specifically, in this case, notice I used the JVM version of Quarkits, which I chose because I'm a Java guy. But if you're using a different environment, you can use what fits there. In Quarkits, pretty much every tool I want is really pre-installed as part of the image by default. This means we can just connect and everything is already there. Since we're all very similar in our needs, Quarkits includes the common things most of us need based on the platform or language. It has sensible defaults and comes with Ubuntu as the distro. This is important. You have a full distribution, like you would in a desktop or a regular server. This is very helpful for debugging. So you get everything you need, even when debugging a bare bone container. Notice that thanks to CubeStyleDebug, we have full access to the main application containers, file system, and the pod's process namespace. So we can do everything there while residing in a more convenient environment. It's having our cake and eating it. So to finish the story from before, I can just use VIM to edit the file, change the logging level to error, which I can then confirm using cat and grep. I can also do a lot of other things, such as profile using profiler, debugger, debug with gdb or jdb. I can use jmxterm to perform jmx operations, which lets you configure the way the jvm behaves in runtime, and pretty much anything I can do in a local machine. To give you a sense of what Quarkits installs, this is the list of packages for the jvm client. And this is bound to grow, as you can all submit pull requests with your favorite packages. This is just the jvm specific image. The other images contain similar tools at a similar scale, and you can get all of that thanks to kubectl, the bug, and Quarkits. What if what we're tracking seems to be an application bug? This is a common occurrence for sure. We might not know it at this stage, but that might be the place we want to investigate. We can try using logs and probably should start there. But more often than not, the issue we need solved aren't logged. We can try using various observability tools, they're great. But not for application level issues. They rule for big picture analysis and contain a level problems, not for application level problems. We can use one of the debuggers in Quarkits to track it. That would only work if we know the server with issue manifests, which we sometimes do. It is remarkably risky connecting a debugger to a production environment can lead to multiple problems, stopping on a breakpoint accidentally using conditional statements that grind the system to a halt, exposing a security vulnerability. JDWP is the Java debugging wire protocol. There are several such remote debugging protocols we can use. This seems like the ideal solution. We'll get to that soon enough. So here's the first problem we run into with the debugger. We can't just start using it. We need to relaunch the app with the debugging enabled. That means killing the existing process and running it over again. That might be something we can just do. But it's probably not. Furthermore, running this in this mode is a big security risk. Notice I limit access only to the local server, but still it's a big risk. Leaving a remote debugging port enabled in a deployed server code is considered a huge security vulnerability. If hackers can ride a different vulnerability, they might be in a position to leverage this from the local system. Still, if we do this for a short period of time, this might not be a big deal, right? In a different window, I need to find the process ID for the application. I just ran, so I can connect to it. I can now type it into the JDB command and now I'm connected with a debugger. I can add a breakpoint using stop at commit. Naturally, I need to know the name of the class and the line number. But I can set a breakpoint to stop at. Once I stopped, I can step over like I can with a regular debugger. However, this is a pretty problematic notion on multiple fronts. First off, I'm literally stopping the thread accessing this application. That's probably not okay for any container you have in the cloud. There are ways around that, but they aren't trivial. The second problem is different. I'm old, so people automatically assume I love the command line. And I do. I do love it, to some degree. But I love Greenmore. When I started programming, there was no option. We didn't have IDs on a Sinclair, on an Apple II, or on a PDP 11. But now that we have all of those things, well, I don't want to go back. I programmed Java since the first beta. And this was actually the first time I used JDB ever. I'll use command line tools when they give me power, but debugging via command, I just end. The obvious answer is JDWP. We have a remote debug protocol that's supposed to solve this exact problem, right? But this is a bit problematic if we open the server to remote access with JDWP. We might as well hand over the keys to the office, to hackers. A better approach is tunneling. During the age of VPSs, we could just use SSH tunneling like this. We'd connect to a remote host and forward the port where the debugger was running locally. Notice that in this sample, I used port 9000 to mislead hackers scanning for port 5005, although it wasn't, wouldn't matter because it says SSH anyway, so supposedly safe. We can do the exact same thing with Kubernetes using the port forward command to redirect a remote JDWP connection to a local host. Port forwarding opens a secure tunnel between your machine and the remote machine on the given port. So when I connect to local host on the forwarded port, it seamlessly and securely connects to the remote machine. Once we do that, I can just open IntelliJ IDEA and add a run configuration for remote debugging, which already exists and is pre-configured with the default, such as port 5005. I just give the new run configuration a name and we're ready to go with debugging the app. Notice I'm debugging on a local host, even though my pod is remote. That's because I'm port forwarding everything. I make sure the right run configuration is selected, which it is. We can now just press debug to instantly connect to the running process. Once that is done, this feels and acts like any debugger instance launched from within the IDE. I can set a breakpoint and step over and get this wonderful GUI I'm pretty much used to. This is perfect, right? Right? Not exactly. It's got quite a few drawbacks still. As we can see, there are so many problems in this process. First and foremost is the need to restart the process. We can possibly run with debugging turned on by default, but that is a huge risk. Then it's breakpoints. They break. I heard a story years ago about a guy who debugged a rail system and literally fell into the ocean. The rail system literally fell into the ocean while he was stopped at the breakpoint because it didn't get the stop command. I don't know if that's an urban legend, but it's totally plausible. All remote debugging protocols are insecure, but I can personally attest about JDWP, which I'm familiar with. It's very insecure. In fact, there were CVs already about products that lifted open. It isn't designed with security in mind, but it's also a stability risk. How many times did the debug process crash on you? Imagine adding a breakpoint condition that's too expensive or incorrect. It might destroy your app in production just by viewing something in the watch panel. Finally, there's the privacy nightmare. Imagine a hacker from within your organization and 60% of attacks come from within your organization. A disgruntled developer might place a breakpoint where user credentials are collected and then start farming passwords and credentials, or he can place a breakpoint and then change the value and elevate his own privileges and let himself essentially dominate the system and change whatever he wants. This literally violates laws and regulations in various territories, so leaving regular debugging on a production server might actually be a legal risk as well as a security risk. I didn't forget about this slide. This is an even more important aspect. Everything we discussed up until now assumes we know the exact container where the problem is happening. In real life, this is rarely the case. We can have thousands of containers in real world deployments. How do you debug something like that? The debugger can solve the depth issue. Well, it lets us dive into the code despite the problems, but it wasn't built for scale or for production. That's where developer observability comes in. We used to call this continuous observability, but developer observability makes more sense. It's a newer set of tools designed to solve this exact problem. Observability is designed as the ability to understand how your systems work on the inside without shipping new code. The without shipping new code portion is key, but what's developer observability? With developer observability, we don't ship new code either, but we can ask questions about the code. Normal observability works by instrumenting everything and receiving the information with developer observability. We flip that. We ask questions and then instrument based on the questions. So, how does that work in practice? In practice, we add an agent to every pod. This lets us debug the source code directly from the ID, almost like debugging a local project, but without the drawbacks that we discussed earlier. So, to start, we need to sign up for free Lytron account at Lytron.com slash free for this specific demo. You can use other tools. They're pretty similar to Lytron, although I have to say not as good looking, not as secure, not as scalable, but the basic concepts are very similar, although functionality might vary. I will mention that when I'm aware of something that's extreme. In this particular demo, I'll show what I know. You can check out the Lytron docs for more detailed instructions on setting Lytron and Docker, Minikube, etc. On Lytron, we need to install the agent before the problem occurs. So if we do run into a problem, we'll be able to jump right in. This isn't a problem because the agent doesn't have any serious overhead and you can just leave it running always. It doesn't have the security risks that something like remote debugging has. I'll skip ahead all the installation because that's all in the website and go right into showing you what this means in practice. This is the prime main app in Kotlin. It simply loops over numbers and checks if they are prime numbers. It sleeps for 10 milliseconds, so it won't completely demolish the CPU, but other than that, it's a pretty simple application. It just counts the number of primes it finds along the way and prints the result at the end. We use this code a lot when debugging since it's CPU intensive and yet very simple. In this case, we would like to observe the variable i, which is the value we're evaluating here, and print out cnt, which represents the number of primes we found so far. The simplest tool we have is the ability to inject a log into the application. We can also inject a snapshot or add a metric. I'll discuss all of those soon enough. Selecting log opens the UI to enter a new log. I can write more than just text. In the curly braces, I can include any expression I want, such as the value of the variables that are included in this expression. I can also invoke methods and do all sorts of things, but here's the thing. If I invoke a method that's too computationally intensive or if I invoke a method that changes the application state, the log won't be added. I'll get an error. After clicking OK, I see the log appearing above the line in the IDE. Notice that this behavior is specific to IntelliJ or JetRanes IDEs. In Visual Studio Code, it will show a marker on the side. Once the log is hit, we see the logs appear in batches. Notice I chose to type logs into the IDE for convenience, but there's a lot more I can do with them. For now, the thing I want to focus on is the last line. Notice that the log point is paused due to high call rate. This means additional logs won't show for a short amount of time, since logging exceeded the threshold of CPU usage. This can happen quickly or slowly, depending on how you're observing. Let's move on to a different demo. This is the Node.js project that implements the initial backend of a microservice architecture. This is the method that gets invoked when we click a movie we want to see and want to see the details. This time, I'll add a snapshot. Some other developer observability tools call this a capture or a non-rating breakpoint, which to me sounds weird, but okay, same thing as usual. Once I press OK, the camera button appears on the left, indicating the location of the snapshot, like you would see with the regular IDE breakpoint. Now, I just access the portion of the front end that triggers this code, and now we wait a second for the snapshot to hit. So what is the snapshot? It gives us a stack trace and variables, just like with a regular breakpoint we all know and love, but it doesn't stop at that point. So your server won't be stuck waiting for a step over. Now, obviously, you can't step over the code, so you need to step by individual snapshots. But this has huge benefits, especially in production scenarios. But it gets much better. This was a relatively simple demo in terms of observability. Let's up the ante a bit and talk about user-specific problems. So here I have a problem with the request. One specific user is complaining that the list on his machine doesn't match the list for his peers. The problem is that if I put a snapshot, I'll get a lot of noise because there are so many users reloading all the time. So a solution is to use conditional snapshot, just like you can with a regular debugger. Notice that you can define a condition for a log and for metrics as well. This is one of the key features of continuous observability. I add a new snapshot and in it I have the option to define quite a lot of things. I won't even discuss the advanced version of this dialog in this session. This is a really trivial condition. We already have a simple security utility class that I can use to query the current user ID. So I just make use of that and compare the response to the ID of the user that's experiencing a problem. Notice I use the fully qualified name of the class. I could have just written security and it's very possible it would have worked, but it isn't guaranteed. Names can clash on the agent and the agent side isn't aware of the things we have in the IDE. As such, it's often a good practice to be more specific and in this sense I want to be 100% clear. Lightning doesn't see your source code, so your privacy is maintained. It's running on the server against the binaries and is oblivious to the fact that you have an import statement in the file because that's a construct of Java, not a construct of the class file that's already compiled. So we have no way of knowing that you imported that specific class. After pressing OK, we see a special version of the snapshot icon with a question mark on it. This indicates that this action has a condition on it. Now it's just a waiting game for the user to hit that snapshot. This is the point where normally you can go make yourself a cup of coffee or even just go home and check this out the next day. That's the beauty of this sort of instrumentation. In this case, I won't make you wait long. The snapshot gets hit by the right user despite other users coming in. This specific request is from the right user ID. We can now review the stack information and fix a user specific bug. The next thing I want to talk about is metrics. APMs give us a large scale performance information, but they don't tell us fine-grained details. Here we can count the number of times a line of code was reached using a counter. We can even use a condition to qualify that. So we can do something like count the number of times a specific user reached that line of code. We also have a method duration, which tells us how long a method took to execute. We can even measure the time it takes to perform a code block using TikTok. This lets us narrow down the performance impact of a larger method to a specific problematic segment. In this case, I'll use the method duration. Measurements typically have a name under which we can pipe them or log them. So I'll just give this method duration a clear name. In this case, I'm just printing it out to the console, but all of these measurements can be piped to StatsD and Prometheus. I'm pretty awful at DevOps, so I really don't want to demo that in this case, but it does work if you know how to use these tools. As you can see, the duration information is now piped into the logs and provides us some information on the current performance of the method. The last thing I want to talk about brings this all together, and that's tags. We can define tags to group agents together, such as production, green, blue, Ubuntu, etc. Every pod can be part of multiple tags. Every action we discussed today can be applied to a tag, and as such can run on multiple machines simultaneously and asynchronously. This solves a scale problem when debugging. You can literally place a metric or an action on a set of tags, and it will happen on all of them. So for instance, if a specific user you don't know which pod he will hit in a request, you can still place an action on a tag and get it regardless of which container gets invoked. Let's go back to the list of drawbacks we had with debuggers and review it again. You can keep an observability agent running all the time since it's designed for this use case. Here we don't have breakpoints, snapshots, don't stop. They are secure by design and are read-only. You can't destroy the server and will throttle you if you overuse CPU. Some services like Clarkeland support PII reduction, which removes private information from your logs and blocklists which prevent users from debugging restricted classes or files. So it's very secure even when an internal user is, say, problematic. So in closing, I'd like to review some of the things we discussed today. CubeCTL Debug made debugging crashed pods possible. It also made it possible to debug a pod based on bare-bone images. CoolKits made CubeCTL Debug easier to use with pre-installed tools. Debugging an application in an existing container in production is difficult and risky process. Developable observability made deep, secure, read-only, real-time debugging, at scale, easy. Thank you all for watching and bearing with me. I hope you enjoyed the presentation. Please feel free to ask any questions and also feel free to write to me. Also, please check out TalkToDec.dev when I talk about debugging in-depth and check out Lytron.com, which I think you guys will like a lot. If you have any questions, my email is listed here and I'll be happy to help. Thank you.