 Hello, and welcome to our talk today. We're going to be talking about how we built a debuggability paved road at Intuit. My name is Anusha Raghunathan, and with me is Kevin Downey. And we're both software engineers working at Intuit on the Kubernetes-based platform infrastructure. Today, we'll start off with giving a little bit of background of Intuit and our infrastructure at a glance and get into the problem statement. Why are we even giving this talk? What are we solving? Why did we build a debuggability paved road? Then we'll jump into how we built paved road using two technical solutions. The first one would be interactive debugging using ephemeral containers. And the second one is the one-click debugging that we built with Argo workflows. And finish it off with some takeaways for you. Now, Intuit is a fintech company that's built on an AI-driven expert platform. And just to give you an idea of the scale at which we run our clusters, we run about 315-plus and growing Kubernetes clusters, comprising of 28,000 namespaces that serves about 2,500 production services. And these are just production services. We have a whole bunch of pre-prod services as well. And all of this serves a development community of 7,000 engineers that comprise about 1,000 development teams. And to highlight our AI emphasis, we actually do about 40 million AI ops inferences per day to help our customers with their financial needs. So what is the problem here? We have about 7,000 developers at Intuit that we serve. And one thing is common across all of their applications. Bugs. We've all been there. We write code, pass a CI. Then we push it to pre-prod. Everything works. And then we promote to prod and things break, yeah? And these can be very esoteric 5XX errors to transient timeouts and everything in between. And we all know debugging distributed systems is a pretty hard problem. And traditionally, developers use metrics, logs, and distributed traces to observe their systems and try and debug root costs and fix the issues. But sometimes this is not sufficient. What if you need to actually do deeper debugging and observability in your service environments? What if you have to actually run a memory profiler? What if you have to capture your network packets? What if you have to do some homegrown debugging tools that you need to run? Or even just network traces? What do you do in this scenario? And any friction that you have during this deeper debugging experience will increase your mean time to detect and resolve issues. What that means is if you're running business critical applications, it's actually hurting your businesses and your end customers. And we noticed that there was friction in debugging. And there were three main problems that we encountered. One, debug tooling setup. All of us have our favorite set of debugging tools or a Swiss Army knife of tools, depending on how passionate you are about debugging. And if you are a Kubernetes user, you actually can just exec into the pod's main app container and start debugging what the issue is. But most of our organizations have a security best practice that hardens our app container images. So when you actually try and exec into the pod, you realize that you have no shell. You have no package manager. You have no tooling and no Java specific, no general purpose tools or any Java or Golang or any other language specific tooling available. So there is the first friction that we encountered in our development experience. The second problem that we have into it was abstraction and debugging. They don't quite go hand in hand very well unless you build the right tool chain around it. To give you a bit of background, we are building a platform abstraction layer where the application developer just specifies their app intent and the app intent doesn't have to be anything Kubernetes related. It's actually very abstracted to just specify what the service is trying to do. And the platform abstraction layer transforms this application intent into the Kubernetes manifest behind the scenes. So in this case, it'll put in the right HPA, ingress, resource quotas, limits and whatnot. Now, this is great. The application developer doesn't have to know the complexities of Kubernetes, the deprecated APIs manage them and everything. What could go wrong? Well, when they are debugging, they have no idea how to proceed because everything is a black box. Before platform abstraction, we have our savvy Kubernetes, the service owner was also savvy with Kubernetes and had access to their namespace. So they could actually use all of the standard Kubernetes debugging tools to actually get in and figure out what's going on. But now after platform abstraction, our guy here doesn't even have access to the namespace and doesn't have the right skill set to do it. So he's basically dealing with a big black box. So none doesn't help much with debugging. And the third problem we faced was a fragmented debugging expertise that is across all of our service teams. We noticed that people who were good at debugging had one expert expertise that they would keep honing on and didn't wanna like sort of branch off into others. So for example, we had a developer who was a Java expert or a network ninja in the team and Linux Yoda. And all of them were so good at their own debugging when it comes to their subsystems, but they sort of cross pollinate and share their debugging experiences. So the side effect for this was depending on who is on call and who is attending the incident Zoom call, determine whether the resolution for the issue was shorter or longer and also dependent on what the issue was. And this was also something that was very inconsistent that we noticed across the teams. So to solve this, we decided to build a debuggability paved road. And typically developer paved roads in any IDP help developers with best practices and guidelines for building deployment and stopped with that. But we wanted to extend that to debuggability as well. So for this, we use ephemeral containers which were perfect for interacting with the application and providing a debug shell and Argo workflows for providing a sequence of steps for your debugging workflows. With this, our goal was to actually provide the debugging superpowers for all of our service persona. That way they can actually get to MTTD and MTTR much faster and this accelerated developer, not just developer velocity, but reduce MTTR. So that was the big grand goal for doing a debugability paved road. Now having said that, let's jump into interactive debugging using ephemeral containers. So why would someone wanna actually do interactive debugging? From what we've seen of service personas, once they are done with their initial onboarding and day zero, day one activities, sometimes they wanna go in and check to see if they have connectivity access maybe to their databases to any of their cloud resources that they need from their main application. They might have some service to service connectivity checks or just maybe just do a DNS lookup or just simple or pretty complicated. It can be any spectrum of activities that they might want to do and launch using an interactive debug shell. And we use ephemeral containers for that. So ephemeral containers were introduced in upstream Kubernetes in 123 as a beta feature and 125 as a GA feature. Now prior to 123, a pod could have an init container and maybe a bunch of sidecar containers and then the app container. And if you had to actually attach another container, you had to actually redeploy your pod. So it was actually immutable up until that point. But in 125, the feature became GA and now a pod can actually have, along with init container and app container and other sidecars, you can actually have an ephemeral container. Now the ephemeral container shares the same process namespace, IPC and UTS namespace along with the app container. So you can actually launch an ephemeral container and target a specific container. In this case, we are always targeting the app container and it's actually sharing the same namespace, the Linux namespace as the main application container. And since all the containers in the pod already share the network namespace, ephemeral containers are an excellent tool for introspection and debugging of your main app container. Now, when we built the debuggability paved road, here is the UI screenshot of what a service persona has to go through when they have to get a debug shell. All they have to do is pick their workspace and then pick the region which they are interested in and the name of the pod and then they actually just press on the connect button and then that launches a debug shell. With that, I'm gonna quickly go to a recorded demo. So in here, just assume that this is the pod that we just launched from the UI and the entry point is just a simple, it's just sleeping and then it has a little readiness probe. So notice that the Alpine client is actually, it's not ready yet. So let's describe the pod and see what's going on and notice that the readiness probe just failed because it's missing the file that is being expected. In this case, it's a temp file. So what I do here is basically launch this debug shell and we specify the pod name and then the region and notice that when I actually do a process listing, the PID one is actually the target containers PID and now all I have to do is go into the proc file system and touch a new file in the root file system in the root file system of the main container. So now you will notice that the pod, when I describe the pod is now healthy because the actual health check passed and the readiness probe is fine. And if you look at the events, the debugger container got attached to the main app container and it was actually, I was able to add the attach a new file and then go from there. So hopefully that was clear as far as what we were trying to do in short, what we were trying to do is just being able to access the process namespace as well as the route FS of the main container and we were able to fix the health check issues that we were having the readiness probe started working and then now this pod is actually working fine. It was a very simple demo but it shows the power of launching the FM middle container. So what do we do behind the scenes? I'll just walk through quickly on the workflow. So the application developer was able to request the debug shell on the target namespace region and pod and that's what we were able to see using the UI screenshot. And this namespace resides on a remote cluster and it's running the app container and we have a multi cluster orchestrator. It's basically our main fleet management software which is in house and the multi cluster orchestrator receives this request from the user. And once that user request is authenticated we have obtained the coop config of the target namespace and we use a client go to actually launch an FM middle container in the target pod. There is actually Kubernetes API for launching an FM middle container in the target pod. And once that launch happens we actually established a speedy connection between the FM middle container and our orchestrator. There is a bi-directional streaming server that we run as part of the orchestrator. And then we have upgrade the connection to the client to a web socket. So the HTTP client connection now gets upgraded to a web socket and a debug shell is presented to the end users. So from then on all of the bytes are basically streamed back and forth between the debug shell and the FM middle container using the multi cluster orchestrator. So what's in the debug container image? The standard tools that you would expect shell package manager, standard Linux debugging tools, language specific tools, cloud provider tools, network storage and anything miscellaneous. And one call out is about the security. So we're actually talking about debug shell to a production environment, which is super powerful. So be super careful in how you are implementing it in your own environments. And what we've done at a very high level is we would terminate the session if there is inactivity for 30 minutes. It's our back control. So only power users to your service namespace are given access to this. And we have OPA policies to make sure that these debug containers are, go through the same rigorous security policies as our main app containers. There is an audit record for every stream that the user launches as part of debug shell. And then we also wanna be very careful about throttling the limits. So we have per user, per pod, number of debug connections, per cluster and across cluster. So we have limits on all of this so that we don't over them and over them any of the components that I showed in the previous page. Having said that, I'd like to hand it over to Kevin to talk about our workflows. Thank you, Anusha. I'm Kevin. We work on the same team and the platform group. What you just saw was a power user flow. But what if we know exactly what we wanna do and we wanna give the most simplified usage for this, for our developers to use? So this is what you'll see. Firstly, I just described what our goal is if you're not familiar. It's a CNCF graduated project. It's open source and very active community. You can go and read a lot about this but it's a Kubernetes native workflow engine with DAG and it has a bunch of UI if you want and a bunch of tools and this is a very simple workflow example. If you wanna learn more about workflows, just go ahead and stop by our booth in this showcase. We have a lot of demos there. So here's the problem. We know developers will need to debug their running code and I don't want this person to come and do this magic dusting of their print statements. We wanna give them a little more context if they can but we really didn't know what their specific techniques are, what language they use, the framework they use. We do also wanna preserve as much as possible the context when debugging. So that's the context meaning the structures of the application, maybe the running code references. And we wanna be able to give the right tools or access in this case if it's a abstracted platform that can be a challenge. Looking specifically at what we do at Intuit you can see that Intuit primarily runs with Java Spring services. Though we support multiple languages here. This is a basic snapshot of what we kind of are running in our templates. But for debugging tools we support just a subset currently but we have plans for future. So right now we support the Java framework and Golang and soon Python and any others that are on this list, we have a lot of growing interest in other languages frameworks. And I wanna describe what I meant by application context. So in this case we have a lot of Java developers that are using Spring Boot. What do we wanna do for these people? Do we want them to use a very specific module? I mean this is, I already give it to them honestly with the templates that we give and then to them. But this is called the Spring Boot actuator module. It's a sub project of Spring Boot. It gives you a lot of little very nice management endpoints if you, and you can enable or disable them as you choose. In our case when we are on our debug paved road we want them to enable these specific endpoints. These are Prometheus, that's a metrics endpoint. There's a thread dump and a heap dump for the Java. Those are two specific things that we think that they will need in the future. And here's our sort of example of if you wanted to dynamically enable this given this application debug YAML is there you can just have a Spring active profiles environment variable and append your debug profile there and allow you will have exposed for them these endpoints. They're actually a different port than their normal application but this is what we want when we wanna preserve their application context. This will give us access to thread dumps and heap dumps and have it within their code base and their structures and stuff. So there's, here's an example of Java developer that actually wants to take a thread or heap dump. The question is why would they even want to? Well let's say there's noticing some trouble in their pod, one of their environments. They could be any one of these reasons, 100% CPU, any memory leak, a garbage collection events, deadlocks, performance bottlenecks. So what we really want them to do is not concentrate on being the guru or finding the guru. We want them to go ahead and just take an action here which is take a thread dump or a heap dump based on whatever these reasons would be. Then we also want them to be able to have access to those thread dumps or heap dumps and use their own tooling to analyze that. And then this is sort of a workflow of how this would work. So we have a little screen of the UI that's presented to the developers for similar to what you saw previously but they will choose one action and one host which is a pod and they will trigger this to run. And with this I will show you a demo. So this is a screen view of picking my QAIL environment. It is now running. This may take a few seconds to run. As soon as it's ready we'll get a download button and the user can then download this artifact which is a thread dump in this case, download it to their local machine. And in our case, in this developer's case we have a BU tool that we use quite often it's called Y Crash. We just want them to use this. We don't want them to try to analyze it themselves or read some tea leaves. We want them to go ahead and upload this to their favorite tool. If you can find the file and give it a second. And as fast as that, you've basically uploaded your thread dump here. Here it shows that there was 75 total threads, 39 were waiting, timed waiting, runnable ones. Just scroll down and we'll see that there was specific stack traces for each. This is very useful if you want to have, if you know you have some deadlocks or you have some CPU issues, you'll see them here. Your tool will be able to help you out a lot and use these flame graphs as well to figure out where the process is getting stuck perhaps. Where's the, so let's just go walk through what it was that we just did and what we saw. Here's the back end view of what's happening. One, that user clicked a button in the UI, we actually launched a workflow in our debug system name space in the cluster. And that workflow launched a pod that went ahead and did take a thread dump by hitting the actuator slash thread dump endpoint. And it took that payload response and actually did sanitize it a bit and uploaded it to S3. And the S3 then became available as a download. So the package is sanitized and upload to S3 for download later. But at each step, what you didn't see was we also do auditing. We audit, so by using the workflow we were able to audit and capture audit events of who's doing what and how long each process took and other metrics as well. One important thing about thread dumps and heap dumps, we need to secure these, for sure there's certain data that we don't want to escape and these are also impactful endpoints. So one thing we do, as I mentioned, we audit at every step. Downloads and uploads are restricted by RBAC and IM. We use the IDS pre-signed URLs when generating those download URLs. So we encapsulate the expiry time. And finally, we use a network policy here. You can kind of see an example of our network policy where we only allow the debug namespace or other specific provision namespace to take the debugging actions that we want. And given this, that we are allowed to give this user a very clean flow. We don't need the guru to come and do it. Every team member has access to this UI and can take a quick thread or heap dump. And with this, I'll get you to the takeaways. So the takeaways here are, we want to enhance developer velocity and improve their MTTR by reducing the friction to taking debugging actions. We also want to automate and secure this access to the sense of data that may be generated from these debugging processes. We also want to facilitate seamless collaboration between young developers through auditable downloadable auto facts. And as mentioned before, we want to democratize the tooling as much as possible. We don't want to have silos of information or experts just do one thing. If the whole team can do it, that's great for everybody. And that's it. That's our closing slide there. This is Intuit's open source projects. As mentioned, we are in the showcase. We have a few of these. And our goal as well is one of our projects. Thank you. Great presentation. You showed the debugging flow for Java. Do you also support Golang and Python on the one click workflow? Yeah, I mentioned Golang. So Golang, we have profiling and go routine dump. Doesn't show as nicely as a thread dump, but yes, and we're working the similar for Python. Python will have sort of a core dump or sort of a profiling as well. Hi, Anusha. Can you switch to the multi cluster orchestrator slide? Yeah, in this slide, how much of the things come from the community and which ones you had to build? Good question. So of course the client go package is from the community. Basically, we built it, we vendor it and build it as part of the orchestrator. And as far as inspiration to do the whole bidirectional streaming, we actually looked at kubectl exec and kubectl attach. And since upstream is using speedy, we said, okay, let's not go explore it. Something else, we'll just use speedy as well. So from a technical leveraging, from the community point of view, we basically did just those two things. Like from a design point of view, we looked at kubectl for this, and then this was client go. Okay, so we are trying to do WebSocket next, so. Okay, yes, I noticed that there was an issue in upstream for WebSocket. Yeah, thank you. This was great presentation. Awesome. Thank you. Thank you. One more. Thanks for the presentation. For the Java, did you guys try the remote debugging where you could see the variables values or line by line? That's good. So that's something that we wanna explore. Basically live code debugging with your favorite IDE and being able to connect to prod. We haven't, it's on our roadmap. Could you expand a little bit on your auditing story? So how in, where in this diagram you're tracking user actions in the debug containers? Yeah, it's actually part of the multicluster orchestrator, so that's kind of a control plane we have and we have an auditing API. So we would push data to this auditing API. That's mainly what we do. So every request and response that the multicluster orchestrator is handling is always an audit record, turns into an audit record with very specific details about the request, the user, pod IP, things like that. And then we also audit, like specifically for the debug shell, we audit the entire text stream that's going back and forth, yeah. Thank you. For, sorry. For ephemeral containers, you're effectively, it's as if you're running on the app container. So JPS works, JCMD, all the Java tools. Yeah. As if you were right and had them. So for, yes, yes. So few things to note. So for Java, what we've done is we don't even want the, as much as possible since Java is our P0 language, we want to make sure that any debugging actions that are Java specific, that are Java specific, we want to be doing one click debugging that Kevin said. So that's why we like, even though we have jmap and jstack that we can bundle and ask them to go exact, then take a heap dump, then upload it and share it, we wanted to automate that entire pipeline. Having said that the ephemeral container, the debug container that gets launched, shares the PID namespace, that's the most important thing. And then you can also like in my demo, I don't know if it was clear, I was actually looking at proc one, PID one, and then root, thereby I was able to access the app containers, root FS. And then I can basically have read write, I was able to touch a new file that way. So it's basically read write, mounted. You can also, when you launch a debug container, have extra privileges as part of the container. So you can introduce things like capsis admin, which will let you do things like nsenter into your main app container. Or you can also do ptrace, cap ptrace, which will end up like you can do dig in other network spoofing if you really wanted to. So you can decide what your debug, like your ephemeral container will, how you want to launch it. We've started with a not, like we'll take whatever default, we're not doing any extra Linux or security privileges, we're just taking what is there. And then as the use cases evolve, we might introduce these more privileged debug containers. Fantastic, thank you. You had mentioned sanitizing the output and you touched on it for stack trace. One, I guess, why would you have to sanitize stack trace and two, do you actually sanitize the heap dump as well? I think we do mostly through the heap dump. We go through the same sanitize step, but we may pick different tooling for each. Mainly the heap dump, there is an open source tool that we've modified internally to capture a little bit more of our PII for heap dumps for sure. So that's the sanitization for heap dumps. For thread dumps, I'm not sure exactly what would come out in a thread dump, but it's maybe not as much PII, but if there is any sensitive data there, the team can actually tweak that and put their own tool in there. On the heap dump, even after your sanitize stuff, they can still use traditional Java memory analyzers. They can still, even with a heap dump tool, you're saying? Yeah, even after it's been sanitized. So I just want to quote, all of these download uploads are only good for 24 hours and we audit them. So yeah, we don't want those data to escape, but if they do, it would be sort of mitigated away. So yeah, there's no way to protect it 100%, I'm sure. So by sanitizing, we mean that we mask out or any personally identifiable information, PII data, so that we're not leaking any customer information. But it doesn't corrupt the download. It doesn't corrupt the data. Thank you. How does the debug session in an ephemeral environment handle, like if you have an app with frequent deployments, like is it a deployment? Does it kill your debug session that you're in the middle of? Well yeah, so good question. So the ephemeral container is ephemeral in nature. So if your main app container dies, your ephemeral container dies, any pod, restarts, redeployments will kill the session. But the idea is, let's say you're on an incident call and you've come to the point where you know this is the bad pod and you're attaching that session for just specifically for a short time window. It's not a long-lasting container that is expected to be there. Basically, you won't have any more ephemeral container. Okay, thanks. Thank you everyone.