 Good afternoon, everyone. Welcome to our session on building an AI-powered paid road. My name is Zabhni. I'm a product manager at Intuit. And with me, I have Todd, who is a principal engineer at Intuit. So let's look at the agenda for today. First, we'll go over the background, and then we'll see our infrastructure at a glance and the scale at which we operate. Then we would look at some of the challenges that come up with such scale. And then we look at the solutioning part, which covers three important pillars around absentric runtime, traffic management, and debugging. And then we'll close with takeaways and Q&A. So let's look at our background and infrastructure at a glance. Intuit is a global fintech company that builds a lot of financial products like QuickBooks and TurboTax. And we are an AI-driven export platform to lay emphasis on our AI power. We make around 40 million plus AI ops in Francis per day. We also have a huge Kubernetes platform layer, where we have around 2,500 production services. And the pre-prod services are even more than that. We have around 315 Kubernetes clusters and more than 16,000 namespaces. And it keeps growing as we speak. We also have a big developer community with around 1,000 teams and 7,000 developers at Intuit working on end user products. Let's look at the personas that deal with the platform on a day-to-day basis. First, we have the service developer persona. They are focused around building the app logic for the product that is then shipped to the end user. The service developer persona, as I mentioned, it's a huge community in Intuit. It's around 7,000 developers. Their focus is on code and shipping it faster. Then we have the platform persona, the platform exports, as we say. The platform engineering's overarching goal is to drive developer autonomy. And their focus is to enable our service developers by providing capabilities through several interfaces. So let's say if a developer needs a database, they should be able to do that no matter they are a Node.js developer or a database administrator. It should be that frictionless and easy to use those capabilities. Similarly, if they want to deploy something on Kubernetes or they want to manage an application on Kubernetes, they need not know the nitty gritties of the platform and the infrastructure. They should be able to do it in a seamless manner. But what are the challenges today of an app developer or a service developer? There is a steep learning curve. Many times, our developers, they find themselves dealing with a lot of Kubernetes internals and APIs. They also deal with a lot of configurations on day-to-day basis related to infra. And when there's something misconfigured, they don't find enough troubleshooting help. The second friction point is regarding inner loop developer experience. It's around local development with dependencies. And the third one is that we have a lot of tech refreshers that might need migration. So like migration of the deprecated APIs, in case we are upgrading Kubernetes or Fluent D replacing CloudWatch agent. Even these kind of migrations, they require service developer team support. This is a sample workflow that our service developer goes through on our internal developer portal. The first one is to create an asset on our dev portal. An asset is the atomic form of deployment on our Kubernetes layer. The developer then has to develop and deploy their app. After that, they find themselves configuring certain Kubernetes primitives in their deployment repo like PDPs, HPA configuration, or Argo rollout analysis templates configuration. They might also onboard to API gateway to expose their app to the internet and onboard to service mesh for configuring rate limiting. After that is when they try to do end-to-end testing. And if they have any performance tests to be run, they do load testing by configuring their Minmax HPA. And yes, of course, the last point is it runs for perpetuity. They are intertwined with platform migrations quarterly to stay up to date. And this can certainly make your service developer go crazy. They now not only have to focus on the business logic, but they find themselves dealing a lot with infrastructure as well. Where we in fact want to be, and this is our target state, is to translate all these application needs to platform means, meaning they just focus on developing the code, deploying it seamlessly frictionless manner without knowing the integrities of the platform, and then able to do end-to-end testing. All the other things should be taken care by the platform. I would now like to pass it on to Todd, who would walk you through all the solutioning that we did for surpassing these challenges. Yeah, thanks, Aparna. Just to set some context, I'll briefly describe into its development platform, which we call Modern SAS Air. At the top, we have our developer portal, like I haven't mentioned, which provides the developer experience for all of our engineers. The complete lifecycle of an application is managed through the step portal. Underneath that, our platform is based on these four pillars. We have AI-powered app experiences, GenAI-assisted development, App-centric runtime, which we'll talk more about in this talk, and smart operations. And we have our operational data lake that provides a rich data set for visibility into how all of our applications are developed, deployed, and run. The focus of this talk is really on the runtime and traffic management component we call IKS Air. IKS Air is a simplified deployment and management platform for containerized applications running on Kubernetes. It provides everything an engineer needs to build, run, and scale an application. The main components of IKS Air are an abstracted application environment, unified traffic management, and developer-friendly debug tools. If you build and run your own platform on Kubernetes, you likely have many of these same concerns. And hopefully, this talk will spark some ideas for your own system. I'll go over the first two areas, and Avni will cover debug tools. As Avni described earlier, we faced a number of challenges in over five years of developing and operating a developer platform on Kubernetes. And our current paved road runtime is a result of all these learnings. We provide our developer with an abstraction that allows them to focus on their application and shields them from Kubernetes version upgrades and other platform changes. The abstraction also implements best practices for application resiliency and provides automatic rollbacks. We also provide intelligent autoscaling to efficiently manage the resources needed to run the application. The application specification abstracts and simplifies the Kubernetes details that you see on the left, and provides an absent trick specification like on the right. The platform takes the responsibility for generating the actual Kubernetes resources that are submitted to the clusters. Our application spec is heavily influenced by the Open Application Model, or OAM, and describes the application in terms of components, which are then customized or extended using traits. Components represent individual pieces of functionality, like, for example, a web service or a worker. And traits can modify or change the behavior of a component that they're applied to. Through this system of components and traits, developers are able to define what their applications require from the platform without having them needing to completely understand how it's implemented in Kubernetes. A good example of this complexity is the progressive delivery solution utilizing Argo rollouts and Numaflow that the platform creates to enable automatic rollback of buggy code. When a new version of an application is rolled out, canary pods with a new version are first created, and then some percentage of the traffic are sent to those new pods. The metrics of those canary pods are then analyzed by the Numaflow pipelines, which generates an anomaly score. If the anomaly score is low, like what we see here in the slide where it's three, then the rollout will continue. But if the anomaly is high, like, say, above seven or eight, then Argo rollouts will stop the deployment and automatically revert to the prior revision. This is an important aspect of how we allow our developers to deploy with confidence and without needing to know how to set up this complex solution. Our platform also automatically recommends scaling solutions for applications. To do this, it needs to determine the application resource sizing, like the memory and CPU, and also handle unexpected events like umkill or eviction. It also needs to handle horizontal scaling to ensure the applications operate properly, both at scale and with varying levels of load. It needs to identify what metrics the application should scale on, as well as what the minimum and maximum number of replicas should be. As you can see, these are all primarily data-driven problems. And we believe AI will have a big impact on capacity planning and auto-scaling, allowing us to be more efficient with our computing resources. So we're building an intelligent auto-scaling recommendation system that reduces the burden on our developers, helps us ensure our workloads have the resources they need, and improves the efficiency of our platform. The details of this are beyond the scope of this talk. But the basic idea is we have components running in the cluster handling short window scaling operations and emitting metrics that are then analyzed by a group of ML models that make long window capacity and scaling recommendations. The solutions to different scaling problems are then applied back to the clusters. Another big challenge we've identified for our developers was configuration and management of network traffic. While some applications need to use some very specific capabilities of our networking environment, we found that most only need to change a few common set of configurations. Our solution simplifies the endpoint management, unifies the configuration of our API gateway and service mesh, and provides graduated complexity as needed. For example, here's a screenshot showing the traffic configuration of a service on our developer platform. Most applications need only change, configure their routes, and throttling. But if needed, they can toggle on advanced configs and get access to cores and OAuth scopes. And then for even more complex use cases, they can edit the underlying YAML configuration. Now I'll turn it over to Avni to talk about debug tools. Thanks, Todd. Todd did go over the abstracted layer and intelligent auto scaling of our platform. But we also know that once the platform is abstracted, another challenge that our service developers experienced was troubleshooting their services. And abstraction and debugability doesn't go hand in hand. So it was really critical for us to provide them a paved path as well to service their debugability needs. In this, we have provided them a debugging experience on developer portal, which is super deaf friendly. They need not know any Kubernetes primitive or have any historical knowledge of the platform. This is also an attempt of us trying to democratize debug tooling across teams where we saw that folks are really juggling with a lot of different tools. This will help us in reducing MTTR and friction in debugging workflows. The first experience that we've provided our developers is the interactive debugging shell experience using ephemeral containers. So we know that we have an abstracted platform now. The user will have no access to namespace. Certainly, abstraction and debugging does not go hand in hand. Everything for the user is now a black box. So we have interactive debugging using the shell experience, in which we try to provide a shell type interactive debugging experience. This we could achieve with a concept in Kubernetes known as ephemeral containers. Ephemeral container is a GA feature. In the Kubernetes 1.25 release. Ephemeral containers are ideal for running a transient container that is custom built for troubleshooting the main app container in the service pod. It is great for introspection and debugging. So now you can launch a new debug container into an existing pod. Here it is outlined as red and known as ephemeral container. This debug container will share the same process namespace, PID, IPC, and UTS namespaces, of the target container. Given that the containers in a pod already share their network namespaces, this debug container, it is set up perfectly to debug issues in the app container. And ephemeral containers are a valuable solution for interactive troubleshooting when kubectl exec is insufficient because a container has crashed or a container image doesn't include debugging utilities, which is a common scenario as many of the orgs have their images hardened. This is a demo of how it looks like on our development portal. Here we see an icon of a shell. The user clicks that, and then we'll go and try to select a host, which is a pod. They click on that pod and hit Connect. That time, an ephemeral container is trying to connect to the particular app container it attaches, and then the session would resume. Here we can see that this connection is established and a session has started. In this way, we have hidden the complexity of kubectl into a pod, or even getting a kubectl config. So a user can go and use this frictionless experience to debug their service. Another thing that we have provided is one click debugging, which is like an on-demand debugging using workflows. For workflows, we have used Argo workflows, which are ideal to define the sequence of steps needed for a debugging workflow. Specific debugging techniques are required based on the language and framework, and we also know that we would like to preserve as much application context, application structures, and code references while debugging a service. At Intuit, we also figured out that our top two use languages are Java and Golan, and these are some of the language-specific debugging tools that we might use. Let's look at what it looks like on our development portal. The user interacts with the dev portal UI to take a thread dump or heap dump on a target pod for a Java service. When they hit Add to Debug List and they add that specific pod or host, a workflow is executed in the background, and it would do certain steps in sequence to perform the thread dump and heap dump workflow. Developer can later download the artifact and can use their preferred analysis tool, and these downloads are available only for 24 hours. So that concludes our session. And I would like to go over the key takeaways of building this payroll. It will definitely increase developer release velocity and help in performing platform migrations with ease without a lot of friction from the service developers and their team. It will also help in reducing amount of time for getting a service into production. Since a lot of it would be vended out by the platform itself, we are actually taking that burden off the developer. It will also help in reducing potential incidents caused by misconfiguration of the infra. And by abstracting the infra network connectivity, we are able to provide a better experience. An intelligent auto-scaling will manage your service availability as well. So that concludes our talk, and I would also like to mention our open source community and our Believe and Open collaboration into it is the recipient of end user award in 2019 and 2022. And we also have a lot of projects that we maintain and contribute to actively. And you can take a look at those on this link as well. Thank you. Yeah, we are now open for Q&A. Yes, hello. Thank you for your presentation. First, a quick question. Can developers also interact with these ephemeral containers using a CLI, or do they have to go through the portal? The way we've designed it is that they go through the dev portal, and they don't have to worry about port forwarding or exacting into the pod. And in fact, they don't need to worry about having debug tools already on their image that their application is. And actually, we prohibit that. We hearten those images. So through the dev portal, they just get a command line prompt to an ephemeral container. And that ephemeral container is essentially a debug pod with a debug image that we've created ourselves as a platform team that has all the approved tools on that image. So yeah, then they're able to interact just from that command line from the dev portal. Right, thank you. And also, do you see any disadvantages of abstracting so much from the developers? Yeah, the big disadvantage is there's the wealth of Kubernetes knowledge in the community. And so on one hand, it's hard when you're doing, if you're providing an abstraction, you also have to document that abstraction. It's something different than what people are going to be learning about here at KubeCon, for example. But we offer both. We have what's called the custom paved road, where people get low level access to Kubernetes. And then we have this more abstracted IKS air paved that abstracts the thing. So depending on the team and the use cases, they may choose different systems. Thank you. Other questions? Yeah, I'm curious about the AI part of this. Did you find that when you started, it was enough to have just kind of a generic, like the AI have generic knowledge of things like Kubernetes and applications. And like, did you not need to train it with domain specific knowledge to begin? Was it effective that way? I would imagine it got more effective if you gave it domain knowledge, but I'm curious if you could start with something that was more generic. Yeah, we actually started with a statistical analysis and not maybe true AI. We actually started more with an opinionated model of how to analyze and predict what the correct solution is for these auto scaling. And then we actually evolved more into the AI. And so now our models, they are trained on domain knowledge and basically historical metrics and performance of each application. And so there's a model that encapsulates all that data that's used to do the predictions. What about for troubleshooting? Like troubleshooting workloads? So we have a separate effort where when we get an incident or an alert, we'd like to give some AI assistance to the people responding to that alert. And so that's more trained on both internal resources as well as external sources, Kubernetes documentation and things like that. But then we also have our own run books and our own processes that we train the AI on. And then we use that knowledge base, that combined knowledge base to give some hints as to what the problem could be and help the people get in the right direction. Troubleshooting. I don't mean to bogart the microphone. Did you find that it was effective? The troubleshooting was effective with when you didn't train it with domain knowledge. Like was it able to deal with a lot of problems or did it miss a lot? So we have a lot of sort of into it, we're a big company so there's a lot of very specific things that happen inside our environment. And so yeah, early attempts that were just based on regular LLMs, they were close but they often didn't quite hit the mark and we definitely had a lot of hallucinations and wrong results. And so we improved that by also combining that with context within our own system. Thanks, Todd. Thanks, Evan. At this point, we are going to have a coffee break and then we will meet.