 Hello, my name is Ankur. I lead the machine learning as a service, MLAS for short, team at Capital One. My colleagues, David Harrington, Patrick Hannis, Trevor Halleck, Cruz Hall, and Christian Langholm, and I are excited to tell you more about our platform and share some of the operational challenges and lessons learned in running the platform on Kubernetes. I also want to give a shout out to Suman Gadapalli, our architect, and Jason Stryker, Allen Mann, and Michael Andrews from our Kubernetes SRE team, who have been incredible partners on our Kubernetes journey. Before proceeding further, let me take a moment to mention Capital One's commitment to the open source community. Capital One made an open source first declaration in 2014, and that's when we made our first contributions to the open source community. We sponsored PINOS, Python, continuous delivery, and the cloud native computing foundations to help keep open source sustainable. Capital One's contributions to the open source community have been significant, and we have released more than 40 of our own software projects. We have invested for years to build the culture and governance required to be open source first in a highly regulated industry. I will now hand it over to my colleague, David Harrington, who will walk you through a high level architecture of our platform. David, over to you. Thanks, Ankur. Hi, so for today's agenda, we're gonna walk through an example architecture of building a simple software as a service platform. We're gonna talk through how we use Kubernetes and how we organize our teams and software in order to accomplish that. Many of the concepts like we aren't gonna be new for most folks who have basic understandings of the primitive Kubernetes objects, but rather our intent with this presentation is for those to walk away with a better appreciation for how we combine the basic concepts of Kubernetes into a higher order system. So diving in at the heart of it, our platform runs ML and data processing pipelines at scale on Kubernetes. We seek to make it easy to run out-of-the-box analytics on the desired datasets in a standardized, secure manner. In this talk, we're gonna cover our journey in using Kubernetes, cover some of the foundational processes, design considerations, and pepper in some examples of incidents to elucidate why getting these patterns right is crucial. And before we dive deep on any one area on the agenda, it's important we level set on what our requirements and high level architecture look like. As for requirements, we need to one, be able to run batch jobs on demand that connects to end users' data stores. Two, we need to be able to enable non-technical users to configure and launch these jobs via a UI. And three, we need to enable least-privileged flexible data access. Now, as a large organization, it's important that we adopt a multi-tenant architecture in order to help ensure least-privileged data access. We wouldn't want people to be able to access data that they shouldn't be able to. And now, one of those important requirements for this platform is that not everyone is gonna be an engineer with direct system access to our cluster. Rather, we need to be able to serve users via UI in addition to API services. Accomplishing the above, provisioning of a job is not as simple as having someone run a coop cuddle apply. Regardless of where the user journey starts, we must ensure the same properties of compute, network, and data isolation. Now, in this diagram, we have somewhat of a stripped down basic version of what our system does. As a given to this presentation, we're running on a Kubernetes cluster, more on that later. And like any platform, we have APIs, UIs, databases, and most importantly, our platform does something of hopefully value to the end user, in this case, running some standardized analytics jobs. Now, what the jobs do for this presentation doesn't matter all that much, other than the fact that they might require customized networking, but these jobs need to be able to connect to the end user's desired datasets as we see here on the right. So, a huge part of this platform is running a reliable Kubernetes cluster. How do you upgrade a cluster of production? What production configuration meets the requirements for your enterprise? What add-ons are necessary for the operation of your platform? These aren't easy questions. We have a central SRE team whose responsibility is to provide platform teams like ourselves the automation tools to provision and manage a production grade cluster. This is the bedrock on which our platform is built upon. Without it, we cannot securely or reliably do any of the fun things like running thousands of pipelines. It's important to mention that a central SRE team does not mean we run all Kubernetes workloads at Capital One of one big old cluster. Rather, for large organizations where you may have many complex platforms, it is not advisable to share clusters across platforms. Needs may differ much lower in the stack making coordination of releases thorny to say the least. This hub and spoke model provides the best of both worlds of a central team of experts and the right size scope limited clusters with more predictable behavior. Now onto the next layer in our architecture. This is a big one, our platform APIs. It is conceivably the entry point for all user interactions on the system. An entire talk can be dedicated just to how to build multi-tenants offers a service applications but that's not the purview of this talk. Rather, what's important about our platform APIs is how they interact with our cluster and the layers below it. When building a multi-tenant system like this, you're always faced with the question of whether you wanna provision a copy of your stack for tenant or utilize a shared service model. What does that mean in terms of Kubernetes? If you have a platform API that needs to create jobs on behalf of users, is it better to have a single service that has permissions to deploy jobs for all users, perhaps across many namespaces? Or is it preferable to have a deployment of your service for each tenant where the permissions of each API is specific to each tenant? The blast radius may differ depending on your approach. And so that's a judgment call that cannot be made universally for all platforms as both options have their own sets of frozen cons. Regardless of whether you have more of a single tenant model for your services or a shared multi-tenant model, we still need to manage resources on behalf of our users, which leads us to the next part of our architecture. Next slide please. The primary function of this layer is to maintain and manage the specifications of resources for your users. The API is not much different in responsibility from your SRE job function, maintaining say, home charts for deployment. The engineer must maintain and upgrade the deployment when appropriate, defrecate it when it's time, and have a disaster plan in place for when the service or cluster goes down. Only in this instance, instead of an engineer committing code to version control and kicking off builds, all of this management has to be codified and automated as to be repeatable arbitrarily many times, which leads us to this layer, the tenant sandbox. For our example platform, we're running jobs for our users. We wanna limit what these jobs are capable of doing as we laid out in our initial requirements. We want a minimal network surface, a limited set of capabilities, permissions, and available resources as to avoid any one tenant causing problems or maybe even snooping where they should. One of the primary functions of the outer platform APIs is to automate the provisioning of the sandbox whenever a new tenant signs up for the platform. And if we manage the environment well, then the next part should be simple enough. We get to run our meaningful workloads for our users. When indirectly opening up what can run on your cluster to a large number of people of varying skills and backgrounds, problems at this layer are bound to arise. In this presentation, we're gonna cover techniques to catch problems as they arise and avoid classes of errors. And with that, I'm gonna hand it over to Christian. Thank you, David. All right. So I'll talk to you about how to go about updating your cluster. Though it can feel a more mundane and routine part of your deployment process than say your platform deployments, cluster upgrades require equal presentation preparation and attention to detail. Your cluster is the foundation upon which your platform is built and therefore can have and lead to unintended consequences when that foundation changes unexpectedly. It's therefore important to have a plan before, during, and after any upcoming cluster upgrades to identify potential regressions, test and monitor for those regressions and recover from those regressions if necessary. This plan should be equally robust and equally ready as your platform deployment plan. Such a plan is equally important whether you own your cluster scripts or if they come prepackaged for your use like in our case. And a good cluster plan includes the following steps. First, listen for upcoming cluster upgrade dates or deadlines and prepare your team to have a resource or resources on standby for potential failures, rollbacks or hotfixes. Ideally, these support resources have been set aside already as part of standard support rotation and have been prepared with this action plan in advance. Second, make sure you review the change log and any relevant documentation for the upcoming upgrade and identify known breaking changes and suspected points of failure. Third, prepare for the upgrade. If you have identified definite breaking changes, make a plan to implement the necessary changes before the update deadline. It's important to review upcoming changes well enough in advance to make these preparations. Second, if you suspect any changes might be problematic or cause failures, ensure your tests and monitors and alerts cover those potential points of failure so that they can be quickly identified post upgrade. Four, test, monitor and alert for regressions. During and after the cluster upgrade, ensure that your entire test suite is run, including integration tests, end-to-end tests and performance tests. These are powerful tools and identifying regressions quickly. No test suite is a complete picture, however, so it's equally important to monitor your logs, performance metrics and application health wherever possible using whatever tools are in your arsenal. Five, after the upgrade, conclude there are no regressions introduced into your platform and you can safely sign off on the cluster upgrade as a success in the current environment. However, if regressions have been discovered, ensure you communicate your discovery and determine whether you need to roll back and deploy a fix. Only after a successful assessment of your current environment should you elevate your cluster upgrade to your next higher environment. Six, performing the elevation. It is crucial that any change to your cluster or your platform that will ultimately end up in production begins in your lowest possible environment. Only after explicit approval in a lower environment should any deployment be elevated to the next level. It's therefore critically important to allow sufficient time before your upgrade deadline to fully test these cluster upgrades in a lower environment. Now, I will explain a situation that happened to us during a previous cluster upgrade and I'll explain how we learned the hard way, how we needed to implement such a plan as I laid out in previous slide. So back in March of 2020, after several smooth cluster upgrades, our lack of proper planning came to light with a regression that was introduced in our QA environment after a routine cluster deployment. After this deployment over the course of a few days, some of our users began reporting high error rates in the form of 502s and 499s coming from our API. These higher error rates were not caught by our alerts, nor did we have sufficient monitoring dashboards in place to help identify the source of these failures. After triaging and assigning a lead to this issue, as well as informing stakeholders and clients about this issue, we began investigation. Due to our lack of foresight to plan for issues like this, this investigation did in fact have to be done manually and touched many aspects of our ecosystem that many of us were only tangentially familiar with, unfortunately. After concluding, a regression was not introduced within our platform code. Our team struggled to identify the source of these errors, let alone the root cause or solution. And all that we could conclude was that due to the nature of our HTTP error codes we were seeing, network traffic was severed somewhere along the way. Though, after some thorough investigation, we discovered logs from our ingress controller, which pointed to it as our source of failure. Further network testing on the ingress controller confirmed a high rate of connectivity problems between it and our server pods. Further investigation along with our network administrators revealed an undocumented regression in our CNI plug-in, CILIUM, which was blocking inter-node traffic for pods attached to certain outdated network policy definitions. Our ingress controller was attached to one such policy and after updating the policy, network traffic returned to its former working state. This trial by fire introduced our team to aspects of our platform ecosystem, which were previously unfamiliar and revealed to us many new potential points of failure worth testing and monitoring. For example, the image on the right summarizes all the hops, skips and jumps our network takes, our network traffic takes to get from our client to our server and back again. As we locked a consolidated view of our system, we had to start at both ends of this flow and test and inspect each stage for network issues manually. This involved running commands like NSLookup to confirm DNS resolution and at points even running simple curl commands from within our ingress controller pods to test connection to our downstream systems. This process was unsustainable to say the least. Ultimately, we concluded that the network error originated from the connection between our ingress controller and our server, but we determined it's worth testing, monitoring and alerting on every stage of our network flow. Since then, we've since made efforts to consolidate our logs from each stage into a dashboard on Splunk, as well as monitor each stage for health and performance metrics using tools such as Newerlick and CloudWatch. We've also revitalized our knowledge transfer sessions to teach our team about the various supporting actors in our overall platform ecosystem. So let's analyze our experience with this particular cluster upgrade. What went right? What went wrong? How could we have improved our experience with a proper action plan like we described earlier? Starting with the positives, to ensure customer questions and issues were promptly addressed, we have built a multi-layered support system and this support system worked as intended in the resolution of the issue. We triaged and prioritized the issue, assigned a lead to resolve it and effectively collaborated using Slack and Zoom to troubleshoot with our team and with our cluster administrators. Simultaneously, we kept the impacted customers and leadership informed on our progress. Additionally, despite our setbacks, we were able to identify the root cause of failures and iterate quickly to deploy the fix and before the cluster hit our production environment, we were able to resolve the issue and continue as normal. However, despite our quick turnaround on this particular issue, we did expose ourselves as insufficiently prepared for cluster upgrades in general. A few glaring issues can be identified from auditing the upgrade process from the perspective of a platform team. First, while we were reviewing the release notes for this prospective cluster upgrade, we failed to identify the Cilium version update as a potential source of failure. Correctly identifying this risk would have narrowed our investigation significantly and since we have taken care to call out any version upgrades coming down the pipeline to investigate first in case of detected regressions. Second, our testing, monitoring and alerting suites proved insufficient to notify us of any failures outside of the scope of our immediate platform components. We only discovered this particular issue after being notified about degraded performance from our clients. And when investigating, we had to cobble together logs and metrics from disparate sources. A consolidated testing strategy and monitoring suite would have identified this regression quickly and we've since begun consolidating long streams into centralized dashboards. Third, before this incident, many of our development teams members had a tentative grasp at best on some of the systems in our platform ecosystem outside of our immediate platform components. A firm understanding of these systems would have resulted in a more confident and robust debug process. We've since ensured that our team has several SMEs of our platform ecosystem and beyond. And we also do regular knowledge transfers to elevate the rest of the team. By addressing the above gaps, we've conducted subsequent cluster upgrades with confidence in our game plan. I'll now hand it off to Cruz to discuss observability. Thanks, Christian. So turning now to observability, I think it's really helpful to frame any discussion or any work around observability in terms of the target outcomes. And so for us, there are really two outcomes that matter and all of our observability work drives towards enabling these outcomes. The first one is probably familiar to you and that is minimizing our time to restore and that time to restore captures how long it takes to bring the service back up whenever an incident occurs. And it's also helpful to think of that journey in terms of two separate stages. First is the stage where you're actually waiting for your on-call engineer to detect that there's an issue. And so that can be measured as the time to detect. And then once the on-call engineer knows about the issue, there's another delay as they figure out how to restore the issue or rather how to fix the issue is at our issue to fix. And so that diagnostic process, we can kind of capture as the time to repair. And so by thinking about time to restore in those two categories, it helps direct our investments to the most valuable work. And so in the case of alerting, alerting is there to help minimize that time to detect. It helps us to learn about issues before our customers come and report them to us, as Christian mentioned earlier. And then we have other levers to pull to minimize the time to repair. So specifically that's things like having a run book so that some common remediation steps are easy and apparent to the on-call engineer and also having very fine-grained application performance monitoring or APM. And that also looks like having distributed traces and even metrics that can point to very specific parts of your stack so that you can easily see where the issue likely is. And so by thinking about the time to restore in terms of these two components, it really helps us to optimize and focus on the most valuable instrumentation efforts and even efforts that may not involve instrumentation such as writing run books. And then another really important outcome is to maximize the legibility of the system. And so we can think about legibility as a measure of how readily the system can be read and understood by anyone on the team. If you have a situation where only the experts know about this one piece of the ecosystem, then if that is where the incident is occurring, then you have to get that expert on the phone. But if you've done a good job instrumenting the system and centralizing those signals to be viewed in one or a few places, then suddenly everyone can kind of be an expert. And in that sense, we've really improved the team's ability to respond to issues and to understand what's happening in the system. And so we can also think about our dashboards as opportunities to sort of answer some known questions of some frequently asked questions, but even that has limitation. And ultimately we want to have opportunities to ask any question of the system with logs that are really rich with context and traces, which also can show the interdependencies between the system and help us even diagnose very subtle issues with high resolution traces. And so in our system, as mentioned earlier, because we're living in this layered world, we have a couple of unique challenges. So the first one is our shared responsibility model. Our MLIS team is going to own the compute infrastructure and the platform APIs that ultimately help orchestrate our workers or whether our workflow jobs that our users are running, but then the individual users, those tenants, they have their own code that they have to attend to their settings, certain resource specifications. Obviously they're writing the logic that could go wrong. And so oftentimes it's been difficult to parse through the root cause of an issue and know whether or not it's ours to solve or if it's a user to solve. Or in some cases it's even up to the cluster administrators to resolve certain issues. And so keeping that in the front of our mind has been really important as we instrument the system. And then another unique challenge, and this is sort of specific to the way our platform is orchestrating our users' workloads is we have very short-lived pods. And in some cases, the pods that are most problematic, the ones that our users want to debug and dive into further, those are the pods that are only living for a few seconds, if that long. And so it turns out that there are several tools in the observability of the ecosystem that rely on a pool-based mechanism. And so we'll learn how that has become problematic in some cases to sort of outline a few of these tools that are operating at different layers in the stack. I think that this diagram is useful yet again for illustrating core observability plays at each level. So at the cluster level, we really do rely on our cluster to provide some of those foundational capabilities of collecting metrics and logs and traces and shipping those to our tools of choice. So in our case, we're using the Fluent D, David said, to collect the logs on individual hosts. And then those logs are ultimately forwarded to Splunk using the HEC plugin. Even there, we have two different projects. The HEC plugin is this project separate from the Fluent D project. And so there have been cases where we have to dive into issues that may originate from the HEC plugin and then others that seem to be related to the Fluent D, David said, pod. And then apart from Fluent D and Splunk, we have a New Relic operator that's collecting infrastructure utilization metrics and it's essentially providing that infrastructure view in New Relic. And then in addition to the metrics collected by the New Relic operator, we also have Prometheus collecting metrics, not only from other applications in the cluster, but also a lot of these plugins that we rely on. And those metrics are also consolidated in New Relic so that we can see them side by side and debug issues as needed. And I can even speak specifically to the Ingress controller that Christian mentioned earlier. We rely on the IngenX Ingress metrics that are exposed and collected by Prometheus to see into that pod, because otherwise there wouldn't really be a good way for us to understand what's happening inside of that, inside that pod or that application because we didn't write that and we didn't instrument it. And then the layer that we actually have control over is the platform APIs. And within that layer, we're implementing some very fine-grained instrumentation with New Relic APM. And that allows us to have distributed tracing, as well as error reporting. And that error reporting has been really valuable for detecting very specific errors with our clients code. And we're always leaning into improving the granularity of the errors that we're able to see in New Relic. And it's this layer that we're actually pointing all of our monitors for alerting. We've learned that alerting on anything that our on-call engineer actually can't resolve has been really frustrating. And it's been hard to actually understand what issues are worth interrupting and beginning to triage. And so we've really just tried to focus all of our alerting efforts on this platform API layer. And that's really helped reduce toil and reduce churn on the part of the on-call engineer. And then moving on to the tenant sandbox, we have several guardrails, including the resource quota construct. And those guardrails help us to prevent those tenants from exceeding certain resource limits. And that is ultimately the best approach to preventing those tenants from having too big of an impact on other systems that are running alongside them in the cluster. That's a much better approach than simply alerting on high resource utilization and requiring someone to take an action. So we're able to leverage that resource quota construct to essentially provide that guardrail by default. And then also in addition to the logs and metrics that we capture from our platform APIs, we're also capturing all the logs from those tenant namespaces. And we're making sure that each tenant has the label that they need to be able to pull the logs and see the logs that are relevant for their workloads. And then lastly, within the actual jobs themselves, we have application logs that our users are going to be writing and they understand best. And we're labeling those and surfacing those in tools like Splunk where the users can diagnose issues related to their application code. We don't yet have metrics and traces and structured logs throughout those batch jobs that we're pursuing those opportunities right now. As mentioned earlier, there are a couple of interesting considerations when you have short list pods. So for instance, with Prometheus, you might actually not scrape that pod before the pod dies. And so there is a chance that your pod metrics wouldn't be presented or they would be presented in an incomplete form. And then for the observability and the implementation of the individual API calls using things like open telemetry, we know that there's a non-zero performance impact to that interpretation. And so we're trying to find ways to measure that impact and quantify it. And then ultimately give our users an opportunity to opt in and maybe even control the granularity or the sampling rates so that they can essentially make those trade-offs themselves. And so that's sort of the state of open telemetry and those more advanced mechanisms of tracing within our individual batch jobs. So now I'll hand it over to Trevor who will sort of explain a recent issue that we had with logging and how we resolve that in our platform. Thank you, Cruz. I would just like to dive into one entertaining case study I will affectionately dub my logs are missing. So as earlier mentioned earlier, our SRE team maintains an enterprise cops repository providing our clusters with several conveniences such as Fluent D. This comes with a seemingly handy cube Fluent D operator self-described as a Fluent D config manager with batteries included config validation, no needs to restart with sensible defaults and best practices built in. Based on the theme of verifying defaults as well as the studies foreboding title, the question becomes, which of the defaults was not so sensible for our environment? And I ask that because verifying the defaults has been a cause of several logging pains, especially the default resource specifications. To answer that question, we'll look at how a log message is built using this pattern. So the log starts its journey and the data scientist's job, hoping to tell the world about the looming issue in prod just as all little logs do. The log is written to the file system and scraped by a Fluent D file source. Next, on a schedule, the log router queries the cube API for the pods metadata and showing that the logs author is able to find it. Finally, the enriched log is sent to Splunk to be united with its engineer. By default, the log router queries the cube API every one minute, which is fine for long running pods such as web servers and the like. However, in an environment with many moving parts, which logs would you suppose people are most interested to find? The ones with errors. It is well known that errors can dramatically decrease a pod's lifetime. Sometimes by so much, the pod is long gone by the time the log router asks for its metadata. This leaves our lonely log without any labels. Now, logs without labels are not impossible to search for. With so many logs in our Splunk index, we found that logs without labels provide little more value than no logs at all. In fact, some of the configurations users add a sleep greater than a FluentD's refresh interval just to avoid this issue. Of course, another option is to configure the FluentD query to query the cube API more frequently, but some jobs have a sub-second lifetime. So there's a limit how often you can do it. Thus, we're moving towards a structured logging approach to remedy these missing logs. By enabling the Kubernetes downward-facing API, the pods creating the logs can also retrieve the Kubernetes metadata required to make them searchable. As shown here, we can more reliably provide meaningful logs by entirely avoiding the fallible enrichment step the FluentD operator offers. In short, our logs themselves are created batteries included. Also, as a last note, when it comes to logging your logging, it is helpful to enable FluentD metrics. In our case, Prometheus sends FluentD metrics for the number of errors in the cube buffer length to New Relic for alerting. This lets you keep tabs of even the logs you missed. Enough of this use case, though. I'm handing it back to Christian. Thank you, Trevor. I will now be discussing how to isolate your tenants' compute within your cluster and your platform. And I'll discuss how the machine learning as a service platform does so. So at its core, the machine learning as a service platform provides its clients, known as tenants, the means to author, manage, and execute their data workflows. Given the unique and open-ended functional capabilities of a workflow, each running instance of said workflows considered its own application. Thus, as the trusted host of such applications, the machine learning as a service platform needs to ensure compute isolation in order to prevent tenant workflows from gaining access to other tenant workflows and data. And if complete isolation is not achieved, there is a high risk of data exposure or denial of service. Therefore, machine learning as a service platform uses Kubernetes namespaces to implement a multi-tenancy model for compute isolation as namespaces provide a means to house or isolate groups of resources within a cluster. In this use case, namespaces are used to house tenants running workflows. Every tenant has their own namespace and tenant workflows run in their designated namespace only. Isolating tenants to their own namespace allows us to administer tenants individually with their own configurations, limits, and permissions. Primarily, we use namespaces to ensure least privilege access and to manage resources on a tenant level. For example, using network policies and role-based access control, we can configure least privilege network and resource permissions respectively, allowing us to limit which services our tenants workflows can communicate with and which resources they can modify. We can also prevent denial of service and resource hogging by setting resource quotas and metering resource consumption, which we'll discuss in more detail later. We apply all of these configurations and permissions on a namespace level to apply the tenant workflows across the board. Now, namespaces are a powerful tool for administrating and organizing your tenants in your platform, but in practice, you're going to run into some overhead when it comes to configuring and deploying said namespaces. The primary consideration to tackle when administering namespaces is determining the minimum provisions needed by the fewest entities to meet your namespace administration requirements. As with all cluster-scoped resources, the cluster-wide nature of namespaces necessitates due diligence when assigning permissions to create and modify them as a misconfiguration could, in fact, open a security gap which could expose all namespaces in a cluster. Therefore, it can be simpler and safer to offload the work of namespace management to a separate service dedicated for such a purpose. A Kubernetes operator could potentially fit this use case. Operators, which offer extensions to Kubernetes that define the deployment and management of custom resources which extend the vanilla Kubernetes API can be used for namespace management. In this case, the custom resource would consist of potentially a single definition file which wraps the definition of a namespace and any additional resources which live within that namespace. The operator would then manage the deployment and maintenance of this custom resource and therefore the namespace and objects defined within it. It's therefore a good idea to investigate if any such Kubernetes operators exist which fit your namespace management needs as it is possible that there exists a well-defined operator within your organization or an open source which you can use to create and configure your tenants and namespaces and resources. By offloading namespace management to such an operator, you're ensuring proper segregation of duties and least previous access for your application. If you choose to follow this pattern, consider the following when choosing a namespace operator. First, as these operators deal with a critical cluster-scoped resource and therefore can have a cluster-wide impact in case of failure, discuss the permissions this operator needs and the impact on existing namespaces in your cluster with cluster administrators and any other applications on your cluster before deploying such an operator to your cluster. And second, research the level of developer support this operator has. Whether it's open source or in-house, you should see how frequently the operator is contributed to and whether it has any critical outstanding issues in its repository and whether any developer team exists to offer integration or bug support. We had to consider these items when deciding between two such operators to conduct our namespace management for our platform. These operators were the hierarchical namespace controller and open source operator and embark, a namespace operator within our organization. When we were exploring operators to manage our tenant namespaces and resources, two promising options presented themselves. First, as stated before, the hierarchical namespace controller. This is an open source operator which establishes a hierarchical relationship between namespaces, allowing for the creation of the concept of a child namespace to which you can propagate resources or configuration items from a parent namespace. In our case, this paradigm would allow us to create tenant namespaces as children of our applications main namespace to which we would automatically propagate standard network policies, roles, role bindings and other resources automatically. Changes to these resources and configuration items and the applications namespace would then automatically reflect in the child namespace. Our second option, embark, is an operator developed within our organization which alternatively defines and operates a custom resource called a super namespace. This super namespace object lets you define a namespace and any resources within it, like network policies, roles and role bindings within a single YAML file. Deploying or modifying any of those items can be done by simply modifying the super namespace definition itself and deploying that object. The operator handles the rest. This option would have allowed us to configure all of our tenant namespaces from one single standard template and deploy a super namespace per tenant which would thus deploy their namespace in all of their resources. A promising option embark was only lacking one feature and that was the ability to configure custom additional resources on a individual super namespace level. By default, you were able to configure objects on an operator level that would apply to all namespaces in the cluster but was missing the feature to do so on an individual super namespace level. If we contributed to embark to fill this feature gap, both operators from a functional perspective would have provided a viable solution to host our namespace management. However, after taking into account the considerations mentioned in the previous slide, even with that feature gap, embark stood out to us as our only single viable option for namespace management for the following reasons. First, the hierarchical namespace controller lacked an official Helm chart to install and maintain it on our cluster, instead requiring installation through a tool called crew. As crew is not a supported tool in our enterprise clusters, we would have had to take the time to create and maintain our own Helm chart for this operator. Now, in the long run, we were considering taking this challenge on and contributing it back to the open source project, but we did hesitate given our internal deadlines and our relative experience at the time with Helm. And given the following concern, which I'll state now, we certainly did not want to make any mistakes when creating the Helm chart for this operator. And that is because of the following reason. This was the main issue we had with the project. And that was that the hierarchical namespace controller has a very broad impact on the rest of our cluster beyond just simply the tenant namespaces we wanted to create using it. By default, the operator placed webbooks on every single namespace in a given cluster, which would therefore trigger admission controllers on upon any namespace modification. That would be for tenant namespaces, our platform namespace, or any additional namespace that exists in our cluster for other purposes. Therefore, in the case of the operator failing or going down, this could have had a potentially cluster-wide impact on namespace creation or modification and could have potentially blocked those actions altogether. And this would include non-tenant namespaces necessary for other platform operations. As a best practice, we wanted to minimize the scope and impact of any operator we install in our cluster. And this control over non-tenant namespaces proved too big of a risk for us to take on, especially without a standard helm chart for us to install. Embark, on the other hand, had no such issues, establishing no admission controllers or webhooks and having no impact on namespaces not managed by the super namespace custom resource. In other words, the only namespaces that Embark would have touched would be those which we were creating for the purposes of our tenants. Because of these two reasons, ultimately our team chose Embark to manage our tenant namespaces. And we were confident in that decision having weighed the impact and level of support for each operator. So if you are in this situation in which you want to manage tenant namespaces and you want to offload this to an operator, certainly consider the same things that we considered for that purpose. I will now hand this off to Patrick to discuss rate limiting and resource management. Thanks, Christian. In order to reduce the impact one user can have on the overall system, implementing rate limiting per client on our APIs is an important step we can take. The goal is to put limits in place so one client cannot push the service over and cause outages for all other users of the platform or cause a response times to fall beneath what is defined in our SLA. The biggest consideration from implementing rate limiting is at what layer in the application stack to put the limits in place. The API gateway, ingress controllers and the app layer were all taken into consideration and their pros and cons were weighed. Not all of our traffic goes through the API gateway, something at this layer would not catch all traffic possibly leaving us exposed down. Rate limiting a ingress controller layer was considered. We already used the Nginx ingress controller in our architecture. So we explored its rate limiting offering but with its complex setup, difficulty to monitor objective requests and inability to limit by anything besides IP, we did not pursue this option. Other ingress controllers such as traffic were explored but similar pitfalls and needing to add an additional layer to our architecture led to us beginning looking at limiting on the application layer. Since our API is written in Python, we explored the rate limiting packages that exist and found the offerings to be what we were looking for. ASGI rate limiting is the package we are currently testing with its simple implementation, rule-based rate limiting and custom authorization function. We found a solution that is easy to maintain moving forward and does not require major architectural changes. With the amount of jobs running in our cluster, we have seen some situations that required manual cleanup of pods and hanging states and other resources that stuck around. Kubernetes offers a TTL. This only applies to pods and jobs in a finished state. The Kubernetes de-scheduler offers a way to configure max runtime on pods. This is just a hard limit and could lead to preemptively terminating workflows that do not meet our criteria to clean up. This led to a design of a cleanup nanny process deployed in our clusters. Using Kubernetes cron jobs, we have a job schedule that checks multiple criteria such as logs of a pod to determine if a workflow is still running. The job terminates resources in our cluster and then issues status updates to our database if applicable. To reduce the load on these cleanup processes, we deploy a cleanup cron job in each of our tenant namespaces. This also helps reduce the blast radius if one cleanup job fails rather than one cleanup job handling the entire cluster. This also helps tenants stay within their resource quarters. This nanny process that handles jobs and pods has been generalized to handle other parts of our system as well. We did face unintended consequence of this cleanup job where it was actively hurting our cluster's health. The cleanup job identified failed pods to delete. However, the job remained active and it continued to spin up failing pods. As this cleanup job deleted pods, it was also resetting the failed pod count for that job, never letting the job fail. We resolved this by not cleaning up the pods directly, but by just letting the job fail and cleaning up the job itself. The delete on the job propagated to clean up the respective pods cleaning up everything. With the implementation of this nanny process to clean up resources in our cluster, we saw a saving of over a million dollars in compute spend per year. Now I'll hand it back to David to wrap up. Awesome, thanks, Patrick. Yeah, so to wrap up, we've laid out the requirements for our sample platform and we took a bit of a tour of all of the organizational processes and technical considerations. We found important in building and maintaining this platform. For those building higher order systems on top of Kubernetes, we really implore you to spend time upfront toward establishing these types of processes for maintaining a production grade cluster and its upgrades, having full observability throughout your stack, ensuring least privilege access. And finally, when building platforms that pseudo extend the Kubernetes control plane, be mindful about what your services actually exposed to your end user. Impart those safeguards to avoid the eventual non-ideal user that we'll call them. And with that, thank you for watching and that's it.