 And thanks for sticking through this with us. I know my head's exploding. Who's had at least a few moments of clarity there? Thanks a lot for coming out. We're going to talk today about some of the things you might have already seen in some of the other tracks. So I've peeled some things out. I'm going to give shout-outs to folks who've already done exemplary jobs in this space, especially in the ML track. We're going to talk about a platform, the Kubernetes platform, that we've built to support decisioning at an FI, our Polaris Capital One. I'm Keith. This is Bryce. This is Gavin. And we'll carve out different pieces. Our colleague Ravi Dubey, unfortunately, couldn't be here today. He's a little bit allergic to the cold. So Ravi does not like the cold. We really enjoyed seeing snow in Austin yesterday. So that felt like a bucket list item. So I'm going to talk a little bit about the cluster that we've built. We're sort of the platform and the site reliability engineers for the platform and some of what our tenant workloads look like, the kinds of support we provide. We really come from the software engineering side, but building kind of cross-stack capabilities in the decisioning in ML space. But don't hold us to too deep capabilities in ML. We're just getting there via decisioning at this point. So I'm going to talk about some of our overall mission. And Bryce is going to talk to cluster installation and ops, some of the nuances of our platform engineering and site SRE engineering requirements for working for NFI. And Gavin's going to talk to our sort of encapsulation and provisioning better multi-tenancy customer experiences for what we view as a pass for decisioning. So we've been in production since the second quarter. We started on, and I'm sorry, I'm going to look on some of these slides cut off on the speaker display here. Starting in version 1.6, we are currently deployed production to single-region multi-AZ homogenous node types. We use M4-10XLs, roughly just under 10 of them to support the workload types I list below. One interesting thing that everybody at the firm, every tech team, platform team, app team, at the firm has to go through is a 60-day complete rehydration from the AMI on up for their application. Everyone hates it, but it has, I have to say, it's probably instilled quite a lot of discipline for resiliency, patching, et cetera. So you get very quickly to a point where you can rehydrate your entire environment, and then that supports accelerated upgrades, better resiliency. So a lot of benefits have been paid. If you ever want to beat yourself with that stick, please use the patterns we've outlined here. We, as much as we hate them, we also, it's a love-hate thing. So the types of workloads, we run our two types that are domain-specific workloads on the cluster. One for real-time, one for real-time, our kind of OLTP stream, decisioning streams, decisioning for transaction streams for retail bank. And that's approximately, we have one app that handles around 600, or 6 million transactions a day, between 6 and 10, starting to get noisy now that it's the retail season. And we store quite a bit of, we retain 180 days of data, which you'll see as we get in the fact that we're streaming with Kafka as a little bit of an unusual pattern for Kafka. Most people don't retain that much data when they work with Kafka. So we've had that requirement, and that's been a little bit unusual for our installation. We run batch-based model refits. So as I said, we're decisioning on the edge of ML. We do run model POJO models that I've exported from H2O, and you'll see where those come in. So then we also use Packaderm for refitting those models, and to be able, and using batches of various versions of data to replay those models, to improve, mature, and compete those models. Our other set of workloads are ad hoc analytical queries for data analysts and data scientists. And our final set of workloads, which probably everybody has, are telemetry stacks for logging, monitoring, cluster services, housekeeping jobs, the various chron tabs you run to backup state, and so forth. So those array out in the following way. In our T1 and T2, that is all stateful, and all multi-tenant. So we have multiple customer teams, multiple application teams running Flink apps. Right now in a single Flink cluster, Gavin's gonna talk a little bit about breaking how we meet the needs for teams who are really doing Kafka-based, Kafka and Flink-based development for their core decisioning application, and how we are working to, as Kelsey Hiderhoer talked about, wrap some of the capabilities of Kubernetes, make Kubernetes invisible, so that they can concentrate on their core decisioning. And by decisioning apps, I mean everything from fraud to clickstream analytics. We're constant, we were sort of given, raised in DATRA from the fraud budget, but we have expanded the decisioning capability and offered the decisioning capability across the bank. Our NIFI workloads manage ETL into and out of the platform, Kafka's our backbone, and actually has become the kind of cross-service bus for teams who are refactoring their Flink apps. All kinds of interesting MVW sort of pattern work that's going on in the Flink space. Once you get to breaking up your Flink app into a richer ecosystem of sub-applications for determining a decision and then visualizing the decision, querying state against the decision, interim Kafka topics to support all of those interim states. Packaderm, as Dan from Packaderm yesterday talked about is a tool unique to Kubernetes, which allows for state versioning of data and then triggering model pipelines, model refit pipelines and invent pipelines using batches and microbatches. The analytical environment, Apache and drill, Apache provides a notebook capability just like Jupyter and drill provides the connectors to various data sources. You'll see in our case S3 Lake, Aurora, Postgres, Aurora MySQL, dot, dot, dot various stateful sources for accessing data and then drill, bringing the data together and then Zeppelin presenting the data. And then we have the standard, I've seen across so many implementations, logging or telemetry stack. So in EFK, shared EFK stack across our environment and shared metric stack, the Higgs stack, Heapster, Influx and Grafana. And then we run because we've recognized that you can't just run streaming services without running some kind of request response services and Istio, the Istio service mesh has been so helpful in that regard. We began running Istio based services. Zeppelin comes with that almost for free and then we do our security overlay with DEX at the current presently. So this, the orientation is a little weird on the screen. So we have data inbound from various sources to NIFI and from Lambda directly into our Kafka topics. So we have Ingress and Egress Kafka topics was established to the sort of black box for the platform, the decisioning platform. But the Kafka topics also run horizontally across Flink apps as I mentioned for interim, you have interim topics for as sort of an application bus. So you've got sort of monolithic Flink apps we're going from the monolithic Flink apps to sort of more microservice based Flink apps and using Kafka as the state mechanism in between. S3 also gets a copy of all of our data. So that big teardrop is the NIFI or series of NIFI canvases with NIFI processors which to the ETL for data, send copies of it to S3 for later querying. And then the Flink decisioning delivers metrics back to Aurora, uses a Redis cache grid. And then the final, and then we have downstream queues back to other enterprise service providers to say in cases of like fraud, you send the fraud alerts back to the investigators and as the entry point to their queues. So we won't spend too much time on the T2 types if you happen to catch Nick's presentation yesterday from Prachyderm. It was excellent download of the deck. We really are, the capabilities, I mean definitely shout out to this community. They're doing such interesting work for event driven versioned data sets that help in any sort of you don't have to be doing machine learning, you don't even have to be doing decisioning just for doing various kinds of data analysis even in R, I don't know why I beat up on R but I like beating up on R. So our analytical environment, and some of this is, I think when I'm talking through a series of patterns that Kubernetes has been very, given us a lot of capabilities very quickly to launch these niche, what we thought were much more niche application stacks and it turns out to be, they actually turned out to be much more beneficial or beneficial in many ways, not only across the stack but for many other kinds of business needs. So general business needs around decision and general business needs around analysis are sort of endemic and omnipresent. So these patterns are very useful and I can see them living quite a long time and just being blown out to scale. Obviously running ad hoc alongside your, alongside and in the same cluster as your OLTP and our traditional transactional flows has been some of the riskier things that we've had to deal with and Bryce is gonna talk a little bit about how we ended up having to hedge our own ambitions so that definitely encourage you when you're looking to try to understand the behavior and then create some lessons or some guidelines and guardrails around the best, better behaving and or attract to better behavior for each one of, each one of the kinds of workloads you run. Our telemetry stack right now, as I said is sort of the traditional stack that you've seen probably across the conference but our future state is to provide separate Grafana stacks and to customers, to tenants and then to provide, keep Fluent D in the mix but allow teams to bring their own log aggregators and dashboards since we see a mix of I'd like to have Splunk, I'd like to have something else, I'd like to do send to CloudWatch, et cetera. How am I doing on time? Okay, thank you. So maybe I should have put this slide a little earlier when our state in multi-tenancy state for us, going from one six, there wasn't as good support for state in one seven. We saw stateful sets and become emerge out of the beta-ware of petsets and they've been rock solid for us on AWS. And so, I mean, that's been some of the core reason to build a Kubernetes cluster is it's stateful capabilities. What do I mean by multi-tenancy? I mean, isn't Kubernetes already multi-tenant? Yes, in the sense that you can deploy any kinds of workloads but not in the sense that if you're running a platform as a service, what do you have to begin thinking about in terms of isolation? Services designed to either be shared and or clustered and you can be overly ambitious with sharing when in turn you should actually be isolating which is some of our lesson with with Flink and or how we'd like to probably end up moving forward with Flink. And to that extent, because you have applications that do their own clustering, the namespaces don't always solve all the forms of isolation but that becomes one of your alternatives. And then, as I mentioned, some of the pain points at scale of combining very different kinds of workloads together and without necessarily having guardrails around them first. So sometimes you just, you're gonna bring a new workload on, you may not know how it's gonna turn out so you have to be vigilant about it. So again, what's the value of a managed service like an internally managed platform to customers free from the 60-day compliance rehydration requirement, just being able to focus on Apple deployments. So it's Kubernetes with benefits, we're giving them the cloud engineering installation, persistent state upgrades, patching, streamlined security, resiliency engineering, common telemetry, common domain, and as Gavin will demonstrate, we even give them a kind of domain-specific language, CLI, to deploy their apps so Kubernetes kind of becomes encapsulated and they can concentrate on their applications. When you get things wrong, you'll start to hear the encapsulation gets broken and customers come out and say, they'll want their own Kubernetes cluster, their own Flink cluster access to the Kubernetes dashboard, they'll demand more resources than you really wanna provide the whole idea being that you were been packing resources to begin with to make your cloud use of your cloud provider more efficient. And everybody recognizes that some need for elasticity because they might have been coming from auto-scaling groups with Puri VMs and EC2 instances and they want more of that kind of thing in the cluster. So that's something we're definitely eager to leverage with Kubernetes and all the great work that's being done by the auto-scaling SIG. So we've got, like one of the things I've talked about is trying to get to an idea of, or you wanna be able, you wanna track to the idea of well-behaved workloads. We have an idea of a conformant cluster which provides portable workloads for customers, but for platform owners, I think going from having your customers come with their own CI, CD, CD tooling and or providing the DSL that Gavin's gonna talk about or stop startable, reschedulable and cord and drain friendly jobs. These are important behaviors we expect from workloads. Good 12-factor app-type behaviors like logging to standard out so you can take advantage of the logging stack, circuit breaker, connectivity and retry built-in being kind of things that a lot of teams don't, a lot of application developers still don't think about even in a distributed world and we have to remind them of, you don't come to us when you can't connect to a service on three times out of the day when you're normally connecting six million times. Those are normal types of blips, just retry. So and then going beyond liveness and readiness metrics to exporting deeper health check, deeper health metrics. So how many folks are running Kubernetes in production right now? Again, many about at least half the crowd and how many run with state? Like how many are running statefully? So almost the same amount of everybody's going for state and then are you running it kind of for your own teams or are you running it as a service? So you're running, are you, yeah. So that's almost, that's sort of where, so I'll come back to that and talk a little bit about some of the things where we're seeing the cluster, this kind of service going and I'm gonna give it over to Bryce to talk about ops. Thank you Keith. Hello everyone, my name's Bryce. I wanna talk about some of the name spacing, tenant isolation. Name spaces, what are they? Essentially a virtual cluster within Kubernetes. So when we went down this path, our system apps as well as our tenant apps were all deployed in the same name space. So we had tenants stepping on each other's toes. So one tenant would update a secret or config map unaware that another tenant was using it. So we decided to break out the deployments as well as name spaces for tenants. So we locked down the user policies via DEX and RBAC authentication and authorization solutions. Wanna throw it out there that we're not biased towards these, it was particularly for our process and our culture using LDAP as well as some of the user policy based you can get with RBAC. So we're in the infancy stage of migrating from Flannel to Calico and with this we can get enhanced pod to pod communication security. All right, so how do we do state applications? And this is via the state full, I'm sorry, the persistent volume subsystem. What is that? Two components, persistent volumes, which is a piece of storage somewhere. Persistent volume claim is a link to that storage. So that persistent volume is married with a pod. Wherever that pod moves in the cluster, persistent volume claim will move with it. And that way any application can always access its data. So when we're talking distributed applications deployed on Kubernetes, generally, say we'll give an example, SCD, Zookeeper, Kafka, with each, there's generally a unique ID for each node. And that unique ID tells that node what subset of the data is responsible for. So we can leverage staple sets to have a unique ID permeated throughout the various layers, the pod, the persistent volume claim, and the persistent volume. Wanna throw it out there? Storage classes, if used correctly, are CIS admins best friend. What they can do is they can dynamically provision the persistent volumes as well as the underlying storage system via a cloud provider, say AWS EBS volume or say the equivalent on Google's cloud. So we are a bank, we do have compliance requirements and one of which is the rehydration. So as Keith mentioned, we have to keep our nodes refreshed every 60 days and this is to ensure the latest security updates and patches. We leverage rehydration via Kubernetes job and subsequently we can also leverage it for upgrades. What it does is it scales out, drains each node, ensures that persistent volumes are reattached to persistent volume. And before we scale in, we ensure that all the pods are healthy. When we did the rehydration with an upgrade months back and it took about eight hours manually and this was a cluster with a static STD masters, sorry, masters, we run the STD with the masters and we have a static set of those of five nodes. The minions were about seven across AZs, very large nodes. So what we did is with this rehydration script, we've automated, we've executed it twice and been able to upgrade from one seven to one seven five and then last week, one seven five to one eight three and it's been very seamless. So as Keith mentioned, we run a lot of Apache based JVM apps and these JVM apps can get very unruly, especially with memory. So we've encountered applications such as a Zeppelin where if they're scheduled on the same node, sometimes they take too much memory for that node and can have the node fall over subsequently. So potting its affinity and sure that at most one pod per Zeppelin is on one node. Some good behaviors or good safety belts or I guess, well anyway, resource limits, limit ranges and resource quotas. Resource limits essentially for each app and this is setting the limitation on CPU memory, limit ranges, if the application developer does not set them, then it'll be the default CPU and memory for that deployment, replica set, whatever it may be. Resource quotas are setting the cap. So you can set a cap on say the secrets, pods, the number of services for a given namespace, as well as the cap on the number of CPUs or memory cores for a given namespace. And okay, it's a cubelet. There's a very small nuance to this. There's a property called system reserve to the cubelet and what this does is it ensures that there's always at least sufficient resources for CPU and memory on that cubelet to always ensure that it's running. If you have on really pod, something went right out there, it can be your best friend. So for the future, we definitely wanna leverage some of the pod autoscalers, the horizontal and vertical autoscalers, as well as some of the node autoscalers that can abstract out the cloud provider underneath. So with the node autoscaler, we're hoping to have taints tolerations as well as custom instance types, especially for our GPU intensive model refit pipelines. And with that, I'd like to turn it over to Gavin Mead. It's already on. Now it is. All right, can you guys hear me okay? All right. So one of the cool things about being at this conference has been where are we at in our journey and our ideas and where we think we should be going. And Kelsey Hightower's keynote about how you shouldn't give developers cube-cuttle access like has really resonated with us because for us, we're not a general purpose Kubernetes platform. Kubernetes really is a implementation detail of our service offerings. Like we offer Flink and Kafka. And really for application developers, we want them to focus on their Flink app. Like their interaction model with the platform is, I write a Flink app that consumes from a Kafka topic and then writes to a Kafka topic. Also another interesting talk was Diane Marsh's from Netflix, her talk about how culture can influence the tech. And that was really resonated with us as well because CICD is a really big thing at Capital One is I mean that we really have been pushing. The ability like for these Flink developers to be able to push their apps quickly. And so they were asking for tools like a CLI that they could use in their Jenkins jobs. And so that's what we wanted to give them. So the idea here is that we had a CLI that they could run from their local machine from when they're working in like a staging or QA but then also the same CLI could be installed on Jenkins. You know, it was a Go app and it could be installed. We had builds for both Mac and Linux. And the way that was set up is we had an ALB and a set of microservices that listened for those various CLI commands. Internally or Ingress is done via HTTPS but the communication between the microservices is done via GRPC. We use Istio as a service mesh for those microservices and the really great thing is that worked like right out of the box. Setting up the sidecar was a breeze using Istio Cuddle. We also were using Zipkin for our tracing so all of our microservices are traced and the one microservice there that talks to Flink, what it does is it communicates with the Flink job manager REST API to do the actual deployment. Like is anyone here familiar with Apache Flink? Done any development? Yeah, a couple, okay cool. So what we try to do is simulate the job manager submission but we try to add some additional functionality. So one of the things that we add is we will watch the deployment for a fixed amount of time and capture all of the states that that Flink job is transitioned into. So as an example, it could go from starting to running and if it stays in that state, we're in good shape but if it went to starting to failing to restarting, we would be able to give that information back to the customer and then they can kind of further troubleshoot to see what's going on. One of the things that Keith alluded to is our sort of shared model and some of the learning opportunities that came from this. We run a shared Flink cluster right now but we realize that it's kind of hard to put some guardrails around that in terms of maximum number of task slots that you can use. The other thing is because we're not using something like yarn, one job can have a disproportionate negative effect on the overall cluster and we talked about possibly using yarn with Flink but we feel like Kubernetes can provide a lot of that capability, at least with respect to CPU and memory usage for us. And the other thing is they were asking for their own clusters. So where we're at kind of in our journey and where we're starting to move is this idea of letting more of a self-service model let the customer create their clusters, like however it is they want. I mean, we would have the appropriate guardrails in place with respect to resource limits but how they carve it out is up to them. So the same idea is to use, like say a Flink create cluster and the same sort of ingress provisioning but now we're really looking at CRDs. Like we've been really inspired by the talks we've seen about operators, like there was a great talk earlier about Kafka operators, the CoreOS folks with the Prometheus and STD operator which we are using both have been really fantastic and are a great way to kind of get started with them. And so what we're really looking at is like our future state is a more self-service model where the customer, we would basically provision them, say a Grafana dashboard with some curated dashboards, set up their own Prometheus server to capture their metrics and then they can decide how they want to alert. One cluster, if they want to share one but say they have one that's critical, they could provision just one Flink job to that particular cluster and then do their other operational stuff on another one. We really want to put it in their hands. The other thing is we're looking at more CNCF tools so we are evaluating Yeager right now to see as a replacement for Zipkin and I went to the Yeager talk yesterday and it was really awesome. And then actually with that, Keith is gonna take us home. So some of them, thank you both, that was awesome. People ask me what I do and with the dynamism in the Kubernetes community and almost in also dodging architectural and going through architectural reviews sometimes we're actually doing the work in the cluster while we're proving it out. And in a way it's made us pretty fearless sometimes that gets a little scary because when you come up with a more compliance-based environment they're usually a lot more on the process gates and design but Kubernetes has really given us a lot of flexibility to test things out more quickly than we would normally be able to do in a normal system delivery workflow, software delivery workflow as well. So this gives, I tell people, what do you do? And I say I feel like I'm an inflate drone maintenance technician. And so we've gone, this is just a few of the things we've gone from A-UFS to overlay two from Rail to Ubuntu, NIFI. And we're starting to start to deal ever NIFI because it's a little bit difficult to orchestrate some of the canvases, little more gooey, have a bigger gooey footprint. We kind of like Kafka Connect as a replacement for that. But going from influx DB and Heapster to Prometheus given the awesome work by the Prometheus community, EFK to sort of the fluent bit plus BYO log aggregation ignite for heavyweight cash grid to Redis, SkyDNS to QuartDNS, S3Direct to Minio, et cetera. So this kind of gives you, I don't know if your journey has been like that but we've just been churning through things, evaluating them quickly in implementations, in production even, and then where they fail. Okay, we move on and this is so much of, in some way we are encapsulating Kubernetes from our end users but in the same way our time to market for the platform and platform services is almost entirely due to the Kubernetes capabilities. Conclusions, let me see, so if you're running Kubernetes I would encourage you to think about what you're running it for, what kind of managed service. If you take on any number of different kinds of workloads you could get into trouble. We've carved out the decisioning space for our cluster. Folks have come to us and said, we'd like to run our NIFI jobs there, we'd like to run, put all our Kafka topics there. For us, we're not a Kafka as a service, we're not a NIFI as a service, we're a decisioning platform as a service. So deciding what kind of Kubernetes cluster you're gonna run is really important. What kind of service are you gonna offer if you are a managed service provider in that regard. I love the DSL domain specific language CLI based installation because you get both the benefit of being a DSL and working with the CI and CD tools. State's gonna creep in, it's almost impossible to avoid it, especially for running resilient clusters as we look to go cross region and multi-cloud. You've got state to manage and state to maintain, that's inevitable. You're unchecked and especially at Hock workloads, you've got to really look at putting resource limits on them so you don't get kind of the tragedy of the commons. You are likely, almost already likely multi-tenant, you may not even realize it even with your own ops jobs. We've seen, and I think it was, is it Jevons paradox? Is that what people have been quoting here? We see it all the time, because I see the slope of the line for the storage that we provisioned for stateful workloads just continue to climb almost hockey sticking. Cluster supporting streaming service, we thought we were, oh, we were just gonna be a streaming service cluster. You still need request response services and so thinking through how you're gonna host, REST or GRPC customer facing services, obviously Istio is the go-to there. And given Kubernetes's extensibility, that's so many great things work in progress. We see more and more specialized clusters coming. I've seen it in robotics automation, decisioning, you've seen some of the machine learning already here, so great stuff that's coming. Community shout out, one of the say thank you to Sam Brown, Sam raise your hand for organizing the Nova Kubernetes Meetup. If you're in the Northern Virginia area, DC area, please come, I'm really excited to grow that community. And yeah, thank you all.