 Good afternoon. Welcome to our panel discussion. My name is Sunil Shah and I'm an engineering manager within the Compute Infra Team at Airbnb. We manage all the Kubernetes infrastructure at Airbnb and at Airbnb we've been running all of our production web services in Kubernetes since 2019. We're excited to discuss on-call best practices for Kubernetes with you today. Each of the panelists here operate Kubernetes at high-scale in production at large technology companies. Let's start with a brief introduction by each panelist. Hi, I'm Ashley Cattello. I am the tech lead of Lyft's application runtime organization which comprises kind of your foundational infrastructure components such as Compute Networking and Data Stores. I've been at Lyft for five years and Lyft has been running Kubernetes in production at scale for around three years now. Our workloads are ranging everywhere from stateless services, stateful services, ML training on GPUs, etc. Hi everyone. My name is Fabio Cun. I've been at Netflix for almost six years now. I am a staff engineer at the Compute Team at Netflix. Titus is our container platform and it's been running, it predates me, so it's been around seven years around and it was built on top of Masos before. We migrated the guts of Titus to Kubernetes around three years ago, mostly with, not mostly, without people even noticing. So there's a big thing for us, move from Masos to Kubernetes, been running Kubernetes since then. Titus is running a variety of all the workloads at Netflix ranging from stateless web services, stateful services, real-time data infrastructure pipelines, and machine learning workloads and coding as well. Hello. I'm Madhu. I'm a software engineer and tech lead for the orchestration team at Robin Heard. My time is responsible for running the Compute platform for all of Robin Heard. We have been on Kubernetes for a little over three years now, but I have been at Robin Heard for only a little over a year now. Before Robin Heard, I worked at Google for about eight years, of which I spent a little over half of the time working on Kubernetes itself and those center-related technologies. Hi. My name is Ramya. I'm an SRE embedded with a Compute Interfer team. I've been here for around five years. I've been managing Kubernetes for around four years since its inception within Airbnb. My team is responsible for Kubernetes upgrades, HED upgrade, managing autoscaling groups, managing cluster autoscaler, certificate management, and cryo, and the whole nine yards. It's a pleasure meeting you all. So let's start with a brief overview of how Oncall operates at each organization. Specifically, what is the split of responsibilities between teams using Kubernetes and your team operating Kubernetes? And how do you organize your Oncall rotation within your team? Let's start with Fabio. I do these completely abstracting Kubernetes, away from our users. We don't expose a Kubernetes API internally in Netflix today, and we package Kubernetes as a product. So we offer a job and tasks-based API so people can run workloads based on that API. I mean, fully on that service, we operate it as a product, as an internal product. So we have a clear interface, clear contract between what's using what we offer, and they don't know about Kubernetes. It's an implementation detail for them. We have full control over how we use Kubernetes. Any problems, any issues, it's completely in our scope. And our Oncall is right now organized in week-long shifts. We have primary and secondary rotation. Secondary falls primary. So each person is roughly two weeks on call, firstly secondary then primary. And we have a single pool for everything, all aspects of the container platform, all aspects of the title. So it's all people are on call for everything today. It's a single, two big, large rotations, primary, secondary. And we tried splitting in the past. We now combined and we've been back and forth. It's a constant trade-off for us on creating silos or managing siloing inside our team, and managing also the surface area as we combine. There's a lot more to cover. I wouldn't necessarily say we got it right yet. We'll be iterating back and forth, and we're likely to continue iterating. We're trying breaking up a little bit again, trying to combine a little bit more. We'll keep iterating on this. Thank you, Fabio. Matu, do you want to go next? Our approach is slightly different to this. We have a platform experience team that's different from my team that provides higher level abstractions on top of native Kubernetes APIs, and we call those APIs archetypes. But those APIs are also essentially just CRDs, so it's very much a native Kubernetes experience. Platform experience team is also responsible for providing a UI. All services at Robinhole are expected to use those APIs and UI. Orchestration provides the foundational compute platform for the platform experience team and the layers above. When it comes to orchestration itself, orchestration is the team that's which is my team. It's the team that's responsible for managing the life cycle of Kubernetes clusters and any other underlying AWS infrastructure. We are a full AWS shop right now. All of this is in addition to developing and shipping in-house infrastructure software on top of Kubernetes. In a sense, my team orchestration builds and operates an internal distribution of Kubernetes that supports the company's growing needs. During instance, we call themselves for services like many other companies in the industry, service teams make a quick determination as to whether the problem is at the application level or at the infrastructure level. If they determine that if it's an infrastructure problem, they page the appropriate infrastructure teams on call. They can go into this tool called as the Goal Services tool, where they can look up the mapping between infrastructure services and on-call aliases and then page that on-call alias. For example, platform experience on-call is paged. If the issue is in archetype, orchestration on-call is paged if you think it's a Kubernetes level problem. But unfortunately, it's really hard for application or service teams to determine that. So orchestration is sometimes considered a catch-all and we get a lot of pages. Orchestration itself is divided into three teams, container orchestration, compute infrastructure management, and cloud infrastructure. And the appropriate team is engaged based on the problem type. And sometimes there is a human router involved if the teams cannot determine which exact team the pages should go to. Orchestration incident response is a 24 cross seven three-day rotation. It's a single rotation. We have experimented with various durations, but this has worked well for us. There's also a separate business as only three-day service desk rotation. Wonderful. Thanks. Yeah, I'm seeing a bit of a theme here, which is for each of your companies, you've abstracted a way how users can actually access Kubernetes, which I think is definitely something we've considered to Airbnb. Ashley, I'd love to hear about your experience. Yeah, so at left, we also go the abstraction route. I think we are what we call aspirationally serverless where we're not quite there yet, but we do aim on the infrastructure side to abstract away implementation details of our infrastructure from service owners. And so service owners are running on top of our Kubernetes platform for most part, are responsible for just defining their own business logic-based SLAs and getting page based on that. The compute team is then responsible for the overall health of the Kubernetes infrastructure, so that's Kubernetes itself, EC2 and anything that is running on that layer. And then we have another team called control systems that operates the operators that generate objects for service service owners that kind of manages the platform layer that the service owners interact with. There is a bit of a gray area and I think this is the area where it becomes difficult and that kind of breakdown doesn't really work anymore, which is operator owners. We do have a few teams that deploy and manage custom operators that are not within infrastructure at Lyft. And the way that we've tried to wrangle this is that we try to spell out in advance an explicit contract and escalation policy for these teams that kind of show that, you know, their level, the first level of support were, you know, issues from their customer's comment, they triage them via the platform. And then if it can be proven to be an issue with infrastructure layer below that, then they should escalate and involve the compute on call. Thanks Ashley. It sounds like there's definitely a theme here of kind of escalation between teams depending on where we think the issue might actually be. And that's definitely something that seems a little bit messy with, you know, with Kubernetes in general, just because the surface area is so large. Ramya, tell us about your experience at Airbnb. Yeah. We also have a similar theme. We have something called One Touch API, which is an internal tool that we have built that helps the product teams build services upon Kubernetes. They just stack away all the YAML files that need to get applied to build and deploy services into Kubernetes. There is a lot of, we maintain a pure API and there are still a lot of questions from our product teams about these APIs. So to help them, we have office hours, a weekly office hour rotation. We have a stack overflow implementation where people ask questions and then get answered. And we have a good developer portal documentation that they are expected to refer before asking all the questions to everybody else. We also have clear expectations. All nodes are effeminal. Everything will go away 14 days irrespective of whether you're a stateful or a stateless. And teams are expected to handle these rotations gracefully and not have incidents every time we rotate instances. We also have high-touch customers who have CRDs running and we isolate them into separate cluster because one CRD should not affect any of our other customers. So high-touch customers get their own special clusters. Within the Kubernetes is a huge infrastructure, a huge surface area. So we have divided them into three parts. The team members shift across. We have a scheduling part, we have a foundation part and we have a one-touch part that builds the APIs that our customers use. The on-call is shared across all these spots and there is a primary on-call and a secondary on-call where the secondary acts like a fallback to the primary. Thanks, Ramya. Cool. So yeah, we've touched a bit upon how the on-call rotations are for you to look in how the split of responsibilities is. Once you've got on-call up and running, how do you all facilitate the sharing of state and knowledge between members of the on-call rotation? For example, there may be a production issue that carries over from one week to the next. How do members of the rotation communicate context with each other? Let's start with Madhu. There is an orchestration on-call slack channel where quick handoff notes are passed from the outgoing on-call to the incoming on-call person. That's mostly how we transfer state. This typically covers, this notes typically covers the context about ongoing incidents and things to watch out for in general. In addition to that, there is a separate weekly on-call review. On-calls are expected to fill out a very, very short form per cell that they responded to before this review. And then on-calls, managers, TLS, and other interstitial team members discuss these cells as in responses to these cells in the form in the review meeting. We look for patterns and come up with short, medium, or long-term action items during these reviews. This is aside from self-corrective actions or SCAs taken during the cells themselves, which are very specific to the SCAs. We also have a company-wide tool called as Houston, which is part of a larger incident response suite, where all incident responders are expected to log their hours and provide brief descriptions of the cells. This data is used for higher-level analysis and for finding larger patterns across the organizations or the company. Very interesting. And I can see a lot of parallels to what we do here. Slack seems to be a big part of sharing context, some other sort of instant messaging system. And that's definitely worked pretty well for us. Actually, how does it work at Lyft? Yeah, we have a lot of similar things, like everyone else has been discussing. We do have the primary and secondary set up at all times, where primary is usually the one-handling issues, and secondary is not normally expected to have to respond to incidents, but is more of a fall-through. So to capture issues that will maybe take more than one week to resolve, there's kind of like two paths here. So there is a weekly handoff meeting that is part of the team's weekly planning cycle, and there is a on-call summary doc that's kind of like a rolling summary that the on-call will fill out and pass on to the next. For things that may be more like debugging-related or a feature request or whatever that we will get in from customer teams using the platform, these are things that we found that our on-call rotation was not a good fit for, because these things usually took more than one week, and the customer did not like being handed off from person to person or things could get accidentally dropped very easily. And so we actually created a separate role that sits in a team called InfraOps that is responsible for less urgent, but probably more long-running issues like debugging requests, feature requests, etc. And that enables these items to stay with the same owner until they're resolved. Cool, thank you. Ramya. Yep, we have a very similar setup as well. We have a primary and a secondary on-call. Primary handles all the tickets, and if there is any fall-through, the secondary handles it as well. This is a weekly schedule, and at the end of every person's shift, there is a on-call hand-off meeting, and there is typically a Google Doc where people maintain their major changes or major incidents that happened the last week or small incidents that are continuing to happen and needs further investigation. We also struggle with what incident needs to be handed off and what incidents need to be, like the person on-call, take it to completion, even though he or she is not on-call for the system anymore. We also made an on-call tasks. We are trying Kanban right now. This is a list of low-priority tickets and debugging tasks, requests that come during the week. On-call probably doesn't have bandwidth for this, and there is a separate bug rotation for to look at these tickets and try out these tickets and take it to completion. Yeah, and I will add that the bug rotation is actually a great way for people to onboard onto the on-call rotation because it allows people to tackle on-call-like tasks without the time pressure of being in an incident. Fabio, I think this will be boring. It's very, very similar to everything that's been said. We also run weekly hand-off meetings, and we use a rolling document, primary fields that document. I think one thing that's maybe unique, not super unique, I heard that before from other panelists, is that we merge, support, on-call, instant response, monitoring, all of that. It's part of our on-call rotation, so all the responsibilities are combined, and we make sure we go through all that in our hand-off meetings. It's very organic, I would say in Netflix, as that part of our freedom responsibility culture. So we lean a lot on the individuals in a lot of cases. For example, between primary, secondary, we don't have a lot of rules or structure on how we will handle the load. It's up to them to divide and balance the load as needed. Primary is always expected to be responsible for initiating all of that, but primary also needs to pull others as necessary. Aspirationally, we want everyone in the on-call to be comfortable with the huge surface area for everything. We know in reality there's always a bit of a specialization, so we try and balance that. Primary is responsible for balancing that, pulling others as necessary, especially when there's specialization necessary. Another thing that's maybe a little bit unique in what we do, I haven't heard before, we spend time in our hand-over talking about the health of the on-call. We use a template and we always make sure we talk about how much time people are spending handling pages, off hours, how much risk that was perceived on the system, what's their qualitative assessment of how was the shift, and it's just prompts for us to make sure that we talk about those things. They're not brushed off and we make so. Heavy health on-call, healthy on-call is a high priority for us all the time. Yeah, that's a great call out. Thanks, Fabio. On-call is definitely a very stressful and anxiety-inducing activity, and that's actually a nice segue into the next question, which is, one of the conversations I have frequently when people join our team, especially if they come from a team that is at a smaller company with a little bit less responsibility or smaller clusters, is just a lot of uncertainty and anxiety about joining on-call. Kubernetes is a big surface area, as we mentioned a couple of times today. People who are new to the technology or the team are often concerned about their lack of experience and comprehensive knowledge and not being able to help when there is an incident. So I'd love to hear how you all prepare new people to join the on-call rotation and get them ramped up. Let's start with Fabio. I think we're still learning. What we're trying, and it's me working relatively well, but we're still evolving this. It's not great for people that join. It's a huge surface area like you're saying. So we're managing the situation, I would say. We do onboarding sessions and workshops. We recorded a bunch of those. That's ranging from day to day. How do you, what tooling do you use to get, to troubleshoot things and to gather operational data so you can trace what's going on, ranging from that to higher-level architecture onboarding sessions on how things plug together, how they work, what are the interactions on the systems, where are the hotspots. We also, typically, new members will shadow all their on-calls at least once, but in multiple times, even multiple. They will just add themselves to their rotation and they will pair up on any issues and questions and support that happens, so that helps build confidence as well. And the opposite, during their first or second shift, in the beginning, when they're starting to get into the on-call rotation, they will ask all their experienced members to shadow them as well, to have a buddy. We have runbooks, though, as you may agree or disagree, I don't know, but keeping runbooks is always a challenge. Keeping runbooks up to date is always a challenge. We do have runbooks. We talk about them during handoffs. And being very supportive as a team, I think it's important for us. We know it's a huge surface area. You can take three to six months for a member, typically, to feel confident joining your on-call rotation. Yeah, that makes sense. Yeah, we usually wait at least three months for someone to join the team before they start preparing to join the on-call rotation. And I often tell people when they join a team like ours, that it's like, you know, you should expect to feel like you're still figuring things out for probably the first year. There's just a lot to handle. Let's serve our lift. Yeah, so this is also very difficult. Operating Kubernetes involves massive surface area and then also like massive risk to the organization since if something goes wrong at this layer, it's going to be pretty severe. It's not handled appropriately. So this is something where we find it's pretty challenging to do this correctly in terms of, you know, bringing people on board while minimizing risk to the company, particularly with very junior team members. However, we did recently successfully add a new grad or entry-level person to our rotation and it's been going great. She's doing great. And so it can be done. We find that we have the most success by starting with shadowing. And so we'll have new team members shadow a primary on-call, meaning that they receive pages but aren't expected to be responsible for tree dodging, but just following along and asking questions. After they're done with one or two shadowing sessions, then they will do reverse shadowing in which they and an experience on-call will be paged. And the new person is now responsible for trying to triage, but there's backup where there's someone there who can help them if they get stuck. And then only after they have completed one or two of those and it's going well, we will introduce them into the rotation. Something we've found is that no one ever is going to feel ready enough when they are put into the rotation. It's a bit of a trial by fire. And so it's more about thinking about ensuring that they know where to look and how to get help in the event that something happens. They don't know how to handle and so that they can kind of learn on the job while not compromising our production safety. Absolutely. And I think that's just the nature of on-call. Like it's never predictable because if it was predictable, ideally you would have automated that edge case away or that scenario away. So yeah, definitely challenging. Ramya, what's your next? Yeah, we have the same issues that everyone else has. One thing we do is we have a lot of different code bases and we typically let the new hire create a cluster which would be more creating PRs across multiple depots and learn the parts that change or get created during cluster creation. We also have video recordings of past presentations of different components of Kubernetes. So these videos are recorded and is in some Google Drive and the typically new hire goes and watches all these videos and then have a session where they ask questions about various sub-components of Kubernetes. Then they get to become secondary on-call first and handle all the low priority tickets and all the JIRA tasks that come in as part of that rotation. Finally, they become primary on-call when they are comfortable. Typically every new hire has a new hire body. So the new hire body and the new hire typically joins and collaborates basically for the first shift and they get a lot of help from the body as well as other team members and that helps them tide over the first on-call. We have runbooks. We hope to keep them up to date. It's a cat and mouse game. It's every now and then something goes off in the script and has to be updated retroactively. We also expect alerts to have tree instructions. Again, that also changes. It's again a cat and mouse game. And then dashboards, the same case. After every incident, I find another metric that's super useful and added to the dashboard. So dashboards are super useful to tree on-call issues. We just have a list of dashboards in a documentation that helps the on-call. We also have fire drills. Every now and then, every Wednesday, we typically have a fire drill that hopefully helps the new hire as well. During large-scale events, the on-call is not expected to be alone. Most of the time, few experienced team members just jump into the slack channel that is specific to that particular incident. It's all hands on deck when there's an incident. I don't think on-call ever feels alone handling large-scale incidents. Thanks, Ramya. And finally, Madhu. Yeah, our experience is very similar to what I just said. In fact, this has been one of our biggest people challenges right now. While huge surface area and familiarity with technology are certainly challenging, people also come in with varied levels of incident response experience. Previously, after three to four months of starter tasks that included SCAs and starter projects mentored by an onboarding buddy, we would ask new members to do a shadow shift and a reverse shadow shift and ask them to go on-call. That was prone to be insufficient as we grow and hire people with different levels of experience and familiarity. We have run both, but many incidents are like street thrillers. Thriller is the operating word because it requires some heroism from someone who knows something about specific technology many times. So due to that, this is all due to the complexity of the technologies involved, which require us to deep dive, even to mitigate the incidents, right? Not just to cause. So we are working on revamping our onboarding process right now. We are planning to introduce a curriculum for new team members to facilitate structured self-paced learning, something like what Ramya was saying about recorded videos. We haven't done that yet. We are working on them right now. More recently, we started running team-wide game day exercises, deep dives on different areas from subject matter experts. Now we even provide one-on-one coaching for people who are less confident and give people opportunities to go through multiple rounds of shadows and reverse shadows for both service desk and incident response rotations. If they ask for it, some people are confident, some people are not. So depending on the level of confidence, we give the opportunities. Wonderful. Thanks. Yeah, I've never heard incidents described as a mystery thriller, but that definitely makes a lot of sense. And there's a certain sense of satisfaction once you figure out what the root cause is. It's often hard to understand what's going on with a community cluster when something does misbehave. For example, we've certainly experienced this where a misbehaving client can easily degrade the performance of the control plane. What tools work well to get observability into what's happening with your cluster? And do you use any out of bound tools to share context as changes are made by engineers at the company to try and understand what might have caused an issue with the cluster? Let's start with... Yeah, this is going to be a really interesting and long-winded one. We use Prometheus plus Grafana for metrics, Vector plus Humio for logging, not just for Kubernetes components. Obviously, Kubernetes exposes some of these things, but also for the extensions that we build. And we expect all the clients that are on our system to have some, at least in-house clients to have some level of monitoring, logging, etc. Distributed tracing is slowly being added to good Kubernetes components right now. So because there was no good prior, we haven't done a lot of tracing for our Kubernetes extensions. Applications do have, you have them and use them heavily though. But that said, we have a block, black box monitoring running against our clusters all the time. We call them as possessions and they run frequently depending on where they're running and what exact set of tests they're running. But despite all these, we have been in cells where none of them have given us full story or the full picture. We had to get the DCP done, read through IP tables and then deep dive into the code itself to even mitigate. So there are all those challenges, but in addition, we are trying to train the team on all the tools that we think we might use during these incidents and the tools that we have used during the incident so that everybody's familiar with the tooling at least so that they can use it when it's required. In addition to that, we're also building a CD system, which we internally call as platform CD optimized for infrastructure change management. We are building visibility features into the system to understand changes that go into the system quickly across the board as in not just the changes rolled out by my team, but all the other high touch teams that are building operators against the system. So hopefully that will give us the visibility we need to understand the changes that go into the system. Fascinating. So there's a lot to it. Is this sort of summary there, which is that lots of different systems, lots of different ways to try and figure out what is going on with the cluster. Let's go to Fabio and talk about Netflix. I personally love the CLI and a lot of people in my team do as well. So we use Qubectl, JQ, Qubectx a lot. We view tooling scripts, side hop scripts on top of it. Also, I found having, we found in my team having an ability to blast SSH even though we hate, we don't, we don't love doing that. It's always very, very good to have that escape hatch. So blast SSH, parallel SSH is also important. We have, so our observability tooling mostly predates Kubernetes. So we already had observability before we introduced Kubernetes. So instead of using Kubernetes native double quote observability, we try to plug whatever we have into Kubernetes. And we mostly do that from a combination of writing our own exporters or using open telemetry these days, hotel collectors. Either we contribute our own custom collectors, we use what's available there. Those collectors and slurpers are pulling all the operational data like logs, tracing metrics and forwarding them to our in-house observability tooling, which is a combination of Atlas, our open source time series database. And we complement Atlas with indexing. We use a lot of ELK Elasticsearch Kibana for indexing various things ranging from logs to state transitions, events from the clusters. And we also complement that with a big, big data style data archival and query capabilities. Both of those indexing and big data is very valuable for us for ad hoc query as well. We often get to the case where we don't have the metrics that we needed to troubleshoot something, we'll add the metrics, but we also have the ability to look back at what happened. That's something valuable to us. We also, we also found it very, very helpful to write plugins to CubeScheduler to add more tracing data, especially on scheduling decisions. Our experience is that the pod, the information about scheduling faders that seem to pod, it's good, but you have no history and it's often very hard to understand why things are not getting scheduled. So we also added a bunch of extra telemetry there. Wonderful. Thank you, Fabio. Let's go to Ramya. So we, we, we have metrics. We scrape all the Prometheus endpoints that is available and then pump that to our, pumped up to our metrics provider. And then, and then we have logging, we have a similar set up that collects all the logs from all the systems and make it available through Kibana. We have dashboards and during every incident, I figure out a new metric that is super useful that was not previously there in the dashboard. The dashboard can have only like 50 or 60 metrics. It becomes unreadable. But after every incident, I'm like, oh, this metric is super useful. I keep adding them to the dashboard. So that goes, the same goes with scheduling, scheduler control plan and Kubernetes API server. Kubernetes API server being a huge metrics. A lot of metrics that is given out by Kubernetes API server is a lot and we pick and choose only once that matter. Events are super critical for us. Why did the pod get killed? Why did a pod get, why did a pod get evicted? Why did the scheduling fail? All of that gets pumped up to metrics as events. And we look through, I look through event logs all the time to figure out why scheduling failed or why a pod got killed or why we rotated a node and stuff like that. So events are super useful. During large scale incidents, we hope to have everything scripted out and have a runbook for like rotating nodes or rotating all deployments and stuff like that. But I immediately pick up Kube CDL like my own four while loop doing incidents to handle the, because there's slight variation, I really want to filter out all these deployments that have this characteristic and then do something to that set of deployments. So I immediately go back to Shell and try to change the state of the system to handle incidents. So yeah, that's been our experience handling incidents. Thanks, Ramya. Ashley, tell us about the lift. Yeah, so we also have an in-house stats pipeline that is collecting up and aggregating in cluster media stats. Similar thing for logging where that's collected up and shipped to central Kavana. We do have and rely a lot on dashboards and alarms. And as part of our weekly on call handoff process, there's tuning and back testing that happens ensure that these alarms are really useful and not just being noisy. We also rely on an in-house canary service. It's kind of similar to the assertions things just running into and tests continuously covering functionality that is essential for operating lift services and paving us if it cannot complete it. For operational tooling, lift has open sourced a project called Clutch that exposes sort of these like critical incident response actions such as coordinating a node debugging or like pulling debug logs from Envoy or terminating pods, et cetera in a web UI. So we rely on that to respond to things and also enable others to kind of self-help like a service owner can coordinate if they you know, notice it's an issue before the compute on call is available. And that way like we can improve that sort of like time to mitigation. And then in terms of how we like handle mispaying clients, we do try to at the much earlier limit the usage of operators and then also isolate these operators and series onto clusters where their blast radius is limited. One thing that we've built that we find very useful is we build a large like collection of these really small operational tools or like self-healing assistance tools for Kubernetes because a lot of times we'll notice that Kubernetes will kind of get itself in this terminal state that isn't really what is expected by the service owner. For example like you might get a failed create pod sandbox and like that's just the end for Kubernetes and it just leaves it there and we'll never retry, which isn't really what the service owner wants. They just want their workload to run somewhere. And so we have like a little controller that like will watch for things like this and just like terminate the thing for the node, et cetera, so that this will just go somewhere else schedule and be successful like the service owner expects. When we think about these little tools and how we build them, one of the principles that we follow is simplicity because these are like things that we want to prevent or help respond to incidents. We don't want like any instance of these tools themselves and so if you build like a giant Rube Goldberg machine that's like managing and trying to do all some magic on your infrastructure you're like going to get into trouble fairly easily and like if something goes wrong with these things will be really hard to figure out what went wrong. So we try to make these things as simple as possible where these services like the thing that goes and kills the pod sandbox there, so that's a separate tool, the thing that coordinates nodes, that's a separate tool. So they're these really basic simple controllers, but they should be get simple and then just work to make things aggregate better hopefully. All right, wonderful. Well thank you all for sharing today. It's always really interesting to see what's worked for everyone, what hasn't so that we can all learn from each other. I've suddenly learned a few things today. Thank you all for your time today and I wish you the quietest of on-call weeks going forwards. Thanks everyone.