 Hello. Welcome to my session on running low latency workloads on top of Kubernetes. An alternative title of this talk is how we run spice TV without hiccups. This is going to be documenting largely our journey as we've kind of learned to run our latency sensitive workloads on top of Kubernetes and kind of guide you along your journey as well and show you kind of the primitives we use in Kubernetes to guarantee the performance that we need out of our systems. So without further ado, we're going to need to go through some introductions. I am Jimmy Zelinski, so I am a co-founder of a company called OfficeZ. Previously I worked at Red Hat and Coreless. I've been in all kinds of roles, product management, engineering, operations as well have been necessary. So despite kind of my current title as Chief Product Officer, I still carry a pager. I still built the systems that then get ran in production. I've been around in the cloud native ecosystem for a pretty long time, basically since the beginning, because I started at Coreless before Kubernetes or the cloud native foundation was actually created. So in that time, I worked a lot on kind of the container ecosystem. So think about Docker registries. I worked on Quay, which is the first private Docker registry. I worked on Clare, which is the first container static analysis tool for detecting vulnerabilities. I am currently a maintainer of the Open Container Initiative. So that is the standard's body that dictates what an application container is. I was at Red Hat and Coreless. I worked on basically what ultimately became the operator framework, which is a CNCF project now to help folks build and run operators on clusters. And then in general, you'll probably have seen me on random data issues around the cloud native ecosystem. I've never been a maintainer of Kubernetes or anything that hands on, but I've been a user its whole existence and chime in fairly regularly on different issues that I fit in production personally. Prior to all this stuff, I was actually working on like the BitTorrent ecosystem. So I'm the author of some of the BitTorrent standards and I worked on a project called Chahaya, which is a BitTorrent tracker, but it is actually part of the orchestration layer at Facebook internally that distributes their software server side from rack to rack. So I was kind of like where I first began cutting my teeth on high performance and low latency go systems, and that will become relevant a bit further on. I figured I would add some of my contact information here just in case folks were going to come back. If you do want to reach out to me at any point in time, any questions at all like related to this or anything that you can kind of think of, email is my best option. You can also see me on like the social networks or GitHub. But basically, if you want to get a guaranteed response, you want me to see something, email is definitely the best venue for that. All right. So next up is an introduction for SpiceDB. SpiceDB is an authorization specific database. So it's a database that stores authorization data. So authorization data is the data that you use in order to determine whether someone has permission to perform a particular operation. The reason why you would usually want to kind of isolate all this stuff is so that you kind of have this centralized place where you have both the logic and data required to determine these permissions that empowers you a lot. It means if you have multiple applications or multiple applications implemented in different languages, all of those different programs can actually just query this one central system to perform an operation to understand whether or not someone has access. And so that means you're not writing tons of different logic, security critical logic in your code, duplicating it, or like trying to make sure everything works everywhere. And we basically give you a framework for describing these systems in a safe way. So that guarantees that, oh, adding a new feature that needs to change the model. We can actually test that and guarantee to you that you have an open to security flaw in your software. And then also we give you kind of scalability guarantees. So as long as you build within this model, we can guarantee performance at a very high scale. That not only means that you don't have to worry if you're like a massive company doing tons of traffic, but also that means that you can get very, very fine grain with the actual permissions you're allowing. You can ask very, very specific questions like, can this particular API key access this row in this other database? So that, if you think about the order of magnitude for API keys, it's a multiple of the number of users on your system. If you think about the cardinality of rows in your database, it's also astronomically large. So this system can scale to the cross product of that, just absolutely massive numbers. In a way that the performance is predictable and always the same and low latency. So those are kind of the high level reasons why you'd adopt something like this. Permission checks are super critical because as I kind of said earlier, before you can do any work whatsoever, you first have to run a permission check. So that means that we're kind of in the critical path of absolutely everything. And before our folks even kind of make queries to their relational database or primary data stores, this is a full request that has to happen at the exact same time or ahead of time. That puts us super in the critical path. Thankfully, we are the most mature solution outside of Google that is inspired by the system that Google uses at scale. That software is called Zanzibar. It's not available to anyone else. And we have basically taken the idea behind that and made it kind of approachable to folks that are not inside of Google. It's kind of well understood that Google has Google-isms. They have all of their own internal software that they can rely on. The environment working inside of Google is not the same as working outside of Google at another business, an enterprise software company. You're not going to have a lot of the availability of standardization around the tools that they use. So instead, we have designed an open source system that works around all these things and is very friendly towards kind of folks that live in the real world and don't have access and the same guarantees that the software engineer is operating inside of Google have. That often means we have to give people really nice developer tools for integrating their products rather than having to be able, basically having a manager top down say you're forced to use this solution. So we have to entice developers with better tooling and better workflows than what they currently already have. And just kind of prove it all out. We've recently run benchmarks and hit 5 milliseconds at the 95th percentile, running a million QPS with 100 billion relationships stored inside of SpiceDB. So I think it goes out saying that we have built a low latency system that can check permissions largely at Google scale. All right. So given my background, given my co-founders' backgrounds, I think it goes kind of without saying that our business is going to run SpiceDB on Kubernetes. It's everything we have experienced with operationalizing. It's the thing we know super in-depth. And I have actually given another webinar where I kind of outlined how we deploy a database as a service as a high level on top of Kubernetes. That includes kind of like the workflows and a lot of the product requirements that folks often forget. But also talks about how we lay out clusters, how we subdivide them, how we manage rolling out different phases of stability to different users. This is very, very high level, but also talks about a lot of the core technologies we use. So if you're interested in kind of like the overall details of how one of these systems works, I encourage you to check out this talk. You can find the URL right there. But this journey wasn't easy. And running SpiceDB optimally on top of Kubernetes was non-trivial. I'm going to talk a little bit about why that is the case. So SpiceDB, in order to get its low latency and very, very quick answers for folks, is massively parallel. So what actually happens is SpiceDB is Kubernetes-aware. When you deploy a deployment of SpiceDB on top of Kubernetes, it knows how to talk to the API server to discover pods in that same deployment, and it will then immediately connect to those pods and start self-clustering. So what is the point of self-clustering? Well, what actually happens is as requests come in, they come into a little bit of balance there like Envoy, as soon as they hit any of the existing SpiceDB pods, that pod will break down that request into a bunch of sub requests that are then parallelized and sharded across the entire cluster. So we use consistent hashrin in order to decide where requests should belong, and what that actually helps us do is it means caching for that particular answer, like that particular sub request will be far more likely to exist on that particular pod. So that basically gives us higher cache hit rates. But then at the same time, because we're doing so much of this, and we're kind of recursively breaking down these requests into more and more sub requests that get parallelized, we are just trying to get as much throughput as possible and doing as much work in parallel as possible. So we're high users of GOs and NAND-threading models. So that means we spent tons of GO routines as much as possible within process. It's also made parallel, not just across the server. And all of this will become relevant way later in the conversation when we start talking about kind of performance implications once we've tightened a lot of the scheduler down inside of Kubernetes. But we have to start from the very beginning. So in the beginning, when you go to deploy your software on top of Kubernetes, there are the defaults. So what do the defaults give you? I have this title, are you even trying? And that's not talking about the Kubernetes developers or anything like that, that's about you. Are you actually trying if you've just kind of deployed your software and you haven't really done anything else? The guarantees that Kubernetes gives you by default is effectively just best effort. So the two real big points that I want to highlight for this kind of best effort behavior is that by default there's absolutely no protection from consuming all of the memory on a node. That means there's nothing that stops an individual pod from just allocating more and more and more memory until the node has none, in which case you get an out-of-memory error and something has to be killed on that node. So effectively then Kubernetes is going to have to move a pod and switch it to another node. But that disruption is still going to take place. There's nothing that prevents a basically background job, a less critical job from affecting a performance critical job. Just because it has a memory leak, it can totally disrupt your performance of your very, very sensitive and critical workload. So that's like probably number one, the biggest thing to look out for if you're just using Kubernetes defaults. And then next up basically kind of like I just said where you can have kind of non-critical workloads scheduled on the same node as a critical workload. Yep, Kubernetes is going to do its best to schedule workloads on different nodes. It doesn't want to put two of the same kind of pod next to each other. But like I said, everything is best effort. If they can't find space in the cluster, it's going to do that. It's not going to throw an error. It's not going to tell you that you need to provision more resources. It's just going to tell you, hey, it's not going to tell you anything. It's just going to schedule the two workloads next to each other if they're both latency sensitive. They might actually then have them performance implications on each other. At the end of the day, there really isn't anything that Kubernetes is doing to prevent the quote unquote noisy neighbor problem and having like one pod affect another pod and you have absolutely no guarantees of which pods land where. So why don't we see what's available to us to help the schedule and decide which pods should land where. So there's actually kind of two major primitives we're going to explore for giving the Kubernetes scheduler more details. And the first one does exactly what I just described. And this is called tapes and tolerations. The general kind of distributed systems concept for this is called affinity and anti affinity. So you might see that referenced if you're reading any material that's not about Kubernetes that's still talking about this general subject. But at the end of the day, they're basically teams and tolerations are two special kind of labels I would describe them as that basically a taint labels a node and says you cannot schedule pods unless they can tolerate this this taint. And then a toleration is a basically a label on iPod that says I can actually tolerate this taint on a node. And so you can use this to achieve certain things like only one pod can run on one node of a particular deployment site. And you can also extend it to go one step further, which is you can create a whole custom node pool that's specifically optimized for your workload. So in our case what we did is we got very excited and then set a like spicy be specific taint and toleration and spun up a node pool. That was the only thing that had that taint on it. And that hardware could be optimized for running spice DB specifically. Like I said, spice DB is very, very parallel. So having more cores does a lot more for spice DB than maybe your general purpose instance types within Kubernetes. So this gives you a bit better guarantees, but there still isn't unless you finally start using anti affinity as well, which would prevent basically you from scheduling on anything else. Then you can start to make these guarantees that your nodes are, or sorry, your pods are only going to run in the very, very specific nodes that you've provisioned. At this point in time, we did notice performance improvements on spice DB, but largely this is because of the optimized hardware we were running it on. And it was kind of a side effect of having like nothing else run on these nodes, purely spice DB. So the next thing that you wanted to try that is kind of recommended all throughout the Kubernetes documentation is requests and limits. Requests are basically the ability for you to specify what resources need to be available on a pod for a pod to be scheduled. You can kind of think of this also as like an affinity, anti affinity kind of thing, except it's kind of aware of the actual resourcing primitives like that the node has available to it. So it's a little bit more dynamic. And then limits actually create a hard limit for the pod. It says this pod, if it starts to grow in CPU or memory, like we're going to actually prevent it from ever growing past a certain point. So that kind of gives you guarantees for how you can kind of fit these pods onto nodes in a way that they're not going to over extend and impact. Start consuming some of the resources that another pod guarantees. That's fantastic. That's our silver bullet, right? Like we're totally done. Basically between these two things, we have tons of guarantees now, right? Well, there's actually a bunch of pros and cons related to adopting these two bits of kind of well, well recommended configuration for Kubernetes. On the pro side, we totally eliminated out of memory errors. We're not going to see those ever because we have restricted the amount of memory that a pod can provision. That's fantastic. And we've also guaranteed that the scheduler is going to look for machines that have memory available for schedule something that we know is going to grow. You're going to have to basically spend some time while you kind of run your system without these limits to understand how it's consuming memory and where you want that threshold to be to safely run your software. But after that kind of like initial configuration phase, you have a good understanding of how to operationalize your software. You should be totally fine. And then the limits actually make performance very, very predictable. You're not going to have anything spiky and nothing is going to go totally awry. On the con side, actually these limits artificially limit your processing. If you have available cores, basically you're going to want to use that. If there's nothing consuming that, like if you have to burst, ideally you would still be able to burst. So that's kind of a limiting factor. And then additionally, we just generally saw that merely enabling these limits, we saw drop in performance. I'm not sure directly what the overhead is with regards to that, but just tightening these things up actually made things more predictable, but also had a performance cost. So that's kind of like a trade-off. You might be, you might ultimately decide that that's actually fine to like lose some of that efficiency, but ultimately gain that predictability. The other super important thing to note about limits specifically is that they're reactive. There is actually a API server flag where you can control the rate at which limits pull. And what happens is it's going to pull at that specific rate. And when it detects that a pod has exceeded its limit during one of those polls, then that's when it's going to react and kind of throttle that pod back or kill that pod. So that means that there is fundamentally this reaction time, like it is reactive and not proactive at preventing these bursts. So you can still burst outside of your limit, very much so enough to impact the performance of a very low latency system. So there's kind of this like battle between the latency that your software has and then the latency at which this polling rate can actually react. And I think that you'll find that for specifically low latency systems, it doesn't matter. There's like a bottom basically limit or how often you can set this polling rate. And fundamentally, it's probably never going to be low enough. And actually what we do need is a proactive solution rather than a reactive solution. And then we kind of included this last bullet here, which is there's still going to be context switching costs. So the scheduler can still move things around, like stop you from what you're currently doing switch out. And because the processor is not necessarily guaranteed, you're going to be running on different cores on the same machine. So you can actually still have other processes running like other pods impacting the CPU performance of your your software. But what exactly is context switching, you might not have a background in this and why that's important. Conduct switching is number one, it's a really expensive thing that your your operating system and schedulers generally do to understand it. Basically, there's these kind of two concepts that a lot of people often conflict concurrency and parallelism. And so parallelism is actually when you have two different processes and they're running completely separate tasks entirely independently. So that means that basically, these are two completely they get the workloads cannot affect each other, because you actually have two workers, fundamentally running two different jobs. But concurrency is actually what you'll see in a lot of systems, specifically modern computers do both concurrency and parallelism. But if you're running on a single core machine, for example, you can still run multiple programs like a very long time ago. Computers could only run one program at the same time and they got fast enough that they could do what's called multitasking. And multitasking is concurrency. Basically, it runs one program for a short amount of time, and then it swaps to running another program for a short amount of time. And then it swaps back to running another program for another short amount of time. And these swaps, while where it is actually like stopping the execution of a program and of saving the current state of the world, restoring the state of the world of another program and then continuing to execute that second program. That process is called a context switch. Sorry, that's a cuckoo clock. And so context switches are super expensive. And if we look on a very, very deep level at like your your processor, the actual CPU executing this code, it actually has to do a lot of work it has to basically save those contacts and restore the context. But what ends up happening too is that basically the CPU cache, like a lot of the memory that's inside the CPU optimizing the throughput of the CPU is going to be invalidated because now it's working with completely different memory. It's working on a completely different problem. A lot of the predictions that you would get a lot of your performance at your CPU, like a lot of that information just has to be completely cleared as it swaps to the next thing. And not only is it kind of like at the very much so hardware level, but at the like higher level the scheduler that's managing swapping between these things that has to do a bunch of bookkeeping around it. Oftentimes there's kind of rules around how things should be scheduled to get fairness. And that kind of leads into the two different kinds of scheduling there's cooperative and preemptive scheduling. Cooperative scheduling is typically when the systems, the two processes are working in tandem with each other. So you only yield to run another thing once you've basically gotten to a stopping point and you say, hey, okay scheduler, I'm actually ready to be paused here. This is a safe place for me to be paused. And then you can run some other work. And basically by providing that signal that's cooperating with the overall system. This is a pretty common system. But it does have trade offs. You can have basically processes that never yield or very aggressive about not yielding to another process and consume as much time as they want. So that way you can actually starve out other processes. And this is really where you start to see kind of like the quote unquote fairness of the scheduler being invalidated. And you start to see performance implications on processes that are not actually getting the time that they need scheduled. And then the other one, which my head is actually blocking a little bit here is preemptive scheduling. So this is actually triggered basically at any time. So that means the context, which at any point in time, the overall scheduler can say, okay, I'm done running this. I'm going to pause it for a second and then swap over to another thing. These systems have to be designed to be robust and kind of like aware of the fact that they can get preempted at any time. But it does get a little bit more control over basically guarantees around fair scheduling. So all that aside, the question becomes, how can I actually deal with these context swaps? I don't want another process running on my node to affect my low latency system. And it turns out there's actually one feature left in Kubernetes that's really critical to solving this exact problem. It's actually not one that's talked about a lot and it's kind of more obscure. And so I think this is actually going to be probably the thing that surprises most people or most people are unfamiliar with. However, they may not have arrived at a lot of the same conclusions we have for kind of like the trade-offs of adopting some of these other solutions. And that is the static CPU manager policy. So this was actually made stable in Goray's 1.26. And what this does is it basically lets you, if you create a request and limit for a pod CPU and you do it with this mode configured with static CPU management configured. If you use whole numbers like dedicate a whole core, it actually will give exclusive access to that process to that core. That means it's mapping it to a physical core and it's going to stay on that physical core. That means no one else is going to be able to use that physical core. It's not going to context switch another process onto that core. So this actually gives you back a lot of those guarantees you're not going to get nearly as much of kind of like the noisy neighbor problems. Because you actually have guaranteed hardware now. There are like some kind of caveats to note with this. You can't just allocate all of your cores on a node as these exclusive cores because there are still processes running on the system outside of Kubernetes and the cooblet itself. So kind of these system resources need something basically to run on. And that's basically also provided as a flag and you need to allocate at least one CPU for that. But you can also allocate more by configuring it with that flag. And then kind of like the major trade off here is that like these whole number of integers, you can't schedule things smaller than that because that would be sharing a particular core and that just arrives right back at the whole problem of context switching and other processes running on that exact core. So now you're kind of beholden to restricting your workloads to very, very specific numbers of cores and whole numbers of cores. But this also has a major thing and its name gives it away. It's static, but we need dynamic users are going to be provisioning clusters here and we're not going to be able to we need to react to that. We'll eventually run out of cores available and then then what happens on our system. So with that, there is kind of auto scaling functionality in Kubernetes. And this is what actually fills the gap. We are big fans of this AWS project, which is totally open source and it is Kubernetes node auto scaler called Carpenter. The super cool thing with this is that it basically adds just in time capacity. So that means that we can fully leverage the static CPU mapping. And when we create a new pod that needs to be scheduled, Carpenter is going to look at the results of the Kubernetes scheduler and this Kubernetes scheduler is going to say I can't find a CPU to actually schedule this on. And when it does, it's going to actually provision a new virtual machine on our cloud provider, like a new node that will have those cores available to them within the limits that we can figure for the size of the cluster be scalable. And then those that deployment of SpiceDB can have the cores allocated directly for it. So this kind of gives us the flexibility of expanding and contracting based on the workloads that we have scheduled, but also the actual full dedication of those cores to our workloads. And if we run out of these cores, we can provision more cores specifically dedicated just to this workload. So there are different like kind of cross cloud alternatives to this. So if you're on Google cloud, GKE has autopilot. If you're on Azure, the AKS has a cluster auto scaler. The nice thing about Carpenter is it's actually open source and it has the ability for folks to implement backends. Eventually I expect Carpenter to be fully fleshed out and support both GKE and AKS. And that would basically mean you have one kind of unified single way of kind of doing this auto scaling. That's cloud agnostic rather than kind of working with the specific APIs across the different cloud providers. So it's really nice. EKS doesn't actually have a checkbox inside of Amazon for you to just enable auto scaling. So this is their cloud solution. And I'm actually really glad that that's the case because eventually we're going to get this neutral way of actually doing auto scaling cross cloud. So with that, that's been our journey. These are the super critical bits of Kubernetes that we think are vital to running low latency systems. I'd like to give shout outs to our team at AltZed, who are the folks that largely have discovered and like worked through this process. Brad Eisen, Evan Cordell and Victor. Basically these folks are the ones that have provisioned the infrastructure, measured the infrastructure, ran benchmarks, dove deep into Kubernetes to understand exactly how all these permits are working so that we can understand the trade-offs. There's also this very helpful medium article that kind of guided us at the very beginning of our journey. And it basically is talking about like how folks just need to learn the actual implications of these primitives and use them properly. Because I think folks often just look at resource requests and limits and kind of assume that these things are going to work for them and they're very simple. But actually there's a lot of deep implications for adopting these, such as the performance hit that we saw just by enabling CPU limits. So that concludes this talk. If you have any questions at all with regards to this topic, with regards to SpiceDB, just Kubernetes in general, feel free to reach out to me via email. But then we also have a community discord, and you can see that you're all bottom of the contact as well. This is primarily the community that does development and uses SpiceDB. But we also talk about generic distributed systems concepts, running software on top of Kubernetes, operators development on code that integrates deeply with Kubernetes. And then obviously as SpiceDB is a low latency system, we're often working with folks running SpiceDB on their own hardware or cloud configurations and helping them kind of discover the primitives available to them to optimize that experience running low latency system. And with that, I'd like to thank you for watching and feel free to contact me at any point in time in the future. Thanks. Bye.