 Hello everyone and welcome to our talk on testing Kubernetes clusters and building confidence in your changes. In terms of what we're going to cover today, we'll cover a little by us. Then we'll go over the problems that we faced and that prompted us to go down this investigation route and building our own solution. Then give an overview of the potential solutions that we looked at. I'll give you a deep dive into our solution, how it works and the components that we brought together to make it work. And then finally cover some dotches tips and other options for you. So in terms of who we are, I'm Guy, I'm a principal software engineer at Stathganner. I work on the teams responsible for our shared container platform. So for the most part that's Kubernetes and enabling the rest of the business to build applications on top of that and that operate reliably at scale. In addition to that, I'm also a co-chair of Kubernetes and auto scaling and I can be found on most places on the internet under the handle of GJ Templeton. Hello everyone. Thank you for joining us. My name is Matteo. I'm a senior software engineer here at Stathganner where I work with Guy on keeping our Kubernetes infrastructure up and running in the production platform tribe. You can find me technically on Twitter, but yeah, please look for me on GitHub. So we both work for Stathganner who are Stathganner. So we're the travel company that puts you first. We help millions of people in 50 countries over 30 languages find best travel options for not only flights, but also hotels and car hire each time. I work with over 1200 travel partners to enable that. In terms of Kubernetes at Stathganner, we've been running the case since 1.6 release. We over that time we have built up the number of clusters we have to more than 35 production clusters at this point and spread across four different AWS regions. On those clusters we've got 475 and more services and most of them running in multiple clusters, multiple regions for improved resilience to a number of different potential scenarios. And in terms of scale the clusters underneath them, we've currently run out about 40,000 plus CPU cores across the clusters with about 150 terabytes of RAM at any one time as well. So what do we want to talk you about today? Well, if you run Kubernetes, you're probably aware that it's a complex platform, let's be honest. There are multiple microservices, API servers, distributed database, and if you're running a production workload on top of Kubernetes, you probably also want to install a few other components. For example, you have your CNI, maybe you have a service mesh or maybe you have an ingress controller. You probably want to do something about the metrics and logging. So you have a set of observability components and while Kubernetes has a good set of unit and integration tests on its own, I'm sure that all the adults that you're running, they have their own set of tests. The very difficult thing is here is when you do everything together. So how do you test your Kubernetes setup with all the components you care about with all your configuration inside either your cloud environment, maybe running in AWS, like we do, or you're running on premises. Why do you care about making sure that everything is up its work is as better? Why do you want to test? Well, the obvious thing is that you want to reduce disruption when you make a change into the clusters. You want to keep your customer happy, users happy. You also want to do something for the people responsible for the cluster, so maybe your squad. You want to push out fitters for your users and then you want to keep like a decent velocity, so you want to be able to reliable test and making sure that something that you're pushing out is working and is not breaking anything. This gives you like confidence that the users are getting the behavior they're expecting. It also gives you a baseline, so if you make a change, then you can pinpoint like when the behavior changed or when something broke and maybe helps you like in better troubleshooting. So we've covered maybe the reasons we wanted to look at this, and so we started to exploring what are potential solutions to enable us to do this work. So the first, which I assume many of you are familiar with, was manual acceptance testing. So the idea of we'd stick a manual approval gate in our continuous deployment pipelines, and that would give us the opportunity as operators to go and run some manual acceptance tests against the clusters or against individual nodes, whatever it may be, before manually approving the pipeline to continue. The problem obviously with that is as we've scaled up to 35 clusters, this solution doesn't really scale as the number of clusters increases. So it increases, every new cluster adds to your toil burden, unlike with an automated solution. It's also easy to make mistakes and it's inconsistent, even if you've made it into runbooks, if you're still relying on engineers correctly running the runbooks with the correct options, for instance, or copying and pasting commands in from runbooks, you've still got a human factor in there, and a risk that things won't quite actually be what you thought you had run. So for that reason, we sort of ruled out manual acceptance tests pretty quickly. The next option we looked at was a custom service and we ran with this for actually quite a while. In this case, we spun up a microservice on our cluster. And that microservice had our back permissions to create pods, create and modify some objects inside the cluster. So that was constantly on a running these tests. That had a good benefit in that it made it easier to catch any degradation. So if we made changes and say it worked fine initially, but six hours later, something had to say bogged down the API server to the point that things weren't working the way you thought they should anymore. This would actually catch that exposed metrics so we could alert on those that that was a real benefit of this approach. The downsides, however, were that the way we'd implemented it, at least it was harder to block our update pipelines on. So we couldn't as easily basically say I have a conditional in our continuous deployment test of if these tests fail the next time the loop evaluates, then after we've made these changes, then block. And the other was it's not there was no easy ability for us to run ad hoc tests when when and where we wanted we were at the mercy of the loop that was evaluating these things. And we could have done work to fix that, but we took the opportunity to look at other options to see if they were a bit better way. And so the next option was building on top of existing frameworks. And so there's existing infrastructure and Kubernetes testing frameworks. And this would have minimized the custom work we had to do and these different frameworks have a number of different, different benefits. Some are written and go somewhere in Python. Some are very, very good. Kubernetes focused somewhere way more generic and have their origins from before Kubernetes. So there's a number of options we looked at there. And then we looked at more specific like provided and supported by the Kubernetes core Kubernetes community options. So the first option we looked at was Qtest. This is stored in the Kubernetes testing for repo. And it was a framework that allows for the creation and tear down the test clusters. It supports the addition of custom data tests and it's supported by the Kubernetes project, but it does come with extra overhead. And because obviously if you're tearing up, spinning up and tearing down clusters, you've got to wait for that creation and tear down to happen anytime you're running these tests. However, the biggest drawback of this for us was it's not necessarily representative of our production clusters. Say we've had a production cluster running for six months. We've had many changes applied over time. Spinning up a fresh new cluster and applying all the components you think are in your cluster is almost certainly not actually going to be representative of that cluster that's had created changes on it. So that was the main drawback here for us. Then we started looking at the conformance suite. And so you might be familiar with this if you've ever had a look at the certification of managed Kubernetes offerings. These are the tests that provide providers or any provider of a certified Kubernetes offering has to validate that they've run against their offering. And that all these tests pass, they test a huge range of core Kubernetes functionality, make sure it all works within acceptable limits, etc, etc. So that would have made it really easy for us to use the community's existing tests. However, the downsides that are obviously because of the range of the tests in there is really time consuming to run that entire suite. There's there's some disruptive tests if you're running the full suite, although by default it doesn't run the disruptive test. And also, you're not necessarily testing the behavior that your users care about. Our users don't care about, you know, some of the low level stuff, but they care about is my application. The thing that conformance suite did lead us to, though, was the tool that's used to run the conformance suite, which is a tool called Sonoboy. It's an open source tool from VMware. And it's very active community, constantly improving project. But the killer feature for us was the ability to run custom tests by the correction plugins. So that meant that we could build our own tests, only testing functionality we cared about into binary and use Sonoboy to run that. So what did we choose to iterate from our custom solution? So before taking a decision like we sat down and then we look at what we wanted from our next generation tool. And the third thing was like something easy to run. The previous tool was like the partners as a demon set across the whole Kubernetes. So it was not very easy. If you have a co-combersome test tool, you're not going to use it a lot. So we wanted something simple that you could run from a laptop when while writing a new test, or even if you wanted to quickly test like a live cluster. But then the same tool we wanted to be able to embed it in our city pipeline. We wanted all the things that you probably expected in a test suite. So around test is here along a parallel, able to get and store and archive test results. So you can look at them over time and see if you have like you can spot them trends. Another important thing is alerting on a filling test. So you might take a decision like in our case we page one of the cluster operator if the test are not passing because maybe it means that one change that we're making that made into production despite all the tests is actually breaking the functionality you might want to even add like an automatic rollback. This on the test suite on that while writing the code, we wanted to find a solution that was allowed us to write the less boilerplate code as possible. So to rely a lot on other people that practices the community, maybe have like a few examples of a well written test that we could take a situation from. And so our current test solution is based on writing custom sonoboy plug-in. So, as Guy said, sonoboy is the tool that has the conformance test suite but we wrote a plug-in. And then sonoboy now is driving in our case only our SkyScan and internal smoke tests. And then we had a look at the Kubernetes repository and there is like a huge section around tests and then to end test and so there is all framework that gives you like a function and utilities to reduce the amount of code that you had to write. Unfortunately, our repository is not open source yet because there are a lot of SkyScan custom tests so we haven't done it yet but we hopefully we are going to do that in the next few months. But I hope to give you like enough so you can go away and try to replicate our setup. As we said, we run the sonoboy image to write the tests. We made a little changes so that we can easily retrieve the failed test, then we can differentiate the exit code when we look it into our CD pipeline so we can bail out if any of the tests are failing. And then we build the actual end-to-end image. We based these on top of the upstream Kubernetes conformance image because the images already had Gingo and everything that we need to run our test. As you can see here, we make a couple of changes in particular, we changed the runnization and I will tell you more about that later. The end-to-end test binary is the actual binary produced by when you compile the code and then we pass some YAML configuration because we mirror internally all the Docker images. So here we are changing the configuration of sonoboy to point to our internal Docker registries. In terms of how the repository looks like, we basically shift and lift the end-to-end Go files that are the entry point for the suite from Kubernetes to Kubernetes. And then we wrote our smoke test. Here we have a networking file because we have a lot of network tests so we moved them in a separate file and a common .go that all share the function between the smoke tests. So what do we do when we push a change to Kubernetes itself or to a component? We have, as I said, we change a little the sonoboy image so we can write. So we have a job that is actually responsible for triggering sonoboy itself and then retrieving the result and the exit code. We also have a job to delete and make sure that we start from scratch every time because sonoboy doesn't like if there is already the namespace, at least in our version if there is already the namespace around. So inside the sonoboy namespace there is that called sonoboy binary driving the test and these are spin-ups like our actual end-to-end image. And that creates and that runs all the tests. So the good thing about this is that every test that lives inside its own namespace. So we run those tests alongside production workload because then we have the confidence that we're not polluting other namespaces or we live like stereo sources around. So what kind of test do we write? We try to really think about what our users care about. And so networking of course is a big thing. You have a pod you want to talk to other pods you want to talk to the internet. We use Istio as a service mesh and we have like some internal configuration, custom configuration to have a multi cluster setup. So we want to make sure that that setup is still running every time we touch Istio. We have a few tests around DNS because let's be honest, it's always DNS. So you want to make sure that that is working. Internal means we make sure that Kubernetes DNS internal resolution works. We run in AWS so we have a bunch of specific tests against AWS endpoints and also other tests around external domains. A lot of our users rely on auto scaling. So of course we want the HPA to be up and running and also scaling on custom metrics. And since we're running in AWS, some of you might be familiar with the concept that they have for giving permission and high-profile. So we want our users to be able to assign their role that allows their pod to talk to their market or database. And that functionality needs to be working. Moreover, there are like a few cases or when you are not changing anything at the Kubernetes level. So adults all the same, Kubernetes version, all the same, no flags, anything. But maybe you're changing the MI or you are changing like something, you're tuning the kernel. So you want to make sure that all the nodes of your cluster are in a known state. And this is an example like how does the test look like. So Kubernetes, Kubernetes test framework use Ginkgo as a framework. So here you can see that there is like the pod to pod test. So communication between a source pod to the extension pod. And so we create the source pod and in space. And the interesting thing is in the last line, you can see that we just one line. We can make sure that the framework aspect that doesn't doesn't happen any error when we try to create the name space at the pod and at the pod running. Then we can create a destination service with the destination pod. Again, same function to make sure that the destination pod is running and waiting for ready to send the connection. And then the last step that we create a service that we that is going to forward the call to the station pod. And then we use other framework functionality to basically iterate and make the same call from the source pod to the destination pod. And we we make a session about the corresponding to under status code. If we don't if we don't get that code within a certain time out, then we know that the pod pod is broken. And we raise another. So, Mattel has taken you over how we implemented our solution there. However, it's didn't come without a few drawbacks. And we want to take you over like what we've run problems we ran into some tips as well as well as other options. If you're looking at this and think, I, you know, this this seems not the right fit for me, but I still want to test my Kubernetes clusters. So, in terms of the gotchas with our approach, the first one and the main one is that, as Mattel mentioned, we're using code from the main Kubernetes Kubernetes repo to avoid having to write a lot of boilerplate. However, the main problem with that is, as you can see from this comment on an issue on the on the repo, it's not intended to consume this module. This isn't is intended use case. And so we need to use a workaround to get get this code to be in a usable state in our our project. And thankfully, in the same thread, someone has written a script to enable us to do that. So, you can use the script is what we're using successfully we've used it to get us through multiple minor version upgrades of Kubernetes and vendor in the new code successfully. That that works for us. It works well. And so, yes, it's certainly something I wouldn't have any hesitation in recommending a solution to this problem. If this is what you need to do, it's used in multiple different projects across the community as well for different parts of the Kubernetes Kubernetes code base as well. The other problem with versioning the ET code, however, is that the other part of it being not designed for this means that there's no API stability guarantees like there is with, you know, public facing parts of Kubernetes. These these are this code is written with the intent being it's explicitly tied to given minor or given release of Kubernetes. So there's there's no hesitation in renaming removing completely replacing methods. And that means that when you do a minor version update of Kubernetes, you have an extra cost here of potentially needing to update functions. One of the methods that Intel highlighted in terms of names, creating namespaces waiting for pods. And that that method has changed previously completely changed them and we needed to do some investigation as to what the correct method to use was now. That's that's only a one time cost though effectively when you're doing a minor Kubernetes version update so it becomes another item on your list of we're going to upgrade Kubernetes we're going to need to upgrade these components. We'll just need we'll need to update the vendor code and figure out what what what methods we need to use now potentially. The other thing is that the the code provides test images like there's a static effectively static mapping within the code of test images and tags, including the repositories they're pulled from. So if you're potentially running an air gap cluster or one where for security reasons you can only pull images from from certain registries or because you don't want a third party dependency, you may need to dive into the code, find out what those images are mirror them, and also use that use the custom config file that Mattel mentioned to affect like figure your tests to pull them from different registries rather than the default registries. And there's also occasionally flaky tests. And so if you have very dynamic clusters where you're scaling up and down a lot, or you're running on, for instance, AWS spot instances where your your instance can be lost. So the version we run is not the greatest recover dealing with tests being interrupted mid run. So it might you might end up with flaky tests due to the dynamic nature of the clusters you're running then these in or something else. And finally, in terms of glitches, you might need to run the run.sh strip. So this is a script that we pull in from upstream and modify slightly, because we want to basically make it more configurable for our use case. In our case, we want the ability to allow more unready nodes. And so that's basically an argument that we need to pass into the invocation of ginkgo. And that that means we have to modify the script by allows us to run this on clusters that are scaling up and down because some roads might be unready and normally that will cause these tests to fail very quickly. In terms of some tips that I'd encourage you to think about, even if you don't take this approach like capture the behavior that your users care about, but that is the same as writing any test, a sort of entity test, but put yourself in the mindset of your users and figure out what behavior you're actually trying to test is working is still working hasn't degraded whatever as you're looking for. Liberate communities after it's like not just the code, also the images that have been built. And one of the tests in the field showed used an agent host image that's something that's built by Kubernetes tests. It ensures that no matter what architecture you're running on, whatever the commands that you're running, whether it's a deck, whether it's a network call, whatever it is, they're consistent and they're, they're, they're tied to response, etc. So those those things there's a lot of thought and care put into them make make use of them don't don't go around building your own images to do things that might fail and weird ways on different architectures, etc. I'm trying to figure out which year test can be running safely and parallel. And obviously, if you're running these for every change against your cluster, which I would encourage you to do. You want them to run as fast as possible so you get that fast feedback loop so that your, your tests don't become a pain point. They become an enabler rather than pain point. If you've listened to all that you're still thinking, I'm not sure that this this approach will work for us whether it's too much effort for the version upgrades or whatever it is or you don't want to write ginkgo test. And there are a couple of other options. The first is, I already mentioned the Cube test framework. Since we did the evaluation Cube test to has become the solution. That's, that's in a separate equal and a number of different providers have got implementations that I've linked to the AWS implementation here as we run on AWS, but that enables you to spin up an EKS cluster quickly, run tests against it, do some tear down. And that's, that's potentially if you're scared about running tests on a live cluster with your production traffic on it, that might be the better option for you. And, and then there's, there's some other community supported frameworks effectively then the first is test and process this is a bit older. It's not very, it's not specifically Kubernetes focused but allows you to test your infrastructure whether it's provisioned by Ansible or salt or a Kubernetes for a Kubernetes cluster. And that's that's Python based. And the next two are a bit more Kubernetes focused. And so the first is Cube test, but a different Cube test, not the Kubernetes Cube test, which I previously mentioned. And this is, this is to enable you to use pie test in Kubernetes. So it makes makes use of allows you to make use of the Kubernetes Python client, but in a pie test framework. So if you're really used to writing pie test, test rather than get the test, maybe that's the, the approach you would want to go down. And, and finally, slightly different approaches Cuba healthy. So this is actually an operator that runs in cluster has some custom resource definitions and you basically interact with it like a creating custom resources that define your tests and it takes care of spinning up the pods and exposing metrics about what's happened. So if that if you want something a bit less involved in terms of you having to write everything in code and building binaries, that might be the best option for you. That has been our talk. And thank you very much for attending.