 Hello everyone. Can you hear me? Yes So let's get started and thank you for joining us today to talk about one of your favorite topics and to end testing and Today we are gonna talk about how you can use the end to end framework to build confidence in your Kubernetes controllers and clusters My name is Matt Aruina. I'm a senior software engineer at Datadog a cloud observability company And I work in the control plane team where we manage the Kubernetes control plane for our users today I'm presenting with Philip and Philip Scorsolini. I work at a bound and a cross-plane maintainer Cool. So the agenda for today is to give you an overview about what is the end to end framework and why we have it And then to give you an overview about how we do end to end testing at Datadog and cross-plane So the first step is to talk about the end to end framework It's a go framework to do end-to-end test of components running in Kubernetes clusters And why do we have that? You can say if I go inside that the Kubernetes repository There is already an extensive set of Example and code to do end-to-end Tests the problem with the framework is that it's tightly coupled with Kubernetes itself Is designed to test the Kubernetes and to ensure a consistent and reliable behavior of the Kubernetes code base It is not designed to be used outside as external library Plus is based on Ginkgo Ginkgo is a testing framework with for go that has its own DLS and It's something that is similar to a behavioral driven development framework So it has its own set of Instruction and its own way to declare the test that is quite different from what they're used when you're writing go tests And so now we have the end to end framework that is an official SIG project with the goal to provide a documented approach for end-to-end testing an official tool that you can use as close as possible to use The go test the package that you use every day with components to help you to build that your test suite Helper functions to get you started when interacting with Kubernetes clusters and especially avoiding all the dependencies on the Kubernetes code base itself Let's have a look about all the end-to-end framework works So at the heart of the framework there is the environment object It has a config that is used to store your test suite configuration You can customize the configuration with your own CLI flags and then there is a context that you can use to pass signaling and data across each phase of the test You can use the functions to customize the different stages of the environment Including environment setup before a test after test on when tearing down the environment Then you have your regular go tests That are made by one all multiple test feature a Feature is a collection of steps that can be executed as a group Where a step is a granular operation that you can combine to perform the actions that you want and The steps that you can apply to your test are set up assess and tear down I know this is a lot. So let's have a look at some code example Here you can see that I'm Creating a new environment in the test main I'm applying an environment function in this case the setup You can add any code that you want to be executed before any of your tests there and then I running the test suite Then I have a regular go test in this case of hello world where I have a new feature You can apply a name to the feature You can add labels and then I'm adding the steps that I want for the test in this case There is a set up a single assessment and a tear down Of course, you can have a multiple setup phase steps multiple assess and multiple teardown And then at the end you execute the test feature Now that you got your 101 on the end-to-end framework. Let's discuss about how we are using it at data dog The first question is why do we even bother about doing end-to-end test and the reason is that a data dog We have a Kubernetes setup with a hundred clusters Thousand of nodes and we run across more multiple cloud providers Also to have the same layer of abstraction and consistency We run our own Kubernetes software and on top of Kubernetes we install all our internal controllers There is not only my team but multiple teams managing all those clusters Making multiple changes per day. So at the end of the day, how do you make sure that everything is working as expected? And so we are using the end-to-end framework to build our own internal conformance test suite For example, is the DNS working as expected not only can I resolve internal or external domain? But can I do cross cluster or cloud data center DNS resolution? Can I provide capacity to the workloads? We have a bunch of internal controllers that we use to manage the node lifecycle after working as expected So can I the commission nodes without causing disruption? We tested the clusters daily. We have a cron job that runs the full test suite in every cluster every day At the moment, we are a little more than 40 tests It takes between 30 and 30 minutes to run the whole test suite depending on how busy the cluster is And so we get a signal if we are breaking something We also use it when we spin up a new cluster. So We provision a cluster and then we run the first set of tests to make sure that we have the Kubernetes primitive primitives that we need And then we install all the data dog customization and internal controllers on top of it And then we run again the full test suite When everything is passing we can onboard the new users and we tell them hey the cluster is ready now you can deploy Another usage of our tests is when we update a single component So imagine that you have a controller. We have health charts. You are bumping the controller versions So you are making a change to the chart. I don't know your are back So when you install it, we also run a small subset of the test related to that particular controller To make sure that you're not breaking anything. What was our journey for testing? So like many we started importing the entry Kubernetes and twin framework But like many we got into all the problems that I was discussing at the beginning You know that Kubernetes as a repository is not designed to be easily important You need to do some sort of vodka is very heavy. Let's say and so When you need to update it, you need to be careful and not everyone was used to the ginkgo way of doing tests So when we saw the end-to-end framework we did the POC we wrote a proposal everyone was on board with that and We liked that is an official sick project that is designed from the ground up to test on Kubernetes That it has already a lot of helper functions to interact with the clusters and that provides also enough components To create your own test suite, but that is very close or try to be as close as possible our pure pure go testing So how did we do the migration? So at the beginning we wanted to try to use the same go module But we tried using what go workspace so we couldn't figure it out and didn't work for us So what we ended up doing was creating a sub directory inside the old the ginkgo test with the end-to-end framework And then slowly migrating one test at the time every time and we needed to add shared code we were creating like an external library to share the code between the two frameworks and then we may To all the tests like that and at the time of building in the same docker file We build both binaries and when we run our test we have the two containers So we run ginkgo test with their own plugs and then to end frameworks with its own plugs When we did this migration, we also asked ourselves if there is anything that we want to change about the way We structure our end-to-end test and so we took the opportunity to reflect on how we write test So first of all now we write Kubernetes object and not YAML. I know we are all YAML engineers We love YAML's but the problem is that when you have 50 YAML files for all your tests It's very difficult when you want to make a change Imagine that you want to ensure label consistency for cost tracking or for compliance Or you need to bump a new version on the base container now that we have a shared library We have a single place where we can make this change and we can ensure that our users are Importing just that library Previously a lot of people were also copy and paste a lot of YAML so to get started now they can just instantiate a new object We also create additional helpers functions as I was mentioning we have a bunch of internal controllers So we also need additional helpers to make sure that our objects are ready And then we extend easily the end-to-end framework For example, we have a bunch of CLIs that we need to pass when we run the environment And so we wrap the end-to-end environment inside our own We do validation we need to make sure that when we start at the test suite We run the test for a specific cloud provider only on that very cloud provider and all of the others and we can do that passing our own CI flags We also tell our developers, please assume that all your tests are going to be running parallel And so we provide two other Guidelines first one is only one feature per test as I was saying at the beginning You can add the multiple features to your test But if you do that even if you enable parallel testing the features are gonna go one at a time And then we also run a test in its own name space This help us avoiding pollution when we do the test and it's easy when you do the cleanup If you go on github at that URL you can find an example repository with the Docker file an example of the library There are a couple of objects to wrap the environment And so that is a possible approach to use the end-to-end test and should help you getting started and adopting it Of course I work at data logger So it would be awkward if I didn't have any slide about observability and so Every time we run a test result a test we also send the metrics about the test result In this case we can build the dashboard and alerting when some tests are failing We also have our data go data dog agent scraping the pod for logs So when a test is failing we can easily see why it's not working and We recently integrated with our CI visibility product If you run your go test you can also export the test result as a unit file and when you upload it You can see all nice of graphs and statistics about your test in this example You see a test that we are actually running that is running a bunch of times is failing once And then you can drill down wise failing is it a specific cluster which cluster and then if you start seeing more And more failing you get alerted because the test becomes flaky and then you can slice it and see Oh, yeah, it's used by environment or a cloud provider is the whole test treat Is it a particular test and if you're interested around the performance? You can also go and see how long usually does it take to run your test and then you can spot Regressions or liars and you can go and make sure that nothing is broken So overall I would say that we are a pretty happy about the end-to-end framework the feedback that I get from my fellow developers It's pretty positive. It's easy to use. It's very easy to get in started It's just like importing a regular another regular package without any magic into your project It has a lot of helpers functions out of the box So you can easily extend almost all the components and it feels very close to write the code test that you already used to Of course, the project is not perfect yet. Still quite young parallel testing is not there yet If you remember at the beginning of the slide, there is this global config in the environment And so when you enable at the moment a parallel test multiple guru teams try to write to the config causing a risk condition Hopefully we are gonna have a PR to fix it in the next few weeks There is also additional flags that you can pass to the framework to run testing parallel So you can use the regular T parallel as a go-to test or you can pass the parallel flag in the framework That is gonna spawn subroutines to create your test. I don't think this provides a very good user experience So we rely on the go-to parallel to run the tests and I think that is annoying if you submit the metrics is that you can't detect if a test is run or skip So we end up looking at the metrics and see a bunch of green And all the tests are taking zero seconds is because of the test is actually not running And we have an open issue for that, but we don't have a PR yet So this is our experience data dog and I'm gonna leave the stage to flip to talk about how they're doing at crossplane Yeah, thank you, Mattel So let's talk about The end when how we run into an test at crossplane to as you'll see We have a pretty different approach in some aspects with respect to what material just showed Mostly due to the different context and the different scope for for our entwine tests But first let's quickly introduce crossplane The sales pitch describes it as a framework for building cloud native control planes But you can see it as three things an ecosystem of providers to manage external APIs such as cloud providers But also whatever has an API actually look up provider pizza if you want to check how to order a pizza using crossplane and But also a low-code framework to compose these resources into your custom APIs and then a package manager to handle all the above this translate into This which as you can see it's pretty complex and this is obviously not even the whole thing so we really need good testing coverage in order to have confidence shipping new features and Refactoring all code or fixing bugs We already had and still have today some good unit test coverage almost 70% more than 70% on core logic But this every rely on mocking Which is fine, but obviously means that even if the sea is green at the end of the day Maybe when once you get to the deployment to the actual deployment staff can still be broken We had that dedicated the repo for entwine tests with a small number of scenarios that we run on a schedule on the latest Commit on on main on the main branch to be sure stuff was at least not completely broken These were playing or test using just client go, but we were lacking the confidence. I was talking about previously so we agreed this was not enough and decided to move the end to end test back into the Crossplane repository and other a few new scenarios so that we could run them on each on each PR but we had to pick something to do that and So what do we actually care about when writing and to end tests? How are these scenarios actually run go testing some custom tool? How much control do we have? We need to spin up and down stuff So like a kind cluster or the crossplane hand chart with maybe some custom values any additional components How to define those scenarios? The language you use the fixtures does it provide any helpers to deal with eventually consistent? Kubernetes resources and so as you can see We considered a few options from the most classical ones ginkgo and gomega Which we actually already used in the past in and crossplane and we removed from crossplane a while ago to more exotic ones like gherkin flavored Tests here. I tried to highlight all the aspects. We were talking about previously So the engine so how tests will be run with each option and as you can see cattle Is the only custom one the rest will still run on go testing in one way or another Helpers whether they provided ways to set up the environment handle Kubernetes resources and deal with eventual consistency and as you can see we have different degrees of degrees there and Then the front end how tests how test scenarios will be actually defined Some are more heavyweight frameworks and have their own DSS or let you define your own DSS directly While other can be just used as regular tests crossplane as you might know is written in go So we decided to write our own tests in the same language And if possible stick as much to the usual go testing we we already used for the rest of the code base So as you can see the only option checking all the boxes for us was very free framework It's handled by sick testing so it felt as the right choice as what they were saying there's a richer set of helpers So you can handle kind clusters directly from your code no need for any hockey Bash scripts to spin up your your kind cluster. You can handle fixtures both through YAML manifest And or directly as ghost trucks and you can interact with the helm charts installing upgrading and uninstalling Then as needed it adds little cognitive overload on top of the go testing framework so at least not as much as over heavyweight frameworks and You can see a dedicated one-pager with more details if you're interested So we defined our guidelines while we decided to have one feature per test So that we can directly select tests to run using standard go test selectors Where all tests must be self-contained Ensuring cross-plane itself is installed as needed restoring the previous state and deleting everything they created and Fail over wise we don't run testing parallel in parallel differently from what Mateo was saying At least for now As most cross-plane resources are cluster scoped. So for us it was really hard to implement parallel tests And so each test can assume it has the whole cluster available as long as it's able to clean it clean it up after after the test We should be we should Be able to run these tests against an existing installation too so as a sort of conformance test to check something is behaving as a good cross-plane installation and We decided also differently from the data dog case to use YAML manifests This is mostly because we are an open source project. So having some Automatically tested the manifest that you can reference people to is actually a good thing for us although the cut and paste and handling and all the things about YAML That's actually obviously an issue but for us it was the pros were were more than the cons And we keep them up-to-date by using renovator to bump all the images around all our manifest We we used some we built and we suggest using some dummy cross-plane packages like a function dummy or provider NOP so that we don't have to spin up actually infrastructure out in the cloud using provider AWS or other providers and if needed user and Enrich when when applicable our library of helpers that which Which we'll see in a moment You can see our complete guidelines for more details there So we also extended the environment to add a few custom flags mostly to handle kind and How to set up its the cluster and how to handle failures and what to do on failures Then we also introduced an additional concept of test fits Which we'll talk more about in the next slide and finally from the test main You can see it's just a matter of adding all List of setups apps a list of finish steps and then run the whole thing as I was saying most most cross-plane resources are cluster scoped and Some tests require a specific configuration like some alpha feature enabled Or some additional external component would be running during the test So we introduced the concept of test suites to bundle up those things together. So both the custom setup and the tests that expect that that setup Then in see in CI as you can see where we use Github actions. We run them in parallel based on the test suite So we have a base one and then more or less one for each alpha feature On failures We want the single ones to Stop so that we can collect all the required information to debug them as I was saying everything is cluster scoped So it's hard to run to be sure next Tests are gonna run successfully if a previous test is failed at the moment. It's not an issue because we actually have Some pretty short times on the end to end test But if it's gonna become an issue, we're gonna address that in some other smart, you know, some other way However, we still let over Test suites finish to run so that we can at least understand whether there is an issue with some we broke Whether we broke everything or we just broke some particular configuration To understand what went wrong on failure. We build a graph of all the resources in the cluster and dump and print down the whole All the related the inform all the related resources to the one we are actually testing at the moment with all the events and if that's not enough we also zip and upload the whole The whole cluster logs with full audit logs enabled and then we can we can debug it further Test suites can Can include each other and tests can be part of multiple test suites as you can see here We have a base one that most of our test suites include But we also split the life cycle scenarios like Uninstalling and upgrading cross-plane from the latest stable release to their own test suite so that we can include just that from suites But we know don't actually affect Core functionalities this way when promoting a feature to beta Which means it's going to be enabled by default We are sure we are not going to break any core functionality and we just need to merge all those tests into the base one So let's see an example of an actual test This is a pretty basic one. We just want to check cross planes composition engine works as expected And that we are able to propagate some field down to compose the resources and back to the composite resource So we have our dedicated manifest folder as you can see there Usually containing a setup subfolder with a bunch of format and a bunch of our manifest and then We define our features We define our feature usually one per test as we said We set some useful labels so that we can slice and dice the tests as needed And also we declare one or more subtests one or more suites the test belong to Then we define our setup steps Applying all the setup manifest and checking that everything is running Accordingly as we as we needed to and then we define a few assessment steps We try to keep it clean and separate different steps so that we can easily see what step failed later In this case, we apply our claim wait for it to be ready and then check it It has the field we expect Properly set and then we define our tier down steps cleaning up everything and So that the cluster is ready for the next test There is definitely room for improvement on our side for example right now we lack the statistics about test flakiness which Mateo showed earlier, so we should definitely get some credits from Datadog for that It's mostly left to us maintainers to know which tests are known to be flaky and They're expected success rate so that we can at least know we if we run that a few times and everything works fine It's the known thing we already handled and we have an issue for and then usually we push at the end of a release cycle to stabilize the whole thing and Possibly fix some bug But actually which is obviously not the best approach. So that's definitely something we should improve on ourselves most of the times actually these are these failures are due these flakiness issues are due to As I was saying to hanging resources by overtail resources left hanging by over tests. So we could actually Improve on that with some smart work around by namespacing stuff or we should work on that probably This could also potentially allow us to run the whole thing in parallel. So the same the same solution would actually bring a to win two times Then go 120 introduced coverage for integration test to so you can export Integration tests coverage for your binary running so for your it we your your test binary running somewhere and You can actually see what part of the code base you are testing and you're actually running and we should definitely Use that to see which parts we are not we are not testing that much and we can and we can improve on that On the it we framework side our main complaint right now is that the setup and tear down steps although they can be given a name. They're actually in ignore the completely at the moment and Similar similar little assessment steps They don't run as a Subtest I have a long-standing issue created on the it we framework Repository, but they still unfortunately still have to find the time to work on that but other than that we are pretty happy and It's been a nice experience And that's it. So thank you for coming and if you have any questions So the question is if there is any performance issue issue with this setup in their case or So for us It was a matter more of Readability and the maintainers of the role of the framework. We were leaving our ginkgo So the end-to-end framework that we had was based on Kubernetes There's not the point you need to decide if you update your end-to-end test Independently so every time they do a minor release in Kubernetes you bump it and how if it needs to be in sync with your own Kubernetes version And so for us it was more like people were fine difficult to write interesting ginkgo and The whole managing of the cluster was quite complicated of the of the dependency was quite complicated But I don't think that we see a lot of performance. I think we're gonna see them as soon as we can go back to parallel Because at the moment is Yeah, in our case the performance is not really an issue yet All test suites run in around depending on which one between 10 and 20 minutes So that's fine. Obviously spinning up and down every time stuff is a waste of time And we took that approach right now because it's not a big issue But it's definitely the first item in the to-do list once we add more test cases and everything becomes a little bit smaller So the next question Sometimes a set up section of those tests can be complex and long installing multiple hand charts Maybe even deploying some clusters. Yeah, how do you make sure that actually set up is Is working and it's exactly the environment that you want and it has not failing because the test is It's a wrong test. It's because setup failed In our case we usually add a check an assessment checking that everything we installed in the setup phase is actually Properly configured everything we care. Obviously, you cannot check everything. Yeah, we do the same So imagine that we want to make sure that the deployment I don't know goes on certain nodes in the setup But maybe we we created the deployment and we make sure the deployment gets ready And then we move to the assessment when we do specific assessment of things But if you don't have a deployment of all test is already failing. Thank you Hi, so if I got your Beginning assessment properly you could not use the test from Kubernetes itself. So did you think of Doing the opposite like using the end-to-end framework from the external repo into Kubernetes so that you reduce the duplication at some point So I think that when Those people are cross-plane were assessing which framework there was comments also from people people Maintaining the end-to-end framework in tree and I don't think there is a plan at the moment to replace Ginkgo also I think they went through a major refactor to moving from Ginkgo V1 to Ginkgo V2 in Kubernetes in tree itself So I don't see I don't see I don't see this happening any time soon And but I'm also not involved in any of these projects. So I might be wrong. I don't know if you know more No Thank you. Hi. Thanks. That was great. I have two questions One for Matteo you mentioned that you usually run a pod that runs the test alongside the Service itself in the same namespace maybe all a different one. Do you do that in production continuously as well? Yes So all the cron jobs that you see in the slide that are known the real production clusters And so this is why sometimes we have the test They're not flaky, but they might fail because the clusters particular busy of the time and also the time that it takes To run the test that can increase especially when you do auto scaling where you have a real workload running But give us the best signal to see the cluster is working We also spin up and tear down And the to end the cluster every day with no workload to make sure that all the controllers and the provisioning tool is still working And I assume you thank you. I assume you also run those testing Short-lived environments in CI like maybe in kind of that sort of cluster or maybe in the crossplane case What one issue we struggle with with that is that's usually a Controller depends on another controller, which depends on another controller, which depends on another controller So it's very hard to test a single controller. We always end up testing Five controllers indirectly. Yeah. Yeah, that's I mean that's end to end test. So we are We mostly focus on Something high-level and hope that everything down there doesn't work and then if If something breaks it's up to the unit test to catch to catch that particular part of the code base But yes debugging the issue failures is the main issue there going through the whole log Of the whole cluster is not a pleasant experience for us the testing in CI is quite complicated because We do a lot of customization of the of the clusters to have a data logo Kubernetes clusters And that is something that at the CI time is quite complicated We have one controller where I'm testing the end to end framework to test the controller because traditionally If you go to the cube builder for example repository and you look for examples is always tested with Ginko So the majority of the controllers that we have are still with Ginko But hopefully now that we have the end-to-end framework and the nice package of shared resource to import it We can extend also to controllers testing Thank you everyone Thank you