 I think it's 25 past. I believe it's the last session of this conference today. So let's get started. You're all probably eager to get to the booth crawl. But first, let's talk about the sick testing update of this year's KubeCon. My name is Patrick Uli. I work on Kubernetes for Intel on various topics, including testing, end-to-end testing in particular. And I recently became a tech lead for that particular area in sick testing. We do have other experts there. Sick testing is a fairly big thing that owns a lot of code. But I'm the one who has done most of the work on end-to-end testing. And therefore, I got the honor to talk about that today. Usually, in these sick updates, someone goes through all the things that the sick has done to inform the community. But I figured that with a lot of things landing very reasonably in end-to-end testing that I'll focus on just that aspect and make it hopefully interesting for you guys because that is also something that everyone who touches Kubernetes sooner or later needs to deal with, whether it's developing features or debugging for something or writing your own code and trying to figure out how to test that with a new Kubernetes cluster. That is where end-to-end testing becomes useful and is needed. But sick testing, as I said, is a big sick that is really crucial to Kubernetes. We officially own several tools that really keep the lights on in Kubernetes. These are the tools Proud for testing, Tide for merging code that has approvals. All of that are crucial tools that keep the high velocity of changes that are going into Kubernetes alive and keeping the project healthy is part of that too. We own some tools that do analysis. We also help other sicks to develop their tests. But common misconception is that sick testing is not itself responsible for those tests. We just provide some infrastructure and then we expect and hope that the other six will use those responsibly, do the right things and write good tests because good tests are part of that thing that keeps Kubernetes healthy. Many tests run on all PRs. They need to pass reliably. Flaky tests are probably one of the biggest problems that we are dealing with in Kubernetes when merging changes because a flaky test not only affects that thing that is not getting tested properly, it also affects any other PR that doesn't go in immediately because some proud job failed temporarily. That is a big problem that we still struggle with. And some of the tins and guidances that I'm providing as part of this talk are about some of these aspects that make tests reliable and useful and easy to debug. So that's the end-to-end testing. I also published a blog post. I have a link to that at the end of the talk that covers much of the same material. So if anything in the slides is too brief, you can follow up, read up on that in a much larger text document, share it with colleagues, post it on the wall of your cubicle if it's useful enough. I don't know, but spread the word because this, for at least for people working on Kubernetes, this is really important. So end-to-end testing, the definition first. It is about testing with real components in real clusters, perhaps on VMs, perhaps on real hardware. But it does really deploy a full, in many cases, it's often in Kubernetes, it's a kind cluster, but it uses the same artifacts that get published by Kubernetes. It's not some special API server. It is the API server that is getting tested. And same with all the other components, KubeBlood and so on. The component that does the testing is the end-to-end test weed. It acts like a Kubernetes client. So it's a binary, a single binary, that connects to the API server and then through client go, deploys some workloads, waits for the right and expected reaction from the cluster to verify that the cluster works as expected. We have, and that's where it gets a little bit confusing perhaps now, two end-to-end frameworks under sick testing. The one in Kubernetes that is used by Kubernetes for testing of Kubernetes is the older one. It's based on Ginkgo and Gomega, two tools that are not owned by sick testing. These are GitHub repositories, but we've been using them for a long time. Ginkgo is the test runner that organizes the test suite and which tests run at which point. And Gomega is an assertion library that you are using inside your tests to make statements about expected states and so on. Like testify is another alternative to Gomega perhaps, but in our case we use Gomega. The crucial part, the crucial difference compared to the other sick, to the other end-to-end framework is that there is another effort, a subproject essentially of sick testing that at some point formed around people who wanted to do end-to-end testing and couldn't do it with the entry framework because the entry framework is kind of hard to vendor into third-party projects. That was one of the problems at the time and it still is a problem. So they started from scratch also being a bit dissatisfied with some things that were not that well done in the entry framework for historic reasons. They started from scratch building something around Go unit tests as the test runner, completely different ABI, completely different source code. So it's completely separate. It basically forked the effort to some extent. My focus today is on the entry Kubernetes end-to-end framework because that's one that I've been working with. That's the one that we have to use in Kubernetes. I don't see a path how we could somehow merge these two efforts again. It's too different. There's no viable path that replaces all of the entry end-to-end tests with something based on this other framework. So I just mentioned it here for the sake of completeness because both are called end-to-end framework. We probably need a better name, something for one of those should get renamed but I don't know which ones. So that's where we are. Just keep that in mind. At this time, I'd really like to shout out a huge thanks perhaps a clap of hands. I don't know whether he ever looks at this recording here on the factory is the main author of Ginkgo and Gomega and he's been extremely helpful. All of it in his spare time to improve Ginkgo in particular, the major release version 2 making it suitable for Kubernetes addressing some of the long-standing concerns that we had. And we can't pay him or at least C and ZF perhaps can pay him but we as other contributors can't do much more than really thank him. He's not here so we can't buy him a beer but anyway. So this is the architecture of the entry framework as it exists today. We do have a much nicer architecture now as of just last week basically where at the bottom we have the e2e framework. It lives in Kubernetes under test e2e framework. It's mostly now a standalone package that only has client go as dependency which is important because it used to depend on Kubernetes code elsewhere like Kubernetes package something which made it hard to pull that framework into a staging ripple which would have been the way how we could have published it earlier except that we had all of these odd dependencies on code in Kubernetes that are not acceptable for a staging ripple. Now we are technically at a phase where we have nice dependencies so the bottom is the framework it manages things like creating a test namespace connecting to the API server cleaning up handling timeouts things like that are in that framework. On top of that we have domain specific helpers like create something for creating a port lives under test e2e framework ports waiting for port states is in that particular package and we do have corresponding things for other Kubernetes constructs. Volumes is a big one and those helpers they get then used by different tests. These tests live alongside the e2e framework so under test e2e we have test e2e storage as just one example and that has the actual tests that we run against the cluster and the test leads this is the last part the part that basically pulls all of this together and defines the actual binary that we are running. For end-to-end testing in Kubernetes it is test e2e which the directory structure is a bit confusing it probably shouldn't be like that but that's the historic part we can't easily move things around but at least the dependencies are more sane now. So test e2e is the main entry point for building the test e2e binary some of you might know conformance testing that's the same binary that gets published as part of the Kubernetes release and Zonoboy then runs that binary to do conformance testing it's basically a subset of the tests that we use for full testing in Kubernetes. We have two other test suites e2e node owned by SIG node is used to test kubelet and it's running a bit differently it's in this case it's running alongside kubelet not with a real cluster basically on the node or on a fake node but it does have the API service so that it can do things like create a port see that the kubelet actually executes that port and that's why it's also using the same framework and kubedm is also an end-to-end test suite using that framework so these are the entry things I personally have had an interest in this e2e framework because I also wanted to do testing of a CSI driver so at some point I have vended Kubernetes Kubernetes to get access to that framework and it has worked it's just a bit more tricky to get the dependencies right so there are out of three test suites that I know of that I've personally written myself that use that same framework it's possible so the one some of the recent efforts have been the genco v2 migration I already mentioned that major version update gave us a lot of new features but then we also had some discussions around what actually are best practices what do we recommend to people and I need to call out one thing here we looked at failure messages some of them were just very brief and that has been a problem and we had a discussion and we basically said well yes it is okay for the failure message because that's the first thing that we show to people if they look at the failed proud job for a failed test that's the full message that they see first and we said that yes it is okay to make that a long detailed informative message including multiple lines of debug dumps that say for example describe a port that hasn't reached a certain state and that's okay if that output contains content that varies between test runs because we have tools that correlate message different test failures from different runs but they can handle differences like different hex values inside that output and they still correlate that so it's okay to make that message long and it's useful so let's do it which leads me to the next point how do we generate those failure messages ultimately a test should pass it should never fail but if it fails these failure messages make the difference between a good test and a bad test a bad test will give you no useful information you will have to start downloading the binary downloading the test suite modify the test suite add debug output and all of that doesn't work if a failure occurs only occasionally in your ci then you're basically stuck with what you have in your test code and and what what it prints so in the context of Kubernetes we made another decision we deprecated some of the helper functions that were in the framework because at some point some people started adding things like framework expect no error framework expect equal it basically created another Kubernetes specific domain language for writing test code we concluded that this is not what we want to focus on using omega directly is what we recommend now we have much more flexibility in the upstream go omega compared to what we had what we had in the framework was basically just a small subset let's just use omega directly it was one of the tech design decisions that we made there's one exception our version of expect no error has some special handling of api server error so that's actually recommended for that particular one use case but it's that's the only one another thing that people often get wrong is that they use omega but then they only know about say strings contains so what they do here is they pass the result of some check in go to omega expect and the failure message then is expected true to be false if it fails that's not helpful it's much better to let omega see the actual string and then use a omega assertion like omega contain substring with the expected substring and even better is if you add some additional information about what it is that you are checking because then with when it fails it will tell you I was checking log output here's what I got here's what I expected and all of that will be in your failure message so when it fails you will have almost all of the information that you need to get started debugging the failure but there are probably cases where there is no suitable go mega assertion you could write one that's part of the design of go mega that you can write custom measures and that may be the right choice but it's also a bit complicated if if for one time things you may do something like I'm I'm important I'm checking something in go but then it's your responsibility to have a ginkgo fail f message that is informative to really print something that's useful another problem that we've had is that formatting api objects like a pot structure has not been done or has not been useful with full introspection of all fields that was what what what go mega normal normally does so go mega failure messages they use this helper package here go mega format to dump a pot and it it usually truncated because we do have fields like a timestamp which when you look at every single field has lots of things inside that you don't care about like a time zone so we ended up with a pot that had lots of information about the time zone of one field and then it got truncated not not very useful what we changed is that this go mega format in our end-to-end test suite gets modified i'd get there's a hook that says basically as a as a it intercepts for formatting if it sees something that looks better in jamel it will format that object as jamel so you will get a jamel dump off your pot which omits unset fields like you would get from cube cuddle and you get fairly readable output when using that infrastructure the other thing is recovering from failures so a test suite might the test might time out it might get interrupted if you if you bought it manually kinko v2 added something called defer cleanup it's like defer in a function but with some some additional features it i have that on the next slide it will it will print it will deal with context and timeouts so the cleanup action the time the test and the cleanup code get different timeouts all handled by kinko and the other benefit is you can call this defer cleanup in a helper function you can't do the defer in a helper function because you always exit from the helper function and then run the real tests but defer cleanup registers for callbacks so that it runs at the end of your test that's the other big advantage we have additional tooling in kubernetes this framework ignore not found here so this example here basically creates a pod it will do some testing with a pod and we want to be sure that it gets deleted this is part of the recommendation a pod is a namespeed resource but cleaning it up can often take a long time so it's useful to do that explicitly to see where your delays are or to be notified if deleting the pod fails then if you do it in the defer cleanup it will be a test failure if you rely on the automatically deleting of the namespace the namespace might never get deleted and you don't see that because it's asynchronous it's not the test doesn't wait for the namespace to be deleted it just gets triggered but it doesn't wait for it so it's better to do that explicitly but then what happens if you register this cleanup call and then your normal test deletes the pod it would the kinko defer cleanup automatically does error checking it would get an error from pod client delete saying not found and it would test fail the test so we have this ignore not found that you can use in situations where it's okay to not to treat that as not an error i've mentioned interrupting another tip that i have is a ginko v2 feature called poll brokers after some small delay usually it's much higher i'm not sure what even the default is but if you run it interactively what this does here is it will regularly if a test runs for a long time after 30 seconds it will start giving you information where the test is currently stuck the before the approach that we had before was that anything that polls needs to dump some lock message fairly frequently and it was doing that also in the ci runs so we ended up with long logs of say checking pod checking pod still checking pod still not done and with this we can control how often we get that output and it's fairly detailed it will tell you exactly which ginko test the latest ginko buy if you instrument your tests with that so you know where you are you can also debug with delt if you want to do it interactively that's the invocation that's all gubernator specific so it assumes that you are on the root of kubernetes this here this make command ensures that you build the right ginko there is some version dependencies between the ginko cli and the test suite so it's better to use the one that gets built together with with kubernetes and and finally it's now safe to interrupt at any time with control c because we do have much better cleanup handling now in communities than we had before i've been guilty of interrupting the test suite and rebuilding the entire cluster because i wanted to be sure that it's in the same state that should be less less often necessary than it will used to be how it works under the hood and again ginko v2 feature is like a normal go code the the callback that i register with ginko get can can optionally take a context parameter it's fine to leave it out but then you don't get information your your code that runs doesn't know how much time it has left so in communities we almost always want the context pass that down into api calls with client go and when that function times out or the test suite gets aborted that context gets canceled and everything immediately returns and cleanup can start otherwise this go routine would keep running it wouldn't know that it needs to stop there's no feature in go to kill a go routine it has to know that it has to stop this is also a bit tricky because the cleanup for example needs a new context it can't use the same one from the the surrounding it function because this context here will be canceled at the time when the callback gets invoked and this is this is tricky it's a slippery slope here getting this all right this is where our documentation has lots of guidance different steps that can go wrong and did go wrong these are all things that i found in practice when rewriting much of the testing to use this context so i'm not going to go to cover more details but this is material but you probably want to will want to look at if you want to get this right and i wish we had better linters for this and i have some ideas but no implementation yet other things other guidance is around timeouts so lots of tests have made up their own timeouts how long they expect things to take this is problematic because performance of a cluster can vary we're kind of using timeouts that fit what we currently use in our Kubernetes ci which is unfortunate because we kind of have to make assumptions about the target hardware we should be more more flexible perhaps at some point we will have configuration options for all of these timeouts right now what we have is a recommendation to use predefined timeouts that are part of the context that you get for each test use those perhaps if you know have you have a slow pod perhaps use use twice the pod start if it takes if it's expected to take longer but try to use those both timeouts and then we know at least why there is a timeout where the magic value comes from the other thing is oftentimes testing will consist of creating something and waiting waiting for a certain state to be reached and this has been the source of a lot of headaches we used to have tests in kubernetes that are based on ever simple for loops with weight poll being perhaps the most common solution but weight poll in particular was targeted towards kubernetes production code it wasn't meant to be used in tests people used it anyway and we ended up with lots of tests that just failed with one error that said timed out waiting for the condition full stop and if that's the failure messages how do you debug that you really don't know what it was waiting for you don't know the actual state so it also weight poll was often used without the context it wouldn't stop immediately it would continue pear polling of all the tests already failed or timed out so we came up with a long laundry list of things that are good pollings implementation should support weight accepting a context for waiting be informative when interactively used you should tell you why it's waiting what current state is why it's waiting ideally using the same format as the failure message but not in the ci because in the ci we don't care about the intermediate state we just care about the result so it needs to be configurable then when it fails it's often because it observed some state but then perhaps there have been other errors we want both the failure and the error we want to be able to compose conditions that it's waiting for so that we can reuse code we have a long list of very specialized weight for a pot to do x y z that are all different top level functions right now they could be more modular um and finally sometimes the polling code detects that something has gone wrong permanently so we want to abort polling and if that list is too long just to remember one thing go mega eventually and the counterpart consistently all do this right whenever you're in doubt just use function these functions here you can also do that in integration tests in a go test and it will almost certainly be more better better than the core that you can write manually but that is something that we need to tell people because some people are conservative they are concerned about things that they don't understand like they need to learn about this go mega eventually and how to invoke it but it's it's definitely but it's definitely better another framework specific thing is that sometimes in helper functions we want to do a go mega expect but then when it fails it's not the source code location of that helper function that is of interest but where it was called and we we solved that by turning omega failures into errors that we pass up back the call chain we can wrap these errors so we get additional context and then in the main test we have an expect no error that dumps that error with all the information that's where framework expect no error is used is coming in so that's how we handle helper functions because go mega itself has some problems you can't pass an offset but it's a bit tricky this is our alternative solution here and a good example if you're looking for code that does that is the test e2e framework pots helper because I've done a lot of updates to to make that particular package use these best practices and I've used that as a test case that my my changes actually make sense and and lead to better code and with that yeah I'm almost at at the end of my talk we do have time for questions afterwards and I want to discuss a few things so I've mentioned the blog post the qr diagram here is taking you to this blog post it it is basically summarizing all of the things that I discovered expanding on the things that I'd summarized in this talk with with a lot more information about the individual aspects like the proper handling of contexts that's all spelled out explicitly there it was also copied into the upstream documentation which we do have but that other of documentation under Kubernetes for the contributors web page that's a github repo that that document is a bit longer it doesn't not it explains other things that haven't changed and this blog post perhaps is a good better starting point I do have some plans for this I want to enable linting both in Kubernetes itself that's the first thing that I need to get that that I need to learn I had another talk at the contributor summit about linting of pull requests because we had we don't follow best practices in the existing code if we enable additional checking in the linter configuration that it gets applied to all codes we certainly have we will have failures because we haven't modified all of the existing code my plan for Kubernetes is to enable linting stricter linting in pull requests for the code that gets added or modified at that time we can say okay you're doing something that we have done in the past but don't do it anymore do something better instead and we will get better code over time the configuration that I'm currently suggesting for Kubernetes doesn't do anything with testing but there is a ginkgo and gomega linter it has been integrated into golang ci lint and it checks for a lot of things that are useful so once I know that I can do this I I'll probably propose a change of configuration that also makes tests or writing tests easier because it will check for things automatically then something that potential contributors could pick up is converting more of the sub packages that we do have in the framework some of these are domain specific but it could be easily farmed out now now that we know how to do it we could continue and basically clean up all of the other helper functions too which will be useful it will benefit Kubernetes because at some point someone will write a new test and will swear about this time while waiting for condition because he is calling a helper function that doesn't use gomega eventually yet and it would be good for your karma if you help out Kubernetes here making paying back some of the technical depth that has accumulated here and well that that karma is one thing but I know of self-interest is perhaps the biggest best motivator to get some work done if you are developing a component that needs end-to-end testing then it of course would be nice if you are a Kubernetes contributor to use the same tools that you are familiar with in Kubernetes also outside of Kubernetes and that's where this question comes in can we move this code in this un-interstaging so technically we can the dependency issue that has prevented that in the past has been solved should we do it I'm not sure at this point some of the API is well we preserved all of the old APIs because we didn't want to rewrite all of the tests so we still have a long variety of ports expect port to reach a certain state and all of these functions that are in the ports package probably shouldn't be in staging but someone would need to look at all of these and decide which ones are useful which ones should be something that kind of become a stable API and a staging we don't guarantee API stability but it's not nice to to change it because it will break some people so it would be more work to move it under staging my hope would be that if we do it and someone is willing to help us with that that we get more people help us maintain this software and that benefits into that benefits Kubernetes that benefits everyone because right now the set of people maintaining it is fairly small and we could certainly use more help so if you want to get involved sick testing channel on Slack is the place to to reach out I'm watching that fairly closely I'll try to be as responsive as I can yeah and with that I think our time is up so I've talked a lot we have time I hope for questions yeah yeah hi yeah I'm just curious about your thoughts on all of the different I think you called them I don't know if they're sub packages or what they are but you know I guess specifically I'm thinking of the the ET node tests versus the standard ET tests does it make sense to have those extra sub packages or would it be better if they were somehow all merged into just having a single ET framework I think the sub packages are useful one of the things that they do or that they could do is have their own configuration options for example that's what that's one aspect that's not currently particularly clean we have an ET framework test context that contains lots of options that are sometimes referenced by the sub tests sub packages but if you build an end if you build a test suite that doesn't use those sub packages and doesn't use the tests that depend on these options you still end up with a test suite that has a long list of command line parameters that doesn't don't that don't do anything so more modularity I think would be a good thing well better better implemented modularity I think the granularity that we have right now is okay we just need to make it cleaner and we need to move around a few things like the we need to smooth the configuration options out of the framework into the code that actually depends on them that would be one of the things that the contributor could do also in Kubernetes and yeah the stable API question is the other one how much of that obvious helper packages are useful in general and are worth being in staging because they are sub packages we can now make a decision case by case and gradually move things create internal aliases import the state import the function and the public symbols to smooth the transition so we could come up with a with a iterative approach yeah well thanks for listening and enjoy the boofcrawler