 Hello everyone. Welcome to the sick testing intro and updates. I'm Benjamin Elder. I'm a senior software engineer at Google And I'm a sick testing chair and tech lead. I'm also on steering and I participate in other parts of the project Michelle Shepardson my colleague Could not make it here today, but we'll have a recording With her portion of the talk. She's also senior software engineer at Google a little bit of a trend here And recently a sick testing chair. We'll come back to that Chowdai over here senior software engineer at Google works on sick testing and Antonio Heya senior software engineer. He was at Red Hat when we plan this talk. He's at Google now and also I plan for him to be a tech lead from Red Hat, but Again at Google now, so So this is the biggest thing I've been working on for the past year As one of the most active leads in this space One of our other leads has been out for a while and just you know, folks kind of move on trying to recruit some new Leaders for the space people who are doing the work and them elevated Typically in SIGs, you'll see two to three chairs and leads or maybe just two or three people that are serving both roles However close read of the charter. There's no limits. So Pack the courts get some more people in Figure it out later. So Antonio here and Michelle who you'll hear from virtually are New leads in this project and Tony is a tech lead and Michelle's gonna chair Chairs help organize the SIG run the meetings things like that and tech leads help us with things like spinning up and down Subprojects and providing technical insight for the community as recognized leaders Patrick Oly at Intel is also a tech lead He couldn't make it to this conference And thank you all So what is SIG testing? SIG testing is The SIG that works on how this is what's officially written If you want to look but SIG testing works on the testing infrastructure the frameworks How do we test? So testing doesn't write your tests SIG testing helps you write your tests. SIG testing helps you run your tests and analyze the results So we have CI tooling. We have log analysis We have test frameworks and we have best guidance for things like how to deflate your tests and we work with SIG release a lot on how to make sure that we're effectively have a test well-tested release Something else I personally work on as well as Antonio is a kind project That's originally Kubernetes and Docker. We're just back renaming it now and just calling kind You know friendly local clusters because we also support podman to a limited degree Which we've been working on so It lets you run local clusters for testing SIG testing built this so we could test things ourselves We've been working on some major updates We've done some pretty big rework with help from the community a lot of help from the community to rework how we Handle reboots so that your search will be re-rolled and things will not be broken if you happen to get a different IP address For your nodes when they come back up. So if you're running on Docker and you're using kind from a current release Should just come back up. It's not follow bug with us. We need to follow up on that We've done some image loading optimizations of the community So we're smarter about we're loading this image again into your into your cluster from a local build and we know that We've already like loaded it before and you just change the tag or things like that we can now optimize out we have a much smarter implementation here and C groups everything about C groups The container community right now is going through a transition from the v1 API to the v2 API And things are pretty different how they work We've done some overhauls in the kind project to make sure that we're like thoroughly compatible with this and we're moving towards the direction of most of the container space which you know Whatever you feel about it the container space is pretty much centralizing on we're using systemd to manage the C groups we're using C groups v2 and This needs to work in kind even when the underlying host doesn't use systemd so we have a couple of Tricks I can talk to folks later if they're interested about to make this work so that kind just works everywhere But aligns with the expectations from the folks working on container run times and sig node and things on how C groups should look how they work and how we run kubernetes and Just generally keeping up with kubernetes kind is essentially a small distro and There's things to keep up with kubernetes does its best not to break users with the api's But there's a lot of details about how exactly you get a cluster up and running If you're implementing one of those tools that you have to keep on top of And now we should hear from Michelle. Hey folks. I'm Michelle. I'm a full-time software engineer here at Google And I work on test grid. I also work as part of a larger team that works specifically on test green and Proud and maintains the instances for external um You also may have seen me around in the sig testing group specifically on the sick testing slack or elsewhere And I am also recently helping host the sick testing meanings that happen bi-weekly as one of the new chairs So more on that if you are interested in the meanings in the link pinned in the sick testing channel on slack There's stuff on the agenda notes and where to actually join the meetings there as well as time Yeah, I just wanted to call this up briefly I'm gonna go and talk a little bit about test grid after this, but I just wanted to call out. I'm Historically have kind of a narrow focus within sick testing on test grid specifically But I also want to get more involved and give a little bit more back to the community that has done like an awesome job here So far So I am very interested and I'm sure other members of sick testing are as well and what you want to see and do For sick testing so is there anything that you want to give us feedback? Do you want more ways to connect with other community members? Is there anything like really cool that other SIGs or the projects have been doing that you want to see replicated in sick testing? Do you have a feedback on stuff that is keeping you from being more active or starting to contribute? Super interested in hearing all of that and I think other members would be as well But yeah without further ado, let me get into test grid a little bit kind of what it is where we're going and More about just how it works in case you're interested in learning more about test grid itself or maybe starting and developing So without further ado That's great is your test results in a grid so a test grade specifically maps Typically the tests as rose to runs as columns And this lets you see a historical view of how your tests have been doing over time in a quick and easy visual format So specifically patterns in your test results over time That's great is also highly customizable So it's really easy to add things like extra information to the column headers here For instance, you want to see a github commit for each of these runs that have happened Stuff like extra things to the wrong names custom status is viewing metrics that have been reported from the tests on the cells themselves or even things like alerting and With an email to anybody who cares about When tests on this dashboard start failing consistently like maybe I want to know whenever the tests Fails three times in a row. I can set up a way to get an email for that Yeah, so those of you who are more familiar with test grid might be surprised to know that test grid has been around for quite a while Although for a while it was internal only back in 2013 Steve seats started Test grid for a team at Google basically in order to get that Visualization of test results over time in patterns for his team This expanded to other teams who also found test grid useful and Eventually in 2016 this led to creating the external instance test grid that case that I owe to serve the Kubernetes community And also other open-source projects at the time though the code was an open source just externally Available instance for folks to see and use And it was also called out I believe in an earlier sick testing intro from several years ago as Basically the only part of the case infrastructure like this that was an open source. So in 2019 we mostly fix that We had a little bit to get from the initial open sourcing to a point where we were pretty solid in terms of all of the code that we have and finishing up a migration but After that we basically have an open source back in and at github.com Google cloud platform slash test grid you can see all of the code that we use to run the test grid instances both internal for Google and external for Kubernetes and other communities and this is Really the majority of the code that we use Even for internal in Google we use this code with a little bit of extra stuff in order to make sure that stuff works with Google The only caveats here for those who are super familiar I Might notice if I don't call it up is that there's a component or two that hasn't been migrated over Into open source, which I think we will eventually get to And then the other part is I called out the back end here But the front end notably for the actual site and server is not open source either That is still internal only at the moment We're kind of planning out roadmap for like the next several years So I don't have a lot to say that's concrete on that yet But I can say we're exploring options and we'll have more to announce on that I think later in the year or earlier next year So also if you have anything that you want to say on this, please feel free to hit us up in the repo itself File an issue or add to one of the existing issues if this is something that you care about The speaking of future plans, I do also know there's some stuff that we are very excited to do Starting in the new year stuff like deeper integrations with prow. I'm getting you to New your test results faster and more directly Getting you the relevant things that you need in order to troubleshoot and debug Also stuff like UI and UX improvements making some stuff that is maybe hidden features or not easy to surface Easier for users to see and making the UI faster and more responsive And easier development both for the core team of us They're working on test grid itself as well as anybody who is looking to join and contribute Yeah, so getting into how a test grid itself works I won't go into detail about all of this But I mainly wanted to show this because I want to show an important Concept of test grade, which is that it is modular Test grade is split up into a bunch of modules that each do a particular job And then those modules have just great inputs and outputs So for instance the config manager takes configuration files in several different formats that might be YAML That might be proud job annotations that might be a Proto itself protocol buffer and then outputs a configuration Proto, which is the global configuration for a particular test grid instance EG test grid decades that I own The other update or so the other modules all work in a very similar way They have some kind of input they output something usually a Proto and then those feed into the other Modules as well so they can all act asynchronously and they can also all swap out as long as the input that they have is correct This is important for Basically letting us be open and flexible So again important till dear here test grid is a bunch of modules glued together by Proto's and sometimes other inputs so again, not to get into too much detail here, but To re-emphasize config merger for instance can take configurations from a bunch of different formats It's relatively easy to add something that is a new format as needed or to make it work with Anything as long as you've registered with config merger where the thing lives So you could have for instance Standard YAML defined in the test in for repo or a different repo You could also hand write your own Proto, which is actually what I did for that Test grid screenshot. You saw earlier is making my own configuration for a demo or you could have something that say you have a 10x 10 matrix of jobs that you need to run in different configurations And you don't want to hand manage all of that So maybe you have a script that auto generates configuration as a Proto for your specific tests and throw that into config merger Config merger understands the format. It will output the global configurations and all these together and then just work with all the rest of the modules This also applies if you want to do something called test grid as a service. So this specifically Again to kind of just hammer home the point as long as the data is the stuff that is needed for test grid to display your configuration and grids You are good to go In this instance, we have something that is running all of the test grid components from the open source code by itself and Just outputting the correct format. I want to emphasize you could also do something like Hand write all of these yourself again for maybe a demo or maybe you have something where you're like Wanting to run a custom Code for stuff like the updater config merger or whatnot Those are all free to experiment with but as long as you have valid formats for all of the data all of the photos Put into cloud storage The shared front end test grid that gets out itself and read this And if you do that with running all of your own components or something that will output the valid format You can get something like the canade of instance basically Knative run its their own components And then outputs it to a particular place in cloud storage Test grid is reading from that particular place in cloud storage and displaying only the Knative results So all of these are scoped to just what Knative cares about from their own test results without Needing to display all of the stuff from every community SIG or other projects that are also on the main instance This is also a great way to do things like local development or he want experimental features So yeah That's a very brief overview of how it works and we're always happy to answer questions about more But the last thing I just want to mention is if you're at all interested, there's a bunch of ways to contribute There are several projects Anything from stuff that we've marked with good first issue to larger projects that would merit some discussion, but we're open to considering like feedback on How to make improvements or improvements that you're interested in doing could be stuff like Autofiling bugs instead of just emailing alerts out Or a rather auto filing issues on github any time that there is a failure meaning certain criteria within Different test grid dashboards maybe routing them to the people who care about them or particular teams who are responsible for the tests There is a lot of stuff that we could possibly do for result parsing And so currently a lot of the result parsing understands Jane it And typically it's uploaded by proud jobs But maybe you want to do something that is a different format that you also want to be able to see in test grid Or maybe you want to do something with the existing result parsing like Add some more things that are specific to the proud jobs themselves and how they run extra things about pods or other stuff that is currently within the test format but isn't displayed in test grid in the Quite the way you want it Similarly, there's a lot of different things that could be done with the summaries right now They have pretty basic info on the tab and test self But there's always a lot of stuff that could be added in terms of more powerful summaries that give things like say additional historical trends performance of certain metrics over time or maybe you want something that is like a really a easy way to drill down into Very broad summary across many many dashboards many sigs that kind of thing And track down What's the most important thing to tackle across all of the things that you care about as somebody who cares about a very large portion of the project Yeah, aside from code. There's also things like we're always happy when people Like let us know what is that are going wrong or pain points with things like filing issues Or just be back in general Maybe you like know about some test grid pain points Maybe you have things that you think test grid does well and you want to make sure we keep doing Or maybe you have some stuff on like short-term fixes that we should get around to or longer term ambitions and like Really cool things that you think test grid should be doing or we should consider taking a look at integrating more deeply with other tools or groups and Anything like that. So yeah Again, we're always open to feedback. You can catch us on the sick testing channel or in the test grid channel itself on the slack And aside from that Yeah, thank you for your time, and I hope to see you all around Cool. Thank you, Michelle. Michelle is not here but Oops if there's any question that you guys are have related to Test grid we can try our best, but the Michelle is no doubt the domain expert So we may leave the answer to offline One thing I'd like to highlight about test grids is that our team had a hackathon week Working on test grid and we did a really fun project in open source we successfully implemented tic-tac-toe on test grid and The source code leaves on my fork of the test grid repo. So if you're interested We can talk about that For that aside Hi, my name is Chao. I'm the one of the tier working on prowl and prowl is one of the important see I told that sick testing is maintaining and as part of the Evolution we realized that one of the problem prowl has is that the build system itself is throttling the development of prowl and also Was a barrier for proud to be Easily contributed by someone who is not very familiar with this system So today I'm going to talk about something we've already done in the Since the last sick testing update, which was the prowl build system overhaul So here I'm going to talk a little bit about the background. So here is what the prowl use prowl has Backend as a microservice running in Kubernetes cluster. It also has a UI. So The nature of that means that prowl source code contains go source code type script hdml CSS all of those stuff and it will also have some Python batch script for gluing things together and when the project was initially established there was a build system needed for Transforming all of these source code into container images and deploy on to Kubernetes cluster and Guess what the build system is? It's basal It's meant to be for someone who is not familiar with basal is meant to be multi language orchestrating system that has a very nice caching feature where you can compile everything Using basal rule and it will import all of those Background backhand tools to help you compile and put everything into the destination. It's pretty good It works really well like very well calibrated and there was very few flaky Flaky issue and it's always deterministic. It works really well unless You want to touch it So if there's any new feature for example Kubernetes Release There has to be someone there to update the basal rule, but the problem is our community The Kubernetes community doesn't have the expertise. So whenever you want to add something or update something something like this and There were a couple of features in prowl as far as you can tell was Abandoned because they were not able to figure out how to update a basal rule So our new answer is We don't want basal. We just want to use all of the native tool chain and This was the project that we did pretty much last year with then I collaborated on this and the native tool chain basically means for go we're just going to do you go build go test and For typescript we'll use all of those typescript native stuff like npm all of those stuff and For image building we use co That was the interesting tool is supposed to be used only by Project that that are written in go because if you have a gold binary Source code, you don't need any Docker file anything else. Just say co publish It will build the gold binary and package it into destroy this image. This is very nice and also by the way they Earlier this year, they also Enhanced co with the S-bomb support the software Build material. Yeah, I don't voice here. He's nodding All right, so yeah, so by by this transition we Got S-bomb by free It was pretty nice and it's a long process I'm not going to go into too much detail how we jump the hoop of bundling typescript compilation into code package, but we had a pretty good Success so the build speed I mean the pre-summit test is a 60% reduction as a result and the deployment speed is 800% faster and guess why Because in the basal era Even though the only thing I wanted to use kubectl apply it will download all of the dependencies From the world like the Python dependency Go dependency every dependencies It'll take like five minutes to download the word and it might also fail because network weakness So right now we we are pretty much Under one minute we are able to deploy everything and People are happy Next I'm going to hand over to Antonio Okay, thank you Well, I'm going to talk about something that is the most Frustrating thing for the people that is starting to contribute to Kubernetes Okay, so it's just an API and you see this list of things Okay, you see a lot of jobs. This is a weird thing because all of this is green but usually there is always one right, okay, and what these jobs are doing is is testing kubernetes and The strategy that we are following for kubernetes is the one-off pyramid and we can categorize in different in different Categories, of course, so the first of all is the verify So we have a lot of jobs that verify what is happening with the code if you modify an API You check that the code generated is okay. We have linters We have a lot of that stuff that unit test it goes through all the files in in the folders and Executes of the goal and test that it finds Then we have a nice thing that is the integration test that it basically creates an APH server an itcd and Run test mocking the APH server against this framework On top of that we have the famous ETA test These are the most we know for all the people because they are the more flecky the more compressed to the back And this basically is is creating a cluster before kind It was a it was only a real cluster But I need to I don't know maybe one hour to run all the tests or something like that with kind I think that right now is in 25 less than 30 minutes to run the whole I think that about 700 tests Okay, and then on top of that there is another Framework that is running that it's not very well known for the cool also for the not only for the developers Just for not only for the new people just for developers to that is the scale test. These are some special framework that is Storing a different Folder that security scalability is maintaining and the run Testing 100 I think that one thousand and also five thousand node clusters. Okay, and What during this year we were improving a bit each of these Categories so you can see in verify this Is This is right now one person working on enable Goal and see I linked this. There is also other person that is trying to lure you to To to review his PR with worker spaces with 82 commits I mean, it's really nice if he can learn that because we have a lot of technical depth and and this PR is going to solve those problems and more or less that's that's the The scope and and the job that we have me in this area for this release or this next six months regarding unit test Right now you can see if you are finding a cap or something that they are asking you to add the coverage I mean this I took a snapshot that rule looks really nice But you see the percentages there are about 30 percent 40 percent We have coverage and this this is an area for new contributors or people that want to contribute to to to help out because I mean It's easy you get used to the cold just start to to get More connections in the community and I mean everybody will be happy to review PR So I think this coverage just trust me and the flakes. I mean we need to reduce flakes by now I think that we are in a good state but This there is always work there Integration so integration was It's it's something that not many people uses and it was It grew holistically so holistically so it was really a mess. So it was a mess in the sense that and When you run the job and the job finish running it liquid about 10,000 gold routines Okay, we're talking about in terms of routine Thanks to boy to avoid it from serious credibility. Now. It's only leaking 10 gold routines but this is an example of of There is no only work on doing or adding new things there is work on you know Improving current things and and this used to be oversight again This is a good opportunity for the people that want to contribute and right now This is one of the areas that is going through Most development. Okay, we Have the we are a lucky project. We have the gingo author that this is a new version history by the one of the sick testing Community meetings and he asked it if we wanted to try And one person they change is here in the photo in the inside he Took he accepted the deal and he was able to land it Thanks to him and patrick and on see and all the people we have king of it too that is being given a lot of new features and and Right now there is a big work on using these new features and improving the e2e testing. There is also a Great effort on writing the test so they can be easy to read and they can report better error fails and Again, this is an area that needs a lot of help and everybody's work welcome to how to to help here And what's Last thing is this It's about this case testing that right now. It's only used by sicker scarabity I don't know if all the secret using it But in sick testing we have the challenge to add a new test to Because right now thanks to Dan Winship. We have nice improvements in qproxy IP tables to To perform better at the scale and we want to know how much better Are we doing? And that's it. Okay. You see we are not the more fancy chick we do things that maybe are not the nicest for a developer, but we are a Funny group and just come join us Join the meetings the channels Okay, I think that's the last one anybody else want to say something Thank you all for coming We intend to have some Q&A time But I think we might be just about out But if you'd like to ask questions before we run out or afterwards we'll be around for questions We have time if anyone has one Hey, sick testing Thanks for all of the work that you've done. I have a question about code that is out of three So there could be changes both in Kubernetes and in this in the out-of-three code and For out-of-three code what we should do is clone Kubernetes to our own test and run them in our own Presummit tests so similar goals for changes in out-of-three code and we want to test Kubernetes so When there are changes in both sides, do you have any plan for ready workers on that? Can you be a little bit more specific by what you mean by like something out of training against Kubernetes? Like what sort of thing are you talking about testing? Okay, so for example if I change something in the QLED that affects a stretch component and I wanted to run the e2e test of the stretch components for a specific CSI driver This is the common problem that you have when you have a when you split out on report, right? That's why I personally against to having an e2e test framework out of report because eventually it's going to go out of scene and There is no good solution or just have to scale and create thousand operation me to submit jobs that there's things in parallel, but You are not going to be able to To deal with the maintenance of that so my suggestion is to have periodic jobs. So at least you know in which In which not exactly the comment, but you can have an idea the in which Time you are external dependency, bro, right? So as an example one of the things that Kubernetes community runs is 5,000 node scale tests, but we can't run that on every PR Similarly, we can't test literally every aspect of Kubernetes and we have like kind of a push and pull there I think for like out of tree integrations That's another good example of run it there test grid You can configure to expose the commit of Kubernetes you were testing against as well as your tool So when we run the 5k node scale tests and something breaks You can you can actually drag between cells and test grid and it will give you the GitHub link to Compare between the commits to give you some idea to where to start bisecting to find that change We tell people Between us and sick release. We'd like you to get some CI running Get it looking stable if you want this to block Kubernetes releases You can talk to sick release and say hey, I have a periodic job that stable that's testing these things after they merge And and it's stable provides good signal This should this should get promoted to informing and then to blocking for the release and the test grid dashboards and then You know when those things break you have signal to do it if you see them breaking frequently Not due to the tests or the infrastructure or how you set this up But because of actual bugs in Kubernetes That's the signal that we need to move this into Kubernetes pre-submit so you're not constantly chasing down bugs So something like unit testing Kubernetes. We're going to do in pre-submit because we know if we don't test it there We're going to keep finding bugs after the code merges But for something like integrating against cuba without a tree provider probably it doesn't break very often we hope and If it does then we should push it into pre-submit testing in Kubernetes All right. Thanks. Hi, um, do you have any plans or any efforts going on for using github actions for testing things in Kubernetes for Kubernetes itself no Our CI that kind of with the project provides a couple of nice properties there to replicate a big one is that we're able to ensure that Because it's such a large volume of chains coming into Kubernetes on some days in particular it can be For these things that will break frequently. It could be hard hard to track down. So the CI integrates with a merge robot. That's part of the CI and Because of that integration the merge robot is aware of exactly what commits of The branch you're merging into and the branch that you're asked that you're asking to merge were tested And we will only merge code if it's been tested At the latest of both of those things. So if something else merges You're gonna need to get tested again, and then we have some optimizations to okay. We have five PRs that are currently Looking good for merge. We're gonna batch test them together See if they all pass if they don't we're gonna fall back to one And we have to run a lot of external test resources System that we didn't talk about today. We have a whole system for like leasing Resources to spin up real clusters out in the cloud and that run it for us our CI system is based on Kubernetes So the system that leased resources just runs alongside the tests as an application inside the Kubernetes cluster Where we're running the tests and these things make it kind of hard to migrate However some sub projects like kind are actually using GitHub actions a bit Yeah, sorry, I should have been more clear my question was more like as a sub project maintainer How can I have more better integration from sick testing if I want to use GitHub actions for some of my stuff? That's probably a question for child, but I'll say very briefly that we've had some discussions about this We kind of need someone to step up and work on things like we'd really like to see for example if tests fail The experience in Kubernetes right now is you can comment slash retest as anyone and get it to run again And because people depend heavily on e2e tests to you know, do integration tests between different aspects of the project Tests fail a lot and these things aren't available But we have some like there's some limited integration of the merger robot and things I Think then just explain the merge robot really well one thing I'd like to say is that I would say from my perspective Kubernetes is not tied to proud if you if we can make it a GitHub action work for all of our workflow I would say let's go for GitHub actions But right now there are things that cannot be done like what Ben mentioned the merge automation is right now is impossible to To serve a project such as Kubernetes Kubernetes, there is just no way that can be Right, yes right for some projects it's definitely supported and The only downside is that we cannot ensure that get up actions test against the latest head Which we do have plan to support, but that's long long on our roadmap It's it's really hard to be honest We will need get up action to expose a lot of APIs for us to be able to trigger get up action on certain commits We only have time for one last question In the sick testing channel I saw that there is this new e2e framework That is led by Vladimir and I am not sure what's the difference with the e2e test binary that comes out of Kubernetes so so that framework is a I don't want to say experiment I think we we start as experiment, but I think it's actually reasonably mature at this point It's just not the one that we use to test Kubernetes itself You know no matter how good that framework is there is an enormous lift to migrate Kubernetes something else We have like thousands of tests And it's hard enough to get people to maintain the messes But we since also that framework kind of grow organically we didn't want the We didn't want to move this out of the project and say everyone should reuse this You should test things exactly the way Kubernetes does the e2e test because I mean I'm actually not a huge fan of It ourselves, but it works. So the out-of-tree e2e framework is a project in the stick to try to figure out Like what is something that is reusable? And it's definitely something you should look at for out-of-tree projects All right, thank you Okay We run out of time, but we can we're going to stay here if somebody wants to come and chat. Thanks everyone