 Hello everyone. Good afternoon. My name is Peter and I'm a member of the OpenShift Developer Productivity Test platform team, the DPTP. We mostly take care about the CI system for the OpenShift organization and also for other related tool that are required to actually build and release OpenShift. So let me start this talk with a little bit of recursion. So this is me a year ago standing on this stage, the hoodie was different. Talking about the OpenShift CI. This year talk about OpenShift CI was done by wonderful Sally and Urvashi in the morning, so I don't need to do that anymore. So year ago specifically I was talking about this new cool feature that we were working on called job rehearsals and it was in progress at that time. So I'm very happy to be able to stand here today and to be able to talk about how it works and how it helps developers in OpenShift organizations in setting up their CI jobs. So not everyone here is a native English speaker, so I'll start with a little definition. Rehearsal is not a technical term. So Marion Webster says rehearsal is a practice session preparatory to a public appearance that's still a bit abstract. So let's say that rehearsal is a dry run of a theater player or a music gig. You try things before you actually go in front of the audience and do your thing for real, like I spent yesterday evening rehearsing. And we borrowed that term for testing runs of CI jobs. So I'll go into more detail. So let's talk about the world before rehearsals exist like the year ago. So some of you may know in OpenShift the CI system that is used there is called Prow. It's also it's also developed and used in upstream in Kubernetes. And it's a cloud native CI slash CD system that's designed to run on Kubernetes. Like contrary to some like more traditional CI systems like Travis or like Jenkins or something. The definitions of what is supposed to be executed when something happens in the repository are stored separately. So you don't have like file like dot Travis YAML or Jenkins file or something like this to get together with your code in your repository. In OpenShift and Prow these definitions about jobs for several hundreds of GitHub repository in OpenShift organization are stored centrally in another GitHub repository. The repository is called OpenShift release. Many of you probably know it. And we expect developers to own and change and develop their jobs inside OpenShift release. So before rehearsals the story of like changing jobs, adding ones, changing ones could be quite a sad story. So imagine you wanted to add a new job for your repository. So what you needed to do is you would go to OpenShift release. You would clone it. You would write a new definition using YAML. You would open a PR. Then you would like struggle a bit to make all the checks happy. Then you would need to find someone to review the change in OpenShift release. Of course the most like the best person who can always the best person who can review your code. This is the half of the world away and it's probably still asleep. But eventually that guy gets up, reviews your code, PR merges. And now Prow knows about your new job. Like the bad thing about this is you have no idea if the job will actually work because you just wrote a bunch of YAML and you can't really execute YAML. What you needed to do year ago was you went back to your repository, opened something like CI test, did not merge. You had plenty of these. And then you waited until your new job finishes. And very often it just failed on something because you were not aware of like, for example, Python not being in the image on which you build your job. And as a bonus to this, right now your PRs on the repositories are broken because you have a CI job that can never pass. So everyone else is screwed until you fix the job. So what you need to do right now is go back to OpenShift release, open a PR. Yeah, you get the idea. You probably hate us right now. Or hated us in the past. So that was the problem. We decided to solve this. And when stuff starts to work, yeah. So let's get back to rehearsals. I was mentioning that for us the rehearsals are the testing runs of CI jobs. So that's still a bit abstract. So let's be really concrete. So we wanted when you go to OpenShift release repository and change some job, we wanted to execute those jobs and provide you the results as if they were run on your own repository. So if you made a mistake like this in your job or something like that, you would immediately get a result about, okay, this job won't work. So no screwing up your repo, no delays, no waiting for anyone to review stuff, and no going back and forth between the repositories. Just the native CI experience. I changed this, something is tested, and I get the results back now. So we went on to build this. And to be able to explain how this actually works, I need you to understand at least a little bit how Proud works internally. And I'm not going to spend more than three minutes about it. So I hope I can do that. That's quite a challenge. So let's start with even that is interesting for us. So the PR is opened on some repository. What happens? What Proud does internally? So the first thing that happens that GitHub will notify our Proud that the event actually happened. And it provides us with all the details about what was opened and where and what are the hashes of the commits involved and stuff like that. So now Proud looks up the jobs that are configured for the repo and for that given branch, and creates the pro job custom resources, like standard Kubernetes custom resources. The thing is Proud looks for these definitions in a config map where it stores all the definitions for all the job for all those 200 repos that I mentioned before. And the thing is like I mentioned before, like these job definitions are actually tracked in the OpenShift release repository and are synced to the config map and asynchronously in a like a GitHub's way. This will be important for the rehearsal later. So let's just leave this like it is and let's continue with that. So we have the Proud jobs. And another part of Proud sees these Proud job custom resources being created and for each of them it creates a standard Kubernetes pod using a specification that is a part of the Proud job. This pod does the actual workload of the test that's supposed to be run. And there's nothing special about these pods. They get scheduled by Kubernetes or OpenShift. They run and they eventually finish. Like I said, they are not special. So they can use more resources in the cluster. In this case, most of our jobs are using some other config maps that are present there. CI operator configurations and templates. We don't need to care about what it is. The important thing is like the content of these config maps is also present and synced from the OpenShift reset repository. It can be changed with PRs. It will be important later. Eventually, pods finish either they succeed or they do not and Proud collects the results of the pods and puts it into the status field of the Proud job custom resource and some different part of Proud takes the status field of a Proud job and imports it with a GitHub API to the pull request when you can see them. Like Proud has a million other things, like a million of them. But this is the part that's actually somehow relevant to rehearsals. So when we know this, how this works, how can we build rehearsals? Like the thing that when I submit a PR to OpenShift release or the job definitions or something, I get these results as a standard CI jobs. So let's start again. Like what happens when I submit a PR to OpenShift release? Like Proud would try to find all the jobs that I have configured for OpenShift release, master branch, and would create a Proud job CRs. The thing is and the roadblock is that the jobs that we want to run, the rehearsals, are not configured for OpenShift release. There are some different repos, part of OpenShift, part of Knative, I don't know what else. That's the first problem. We can run pretty much any job from the... We would need to run pretty much any job from all the jobs that we have in OpenShift release. The second roadblock and that's more fun than one, is that Proud at this moment doesn't even know about the jobs. As far as Proud is concerned, the jobs that are supposed to be run might not even exist because they are not in the config map. They are not even yet in the OpenShift release repository. They are still in the poor request. Like the poor request that's not even merged yet. So Proud itself has no idea what's happening there. So we can cheat a little. And the cheat is this little tool we wrote. It's called PJ Rehears. And it's a thing that runs in a pre-submit configurator for OpenShift release repo. So let's talk more about this cheat. So it was designed. The tool was designed to run in a pre-submit. So it always run in a cluster of the repository that holds the job configuration. The tool itself doesn't have any use outside of this context. So it's only designed to be run like this. When it's being run in a pre-submit, the PJ Rehears has access to the Git content of the repository. So when it runs, it can use Git to get the candidate job configs, the new ones, the changed ones, and compare them to the baseline configs, those that are already present and merged in the repository. So once we get this comparison, we get the jobs that are new or that are changed. And these are the jobs that we want to rehearse so far so good. We need to somehow modify them. For now, let's be satisfied with calling the modifications magic. I will get to the magic later. And then we submit the resulting proud jobs to the cluster. And let's let me dig more into this. So remember this part where Proud itself was looking into its config map and select something and creates a proud job. We can make the PJ Rehears tool to be sort of a proud freeloader and creates proud jobs itself. So proud jobs, making proud jobs like this obscene thing. Of course, this service account that runs PJ Rehears needs the rights to do this. Most of the normal proud jobs do not have these rights. But we are okay with PJ Rehears doing this. So we gave them the rights to create proud jobs. And once PJ Rehears creates the proud job custom resource in the cluster, the rest of Proud doesn't give a damn about why these proud jobs were created or by whom and when. It's just like, okay, there's a new proud job. Let's schedule a pod and do the rest of my thing and continue eventually showing up in your PR. So like I described it, the process wouldn't be enough. If we just took the job and created the proud jobs like this, it wouldn't work for various stupid technical reasons. That's why we need the magic I mentioned before. So we actually need to fix up these candidate jobs enough so they work. But we need to make sure we still execute your job so we don't by mistake fix it or something. So let's talk about this. So first thing and the most trivial things about this is like we need to fix up the maze of the job because when rehearsing the uncommon situation may happen that we will actually run the jobs that would otherwise never be run together. Because they are jobs for different repositories and different branches. So if we just, if we haven't renamed them, they would show up a CI proud unit in your PR and they would probably override their own status checks or in GitHub and he wouldn't know what actually run and what succeeded, et cetera. So we do the simple thing and we just expand the names to describe, okay, this CI proud unit is a rehearse of a unit test for org repo branch master. Different CI proud unit is a rehearsal of org repo for different branch, et cetera, et cetera. So that's a simple case. The second case, we need to deal with this like Sky's a phoenix situation for proud is not really prepared for where the job needs to test the target repo the one for which the job is originally written for but needs to report results to the PR to a different repo, open shift release. Like fortunately proud has some low level means and I won't go into the details here. I'm not sure if we are using the means or abusing the means, but they work so we can achieve that. And the third piece of like the simple magic is that the jobs that we are rehearsing do not necessarily need to be pre-submits like the jobs that run on PRs. They can be periodics which can be thought like a proud cron jobs or something. So the periodics are not tied to any PR so if we just made rehearsals from them and submitted their proud jobs, they would run but they wouldn't report results anywhere because they're periodics, not supposed to. So we need to actually do some heavier modifications and convert these periodics into pre-submits with the same workload and make sure we haven't inadvertently fixed something that you broke. So that's simple magic. We do more. We do some crazy magic when making rehearsals work. So this is when the second piece when I said like this will become important comes into play. Like the pods, if we submit a proud job, even the rehearsal and the pod is created for it, it can use resources present in the cluster, config map secrets and stuff like that. Some of these resources are, like I said, tracked also in the OpenShift release repository. Which means we have the same problem. Like we have some, the PR that we are rehearsing for may have changed the content but the content, like the changed content is not yet present in the cluster. So if we just run the pod, it would try to use some like CI operator config or template without the modification in the PR, which is might not or might not be like good thing, like the modification may be broken, stuff like that. So we need to make sure that the rehearsals that we execute are like seeing the content that is consistent with the PR for which we are rehearsing. And we do this with two means. One is like more simpler when we do this for like environmental variables when we simply like sever the dependency of the pod on the content in the cluster and we just inline the content from the PR directly into the pod specification. We can do this everywhere. So the second and more complicated thing that we do, we actually create temporary resource just for the purpose of running the rehearsals so that the pods can use this like isolated config map or something and sees the modified content. Unfortunately, this means that like PJ Rehears needs to be a little bit complicated than it was, but still, it also has some advantage. So it does this. It needs to make sure that the config maps are created. They need to make sure because we can also, in multiple PRs, we can change the same thing in different ways and we need to make sure that the correct rehearsals for PRs see their content and know the rehearsal content of some other PR that changes the config maps in some different way. And it's like a little trick to make this all work. The advantage of this is that when PJ Rehears knows about this additional OpenShift release content we can also benefit from that because right now we know that some content like template or something changed. We haven't touched any job but we know that we changed something on which some potentially not even touch job depends and we may add this like job that in the past wouldn't be rehearsed. We can make sure that we rehearse that because the author of the change, for example, in the template might be interested in that. So in general, like if I simplify this, whatever you touch in OpenShift release repository and it can affect how some CI jobs, some CI job in the repository is executed, we will rehearse it. And I think that's like pretty cool and it makes life easier. So I would like to concede the talk with few comments on some things that are not that cool. So one of the thing is reruns that are generally pain in the butt. Like the cheat we are using when we first execute PJ Rehears and that selects with jobs should be rehearsed and creates all stuff means that from proud you can't interact with individual rehearse. So if 20 jobs are rehearsed, 19 of them passes and one of them is read, you have no mean how to rerun just this one. Maybe it flake, maybe that didn't, you need to get like all or nothing. So that's annoying and we can hopefully eventually fix that. And the second thing which is very annoying is that some PRs simply affect too many jobs. Like if I change all the jobs, shall I rehearse all of them? Like several thousand jobs? I don't know, probably not. Probably it's not, it's excessive and nobody would even probably look at the results. So we deal with this in two layers. So we have things like templates that they are basically piece of infrastructure that many jobs depend on. So if I touch a template, I know that hundreds of jobs are affected. So we prep this by sampling from the set of jobs that we should rehearse. We just select some of them and run them. But even with this optimization, if the PR ends up touching too many jobs and too many jobs would be rehearsed, we just give up. Right now the number is 35. And the problem with this is of course the large changes that affect everything are the ones that need the rehearsals most. But our tool just says, okay, too heavy, I don't care, I give up, your fault. So there's some definite room for improvement here. So I attempted to walk you through in 20 minutes through the feature of OpenShift CI that I found some cool. I hope it was at least a bit interesting for you. And I'll be happy to answer any questions that you might have. Looks like I have no questions. Still, thank you very much for listening.