 So thank you, Dan, for introduction. The first thing first, to whoever you came to this room for a talk called CI for OpenShift, and now we are looking at the title slide, wondering if you are in the right room. Yes, you are. I will deal with OKD OpenShift situation in a while. So my name is Petra, like Dan introduced me, and I'm a member of the OpenShift Developer Product and Test Platform team. And we are basically somehow the CI team for the OpenShift upstream. We are not the only ones responsible for this. It's a huge system consisting of many parts, some of them by Kubernetes upstream, some of them done by other Red Haters. But we are sort of stewards. So like promised, I'll first deal with what OKD actually is. So if you know Kubernetes, that you can think of OKD as a distribution of Kubernetes, augment it with more features that are aimed mostly for developers. So it helps developers with application lifecycle, et cetera. If you know OpenShift, which is a family of Red Hat products, then you can think of OKD basically like upstream for OpenShift, a little bit similar to what Fedora is for Red Hat Enterprise Linux. If you heard Origin before, then OKD is basically renamed OpenShift Origin. If you don't know neither of these, I'm afraid I can't really help you go to the URL shown on the slide. And let's continue. What OKD practically is, is basically like hundreds of Git repositories developed by more hundreds of people involved in the OpenShift organization on GitHub. And OKD is if you build a bunch of artifacts, which are mostly container images, out of the right revisions from the repositories and use some of these artifacts in the right way on a suitable infrastructure, you'll get a working OKD cluster, hopefully. All the repositories there are, of course, somehow different. Some are forks of the upstream, various upstreams. Some are OKD specifics. Some are intensively developed. Some are pretty much dead, or mostly stable, not dead. And the goal of the CI system for OKD is to continuously have a good OKD version, no matter how the individual components are developed at the moment. So the thing is, we always want to have a good OKD, something that works. And to achieve that, we basically need to enforce that for any request in any of the components repository leads us to the OKD that is working. The reason for this is that once OKD becomes broken, all of the other component teams are basically in the dark. They don't know how their development affects the whole, the whole OKD. And together today, we are in this state, although people who are probably at this very moment praying for the end-to-end tests OKD to finally pass for a 20th attempt or something would probably laugh at me, laugh at me bitterly, perhaps. One big principle that we would like to have in the CI system for OKD is that we would like the component teams, the developers of the components of the OKD, is to self-service and own their CI config as much as possible. But at the same time, we don't want them to own it and go to a different direction because otherwise, we wouldn't have no base criteria and the idea of having good working OKD would be lost. So we need at least some base criteria that are somehow enforced throughout the whole organization on all of the components. And this idea of self-service and ownership comes a bit out of the history because OKD and previous to the origin, of course, as a major and quite established project has a lot of history. I don't really want to delve into this history. But I will suffice to say that in the past, most of the CI system for origin at that time was based on Jenkins and a bunch of custom built and auxiliary tools. And now, it's all built over a system that's called prow, prow is the front part of the ship or something like that, which brings us to nautical terms, like we are in Kubernetes world, of course. So the switch to prow is actually quite recent. So when I prepared this talk, one of my resources was a talk by Michalis Kargakis, who at Defqon last year talked about what were the problems with Jenkinses and Jenkins-based infrastructure. Why did they decide to switch to something else and how it went at the initial stages. So if you are interested in this kind of history, go Google his talk. It's actually pretty interesting. And now, when I dealt with history, I will continue talking about prow, what prow really, what prow is. So prow is the Kubernetes-based CI CD system. That's a lot of buzzwords there. So what prow actually is, it's a bunch of straight-less and very loosely coupled microservices, each one doing their own thing. They are designed to be running inside a Kubernetes cluster. And together, they somehow implement a continuous integration system. So for example, there is a component, it's called hook, that its only job is to watch what is happening inside Git repositories and it dispatches these events to the other components that care about some of these events. There is a totally separate component that only handles merging of full requests. And there are many more, about 20, 25, something like that, perhaps. What any of these services do not do is they do not care about scheduling and resource management and all these things that you would probably need to handle when you're running Jenkins, because they are running in Kubernetes. And Kubernetes is actually quite good at scheduling and running workloads from pretty much wherever, as long as the workloads are somehow containerized. And prow is actually, it's not an OKD thing, it's a Kubernetes thing. It originated in the Kubernetes upstream. It's used there and we in OKD run our own instance, which some of you may know as API CI cluster. It's mostly there for running prow. Another description of what prow is, is coming from the upstream documentation for prow and the documentation says that it's actually, like if this, then that system for GitHub, more like developer oriented perhaps, but still. And by default, what the components are implementing is the library of default triggers and actions that are somehow useful for implementing a CI system. Prow is very designed to stay out of your way as a developer, so if everything goes well, you only interact with prow via GitHub. There is no separate system that you need to go to. There's pretty much nothing else. You get information via GitHub comments, via labels. Sometimes prow does something with your full request or something and you should only leave GitHub if something went terribly wrong and you perhaps need to follow a link to C log or something and prow can display logs to you. So what prow does, can be separated into three main areas. So the first thing that many of the components are implementing is it improves development workflows for developers and improves them above what GitHub itself offers, especially for the big and distributed communities like Kubernetes or OKD. So I'll try to give a few examples. So there is a component that watches PRs and just only labels them with how big the change they actually bring. There's a totally separate component which knows who are the people who are likely to be a good reviewers for a full request and assigns them to review a full request. There's a different thing that records the fact that the code was possibly reviewed as indicated by the slash command, LGTM, and records this fact as another label. There's a theme occurring there with the labels. There's a different thing that watches PRs that needs to be rebased because the base branched moved in the meantime and labels the PR. It tracks approvals, which in the prow world is a slightly different process than a simple code review. Like code review says the code is good or code is bad and approval is meant to be something like do we want this feature? Is the feature in the right area of the whole thing? Like these like big ideological answers. And finally, like final example is like it automates what humans like to do, which is like re-triggering the failing test indefinitely until they actually pass. So with Proud, we don't longer need to do it yourself. Proud does this for you. All of these actions like are performed by separate little components. And you can notice that they're really, really simple. They just watch for something, some even to happen on a GitHub repository. And do a very little, very simple action somewhere else. Do a label, comment, something like that, trigger a test. So the second before that Proud offers to users or developers is it handles merging. So in Proud world, it's not humans who are triggering the merge button to accept the PRs. It's robots. There is a component, it's called Tide. It's like when the sea tide is rising and going. Tide know what the criteria are for a PR to be accepted and merged and it enforces them. It's better than humans because it doesn't, it does this all the time and immediately. So you don't need to wait until the senior accepting guy finally wakes up somewhere in Japan. It does it efficiently. I'll get to that in a while. It doesn't force merge because it thinks it's a rockstar developer and it's a Friday afternoon so it's probably okay. So the way how Tide communicates with you within a pull request is with this little GitHub check that just for Tide it's, this pull request is most mergeable because it needs LGTM label and you just need, you know that you just need to bother someone to put the little comment there, some other Proud component would put the label there and Tide will say, okay, this is now mergeable. If you want to know more about this, you can follow the details link and you get the full criteria. So Tide enforces that the old tests are passing and even better. Just before something is mergeable and it would be merged, Tide is checking whether the test results are fresh enough. That means if the tests were performed over the current tape of the branch of the target branch. If it's not, if the target branch moved in the meantime and we have the old test results, Tide re-triggers the test so we always merge something that's probably working. Of course this means that whenever anything is merged into the branch it invalidates pretty much all of the existing test results which would ultimately limit the throughput for merging. So Proud, Proud 2, but Tide actually, is smart enough to do the merging in batches. So if there are multiple PRs waiting to be merged and they would be mergeable, merging any one of them would invalidate results on all of them. So Proud just takes all of them, attend to merge them and tests over the final result. If the result of merging all of them would pass the test, it merges them all at once. And you can also see, not just test, it also Tide tracks what labels are present on the pull request. So remember all those little labels that the various components were putting on the pull request. So Tide cares about them and is able to say, I'm missing this one, should be there. And this is something that shouldn't be there, so get it removed, I won't merge it. And of course, it doesn't merge anything that's in conflict, that's not that, that's not rocket science. So that's merging. And third, and for the part of continuous integration system, probably the most important and most complicated part are Proud jobs. So the concept of Proud jobs is like if this, then that on steroids, which is like, we know that something's happening in the GitHub repositories, so we want to execute something. And that something is very useful to be a test workload. So we are in the Kubernetes, so the something is basically a Kubernetes bot. So as long as I have something containerized, I will just submit it, it will execute, Proud will check whether it was executed successfully or not, and report that back. There are three types of jobs, depending on how they are triggered, in what cases they are triggered. The easiest one, we call them periodics, is just like time-based trigger. It's 12, 12 a.m., let's just run some tests every day or every week or something like that. Very simple. Something more complicated are post-submits. So post-submits run whenever there are new comments on some branch in the usual world, that means some poor requests were just merged, so we need to do something. In this case, it's not that much useful for testing because it was just merged, so it's too late for having a kind of meaningful testing. We want to have testing before we merge, so we don't get bad code on the branch. So this is useful for reports and artifact builds and stuff like that, so we'll get to that a bit later. And the third one, mostly used for for testing, are pre-submits, which are run whenever there are new comments in the proposed poor requests. Like I said, these are mostly used for tests because they can run and they can report their results to the PR to which they are testing, so these can be used to get merging for poor requests to any repository. So let me now describe, we know what the jobs are actually, so if I'm a component owner, I have proud deploy, what do I do to actually create a job or something? So we are a Kubernetes, so we write YAML, like a lot of YAML, of course, and track it in Git. So this is a, it's even shorter, an example of how a job might look like. So this is a pre-submit that's set up for the CI operator repository in an OpenShift organization. It has a name, which doesn't really matter that much. This part is a standard Kubernetes spot spec, which just tells what to run, coming from which image, that's it. It's a pre-submit, so it has a bunch of options that allow you to fine-tune when it actually should be triggered and how. So you can see, there's a regular expression called trigger, which allows you to use a flash command inside your GitHub full request and Proud will trigger this pre-submit. And finally, as a pre-submit, it has something called context, which is nothing more than just a name under which Proud will report its results to the GitHub full request. And that's pretty much it. So let's just go back to this heap of YAML, so this is a single job. If I am an owner and I'm supposed to use, and I have Proud, what I'm supposed to do, I'm supposed to write one, two, 10 of these jobs in YAML. I would, of course, need to make sure that I have all my test workloads containerized or I would need to have, like, a way how to containerize them easily. So it's probably nice, allows us to run pretty much anything, but it's still a lot of work to get to the actual useful CI system. So, enter CI operator. Now, if you lived in a Kubernetes or OpenShift world in last year or so, and you installed some of the talks here on DevCon, you probably heard a word operator in a certain context. So, yeah, CI operator is not that operator. I have no idea how the name was born. In CI operator, the L word is just a part of the name. That's it. So what it's not, we know, so let's focus on what it actually is. So CI operator is a tool that knows that it will probably be run in a context of a proud job, and it knows how to achieve and simplify the usual task that all the OKD components might want to perform within the pro jobs. So it knows that when something runs within a pro job, it proud exposes information about why we are running this job. So that is something called job spec, which the thing that's running inside a proud job can see, and we see here that we know what is being tested. So we see that there's a certain revision coming from a certain pull request that's supposed to be merged to a master branch of a component, of a repository called component in an organization called OpenShift. So we know what should be tested, like up to the certain revision in Git. So the only thing remaining is we don't know how to test it, and we allow, or CI operator allows component owner to define how their component should be tested, how it fits into OKD distribution and stuff like that, the usual stuff, inside something that we call a component config. So what is component config? It's of course, Moriyamo. So the first usual action is to execute test from that component repository. So, or more generally, it's an action to run some commands when you have checked out the component repository. So here's an example of the component config. So given a build root image, release OpenShift GoLang 1.10, which is supposed to contain all the necessary dependencies to build this actual component, when a CI operator allows me to specify the test binary build command stanza, the command that CI operator can use to build the binaries for the component. And also after those are built, it tells CI operator it can run the make test unit command inside the component repository, which execute something that happens to be unit test for this. And now, CI operator does not really do all this like image building something complicated legwork itself. Like, Proud uses Kubernetes to do the stuff Kubernetes do well. CI operator uses OKD or OpenShift for stuff that OKD does well. So it uses OKD features like builds, and image streams to do the actual legwork, building images and moving images from place to another place to the image streams. And CI operator only orchestrates this as a OKD client. So containerized tests are the most simple action to be specified for CI operator to execute. And once we have this kind of component config, running these tests when you have a API CI cluster or some other Proud enabled cluster, it's just as simple as running CI operator and target unit and the CI operator will do everything. So the second useful action that all the components need to successfully do is they need to be able to build their image so that the image can be included in OKD, in the whole big OKD collection. Again, CI operator made this quite simple. You provide a base image, which is mostly shared for all components, and you provide the structure how the CI operator should build the image from the component image, from the base image. And CI operator makes it easily happen with a similar call like before. When it can build component images, it can also do mockingly build something like OKD builds, which are called release payloads and which are mostly collection of specific component image versions. And again, it's as simple as running with a different target. The last thing, if we have a collection of OKD images which are known to be good for some definition of good, way they work, and we have this single image that was just built, that's a candidate image for a single component, then we can use, like put them together and install the whole thing and deploy a new cluster and execute various end to end tests to see whether like the resulting OKD cluster would behave well or not. I stressed in the previous examples that running say operator to achieve something is simple, calling target. In this case, it's not easy, it's not that easy because the whole functionality is achieved by using OpenShift templates. But our tooling in the current state it's actually very easy to set up as long as the results will be run within CI. It's not that easy to run it locally on your workstation. We have multiple templates available for component owners to use. The two most used are, the first one is executing a shared OKD test suite which basically verifies that whatever we just created is a well-behaving OKD cluster. And the second one is allows component owner to run more detailed end to end unit tests coming from their components repository against the freshly installed cluster. And this is of course used to allow more thorough testing of the component on the test in the context of a freshly installed OKD cluster. So, and the last thing that CI operator can do, so if we execute all the tests that we have available, we assemble the release payload, we run all the end to end testing about the freshly deployed clusters. We can, CI operator can promote the image to be included in the next batch of what is now a good OKD cluster. And this image would be almost immediately available for the other parts that are just starting testing, for example, and while if we have everything set up, that means that we just have a PR, it is tested, everything passes, it gets merged, the images are built, it gets included in the good OKD and from that point, other components or even my component will use this new OKD. So we get this gradual, this component is good, this component is good. Again, people who are praying for their end to end tests to finally allow them to merge would perhaps disagree, but we kind of are there a bit. So, now we have proud that executes something for us and we have like CI operator which allow us to easily specify that something meaningful for proud to execute. So we are right now in a situation that again, if I'm a component owner, what I would need to do to set up CI for my component, I would need to write little EMOS for the CI operator stating this is my component, this is how it's tested and I would need to write even more EMOS for proud. So execute CI operator for this target and execute CI operator for another target when this happens and that happens. So that's more like, that's still like hundreds lines of EMOS for each component. It's easier than before that at least I need, I need to just specify CI operator instead of preparing paths and handling image builds myself but still we are not there yet. So we were not there yet, right now we are. All of the configs or all the EMOS that I described for both proud and CI operator is living inside a single separate Git repository. It's not just these configs are living there, it's like general proud config and some other config and other things. So what, how the whole CI is currently configured list in this repository and I mentioned that we want self service so this means we want everyone to submit their configuration changes, submit PRs to this repository. But until around August last year all of the jobs for EMOS for jobs live in a single EMOS file which at that time was like 5,000 lines of EMOS for all the components. And of course this is not very scalable when you expect like tens of people at least to own their own jobs. So the obvious thing that we did was create this little hierarchy when each component had their own directory so that this is not the case. So we have this, that made it easier for teams to own their jobs. After we have few early adopters from Bored with proud together with CI operator there were patterns starting to emerge and that means that like the success of full adopters mostly used CI operator to define all the tests. They usually define okay we have something that's unit test, something that's integration test, perhaps an end-to-end test. We can build image or two or 10 or something and all of them basically just did the matching part inside proud. So what happened is that everyone set a pre-submit that run pretty much all the tests and built images. So they had pre-submits running the unit test, pre-submits running the integration test, pre-submits running the end-to-end test, pre-submitted would attempt to build the images and a full submit that would build the images and promote them. So that was the pattern that emerged and of course because we asked people to write 100 tons of EMO it's not that all of the adopters would write the EMO themselves from scratch but they copied it from each other which even reinforced the pattern. And like copy, paste means mistake. Nobody really wanted to deal with proud jobs anymore. Proud configuration became boilerplate and nobody liked boilerplate. So we decided to get rid of it. Get rid of the proud job boilerplate. The way how we did it was that we built something called Proudgen and that's nothing more complicated than just taking all of the available tests and generating jobs for running all the available tests in the pattern that I described on the last slide which is execute everything before merge and build images and promote them after merge. That's it. So that means now nobody really needs to care about Proud config. Of course, if they want to and some people do because it allows you to have more flexibility you don't need to care about Proud config. So it's not hundreds of EMO anymore. You just need to write something on 40 lines per component once and then just perhaps adapt it when you need to add a new test type or something like that. And having stuff generated and having thousands of lines of EMO in the repositories and there was still like copy paste occurring pretty much everywhere. Mistakes were still very likely like copy paste, typos, leaving files somewhere that shouldn't be preparing a CI operator config file but not running the generator, running the old versions of the generator. All of these like little things why the EMO can be like painful to use. So on OpenSheet release, we linked, test and cross check everything with extreme prejudice on incoming PRs which means that like getting the change there might be a little painful because unless you check it for yourself locally submit some change, I want a new job and some part of config says no that's like you have a hyphen where underscore should be go away and you fix it and some other checks notices. So this is badly ordered. We need to like have this strictly ordered. Oh, you have Python, fix all these Python errors, everything. So we are very strict on the configuration but the result is when something goes in it usually works at least for our party. If you tell us that your repository has a make target called unit tests and it doesn't, that's not, well, something that we could check or we could, I will touch this in a while. So at some point like two months ago something like that. It seemed like Progen and all the checks were the missing piece. It made adoption earlier. We saw like pretty much everyone being onboarded in Proud and we were finally in a state when despite our like constantly changed config, new jobs, removed jobs and everything, CI worked and it worked well. So we were able to focus on improving the experience of the job authors. There are many and I selected like two because they are like technically interesting about what we are trying to achieve. So the first thing that like we had a common pitfall. So let's imagine that we for example added a job for a certain repository and this change like adding a job was merged while there was a full request in-flight on the target repository. So that was filed before the job was merged but it was for example waiting for a review or something so the sequence was there was a tested repository full request. We merged a changes that brought a new job, something reviewed the or someone original, someone reviewed the original full request and now it wouldn't merge because like the tie, the component that handles merging knows, okay, there are all these tests that need to pass and this full request doesn't pass everything because there was nothing that would trigger the newly added job with all the PRs that were in-flight. So that was a problem. This was, such PR would be stuck until someone came and I re-triggered the test manually or something like that. And removing and adding and changing jobs would cause similar problems that it was like pain, huge pain because nobody like, we had a lot of questions, why doesn't this merge and why doesn't that merge? So we, and by we I mean my team lead Steve, wrote a proud component that would be watching for like job config changes and immediately would apply this like little intervention automatically. So whenever we add a new job for a repository, it triggers this new job on all the PRs that are currently open for this repository. We, if we remove a job, we make sure that this, like the context from the all removed jobs is removed from all the PRs that are in-flight and something like that. So these problems no longer happen tight, mostly merges flawlessly and happily. So the second thing that we are right now focusing on building is that we still have this anti-pattern we observe on our repositories. So when a component over wants to set up something for his CI, he goes to our repository, the OpenShift release one and does some change and gets it merged. And then they go back to their own target repository, file a new PR to test what they actually did in the CI config repository to observe if the right jobs are triggered, if they are executing what they should and if not, if something went wrong, they go back to our repository, try again and repeat and repeat and repeat and that's of course, people are doing that and not actually that much complaining about it but we are not, we don't like it. So we are working on a feature that would, we call it rehearsal, rehearse jobs. So if you submit a PR to our CI job config, we would detect how does it affect jobs, if it brings new ones, if it changes some existing ones, if it changes something that's used indirectly by some of the jobs and we try to rehearse it and give at the OpenShift release for request, give the job author feedback about how would the job behave if it were executed on the target repository. So we know we won't be able to do rehearsals for all of the jobs and we are not sure if we want to, for example, if you change one of the base templates I mentioned before, if we really want to rehearse something that would add that second try to schedule 200 jobs each attempting to bring up a full new cluster, we'll sort it out, but the feature is coming. So that's what are we doing right now? So I promised some view on the future in the title, so I'll get to that. So with everything this, it's the feature is mostly stable and the biggest thing probably everyone has in the OKD organizations, so is that the investigation of failures is not very easy. If you have a job failing on you, especially one of the template based jobs to try to bring up a whole cluster and run all of the OKD conformance tests, it's painful, it really is. Sometimes there are no locks, sometimes there are locks, but there are so many moving parts that it's very easy to find, it's very easy to miss the artifacts even if they are there. So we would like to focus more on improving the experience from investigating failures. And we also want to focus more into bringing metrics and alerts there, mostly on test results and analysis of data of test results, like investigating flakes and stuff like that. So there's a reason why we currently do not have that much in OKD and that's the reason is that in this we are mostly alone on our own in OKD. In upstream, Google feeds everything, all the artifacts into some internal big tables into a tool called take test grade, which we can't really deploy because it's not open source yet. So we have nothing to reuse and unfortunately all the priorities were different so far. So that's like two main areas we would like to improve in the future and I'd like to wrap up the whole thing with a few numbers. So right now, I checked yesterday, we have 700 jobs set up for 120 repositories spread out in 10 organizations. So it's not longer just the, it's not longer just the open shift organization but people are setting up jobs for their personal accounts or I think we have something in Knative and other orgs. When I said that in August, we had 5,000 lines of YAML. So now our jobs are like 35,000 lines of YAML, which I was not sure if I even want to include if it's a good thing or bad thing. The good thing is definitely that most of the people, most people do not really need to care about this number. If they even need to care about job configuration YAML, they find like little components specific files with like 100 or 200 of lines. So that's quite reasonable I think. And right now, like the whole OKD end-to-end tests are getting through requests and coming into like around 70 repositories in open shift organization. In last 24 hours, I'm lying a bit because like I said, I checked the numbers yesterday evening. We executed around 2,000, 2,600 jobs. I checked just before I went to this talk and it was around 2,200, like 400 less. And that's probably what we get by doing DevCon on Friday. All the jobs were triggered for 233 different pull requests and out of the 2,600 jobs, 600 jobs were actually trying, at least trying to deploy a full cluster and test something around that. So with these numbers, I'm wrapping it up and I'll be happy to answer any questions. I'm actually not that long in this organization, so I would probably make things up. I don't know. I can ask, but I'm half a year back in Red Hat. That's my time span. More questions? Go ahead. Can you repeat please? I was like. No, the result is to run all the AI jobs, like how do we, on the cluster on that in your example and other types on AWS, so how many results do you need to support that job? I'm not able to answer how many resources we, the question is how many resources we need to execute all of this. And the thing is, I'm not able to answer this question because our cluster on which all the test workloads are running is out of scaling, so it just like asks what we don't care. Someone pays the bill. The question I understand is, if there is some manual testing involved in this process, in OKD organization, only the manual testing that whoever submits the PR does before he submits the PR. So for OKD, which is a community distribution, I don't think there are any systematic testing efforts. It's certainly not manual. I'm not able to comment on the open shift products because I don't know. It should be something around 40 minutes. The question was, how long does it take to run all the tests? So it's all parallel. The duration is whatever takes longest and the end-to-end tests are the ones that run longest and it should be something around 40 minutes. Sometimes it's more, especially when we have failing tests which a lot of files are based on time amount and stuff which makes the whole execution longer. Mike, again. We don't, we don't do that well. So all of the, there isn't anything more smarter than test individual component in the context what's currently being well assembled as a good open shift. There is a step that I missed like between the promoting an image and it being available for next testing, like handled by something called release controller and after something assembles the new release for other like tests to use, it runs these tests again. So if something like past isolated testing and would be included and something else would do the same thing and there would be something like timing grace and collision, it's likely that the, like this thing that test the whole thing again would discover it and would not pass this thing to be like the latest known good OKD to other tests. So we would just continue to use the old good OKD. More questions. Yeah. I think the question was if we, in the case when we want to build images for components whether we always need to, like whether the user needs always to provide the Docker file or whether there are other means to be able to build images. I think right now there are no other means and it's not decided by us, but it's decided by an architect who decides how the images for OKD and later OpenShift products should be built. So right now it's only Docker files. So I guess that's it. So thank you for all those questions.