 My name is Ben Howard and I'm on the CoreOS tools team and today I'm gonna be talking about moving the unmovable. And what I kind of guess to kind of introduce what I mean by moving the unmovable in 2019 the Red Hat CoreOS team undertook a massive lift of moving our build pipeline from virtual machines into a Kubernetes open-shift environment. And so it was a monumental effort that involved a lot of people across two different teams actually no at least four teams that I can think of I mean it was a it was a huge lift so just by way of background I've been in the cloud space for professionally for the better part of 15 16 years and I've done a lot with Jenkins concourse go CD CI that sort of stuff but this is actually the second or third migration that I've done and this is the only one I would consider a success before I came to Red Hat I worked for a small company in the financial sector and they had 70 virtual machines that comprise their stack and they decided that the Kubernetes was awesome and they wanted to do the heavy lift from virtual machines all the way into Kubernetes and in that lift they went through and I helped them with it we described the entire infrastructure in Helm charts and had it so that we could deploy everything in 15 minutes or less we had a pre-prod environment we had the full prod environment ready to go but the workload never shifted because we never got the the confidence that it would work and there the cost of making that final leap was too expensive and one of the things in working with these different environments and doing these heavy lifts in the past is that only towards the end of this last one did I realize some of those lessons learned and so the context of this is going to be talking about moving the core OS build pipeline from virtual machines into Kubernetes but I want to talk more about the the general and talk in terms of what is unmovable and so I in per parent for this I went off and looked at some some interesting stats and found that a lot of the cloud migration efforts they simply will fail or they run longer than expected or they go way over budget and part of that I think is because the problem domain is poorly understood and so part of our problem when we started moving the pipeline was we referred to it as the pipeline that one word the should have been our warning that we were dealing with something that had an identity and it told us one important thing that we had a pet and way back few years ago we started to hear in this phrase in Kubernetes that pet services are out there and if you try to move a pet service it's going to become a Kubernetes pet service and that's bad and so my wife was sending me texts of the the cat and I thought this picture was really useful that is the end of our bed and that is her blanket Mia is a delightful 13 year old cat and she really hates it when her environment is disturbed if you move that blanket you will hear about it in the morning she wants her blanket warmed she even has a heating blanket that slips underneath it that's made specifically for pets and she will tell us to turn her blanket on she really likes her environment they don't like being moved and they don't like their environment being disturbed but we also discovered that someone lied there are no pet services there are no cattle services for that matter there are gremlins and if you remember from the 1980s we had that kind of quasi puppet whore from I don't know if Jim Henson did it but gremlins they're all cute and fuzzy until you decide to feed them after midnight and when you feed them after midnight all sorts of things bad things happen and the biggest problem when you're trying to move something that is big and unmovable is the way that you think about the problem and I said earlier that we were using the pipeline and we use the phrase the pipeline I think as I was thinking about this for and in preparing I think we continue to use it even after we have multiple versions of the pipeline and multiple iterations of the pipeline and so our thinking is that we take this thing that's unmovable and it becomes unmovable because we've given it an identity and that identity is often conflating the process of producing the work why it exists in the first place with the output and so and one of the things that that I noticed was we were talking about moving it which is something that implies you can just pick it up and put it someplace else against the Kubernetes idea of deploying the service and so that speaks to the kind of the environment so we probably spent about six months on this journey before we kind of settled in someplace safe and I like to refer to this as Gremlin training because you have to wrangle this on this big unmovable problem and what you have to do is you have to figure out kind of a guess you know how to land someplace that's workable in those phases we started off first with all virtual machines the classic sort of world then we had a transitional period where we had Jenkins running on Open Shift Jenkins was the process that drove the pipeline with VMs doing the work and then we got religious about Open Shift and we wanted it to run all on Open Shift and we then put a bunch of really smart people on it and we worked it out and the last part of getting all in Open Shift was probably the hardest so what we we learned is that the environment is the single largest obstacle when moving a service and when I talk about the environment there are certain things that are inherent and implicit in an environment when you're moving or talking about a service when you think about in the old days we used to have you know talk about our database server is on rack you know 3c it had a name called Thanos for kicks and giggles and Thanos had so much RAM it had these characteristics these identity and when you build your database server the the environment kind of imposed itself and changing the environment such as changing characteristics regarding memory or CPU environmental variables for the operating system changing the operating system itself all these things contribute to that and when you move something from one environment to another you are breaking some of those implicit assumptions and if you don't take the time to understand them then your gremlin will get really unhappy and that's what I mean by gremlins really like their home when you move something over you there are things that you could take for granted that you can't anymore in our case when we're building an operating system performance is kind of important and we ran into problems with speed we ran into memory constraints because of some of the of the way that Open Shift was constraining us in terms of CPU caps memory limitations device access network bandwidth you name it and some of these things just suddenly start changing unless you take the time to understand what exactly the workloads doing before you embark on it you're going to experience a level of iterative pain that only gets worse and I'd like to propose that if you're using the terms moving you're already off to a bad start and the reason for this is that again and do the research business analysts use the phrase lift and shift and they say that when you do a lift and shift to the cloud this is going from bare metal to virtual machines in the cloud it almost never works and the reason it almost never works is because you're changing the environment and the characteristics and so that gremlin you're going to move it into the cloud feed it after midnight and when it blows up your life is going to be very unfun and we decided that we were going to move the pipeline and we did move it and over six months we experienced a lot of pain so some of the things that where we kind of went off the deep end was we wanted to move the pipeline and with that we wanted the pipeline to work in one environment to work in another environment and we went through all sorts of fun pain because we wanted feature parody like one of the examples of feature parody is we were running Jenkins and our users were used to interacting with Jenkins but open shift in Kubernetes gives us the ability to expose different options through in the yaml through environmental characteristics which Jenkins then reads in as parameters and so one developer and I we went back and forth on how this should be handled and we were trying to figure out trying to figure out how to get truth of what the default option was going to be and we ended up in a case where you could end up with three different ways to get what was a default option because we get it from open shift which would then tell Jenkins and then finally the pipeline would have its own defaults and only by following through all three of those could you know what was going to happen and so we spent up I spent way too much time figuring out what was going on and then we ended up come up with this idea of declarative the specifications where we said this is the way it will be and we went to explicit options where we started defining the environment explicitly and we also discovered that some of the flexibility that we didn't that we needed when it was a pet we no longer needed because deployments are cheap as we started doing the deployments and discovered that we have the ability to do developer builds we went from having one pipeline to at last count about 20 different pipelines that are running from different developer options because instead of going through and clicking a check box you could change some YAML somewhere and so just for some information part of what we are doing is we are using Jenkins with its pod templates to construct a pod and we went to doing completely frameral builds where we construct the pod at initialization time it contains absolutely no state we take our core assembler which is the code that builds core west a Jenkins life time together to report back to Jenkins and and then it goes off and does the build but one of the things that we also did is discover is that that Jenkins when I talk to people about Jenkins there are two types of people when you talk about Jenkins those that love Jenkins and those that hate Jenkins most people who love Jenkins haven't had an experience with Jenkins to know that they really don't like Jenkins and and so what we ended up doing was we moved away from Jenkins as a source of truth because we didn't want it to have any sort of state another problem that we had was we had four different sets of defaults because we were using mutually exclusive templating systems I'm a make file geek and then we also had the yaml templates we had Jenkins with the groovy trying to get stuff and we had OpenShift and all of those ones came together and we spent a ridiculous amount of time trying to merge them and one of the things that went wrong was trying to figure out what was true which seems like it would be very easy to do but because we had all these different inputs instead of one input we didn't change our environment to be able to deal with with it just being deal with one set of defaults is that we were taking OpenShift will send a environment variable over as an as an end bar for a parameter which means that you end up with something that's it evaluates to true even when it's not and so we ended up trying to figure out why our different configs were broken and it was because we had multiple sources of truth and this is an example of of where you can end up with something going wrong that's just one of the early I guess iterations where we'd say make something and we would set something in the make file which would then go off and change something in the template but the template itself also had a default and so we just ended up burning all that and went to just one single source of truth another thing and this is something that I've done some thinking about we really haven't gotten there is that there's a temptation a strong temptation when you're doing your code to take it and turn your your options into parameters but what we ended up seeing time and time again is that all you're doing is making indirect pain because you end up defining the that of the the variable in multiple places so you have to use the variable here in the actual code the representation then you have to set it somewhere and then you also have to set it in your template and so what you want to do is instead when you're changing the environment is switch it over at the same time if you can describe your service in JSON and YAML which is how you get it into Kubernetes you should probably be able to do the same thing with your environment so embrace the JSON and the YAML early use it and teach your service how to use it we ended up as we were going through converting our environment we define what we call the job spec the job spec is gave us a lot of flexibility where before we had some small set of options it was probably maybe about a dozen different ways to do builds now we're able to do probably several thousand different types of builds based on changing different parts of the YAML so the two options that are very clear is one option to which I was talking about is teacher your gremlins understand the YAML but the other one is to use config and secret maps and then source them to set the environment on a per service or per deployment the reason for that is that it allows you to store your config in a kubernetes way and for things like secrets you can share them within the namespace and so you know standard sort of kubernetes things I wish we had done it earlier we started doing it with just secrets and I think it would have made our life a little bit easier in the long run another thing is make sure that you loosely couple your environment when you start the big thing is making as you unwind it make it so that you can change your environment as you you move along we ended up separating our code of what we called our core west assembler keeping it separate from our pipeline code which also had our templates and we've been wanting for some time to unwind that actually have our kubernetes templates outside the code but one of the the other things that we ended up doing is we introduced configuration branches for holding our job specs so that we could describe all those different variations so that what we can end up doing is just changing one bit of yamal and we can change the location for everything else another problem that we we ran into was how is state going to be stored one of the things that happens when you run Jenkins is when Jenkins is unhappy it is horribly unhappy and the question is how do you recover going in and trying to recover Jenkins after it has a crashed or just went to lunch or something like that became problematic and then there was the other problem that we ended up having where once we have one you're gonna want another and another and another and so we tackled that question and move state outside so that we're storing that in in s3 or in Koji we also learned that that Jenkins is not well suited for the cloud we hit some some problems that at the for a while there one of my favorite commands was OC delete pod slash Jenkins if I was having a bad day I would just run it just for kicks and giggles and if the build was having problems just run that because we got to the point where we were able to blow Jenkins away and Jenkins no longer be was a pet it simply was an execution environment and so one of the things that I would challenge people when you're moving something that is your unmovable task is to look at what you're actually doing what is the actual workload is the in like in the case of Jenkins and some of the CI things you will notice that Jenkins itself becomes the pet and that what you really after is the workload or the output and so what we ended up doing was we put a lot of engineering around making it so that Jenkins was simply an execution runner that results in us making our own master and slave Docker I guess container images since I was paying attention Dan Walsh earlier and where we made it so that Jenkins was completely scripted where it would come up and on restart it would not remember any prior history we didn't want Jenkins to be the center of gravity the purpose of the pipeline was the output of OS disk images it was not to run Jenkins and so be very careful that you don't conflate what runs your workload with what your workload is and for a while there we also really liked master but a certain developer wanted to break stuff a lot and we got religion quick about pinning rather early on and what we started doing with the pinning was that we would use different get branches for configurations different ones for the version of the pipeline and different tags to be able to create certain versions of core of the build pipeline as a result right now we have four production versions that are running all based on snapshots in the code but it also gave us the ability to break stuff and not break production another thing that we experienced was in at least with OpenShift you aren't supposed to run by default as root can but we ran into some privileged problems and if you need root first question you should be asking is why and do you really need it in our case we did because we needed KVM access and that resulted in I'll get to that but essentially you don't need it so use the opportunity in this lift to move away from running this route so KVM is really difficult in Kubernetes at least because it all depends on the way it's set up so when I prepared the slide I said we had three OpenShift environments I've since been corrected that we had four OpenShift environments that we moved this through each one had a different method of KVM access and so the first one we did virtualize KVM that was dreadful hated it so then we ended up moving to a different one where we got a used a direct KVM through a package third one we end up with a service account so that we can go ahead and do you know get our access but the problem then was we ended up in a privileged pod and that's bad because now we're running our process as root what you recall I said don't do so what we end up doing in this one is we start up give ourselves the access we need to KVM drop privileges and then pretend like we don't have the the permissions to run as root so the lesson that this that in all four or three or four of those environments that we we learned was that if you aren't very careful you will trade one tightly coupled environment for another tightly coupled environment and this this can be true in the cloud as well as an OpenShift or self-hosted Kubernetes as you start using the limit eight or the full capabilities of your environment you will invariably make things difficult for you later if you aren't paying attention to what you're doing so for that there are some things you can do to protect yourself one is use variables whenever you can for things like DNS names host names whatever we don't do it in our pipeline but there are some Kubernetes ideas one is external names where you can give in cluster names to external services same thing with external IPs where you use a proxy both of those are great ways to be able to do things like firewalls and give your stuff the access it needs but by using in cluster naming the main advantage is that you can further decouple from the external environment and then the other problem that we had was a fun little lesson on our test and deployment of our services we have dev and get ops and this was one that we as a team debated for for quite some time and I think we're still settling on that when you have a virtual machine you often will do things like Ansible and config management puppet chef whatever so the way that you actually manage your service needs to fundamentally change and so you have to ask the question how are you going to be doing your deployment and we ended up shifting in large part over to some combination of I guess kind of a devy get up sort of thing where we do all our stuff based on get polls PRs and we even make use get lab to do some CI on our pipeline another mistake was we tried to support Jenkins and the Jenkins way of doing things while also trying to do open shift things open shift allows you to be able to do build triggers based on URL callbacks so does Jenkins and the problem is that if you do both of those you can end up with unexpected results based on Jenkins parameters for example so what we ended up doing was we backed out all the Jenkins stuff and just use the Kubernetes primitives the reason that that worked better for us is that we moved to the declarative declarative method of configuring our pipeline we didn't want to have unexpected results if someone changed a parameter in a URL so what we did is just backed out use just what's coming in through the configuration and we now ignore input parameters coming in other than we wanted a build to happen but if you mix things mix what where your inputs are coming in from you're gonna end up with a truthiness problem and then an interesting side effect from this was that as our pipeline started becoming more and more durable and it was stable our risk tolerance went down and then we started CI in the CI that was having Git lab run CI checks against our Jenkins jobs made it so that we did reduce some of our outages but we've done other things to keep things more taller so the one of the bigger issues that we also explored was I would probably say complicated setup we introduced a bunch of knobs because we had the ability to once we had a way to descriptively describe our environment to describe how we wanted to build we started adding knobs for everything we had a knob to force builds we had one for a dry run to do everything but upload stuff we had knobs for skipping certain tests or certain parts the code and what ended up happening was that the code started getting very clever but also became dangerous and we ended up going back and removing the code and I would say that one of the big lessons learned is that any time we introduced complexity we introduced brittle code and that brittle code made it difficult to debug reducing your reducing if you if you can reduce your code reduce the complexity make things simple while you're doing the lift your life will be much easier so make sure that you're managing your complexity and at looking at the complexity before you start your your lift will make your life easier this one was our biggest pain point was the internal CAs the reason this was was difficult is that OpenShift because of its internal security model does not let your containers run as root so we had a internal CA cert that we needed to apply and because of changes with the underlying disk or container images we were using about every couple weeks a build would happen and our CA was not being trusted and we weren't able to do the regular update model after several iterations we just ended up with a base container that we use to build all our infrastructure containers I would say that having a base container to start with your most base configurations for things like internal certificates common packages and things like that would be a best practice I wish we had done it earlier it would have solved a lot of pain later and so the one thing that we did do in the six months with four or three four different versions of OpenShift was that we planned for frequent deployments and that was from day one we when I we started on this journey we decided we wanted to be able to blow the entire pipeline away and start over from scratch from three seven to three three one one the first time we moved from one version of OpenShift to another one it took about two weeks for us to do the move the last move that we did took 45 minutes and it was a fire there was a issue where our environment disappeared because of some hardware problems and we were in full-out panic mode but because we had been prepared and we were able to describe our entire environment including the secrets it took 45 minutes and the only reason it took 45 minutes was because we had to wait for the container images to do their builds as a result of that and get to it later but there are at least four teams that are running copies of our pipeline that's the art team which runs our production teams and a multi-arch team we have our own development builds and you know today you could take that if for those internal red hat and run your own copy now so yeah I've already hit on on most of these points and I'd say that the the biggest thing that I wish that we had done when we did the lift was we had challenged our base assumptions I think that if we were to step back and do this again we would probably not do Jenkins we would look at our base technologies and look at what really was was there and we would use that and so so just I guess you know quickly to tell me I have ten minutes here is that we had to change our thinking and if we change your thinking earlier our life would be easier and so far I'd say that we're a little bit happier so the gremlins they've all been placated and right now one of my I would say our success points is that developers can now run and test their code in pipelines without having to do so on their desktop and it is one of the fastest ways to bootstrap a new developer is to give them a pipeline and so I guess in conclusion leave you with this thought is that if we thought more about the results and less about the how we would have made different decisions and if you have an unmovable workload it's probably unmovable because you are think you are thinking about the how of moving it instead of the results that you want step back consider it and all of a sudden those things that are unmovable will become move movable for you so with that I guess we're open up for questions I've looked at it but haven't had a chance to play with it the question was had I looked at tecton yet how much so if oh so the gains the question was what gains did we expect when we made the move from virtual machines to VMs as I understand it it was because we were building core OS which is the base of open shift we wanted to dog food our own technology and we wanted to see what it would take to get it done and so that's that's kind of what we did what we ended up gaining was I would say a lot more than we had before probably the biggest gain in the move was being able to build arbitrary pipelines based on whatever configuration we wanted and so as a result we are able to test different versions of coro s assembler different packages before this when a person needed to test a different version of of red hat coro s may needed maybe just one different package added or removed or maybe a different configuration that could be tested it would take us time to do that now we can produce an image artifact in less than an hour if we have the configuration so it gave us the ability to react much faster than we previously been able to and now I think people kind of expect that we can do that sort of turnaround time which is it is a downside so so the first question would be is what is the OCI KVM hook and how did that help OCI KVM hook is simply inserts dev KVM into images it's a fedora package that you can install no no you don't need to the question was do you need to run the container and privilege mode to run QEMU in it not quite sure I follow we aren't using anaconda anymore and so coro s assembler is this is a set of tools that builds the RPM OS tree and then we build the disk images ourselves so we don't use anaconda at all in a virtual machine you you could so the question rephrasing can you do anaconda based installs in a container this way you can using vert install if you have access to a KVM device in early iterations of it we were doing vert installs with KVM access any other questions Steve yeah so the question was that one of the things that I had noted was that we would consider other technologies one of the lessons that I think was a team member ask a rather silent question after I put together a I think it was like a three or four hundred line PR that was a bunch of groovy and the question was why are we doing this in groovy why aren't we doing this elsewhere and in retrospect I think that one one of the things that we were doing is we're putting so much energy into Jenkins itself and Jenkins technology not into the build output and we had something else so I would have used build configs themselves and used OpenShift to execute those parts of the code that we wanted and then using you could chain those together so that Jenkins wouldn't even be needed use this the pure Kubernetes permatives instead of sugar around it any last questions thank you