 Thank you for being here I'm gonna talk today about The changes in the process that we established at GitLab on our way to continuous delivery But this talk is titled this way Only because it kind of fits within the track. I prefer the name Kubernetes the prequel Let me introduce myself first. I'm the engineering manager for the delivery team I've been with GitLab since September 2012 so that's seven years and I got hired as a backend engineer and through my tenure at GitLab I changed positions multiple times I was responsible for the omnibus package installation methods and so on and then recently I moved into the position of Managing a delivery team whose sole responsibility was figuring out how to migrate GitLab.com and all of our release management processes to Continuous delivery To just give you an a bit of an idea of what we are going to be talking today about I'm going to give you a history overview of how release management will evolve at GitLab How the whole team that I'm leading right now and the whole process got created and How we while we were doing things started changing things just to make it more interesting and finally the results of all of that and if there is anything I want you to get out of this talk is That there is no shame in trying things out Seeing how you can change your processes around how you can leverage your legacy tools Even though everyone screams. This is the best thing. You should be doing this. You should be doing that So to give you a bit of an idea of How we ended up in a place where we started from 2013 Onwards actually from 2011 GitLab has had Monthly release on every 22nd of the month. We haven't missed a beat in all these years From 2013 when we formed the company and had more than one engineer We had a rotating release manager role. This was an Engineer who was responsible this month for writing a blog post tagging a version pushing this out to the public even tweeting and in the first couple of years it was only three of us because we were at five people at that point and That meant that a lot of actions that we did were manual mostly because we didn't have the tools so I Would just log into a machine Build a package manually upload it to S3 manually Copy the Shah Put it on the blog post and release it. That was the release process as we started getting more and more of back-end engineers hired They started getting into that role as well. So What happens usually when you put an engineer on a problem? They find a way to get out of it. So what they did was Automated some of these tasks But that still meant that a lot of the tasks that we had were follow this documentation execute this script do this do that all of that manual because the release manager role Was actually a rotation. So all of the knowledge that gets built up during one month just gets wiped clean for the next one and The idea was also to make sure that we improve things by having fresh pairs of eyes looking at the process Now We actually had One near miss with our release. We almost missed the 22nd deadline that we had Back in December 2017 we deployed one day before the release which was unheard of I think that was where the alarm started going off that we lose a lot of knowledge by Rotating constantly and I got tasked mostly because I'm allowed month. I got asked with Investigating what happened? How would how can we improve this and as I was collecting data it became apparent that We actually do need to spend some effort in either formula team that is going to be Leading this change making sure that we don't have semi-automated things anymore, but automated or We are never gonna get better at this. So in July 2018 I got two engineers and At some point someone at the company said hey It's great that you're doing all this automation and kind of release goes well with kubernetes and kubernetes is this awesome thing So how about you just do kubernetes as well, right like it's easy They added two more engineers and said you're gonna be the delivery team So automate the release and migrate github.com to kubernetes github.com has millions of users has a lot of traffic and that actually is Task on its own. It's a huge challenge. So whenever you see someone screaming kubernetes I hope you're gonna remember this like this bill Burt fake comic So I accepted the task The team accepted the task. It was a great challenge So we set out to see what are our requirements for everything that we need to achieve So github.com live system. We can't have any downtime So everything we do we need to make sure that github.com stays up We cannot move the timelines 22nd remains the same Engineers need to release codes because this is our lifeline So no delays You should migrate github.com in the next three to six months to kubernetes Changing the whole platform So no time to do this great But one thing that actually stuck with me the most was the question is our Engineering organization ready for continuous delivery. It's great when you're using all the greatest tools, but How you use them is really really important So this was the biggest unknown for me like at least in the other three items I knew that I can't change that but can I change the last one? so What do you actually need to do to prepare your organization for continuous delivery? first of all your Development needs to completely shift left. That means that before things get merged into your main branch Things already need to pass all the testing all verification security checks Are you are your testing system solid? Do you have end-to-end integration tests? What do those tests tell you? How are you using data and metrics to inform your deployment decisions and Do you have capability to react quickly to any sort of change and Unfortunately for all of these things in 2018 the answer was no That was my face when I realized this was a humongous challenge not on the technical side, but on the process side so Now that I understand all of my requirements and the challenges that I'm going to encounter What is my team? spending their time on a lot pie charts because they don't tell you anything But this pie chart is Created out of the data we gathered over the 14-day period Where the development kind of slows down so that we can prepare for a release? My team spent 60% of their time in the 14-day period babysitting deploys and then 26% of our time was Related to manual tasks or semi-manual tasks that someone had to do writing a blog post or helping writing the blog blog post communicating the changes between people Doing various cherry picks for P1 problems that the developer found by the way if you Trust your developer today understand what the P1 means You are fooling yourself We also had a manual process where release managers had to do some basic QA that's kind of silly so Release manager goes in clicks on the button. Oh button works great. That's check done and get left also had a special thing where Community edition and enterprise editions were built from separate repositories And we had to merge one into another because enterprise edition was a super set of community edition So that took 10% of our time. So if you take a look at the whole thing in to in 14 days in two weeks we my team did nothing but Sit on the computer and watch well paint dry. Yes in this case so If we change the 80% of our whatever we were doing during this period We would be able to make sure that We have no release deal delays because we'll be freed up to make sure that everything happens in time if we do deploy quicker and Smaller chunks get deployed to production. We will ensure that there is no time or at least we would be able to control that better and If we free up all of that amount of time we would be able to actually start working on the Kubernetes migration that we were set out to do and Another thing that I thought was like a really great bonus here is While we are doing the changes here, we would be able to prepare the organization for the incoming changing process So That is what we set out to do If you take a look at the cycle time compression We set out to go there, but we started going down this route instead so one of the items that we observed is Everything was tied into this simple process How developers behave How the product behaves? We had a seventh of the month which was our feature freeze date at this point we would branch off from the mainline and We would have a slower moving branch from which we would do deploys and prepare release from This reinforced a really great behavior where developers would kind of pile around that seventh because I have time 70s in seven days and Then on the sixth in at midnight. They would panic merge things Because they know that if they miss this deadline, they have to wait for the next month But if they get it under this deadline, they have good two weeks to fix any problems that happen now We are creatures of habit, right? so I thought what if we and I didn't think of this by the way a lot of companies do this What if we speed this up like do the same thing, but just more frequently right like if it hurts So this is the same system But instead of doing one branch We create three every week. We create a new one developers get this Similar system of all right. Well, I have some time to fix things, but I don't have much time So I'm gonna think twice do I want to spend time panicking in like fixing things quickly? Or am I going to like make sure that things are actually operational before I merge things? It also gave us a bit more time to Make sure that Whatever we deploy this week We can be certain that by the end of that week The only new thing that is gonna be bring problems is whatever was created With this new branch that is going to go in we also had great help and that is We got to use the tool that we built so I Mentioned get love calm get love calm is one of the biggest instances of get lab in the world, but we use Get lab To build get lab and then we use another get lab to deploy get lab one Other thing that we had is an advantage was we had access to all of these developers that were working with us because If we don't get something that we need they won't be able to deploy their thing. So they have quite a lot of Excitement when we came to them and asked hey, can we improve this feature? How can we get this done better? So Some of the release toolstiles that you saw the 26% of the time we automated by just taking it into get lab CI Triggering things through the API and using the schedule pipelines. So if we need to create a branch It's set in the schedule pipeline. It triggers every Sunday evening Any P1 item that comes in we automatically cherry pick things into the branch that is currently active we create various issues to track progress through the environments or the q8 us that need to be executed and As I said get lab.com gets deployed from git lab. So we had to mirror some projects between instances and I think Another thing that is worth mentioning here is like get lab CI was the actual Pool to get or rather the glue to make sure that this all pulls in one direction that we wanted and finally get lab chat ops feels like underappreciated here or here, but a lot of the release tasks got automated just because we got a Very easy access to everything we had to do by using it to slack for example We connect get lab chat ops with slack everything is there. It's very convenient. You don't have to change your context So To explain a tiny bit more how this ended up looking So the happy developer as you can see on this side Goes through the whole process right review Making sure that their pipelines pass do some verifications through the review apps that we connected to a Kubernetes cluster and when they're absolutely sure they want to merge this thing they emerge it Usually that means it's out of their hands That's magnifying glass thing and the thing that you see scrolling here is our production pipeline what happens is What happened was we realized that all of the items we had to do were related to moving the semi-automated tasks into CI so developers machine is a machine and CI machine is a machine. So why not just have it there and it automatically logs things and the Release manager doesn't have to make sure that They're looking at the screen for six hours while the deploy is going now one challenge that we encountered here was that The tool we had at that point Was we already outgrew it so instead of continuing Using that tool we decided What are the top two things that we need to do to make sure that we can deploy safely? First get the package in Second thing make sure that it's deployed in a certain order. All right. Well, that's easy and we Rewrote the tool we had Rather we didn't rewrite it. We just wrote a new tool using Ansible and we placed our CI runner on a bastion host that had access to the infrastructure That was one of the bigger battles so to speak We had to do because we had to get a sign-up from security to put basically a remote code execution machine in our System We did get to do that mostly because we get a lot of insight in how we can actually make this happen and One of the great things that happened was we now got to connect through all of our environments and Do sequential checks as well. So what happens was Think like when the developers merge Whatever they did we automatically create a new package that package gets picked up by our CR system deployed on staging and we got to put Automated QA in CI as well if the automated QA passes it progresses to the rest of the environments That meant 60% of the time that we used is now out there It just happens. We don't have to do anything with it so the finish line When we enable the system Same 14-day period We did free up 82% of our time See to emerge also got automated and the 0.3% you see there is Sometimes the pipeline fails and we need to check it to see why it failed. That's it. Oops The release tasks remained relatively high still The biggest chunk of this 17% here is The security release we need to do security release requires a lot of coordinating There are a lot of stakeholders security teams development teams marketing and a lot of back porting to prior releases So that remains like a big chunk of our time in May 2019 We still had the old system and we had around seven deploys that month which was standard forgetlove.com in June we cautiously enabled things and we went to 12 I think and then I think I'm super proud about this August this year we had 35 deploys down on github.com that means more than one deploy a day and Did I mention that none of this is in Kubernetes? All of this is using our old legacy system But what happens with this is we bought ourselves time So my team has time to actually work on the migration But one of the biggest changes that happened was in the habits of the engineering organization People are thinking twice When they click that merge button, they make sure that integration tests are done before They click that merge button We have all of the developers on call because once we enable the system the old habits just got exposed and We had to call a lot of people to help us out with why are yous our performance going down? performance going down Why are we seeing the uptick in errors and within Two months of us enabling the system the whole engineering organization or rather the whole development organization went on call Sufficient to say I'm not really super popular there anymore but where I am popular or rather where my team is popular there is Everyone is grateful and everyone is excited that When an issue is found they can fix it really quickly deploy it really quickly and Within a couple of hours of finding a problem we can push this out to production, which is a huge huge change Why is this a Kubernetes the prequel? Well, because with us freeing up all of that time My team migrated one of the services that we had to Kubernetes So if you go to gitlab.com right now and try to do docker pull or docker push That is being served from a Kubernetes cluster We are using successfully our deploy boards We are using our monitoring successfully, and we are using our web terminals successfully because Obviously we had time to play a bit more than we would usually That's it from my side. I wanted to leave you with a bunch of links If you're interested in how all of these developed You can check out the design docs that we wrote you can check out where That pie chart you saw originally came from the gitlab 10 for release report And finally you can follow follow along our progress with our Kubernetes migration And what kind of challenges we ran into when we started doing the code Kubernetes migration Thank you Questions we got a few minutes for questions. So I'll bring you the mic if you want to ask anything So You're talking about how sort of like CI CD Enables cloud native, right? And you're talking about a lot of like Having to put the entire organization on call and that kind of thing, right? How do you get buy-in from Executives to make those kinds of changes So one thing that I think the executives love is dollar numbers So when they see how much time gets spent on busy work something that does not contribute to the actual Goal of the organization and if you transfer that into a dollar amount Expose how you can change that dollar amount to something way less Everyone starts listening Maybe they don't understand what we actually did here. Maybe they do it totally does not matter What matters to them is that a developer can fix a problem within a couple of hours instead of two weeks three weeks a month And then sort of to follow off of that question. Um, how do you How do you implement CI CD without having the engineering team be on call all the time Or is CI CD just a sinister ploy to extract more productivity out of the engineers So I think the the the on call was not made to make developers less productive The on call was developed to create some sympathy To with people who are actually managing the infrastructure and are on the forefront And the idea is not to have developers on call the idea is Teach them how to get themselves out of the on call rotation So think about what kind of problems at scale you need to think about and how to resolve them properly without Yeah, just merging randomly. So what I think is is happening already like within A two month period that that we had developers on call there are changes in habit And they're starting to ask the right questions How can I get access to the production database? To get the to understand the scale of the problem that I'm trying to resolve That is a great question to ask Because now you can provide them the data they can inform their decisions or other they can Understand how to fix problem on that scale and they're not going to get paged or their colleague is not going to get paged And the organization is getting better with it and I really do believe that With time we are going to remove the need for developers being on call Even if they are on call, they're not going to get paged And I think that's a great success This story at this point it sounds like you're still releasing once a month Yes, correct. Is are there any plans to increase the amount of times you release? eventually So there is a difference between what we do on github.com and what we do for self-managed release We already hear from well all of you that Having a release one once a month is great, but our organization can't update that quickly so We don't necessarily want to change that cadence, but we still use the same tools to To release to self-managed as well. So what is actually happening right now is We are getting tried and tested product Earlier and once we actually say this is 11.3 release What customer is going to get is something that already ran I wouldn't say bug free because that's not possible But at least definitely like not as many bugs as uh as you usually have in these type of processes. So The system enabled us to ship faster to github.com with more confidence And that actually allows us to make sure that we are not going to have more releases for our self-managed customers But actually less because we don't have to create more patch releases and we get to focus on Doing the new features that we want to ship