 So who are we? I'm Ramya. I'm a software engineer at Pivotal. My name is Miriam. I'm also a software engineer at Pivotal. So what is release engineering in our context? So our product is made up of many different teams and each of these teams contributes specific parts of the code to us. And then the release engineering team takes this code and runs them through pipelines which are called test suites, also called test suites, and then base them into a product and serves it to the customer. What does this mean in terms of Kubernetes? We're talking about different integrations like Harbor, HCD, and Fluentd, et cetera. So teams integrate with these and share code with us. And then the relinch team runs them through pipelines or test suites, an example of which is on the slide there. To create a product called Pivotal Container Service and then that's the one that we ship to customers. Few things before we proceed. What is concourse? Concourse is a open source tool. It's a way in which you can write tests into different jobs. You can combine them into different jobs and then create a pipeline. This is an example of a pipeline and you can have different jobs sequenced one after another and you can have like resources, resource types and inputs and outputs to this job. So it's just a way of having a test suite. It's anonymous to that. And it's a pretty cool tool. You should check it out at concourseci.org. We'll be referencing this frequently through our talk. A few more terminologies that we will be using interchangeably in our talk. When we say fly, we mean running a pipeline or running a test suite. A pipeline is equal into a test suite and a lock implies an environment or a test bed and a pool is a group of environments which are configured the same. So we're going to tell the story of our release engineering team and our story starts in April 2018. A new team had just started and they wanted a fresh start and we were facing two main problems. We were shipping too slowly and we had a lot of bugs in our code. So naturally we asked ourselves, why is this the state that we're in right now? Well, one of the things is that all of the scripts and things that we were using had worked at a certain point but the teams that were integrating with Kubernetes have been growing and we've been supporting more and more integrations. The size of the relinch team itself had stayed the same. So we no longer had enough people to keep up with all of the work and we had done a lot of things super like duct tape. So now we had a lot of technical debt that we had accrued and this was resulting in us releasing slowly and so our customers were unhappy. We broke it down into these particular issues that we had. We had written a lot of scripts that were not being tested. We had a lot of YAML that was duplicated everywhere. Our environments that were being used for tests were in high demand and we didn't have enough of them. Our pipelines were taking way too long. We didn't have a very well-defined integration process and we still needed to sort out our open source licensing. So the first thing that we wanna talk about is no testing, one of the problems that we had. So why didn't we have any testing in place? When we looked closely, we found that we had a lot of bash scripts written in order to do our testing and bash is really hard to test drive or even just to write tests even after you write the script. It's also intermixed with Linux commands here and there, so it's hard to notice bugs. It's hard in general to just program in bash. So we decided to tackle that first and as a team we decided to change it into programming languages. So what could we go for? The two options were either Ruby and Golang and the reasons for these were because the team had prior experience in these areas. There are a lot of Ruby experience in the team and there are a lot of Golang experience in the team. Also, Ruby is more testable or test driven and Golang is also useful for concurrency and parallelism which we could really make use of. So here is an example of how our scripts looked before. For example, we had an environment metadata file and let's say we had to connect to this environment, right? So we had all of these properties in maybe like YAML and we had to read through it and find out the proxy and the username and password in order to connect to it and this is how we were writing it in bash and as you can see there's like YQ or AUK or SED and there are other commands that we were using interchangeably. This, as you can see, is untestable because we have to test, we have to mark out a lot of things for these. So then we decided to do bash in Ruby and this was an intermediate step. So what we were doing was reading the environment metadata by opening files, reading it into a hash and then getting properties out of it, doing whatever processing we wanted to do and then closing the file and this was happening pretty much everywhere in whatever we were referencing the environment for testing. This is somewhat testable. Like you can still mark out a few things and get away with it but not completely testable. Like for example, there's like splitting on colon and then expecting that the first array of an item is gonna be present, et cetera. So it's quite error prone. So we decided to completely change it to Ruby and what we mean by this is trying to think of the things that you manipulate in your test suites as objects that you pass around. For example, the environment metadata could actually be a lock file class which had properties that you can fill in and this is super testable in Ruby. So we try to think of it as in an abstract way and started creating classes for lock files and pipelines and it was easier this way to test drive it. All right, so the next problem that our team was facing was having a lot of YAML duplication. Doing a quick Google search showed that everyone else was also struggling with the same problem. So why is duplicate YAML a problem? So we were getting a lot of YAML from Kubernetes configuration in general and our CICD tool that we were using concourse is completely configured by YAML. So we were dealing with a lot of YAML that looked very similar and was slightly different and the process of looking at a bunch of files and trying to figure out what was different between these files that created a lot of cognitive overhead. The second part is that we were as part of our process we were copying and pasting a lot of files and then we would just change one or two keys and sometimes our team, which is human, like forgot to go back and change these keys. So we had like mistakes like that. The team was very unhappy. Anytime we needed to add pipelines, modify pipelines, do anything, everyone got very sad about it and it's always important to try to have happy devs. It was taking a lot of time to make these changes and because we were sometimes copying and pasting and forgetting things or were indenting in different ways the pipelines were becoming inconsistent. We decided that we really needed clarity. If we go in and change a pipeline, we need to know exactly what makes this pipeline special from others, which part do we really need to change and did we do it correctly? So the first step is that we had gotten used to having really long YAML files. Like our YAML files were almost a thousand lines and we had a lot of them and it was really, really overwhelming to constantly be searching a thousand lines. So we decided to divide it into as many small pieces as we can and that way we can just see what we're dealing with. After this, the slide cutoff that says embedded Ruby, but after this we decided to use ERB. So ERB is basically a way of using Ruby in YAML. So this is an example where for us we were doing testing and we were testing Kubernetes against all these different infrastructures. And so instead of just repeating AWS, AWS, AWS, GCP, GCP, GCP, we now can just use like these four loops. And so that was really helpful and it really dried up our YAML. But there was still a problem with this. And the problem is that when we had Ruby inside of YAML, YAML cares a lot about indentation, but Ruby doesn't. And so every time we were actually putting YAML inside of ERB, we actually needed to make sure that the indentation matched. And so we were having a really hard time making sure that the indentation where we were actually taking this and moving this had the same indentation as the other place. And this was really hard for us. So we decided we wanted to try something else. And this time we went with anchors and aliases. So if any of you are familiar with YAML, YAML itself has this notion of an anchor. And when you define an anchor, you can just reference it in other places. So we would start having one resource type defined in one place. And then we would just use it in a lot of other pipelines. So this did a really good job of drying up. But we noticed that it wasn't taking care of the cognitive overhead. We were still spending a lot of time opening up a lot of different files to figure out what we were actually including in the pipeline. And specifically for us, we were seeing that certain parts depended on other parts, but we couldn't even tell that they used each other because now we dried up everything so much that we couldn't tell what was even failing or why it was failing. The other part of this also is that we needed a tool that would actually go in and understand YAML anchors and aliases. And so we actually built a CLI, which is always a bad idea. And that was hard to maintain and everyone was grumpy about it. We decided for attempt number three that we were just looking at it the wrong way. We were splitting this up based on how it was configured in concourse. And what we should do instead is just look at how our pipelines are split up. We noticed that there's a lot of parts of our pipeline where at the top we would dry up something and that thing was referencing something at the bottom. In this case, a resource type that's being used by a job. So we were looking closely at our pipelines and sometimes we would have a configuration like this. We had the infrastructure layer, so vSphere 6.7 in this case, and then we had the networking layer, which is NSX. And then we had the version of whatever software we're testing and then the scenarios, like in this case an install. And we decided to have these templates, these templated files, and we're going to just have all of these dependencies in one file. And then when you actually come to make a new pipeline, all you need is a file like this and it's completely empty. And just based on the name of that file, you can figure out which templates you need to use. Okay, so the next problem we had was environment contention. Now to explain what we mean by environment contention, we as a relinch team were testing on many different IaaSes. Some were public like AWS, Azure, and GCP. And then we also had like the private ones that we make sure to test on like vSphere and then NSX pipelines. Nimbus here is an internal tool that VMware uses to spin up NSX environments. So basically we cover all of this in our testing. So why is this a problem if you don't have these environments around? First of all for public IaaSes, like we had many different pipelines, but the number of environments that we had was very little. So the ratio was really skewed. We had 50 plus testing configurations to test, but there was no automation around environments. So we were basically going to the console and then like creating them manually and like making all these configurations and forgetting some parts of it. For private IaaSes, there were other issues like Nimbus like I explained before is based on a quota system. So whoever requests for an environment first gets that environment and let's say the quota for an entire organization is over you wouldn't be able to request for another one. So you just have to wait till someone cleans it up and then you have access to it. So basically we were really sad again. This is causing a lot of human errors and configuration errors and made us sad like the dog in the picture. So how did we fix it? We used Terraform which is an open source tool which can create environments on the fly. So we started having a pipeline for creating an environment and destroying an environment and this was per IaaS. So it was like a one click solution and we just automated it. So we never had to bother about like creating it again unless just going to a pipeline and clicking on a button. Also the pipelines were recycled sorry the environments were recycled after 24 hours if they weren't used. For the private IaaSes that is Nimbus specifically our infrastructure team from VMware came up with this tool called Shepherd which pre-made environments and then preserved it for the relinch team and we basically started having a shared pool for all the teams that were going to test on vSphere NSX. So this way we had on-demand environments and lesser contention because now we were at least we were at high priority. Okay so another problem that we had is that our pipelines were taking way too long. So we were watching a lot of our deadlines pass us by and we were missing them and we're like why are we always missing our deadlines? And also I totally screwed up the formatting on all of these it's fine. You didn't notice, you didn't notice. So we looked at pipelines that are time consuming and we saw that they're time consuming for two reasons. The first is that they're really slow and the second is that a lot of them were really flaky. So we wanted to look at the ones that were slow first. First of all, our pipelines were doing a lot of tasks serially that could be paralyzed and most people on our team were comfortable with Ruby but we decided maybe it's time we actually start rewriting these and going and actually use go routines and by being able to leverage concurrency and parallelism we were able to get a job that ran in every single pipeline from two hours to 15 minutes. Another thing that we wanted to tackle is that in a lot of our test suites we were spinning up a cluster and then we were running tests against that workload and then we were destroying the cluster and then this was happening also in serial. So we decided let's just have one job create a cluster and pass the information for that cluster to all the jobs that need it and have those jobs run in parallel. So this resulted in our pipelines being sped up by four times. The next thing is we wanted to look at our pipelines that were flaky. So we knew that this was the flow that we were experiencing that was causing us pain. We would have a PR get merged and we would build the product and then we would try to run it through almost 39 pipelines against different infrastructures and we would have to trigger this many, many times same exact inputs until it went green and we could ship it. What we really wanted is to be able to merge it run it through all the pipelines once and then see it go green. So one person on our team was like let's just run a pipeline, let's just call it run, run, run and this pipeline is gonna do exactly what it sounds like which is it's just going to run over and over again with the same commit and we're just going to keep this pipeline around and use it to just get a sense of how flaky certain jobs are. So we did 112 runs took about four weeks and we realized that our pipelines were succeeding 55% of the time. So after some very complicated analysis this very small here, we actually realized that there were certain jobs that were failing a lot certain jobs that weren't failing as much and we decided to prioritize these jobs by failure rate. This was actually really successful because we discovered a lot of random errors that really hadn't been looking into apparently running out of disk. Apparently sometimes we were retrying and it was failing and by being able to really look for these we're able to look for those specific teams that were in charge of wherever that integration was failing and they were able to fix these things and this made a big difference to our speed overall. So the next round we had was no integration process. So at this point we had all of these pipelines all the teams had access to these pipelines and we had no idea who was triggering what pipeline at what time. So this was like really initially before we had any processes in place. We would come to work in the morning one day and find pipelines parsed or pipelines run or pipelines deleted. And also teams had no way to test their code. They were pushing changes to master directly and actually debugging after shipping it because they had no way of knowing where their tests were failing or they had no way to run their tests in the first place. So we could fix some of these issues like restrict our concourse access. Like we created, we made it more private and only the relinch team could access it and we started introducing code reviews and PR processes but it still doesn't fix the issue of the teams not having a way to test it. So the teams were making pull request to master and we were running it in production concourse but every time there was a problem with their code the only way they would know it was through relinch because master was broken and we had access to this. So this is where relinch as a service comes in. So basically what the relinch team did was created an exact same concourse that was available to the teams and each of them had their own separate buckets and own separate places to host their code. So basically what they would do was they had access to our code so they would run those pipelines in the separate concourse and they could also run only what was relevant to them. Like let's say they wanted to test only vSphere and AWS. They had access to the YAML files and all they had to do was run it in this concourse and test it out. So this way we just had a script which made this configuration. The teams would change the location of their buckets and their repositories, run the pipeline in the separate RAS concourse and see if it fails. If any job fail they would fix it and then PR it to us and then we would have it green in our production concourse. So this way both sides had more confidence. The teams had more confidence because they tested their code and we had more confidence because we know that they had a way to test it. The last part is open source licensing. So at this point we have pretty much sped up most of our processes but there's still one last hiccup. So whatever we are shipping is this legal and one of the biggest problems of shipping open source software is knowing whether it's legal or not. Whether have you acknowledged for every open source software that you have included. So who's responsible for this at this point? It was the relinch team. So what does this mean when we're talking about open source licensing and why is it hard? When we're talking about Kubernetes especially we're talking about containers and containers are made up of software packages. They're made up of base OSs and it's pretty early right now for us to know what is included or how do we scan these containers is hard. Containers can either be on the internet or they can either be created by the teams themselves and we need to know what's included in each of those containers. So we started scanning them by ourselves. We take each of the component teams code, we scan them, we wrote scripts to scan these containers and find what software was included in that, find the software packages, find their licensing through open source tools and then create OSL. We also became smarter on what to OSL and what not. For example, we started asking the teams to do their own part for the OSL because the teams know what containers they include. We also did OSL only for GA bills and not for only for general availability bills and not for all dev bills that we were like producing. We also started doing this much earlier in the process when we knew that something was a candidate for a build and also started providing teams with early access to the bills so they could scan it after and verify that they had included the containers that they were or scan the containers that they were including. So at this point in time, we had tested code. Our YAML had gotten dried up. Our process was more defined. We've automated a lot of the environment creation and our pipelines were faster and we actually had, we were in good legal standing. So that's always good. At the beginning of this, when the team had started, we were shipping every six months and now we were shipping a new build every 10 days. Our pipelines went from 12 hours to four hours. A large part of that was parallelization and we were able to even have a turnaround for bugs, CVEs, patches of 48 hours. Our release time for one of our builds, 130, was 35 hours from start to finish. That was from testing to OSL. We got it down for release 140 to about 10 hours now. Though OSL is not included in that one, but yeah, OSL, actually OSL is not included. This is just testing part, I'm so sorry. OSL still takes a lot of time. So release engineering as a service, it's doing Kubernetes the hard way, but it's worth it. One of the best parts about all of this is that someone made us pie at the end because we made it a lot faster. Thank you so much. Are there any questions? I'm very interesting. What scanning tool are you using to scan the license? So we developed some scripts internally. We used Bash, but it's actually a really hard process. We did not find any tools that are reliable in the market, so we had to develop it in-house. But the basic idea, it's little manual at the moment, but at least mostly loaded in Bash. Yeah, we used for actually scanning the rest of the product we used for Sology and I forget the other one. Yeah, license finder and for Sology, but they don't do container scanning. For containers, we did it ourselves. Okay, thank you. During that infrastructure as a code part, you mentioned all the test tools will be commissioned after 12, four hours if they are not used. How do you decide? How do you know if that piece of infrastructure was used or not? So we have a GitHub repo that has the actual lock files in them. The credentials are stored elsewhere, but it basically, the act of claiming an environment, you actually move it from one folder in the GitHub repo to another folder. So we look at the time that it was claimed that it was basically moved to that folder and that's the time that it was claimed. So we have a pipeline that runs every hour. It looks at all of the locks that were claimed from that repo within the past 24 hours or more, and then it moves them into the destroy pool. So we basically use GitHub and we look at commits, everything's automated through commits. So that's how we figure out how long an environment's being used. Great, thanks. You just mentioned the third party dependency management. I'm not sure, do we have more information about it? Are you asking around OSL or about another aspect? Yeah, I mean, since you project to rely on some third party dependencies, either Java files or RPM files or some other files, if your project has lots of microservices or projects, all of these services should rely on some third party dependencies, then how do you project, all these projects manage the third party dependencies? Is this for OSL or in general, are you asking? Please go back to your slides. Third party dependencies, I just saw. Just tell me when to stop. No, no open source. Oh, okay. Oh, next, all dependencies in one file. This one, okay. How do you manage this? That's a good question. So, okay, so what we do on our team is every time we make a change to any integration in Kubernetes, what we do is we build our product with that change and we have to get an environment that's configured with this configuration. So what you're looking at here right now, these are all configurations and for vSphere like, for vSphere 6.7, for example, like we actually have our own set of servers locally that we're using to test. For NSX, we're actually deploying that networking layer on top. We do this for public infrastructures as well, like for GCP, for example, but with GCP, we would just do flannel for the networking layer and the OM, that part is sort of a product that's like we're also testing against and the scenarios are just, install is like, we're just trying to see that we can get it deployed, get a cluster, get it to pass against tests. For upgrades, we're just trying to make sure that we can go from one version to the version after it. So for all of these, we basically are using a lot of environment management. Like all of these are just environments. I don't know if that answered your question. Looks like maybe I didn't, I'm sorry. Oh yeah, okay. So we do have, so we have something we brought up and where did we bring it up? Oh, I was gonna say, okay, like just for these environments, I just want it, the ones on the right-hand side, those are all public infrastructures. So we're using like Terraform to be able to deploy everything on those environments. For the left-hand side, we have like internal tooling and we're using that to be able to specify like which specific integrations we want in it. If we like have, I don't know, like Veerli or Harbor or something, then we're specifying it that way. But I guess a lot of these are right now, they're internal. I don't think that we've made them available yet, but some of these things maybe someday when they're like ready, we could share them. Did you ask in terms of YAML files or were you asking in terms of environment? Oh, okay, great, awesome. All right, thank you. You can find us after, if you have any more questions. Thank you.