 I'm going to go ahead and get started so I don't keep you during lunch. I'm Mira Gary, and I work at Pivotal. And this talk was written in collaboration with Josh Hill, who unfortunately can't be here because he has kids and traveling to the States is too far for him. We're in the London office. And it's based on our experiences on the platform recovery team, which builds BBR, which is a tool some of you might be familiar with. Anyways, so here's how not to build a pipeline. Lessons learned from doing it wrong repeatedly. And clicker has decided to stop working. And I've lost my mouse. So the talk is put into two parts. Part one is mostly about build anti-patterns, so things that can go wrong in an individual job or task. And then part two is flow anti-patterns, so things that you can do wrong connecting your pipelines together. Obviously, connecting pipelines together is an important part of pipelines. So yeah. So part one is builds. Builds pass. It's great. Everyone loves it when it passes. And builds fail. That's also OK. Totally fine. Sometimes your tests don't pass, and it's good that you failed. But then sometimes you get an error. And if you're using concourse, this talk is mostly focused on concourse because that's the tool we use to run our pipelines. And if you error and concourse turns all orange on you, that is not a place you want to be. Errors are wrong. So yeah. The visual is that it's gone orange. I unfortunately don't have a picture of one, even though actually the team I'm on right now did for a while have a lot of those orange builds. Yeah. Usually that means you've misconfigured something. And to fix it, you can just correct the pipeline definition. So some examples. You got the wrong credentials, so you tried to log into your repo and it just erred. Or you get a typo in the task name. That's a fun one. It's just like, nope, there's no task. You get an error. Another fun gotcha. So if you're using concourse and you're using the Git resource, GitHub has an API limit. If you're not authenticated, and the Git resource pulls just frequently enough that you hit this rate limit and so your builds will also go orange if you don't authenticate. So easy fix for that. Just log in to GitHub at a private key, problem solved. Just make one for your CI. Another fun one. Errors are wrong, but keeping on flowing for without, things failed and you just say it's okay. That's also wrong. We've seen this one before. The error happened actually at the deploy stage, but nothing went wrong until you got to the run stage and you're like, you gotta look at the run task and you're like, what's wrong? Oh, there's just no environment at all to be testing and what are we doing? If your task fails, it should actually fail. So tasks should fail and they're code. So you should treat them like code and test them. Your tasks need to be tested and test the unhappy pads too. Make sure that if things are going wrong and failing that your task actually fails so your pipeline stops. Another fun one, bad bash. So we write a lot of our tasks in bash and then use that as the glue to hold everything together and run everything. Shellcheck is your friend. Shellcheck will tell you about when you've done something wrong in bash. This example here, rm-rf $variable slash star. Shellcheck will scream at you because that can expand rm-rf slash star if variable's unset. This example was actually on every Bosch director. Great. You don't wanna run that. Yeah. Also, set if you're using bash, set minus EU and set minus O pipe fail. All these things will ensure that your tasks fail if something goes wrong. Yeah, minus E exits on error, minus U exits with an unset variable. So in this case, if variable was unset, you would just exit instead of rm-rfing slash and minus O pipe fail exits when a pipe fails. Okay, here's another one. Our thing failed, but what went wrong? Anyone know? I don't know. There's no output. It failed. I have no idea what went wrong. You don't wanna have to hijack, or I guess the correct term now is intercept a container to figure out what went wrong. You should output debugging output so that you can debug your tasks when they fail. So the treatment for that is to log things to standard out and standard error. Concourse will just print them up for you. One convenient way to do this is to set plus X and then every command that you run is displayed, but that's also a little unsafe because if you've got creds going into your commands, then they're now suddenly printed in your concourse. Concourse is not the most secure thing in the world. Maybe you don't wanna leak your creds, so be careful with that one. But yeah, generally, more output is better than less output. Here's a good one. Flakes, this is a picture of an actual pipeline and all those red jobs, those aren't actual failures. We were having some real fun networking issues and some things were just dying because the network would just drop the connection. It's not good, because if you have flakes, you learn to ignore them and just you're like, oh, it's fine, I just run it again and you can hide real problems. So you really always need to investigate them. A good example actually of a real-life flake. So my girlfriend has a little sensor on her door to tell if the door is open, but it's not that well attached to the door. So sometimes it just goes off and sends you a notification that the door is open, you're like, it's fine, the door is closed, everything's fine. Well, she was going away on a work trip and she had just left and then I got this notification that the door was open and I'm like, eh, it's fine, who cares? And then she messages me and she's like, can you check my door? I was in and out three times so I kept forgetting things and I'm not actually sure I shut the door. I'm like, okay, so I went over to her house and yeah, the door was open. Hunt down flakes because they can be hiding real errors. And if you're just ignoring things because oh, it's just a flake, you might miss the real signal. Another problem, copy-pasta. I mean, yeah, there's a lot of things, like you copy-paste things and you end up with three different tasks that are almost identical and they just take some, one slightly different config thing. Yeah, no, infrastructure is code, is still code. All of the normal things that we do with code apply to it. You can refactor it. This task here needs refactoring. We can just make one task and then use input mapping to like, you make a general input and then use input mapping to specify the specific one. There's also output mapping does the same thing on outputs. But yeah, it lets you write general tasks with a very small API and then you can just make it very specific by using input mapping and just, yeah, refactor your tests. Another one, another common copy-pasta. This is actually sort of more flow, but sometimes you're like, oh, we need to add another step to this thing. So I'm just gonna copy this piece of YAML and paste it into the other piece of YAML and now we've got a new pipeline config and then you apply it and everything's great and then you look at it and you're like, wait, that's not right. I wanted to configure things before I ran the tests, not in parallel with the tests. Yeah, so when you copy and paste things, like, you know, actually edit them too, don't just copy paste. You can look at your pipeline and you'll just see it looks wrong, that things are like not flowing the way you wanted. Yeah, not great, but easy to fix. You just, you know, go edit your YAML and set pipeline again. Ooh, this is an interesting one. You should be able to run tasks again and again and again. Concourse really helps you a lot with this because it runs things in fresh container images. It's very repeatable, but like if your tasks have side effects, like they push to GitHub or something, maybe it's not so repeatable and you should really like make sure that your jobs are item potent and so you can run it again and again and again and that way, you know, you can just, you're not worrying about building up state and cruft and if you have state that you need to build up, that should live in resources, which we'll talk about more in a little bit, but those are the pipes that flow through your pipeline. That's where your state should live. It shouldn't like live in your tasks. Yeah, very important that we can run things again and again and again. So like an example of this, we actually wrote a task that pushed to Git because we were using Git Crypt and at the time the Git resource didn't support Git Crypt and it would like, it was usually fine and then sometimes something would go slightly wrong and like all of our Git repos were completely garbled and you had to go manually ungarbled them and so you couldn't run the thing again until you ungarbled the Git repos because you just pushed like encrypted nonsense in a way that overwrote everything with encrypted nonsense. So thankfully we were able to fix that and we made a PR to the Git resource and now it supports Git Crypt, but yeah, if you have state, you should put that in a resource. Don't put it in a task. Oh, here's another one. This job actually failed not because of anything wrong in the job, but because we were using an S3 bucket and the S3 bucket, actually every time we ran the test it made a new S3 bucket as part of the test and then it never deleted the S3 bucket and you can only have 100 S3 buckets on an account. So time number 101 when we ran this, we were like, oh, we've made another change to code, let's just run it and then it like failed and you're like, what went wrong? Oh, well, we never cleaned up those buckets. So we added a step now to clean up those buckets and now this works. Yeah, so like other similar things are sometimes you have quotas on things. Anytime you have a quota, like how many VMs can you deploy in an AZ if you're deploying stuff? Make sure you delete them afterwards because otherwise you can run out. Yeah, yeah, and just automate your cleanup. All right, on to flow, the more interesting part because this is where you glue all things together and your pipelines actually flow, right? Pipelines when everything's green, everything flows and you're happy and you ship products and you continuously deliver everything and we're in the world of dreams. Yeah, that's great, love it. This is actually a pipeline that builds a CLI, it has some unit tests, some integration tests, a variety of system tests and then at the very end it cuts a release and publishes it, it's wonderful. Pipelines can also fail and block and nothing flows. This is great, you know, if something's wrong you don't chip it, perfect. You don't wanna publish a new version of a thing if it's wrong and so, you know, if something fails, you know, it fails and it blocks until you fix it. And the first rule of the pipeline is that everything has to pass at every step in order to reach the end, that's why it's a pipeline. And concourse does exactly what we tell it to do in the definition, but sometimes you end up with nonsense. This often happens because pipelines are opinions, we have opinions about which tests should run when and what tests we should be running and sometimes our opinions change. So, yeah, plumbing. Here's what the first section of that pipeline I showed you looked like about nine months ago. There was a small change that flowed through and our tests that ran system tests against a Bosch director failed. And we're like, okay, we looked at it and we're like, what went wrong? Well, there was no Bosch director. And so we started trying to figure out like where'd the Bosch director go? Well, then we looked around and we're like, oh, that's actually quite clever. We automated deploying the Bosch director, that's great. We can just click that and run it and bring it back. But it was very thankful that earlier ourselves remembered to like automate this, but so we ran that and everything worked again, it was fine, but maybe that should have run beforehand. This is no pipe. So the deploy Bosch director should have happened before we ran the tests against the Bosch director. And the visual on that is that the jobs weren't linked and to fix it, we just made a resource and we added a past constraint and now everything's great. The Bosch director has to be deployed before we run the tests against it, perfect. Okay, so that's what it looked like after we added that past constraint. Now the deployed Bosch director happened before the system tests, but it's a loose dependency. That there's a dashed line going through, not a solid line, so that means that it has to have happened already but it doesn't trigger it. And then we started thinking a little more, like, hmm, is this right? And it turns out that the test releases are actually a prerequisite for the system tests, but right now when we make changes to those test releases, it didn't trigger and so we could actually cut a release without testing everything, which is not great, we were missing builds. So we had to add a trigger there to make sure it triggered automatically on the previous job. So now it's a solid line. All right, fans. So sometimes you have lots and lots and lots of tests and things fan out and fan in and get complicated. So here is another pipeline. I don't know how well you can see that, but there's all these tests that run in parallel and everything fans out and goes all over the place. It's got a bunch of integration tests between all sorts of different components that we had to integrate with. And one day we added a new integration and we copied the system tests and the config and all the new tests passed and everything was fine. But we forgot to fan in. So one day that job failed, but the pipeline was still green. We're like, what happened? So we could have cut a release that failed these new tests, thankfully, like we also visually inspect the pipeline before we cut a release and it was red and we're like, nope. So we fixed this, but yeah, you need to fan in. So this is an incomplete fan in and so a version that fails the test could be delivered and the lines were just inconsistent. You saw those lines flowing down into the void. They went nowhere and the treatment was just at a past constraint. And now those lines fan in and everything's great. But here, here we have another one. Now there's even more, like we've zoomed in more and you can see that there's just lines everywhere flowing all over the place. Everything depends on everything and actually not everything depends on everything, but we had wired everything to everything and so we just, anytime anything changed, like let's say we made a change to Azure integration. Well, we'd also run the tests against GCP even though those tests didn't have anything to do with GCP and we hadn't changed the GCP code. We were over testing. So actually we needed to remove some of those lines. Yeah, so really you should minimize the number of past constraints. Only have the ones that you actually need. The things that you depend on should be past constraints but the things that you don't depend on don't need to be past constraints. Yeah, and the visual's just like that mess of pipes going everywhere and you're like, I can't see anything. Maybe check if you really need all those lines. All right, let's talk about the resources that are flowing through things. So the nice pipes that we have this pipeline builds a Docker image and tests it and promotes it. Everything looks good, right? Let's look at what flows through the pipeline. It's actually the repo with the Docker file that flows through the pipeline. Perfect, seems fairly reasonable. But what about the candidate that it builds? Cause we built the candidate and tested the candidate. Where'd that go? So how does it promote the candidate if it just doesn't have it anymore? Well, what it was doing was it would build the image and then it would test it and then it would build it again. Great, that image of course has unpinned versions in it. So like if someone else changed a version number, like we have no idea if that Docker image still works. We don't think we actually managed to cut a Docker image with this problem, but we had to fix that. And so now it's the Docker image the candidate flows through. And so we build it, we test it, and then now we promote it. And so we just add a tag to the release candidate and make it actually the release. And so we're not building it twice. I think we just noticed it. We didn't run into any issues. We were just like, this seems wrong at some point. Someone was like trying to add another thing to the Docker file and then noticed that the pipeline just didn't seem right. So we build the image and then we test it and we promote it. This is the wrong resource flying through. Which resource flows through is tricky and important and can cause very subtle issues like shipping a thing that you didn't test. Not great. So make sure you get the right thing flowing through your pipelines. Here's another pipeline. It has three rows that are all the same. So it deploys a cloud foundry, runs the disaster recovery acceptance tests against it and then deletes the cloud foundry. And there's three different versions. One is just running normal threats. One is running threats with hard mode threats just like backs up a cloud foundry, changes all the things, restores it and checks that everything is back to its original state. Drats with destroy deletes the entire cloud foundry, brings up a new one and then shoves the state into the new one, checks that that works too and then there's a PR flow for when people make PRs to the test suite to test those. So it's great. So we deploy the cloud foundry, we run the tests and then we delete the cloud foundry. Everything's great. And we knew only one of these jobs could run at a time so each of these trios is in a serial group. But one day the system tests failed. And the reason was that the jobs triggered in the wrong sequence. Someone pushed two changes, one after the other and let's replay what happened. So it deployed, it tested, it deployed again. It deleted and then tried to run the test. Cas serial groups just enforce that one thing runs at a time. They don't enforce the order. And so the test just had no chance of passing because the deployment had been deleted. So this is actually a missing resource. There is state in this system, which is the fact that there's an external environment with a cloud foundry deployed on it and that was just not represented at all in the system. And so you need to add a resource to represent that state. And thankfully, Concourse has a resource to do this. It's called the pool resource, it's built in. And it just is a thing that lets you represent some state that's not contained in your system. It's great. All right, so let's go over what we talked about. So there were all sorts of pipeline antipatterns. They're build antipatterns and flow antipatterns. And all of these things are important. And fixing all these things can make your system work better and your pipeline's happier. And that ensure that you don't ship a product that doesn't meet your standards. That's all. All of the platform recovery pipelines and Concourse are all open source. You can go look at them, they're on GitHub. And the Concourse is publicly accessible, sort of. I mean, you can see it, you can't do anything to it, obviously, but you can see it, that's all online. So thank you for listening.