 Hi everyone. Everything has started. But before we do, I need to read the fire exit announcement. So please note the locations of the surrounding emergency exits. Locate the nearest exit, nearest lit exit sign to you. In the event of a fire alarm or other emergency, please calmly exit to the public concourse. Emergency exit stairwells leading to the outside of this facility are located along the public concourse. For your safety in an emergency, please follow the direction of the public safety staff. Great. So hopefully you're all here for this talk. We're going to be talking about transforming a bank with a highly opinionated, automated release pipeline. I was trying to win the prize for the longest title at the conference, so we'll see. It's very slow. So who am I? I'm Reid. Reid Lebeck. I work at RBC, which is a bank in Canada. I've been there for eight and a half years, and I've been on the cloud team for the last three, two and a half, three years thereabouts. And in that time, I've written several microservice-based applications, and we've started to try to roll that out to the rest of the organization and try to write some tools to help them do it. So what are we going to talk about today? So I want to kind of take you through, for those of you who haven't worked at a big organization, government or bank, that sort of stuff, I want to talk to you through kind of how we've done it in the past, some ways we tried to fix that, and then kind of the problems that led to how we solved those. So hopefully everyone knows what monolith is, a big, giant thing that's unruly. So that's not really the interesting bit, but more what comes out of having monoliths. So we end up having these long development cycles, so typically waterfall three months. If you're lucky, you'll get a deployment out. And each application deployment is a special snowflake. So we need to work with every other team. You know, you need to work with the network team, the database team, the five app teams that also use your database, et cetera, et cetera. So that's a mess. So we don't really want to be doing that anymore. And the other thing to note is that we do a lot of ETL. So a lot of the banks' applications revolve around these three steps. Get data from somewhere, transform it into the right format, and then put it somewhere else. And so traditionally that's been done with some database and then some process and then another database. So our idea at the bank was to stop doing that, which I thought was a good one. And we were going to do some vent driven architecture. So vent driven is new hotness. The idea is that you take your data and turn it into events, and then you can write event processors instead of having batch jobs. So streaming is ultimately better than batch because you can always express a batch process as a stream. Reading a file in is just streaming the file, right? So we chose Kafka as our bus event stream, and we wanted to put the data into Elasticsearch. So we have a lot of use cases where there's search use cases. So someone calls out the help desk, hey, can you help me here? And we need to look up the user's information. So Elasticsearch is perfect for that. And then we didn't want to do microservices. Sorry, we didn't want to do monoliths. So we decided to do microservices, put a REST API in the front, and then the data is exposed to the rest of the bank so we don't need to go through this mess all over again. So just briefly through the application architecture. And so the reason I want to focus on this is what we've done is we've written an opinionated framework for this type of architecture. So we didn't write a framework for every application at the bank. We wrote a framework for event-driven applications using Kafka in some sort of data store. In this case, it's Elasticsearch. So the data comes from somewhere, goes into Kafka. We have some consumer which will do some sort of transformation, put it in the bespoke format that we need for answering the question. And that question will be answered by the REST API, which will ultimately just ask Elasticsearch. And then the client will connect to the REST API. So fairly simple architecture. So we should be able to do something to make that pretty easy for everyone. So we went from monoliths to microservices, and now we have a whole new set of problems. So all the processes at the bank are set up to deploy very infrequently. So when you do a deployment every three months, it's really hard. So you want to be really careful and you want to be super diligent and you got to coordinate everyone. So there's all kinds of processes about change records getting everyone coordinated. But if you want to release once an hour, once a day, once a week, that becomes quite onerous and you spend more time doing the process than you do doing the actual deployment. And so the things that they want to make sure that are done in the process aren't necessarily bad things, like making sure testing's done, making sure that you're not going to break stuff, regression testing, et cetera, et cetera. But the idea is we want to automate those checks. We don't want them to be some manual process. The other thing is as we get more microservices, things get hairy in a hurry. So if all our operations is set up around monitoring one application, then we're going to be in for a world of hurt when messages are in Kafka and we have some eventual consistency in our API. So things get a lot harder. And the other bit is that every team has to do this themselves. So every team has set up monitoring. Every team has to set up a pipeline to go to production. Every team has to figure out how to solve all these problems. And that sucks. So we need a pipeline. We need something that everyone can just use and benefit from. So there are too many pieces to do at ad hoc. You can't, you know, before you could get away with some person typing on a keyboard, and it'd be all right. But now if you have, you know, even five microservices, you're just going to be typing forever. We want something repeatable, and we want some way to automate all those processes away. So what do we want to automate? We want to automate testing, deployments, rollbacks. Ideally, we do some zero downtime deploys. And those are not necessarily trivial. So it'd be nice if everyone didn't have to do it all themselves. All right, so the first time we tackled this was for an application. So we were on the cloud team, I'm part of the cloud team, and we were kind of doing the first of these event-driven architectures. Our very tight deadline is about two months, two or three months, which is fairly tight when you have nothing going on, and it's incredibly tight when you work at a bank and everything has a three-day SLA to get access timing. So what do we have? We had Cloud Foundry and we had our code in GitHub, and somehow we needed to get it from one to the other. So there was nothing. We had nothing there before. There's no pipeline we could leverage. We had a few tools, though. We have Jenkins, we have a central Jenkins. So we opted for Maven because we knew Maven. No other reason than that. And the Jenkins was already up and running, so we didn't really need to set up our own. So that was a small win. We had this other tool called Urban Code Deploy. So a quick show of hands. Who's ever heard of that before? Wow, that's way more than I thought. Okay, so for those of you who haven't, Urban Code Deploy is a IBM tool that lets you draw pretty pictures, and then those pretty pictures will somehow deploy your application. So it had a few nice features to it. So one, it was bank approved. So we didn't have to go through the process of figuring out, you know, how do we get whatever tool we're going to use? Approved, we can just go ahead and use this one. And for those who've worked at big organizations, they'll know that that's a non-trivial task. The other thing is it let us store secrets. So it was a place that was approved to store user names, passwords, et cetera. So if we need to connect to databases or Kafka, then that's good. The bad part is that we couldn't store our config in social control. So in our case, git. Which is bad because anytime you make a change, you make a change to the code, and then if you need to update the pipeline because we're still fairly new, then you have to go into this UI and make the change. And that's bad because as your code changes, this doesn't really change with it. So it's a situation where one branch needs to change the pipeline and then master doesn't, and you can only deploy one of the two. So that was no good. And then we had a UI, and I don't really like using UIs for development. It's just not my thing. So we decided to use Ansible. So who here has ever used Ansible before? Okay, who here has used Ansible to deploy the cloud? Yeah, so, okay, we'll get to that. Ansible was good. We knew Ansible, so that was kind of, we wanted to leverage some of the things that we knew. We had a tight deadline, we wanted to go ahead and not reinvent everything. We can store the config in git. It was good, we can move along with the branches, and we can have different config per environment. So Ansible has a concept of groupvars, and you can have a dev config and a prod config, which is good because you're going to have different servers for Kafka, Elasticsearch, et cetera. Bad Ansible is really good at deploying. I say you want to install Tomcat on 10 servers, and you just kind of list the servers and the way it goes. That's great. Deploying to Cloud Foundry, it's not so great. So there's no plugins for it. So we had to write a bunch of that ourselves, which was in scripts, like Python scripts, I think it was. So that was a bit awkward. We also wanted to achieve unimpotency, so the idea behind Ansible is if, say, Tomcat's already installed on server A, then it doesn't do anything. So we wanted to get the same idea from that for Cloud Foundry. And so part of that was driven by the reason that we used mono repos, or A mono repo, I should say. So all our microservices were in one repo, and we can talk about whether that's good or bad, but it was both good and bad in this case. So because every time something changed, it was going to rebuild and redeploy everything, we wanted to make sure that it only redeployed what actually changed. And so we did some fun, and by fun, I mean terrible ideas that you should never do at home. We decided to use MD5 sums to compare whether things had changed. So we took the MD5 sum of the jar we're deploying, so it's great, so we got that, and we'll store it in an environment variable in Cloud Foundry, and we can just compare for the next time, right? So the problem is, how do I even notice this, but jars in their headers have a time stamp or something that changes every time you build it. So if you zip up the same files twice, you'll get a different MD5 sum. So it's like, okay, well, we can work around that. So we expanded the jar into files, and then we did MD5 sum of each of the files and put that in a text file, and then we did MD5 sum of that text file. So it got complicated real quickly. But it did serve the purpose of we only redeployed what actually changed. The other tricky bit was services. So we have a few user-provided services to point to Kafka Elastic Search, any other services we needed. And it's very easy to update a user-provided service. I mean, it doesn't really need to be unimportant. The only problem is that if you're not deploying the application that the service is bound to, then you need to restage that application. So we had to do some finangling to figure out which applications were bound to the service that changed. So it got way too complicated, way too quickly. So we then had a second application, and it was very similar to the first. Very similar event driven architecture. We also had a tight deadline because we did such a good job on the first one that they figured we could do an even better job on the second one seeing as we solved all the hard problems. So there were still a few problems that we wanted to solve. We were going to use the existing pipeline because we didn't really have time to rewrite it, but we wanted things to go a bit faster. So I think for the first application, we had five or 10 somewhere in that range, microservices. So deploying that every time gets real slow. So what we wanted to do is deploy all of those at the same time because there's nothing really dependent between the applications. We could do five CF pushes at the same time. And all of this, again, is driven by the fact that we had a mono repo. So we added some parallelism. So our first attempt was to have urban code deploy called Ansible, say, five times with different parameters. Say deploy application A, application B, application C. So then we broke the cardinal rule of put nothing in urban code deploy because then anytime something changed, we had to go back to this UI-based system, which wasn't in Git, so that was no good. So what we did then is we added bash in front. So you know you're in trouble when you're doing parallelism in bash. So we had a bunch of execs and weights in bash and eventually we achieved some parallelism. So it was great. The other thing we really wanted to do was to do that automated testing bit. The first one was a UI, the first application was a UI. And it just didn't have time to automate. Automating UI testing is difficult. So this one was all kind of just back-end stuff. So we could easily send some data in the front and check that it made it at the end in the right way. So the problem we had though with automated testing was that we ended up running on a live instance. So we would deploy the application and because everything was so entangled, we had to have all the services up. And then we ran testing on that. And so that's okay, but if you find a problem, it's kind of too late, so you have to roll everything back and it takes time. So this was definitely a lot better, but we needed to work on a few things. So what were our shortcomings? Well, it took too long. The whole thing took about 20 minutes, even if you didn't change anything, but more often than not, you'd change some central library and you'd have to redeploy the whole thing. We had five or seven developers. And if we all wanted to do something, we all wanted to do work, right? And so every time we'd go to deploy, we'd have to wait 20 minutes. On top of that, our automated testing was a bit flaky, so we weren't sure if it had just timed out or if it was an actual failure. So we ended up rewriting it. So we were lucky if we got four or five pull requests done in a day. So it was really slowing us down. The other bit is we... If you haven't figured it yet, we didn't do the wrong tools. So Ansible wasn't necessarily the right tool for deploying to the cloud. Bash is fine, but not necessarily that robust. And Urban Code Deploy wasn't really getting us too much. And on top of that, there were still a ton of manual steps. So if we wanted a new Kafka topic, or we wanted a new Elastic Search Index, or we wanted a new Cloud Foundry space, it was all manual stuff. We'd have to do that manually. So if I wanted to deploy to my own particular space, I would have to set up the whole pipeline. Set up all the pieces myself. So welcome to the third and ultimate pipeline. So this time, we had a bit more time. So we decided to rewrite from scratch. So we threw away the code, but we kept all the lessons we learned. And we still had a few things. So we had GitHub. Still, that worked well. And we had Cloud Foundry. We wanted to keep that. And we wanted to keep Maven. So we like Maven. We know Maven. So we kept that one. Now, one of the things we wanted to do, and you'll hear a lot of it, other people talk about pipelines, and they'll focus a lot just on the deployment side. And that's great. That's great that you can, you know, it certainly solves one part of the problem. But we wanted to go a bit further. We wanted to provide, not necessarily lock people in, but give them a head start on building their code as well. So when we're doing these applications, they're fairly standard. So everyone's going to have more or less to see, you know, you'll depend on Elasticsearch, Web, Kafka, et cetera. So we can at least fix the versions. So we have a bunch of, a bunch of defaults that we have in the CI space. So we have a default parent palm. So you can just inherit from that, which is great. And we have some Maven jobs, or sorry, some Jenkins jobs, which will scan your repo and just grab your Jenkins file. And say, okay, well, that's kind of pretty standard, right? But then what we did is we wrote some functions, which we added into our Jenkins. So you could just call those from your Jenkins files. You just say, you know, do the CI or do the CD. And then you wouldn't have to worry about all the steps. So like stuff, and we'll get to what they are, but stuff that are not necessarily straightforward. So we have Jenkins. We decided to do our scripting in Python. We have some experience in Python. It has a fairly good REST API, fairly good to call REST API, which we're going to use for Cloud Foundry. We can run stuff on the shell as well. So it's good. And the other thing we used is Docker. And this is so key. So we ended up running Jenkins on Docker, and then the jobs would run inside Docker containers on Jenkins. And so that's super important, because now we don't need to worry about, like, is CFC-aligned installed? Do we have the right version of Python? Oh, we need to scale out to another node. What's our procedure to set everything up? And I know it sounds like we could give this talk five years ago, but it's pretty revolutionary for RBC to run this stuff on Docker, and it really helped us tremendously, because once we had those base Docker images with those prereqs in, we could just reuse those over and over again. It's just great. And the other thing we decided to do was go multi-repo. So we didn't like the way that Mono repo was working out for us. And I know some companies have made that work tremendously well, like Google and Facebook, but if you look at how they've done it, they end up using a lot of custom tools. And I know that Concourse, which we're at a Cloud Foundry summit, so we might as well mention Concourse. Concourse does very well where you can grab the subfolder inside your Git repo and say, well, just build if this subfolder's changed. But we didn't really want to go learn Concourse. We had enough to do. So we stuck with Jenkins. We wanted to make your changes at the top, so the top level changed, and there you go. So we went multi-repo. So every microservice has its own repo in GitHub. So the other thing we did is we wanted to automatically provision Elasticsearch in Kafka and Cloud Foundry stuff. And so we created a DSL. So it's in YAML. So that's how we got to talk here, you know. If it was JSON, they'd kick me out. So we have it in YAML. And it's just simple stuff. So what's your app called? How much memory do you need? What's your topic called? It's not all the options. So it wouldn't be everything you put in the manifest, for instance. It would just be a subset, because there's a lot of stuff that you don't really need to worry about, like where the Kafka server is or where Elasticsearch is. We'll take care of that as the pipeline. You just need to know you're going to use Elasticsearch, and you want your index to be called this, and here's the JSON and such and such. So you're checked to make sure that's okay. So you want to know sooner. You don't want to go all the way to deploy. It's like, oh, hey, the thing failed, because you had a syntax error. So we want to know that as soon as possible. So we checked that here. And then what we do is we put the artifacts in S3. Now we can argue all day as to the best artifact repository. Some people like Nexus, other people like Artifactory. We chose S3, partly for the reason that we had more control over that, and we didn't have to rely on other teams. S3 is to do stuff you have control over. The other bit we did was, so part of our goal was that each deploy, each branch would get its own pipeline all the way through. So if I'm working on my branch, Read, and my colleague's working on his branch, John, then, you know, we'll have a Read space, and we'll have a Read Kafka topic, and we'll have a Read Elasticsearch index, and so on and so forth. So I'll be able to deploy everything all the way out, and we won't conflict with each other. So the nice thing of using S3 is we can put the buckets named after the branches, or named after whatever we're deploying. So we have a prod bucket or whatnot, which makes it really easy for cleanup. So whenever we delete the branch, then we just delete the S3 bucket, and it makes life a lot simpler. And then grand plans, which haven't quite been realized yet, I'll be honest, but the idea is to have a chat bot in between the CI and the CD. So when you're deploying to dev or maybe staging, it's great. You can just kick it off automatically. But especially at a bank, you'll want to hold off your prod deployment to have some sort of manual step. And that manual step will be someone saying slash pipeline bot, deploy. So that's kind of how that will work. All right, so that was the CI. So let's have a look at the CD. So what we did now is we created the environment parameters. So these are things like server locations, API URLs, et cetera. And then we combined those with the application parameters to get our full picture. So we know it's going to be called Reed's Awesome App. And we know it needs a large amount of memory. So we give it four gigs and so on and so forth. So we get something that we can actually go off and deploy. And then we'll go and provision our Kafka topics if they're not there. Again, item potency. We'll create any elastic search indexes if needed. And yeah. And then we'll do the actual deployment. So the idea is we're going to do a blue green deployment. So we want to have zero downtime as much as possible. So I can never remember which is which, but so we'll do the green first, deploy that. And then what we did this time is we did automated testing on the non-live route. So we have our temp route pointing to our new app. So we'll do the testing there, which is great because now if the testing fails, we can just not switch the route over. And then if the testing does pass, then we will switch the route over. So the idea is that this automated testing is something that the user, the developers will provide. So their development team will plug something in there and then it'll call whatever testing it needs to happen. And then we'll just check the return code. So either success or failure. And then we'll finalize the deployment. So we'll make the running app live. So what, so this is a picture of what the developer experience looks like now that they have this pipeline. So all they have to do really is inherit from the parent palm, add a few parameters to say, here's my app name, here's the memory, here's my Kafka topic. And that's it. And they commit and away they go. So I mean, this was so good that when we were developing it, we hadn't kind of finished everything, so we got like little bits done. And so we're working on the bits that weren't in the pipeline yet. There's a little room of like four of us. And we still pay what we always turn around like, oh, what do you use the pipeline? Is it so good? So at that point that I knew that we had something special. So thank you very much. That's it. That's it for the talk. Thank you. So we have time for a few questions. Anyone wants to grill me? Very good. So the question is what does Cloud Foundry do for us in the pipeline? So Cloud Foundry is just our deployment target. So we use that as a platform as a service where we run stuff. Does that answer your question? Like it's the target where we deploy. Where we deploy to. That's a good question. So where we run Docker is still in flux. So there's talk of running it there. There's talk of running it elsewhere. So did we have to do what for Artifactness? Sure. So the question is, did we have to get approvals to store our data that's hidden in this wonderful green box? Is that an internal S3? So it's an S3 API on top of internal storage. So in that case, we didn't have to get approvals to put stuff outside onto public cloud. So this whole thing runs internally. Yeah, good. So the question is how much work was this and was it frustrating and it felt like it should be available somewhere else. So the answer is probably too much work or not enough. I don't know. It felt like a lot of work at the time. But the results are certainly there. I wish there was something that we could use straight out of the box. And there are some efforts. Shed it to Spring Cloud Pipelines. I know Marchion is doing a great job with that stuff. It didn't quite fit our use case. We kind of gone far enough down the road that we weren't able to use it. But it really depends on how much control you want to give up. So if you want to do everything for your developers or as much as possible, then nothing is going to end up creating Kafka and Elasticsearch as well. So if you want to do just deploy to PCF, then that's okay. So the idea that we want to do is on top of this is we'll add the security checks. We'll add whatever bits that are currently manual processes. In here is automated checks. But if we use something out of the box and is that availability, is that feature available? So it's kind of the trade-off. Anyone else? Yeah, yeah. We just end up calling different APIs at different points. But the idea is we do a CF push with a lot of the no options. So dash dash no, pretty much everything. And then we'll assign a temp route to it, which is really all we need. And then eventually we'll assign the real route to it, delete the temp route, and delete what the real route used to be pointing to. And then we'll clean up any applications that were there from failed deploys in the past. So it is sadly a set of just like making a bunch of steps. In the same way we piped together a bunch of use commands, it's really the way we did it. Yeah. So the question is how long did it take us to get there? Yeah, so probably about three months from the time we'd finished the second pipeline to here. One of that, quite frankly, was spent deciding which direction we wanted to go. Like making decisions, I stand up here and say, we're using multi-repo. But getting to the point where we decided that that's the right thing we want to do, we want to do Jenkins, and we want to use the GitHub org plugin in Jenkins, and we want to use Docker. These sort of questions, and how much control do we give to developers, was really where we spent a lot of our time. The coding it up wasn't terribly important, terribly difficult. The hardest bit was kind of figuring out what we wanted to do. There you go. Yeah, so the question is, did we look at concourse? Yeah, so there are kind of two parts of the cloud team at RBC. So there's the platform team, which runs Cloud Foundry and Kafka and Elasticsearch, and they use concourse for that. For the development side, we didn't have enough experience, and I played around with concourse a bit, and one of the problems I found was that you have to build, and I don't know if it's better yet, but you have to build a lot of the things you get for free out of the box with Jenkins. Stuff like having a workspace that carries through all the way, like on all your Docker containers. Stuff like that you kind of get for free with Jenkins, and we had a ton of experience. I think between us, there was something like 30 years of administering Jenkins, so we had a ton of familiarity, so it felt silly to throw that away when we had so many other questions to answer. Yeah, go ahead. Yeah, the chatbot. So the chatbot isn't implemented, so this is all kind of theoretical, but I will talk about how the chatbot works. So the idea is a PCF service that would talk to, in this case Slack, and the idea is it would say, hey, your build's done, and then hey, your deploy's done, et cetera, and then for higher level environments, you would ask it to deploy that application. That would be your interface. So instead of going to the UI in Jenkins what you would do through a chatbot. So that's the idea. Anyone else? Awesome. All right, thanks so much for listening. Appreciate it.