 are the nice introduction. So yeah, when maintaining CI-CD pipelines, we always are compelled to automate more of the process. And the more we automate, the more slower they'll get, usually, unfortunately. But so today, I'm hoping that I can share some insights and tips and tricks about making your build faster and your deployments and software delivery more enjoyable. My name is Zan. I'm a developer advocate at Circle CI. I'm based in London, UK. And my greatest joy is enabling and inspiring developers all around the world. Yeah, so if you'd like to get in touch with me, I'm quite active on Twitter. So twitter.com slash zanmarken or email me at zan at circleci.com. So yeah, without further ado, let's begin. So yeah, what we'll cover today is we'll cover some motivation why we should really be interested in our build plans and why should we be kind of on top of them. And yeah, we'll look at how some ways and techniques we could use to track and measure our build times, identify problems, and bottlenecks, and also decide how to react on those, how to improve our build times. And then yeah, we'll actually look at how to react to some of those with practical examples. And yeah, in a summary, we'll cover what we covered and where to go next with some Q&A. So yeah, I hope you enjoy it. I've lived through this. So this problem of builds taking too long. I know this issue too hard, like very well. And I'm pretty sure some of you, if not most of you, have heard this one as well before. My story about builds taking too long is about five, six years old. I was on an Android development team. And we managed to create an app that took eight-ish hours to build essentially whole day. What this meant was I would get up, check my email, see OK, CI has failed. Great. Poke around, make some changes, try to pass this build. Go to work, spend most of my day working. And before essentially coming back, starting my journey back home, I would discover, oh, the build has failed again. So let's make those changes again and try to get it passing. And this would repeat itself like all the time and for several days in a row until we got on top of this. And our team was quite small. It was like four or five developers. So it wasn't very impactful, but still it was a very frustrating time for everyone because we couldn't really deliver, we couldn't ship. And yeah, so this is my motivation, personal motivation for giving this presentation and hopefully solving this problem for some others as well. So yeah, before we go deep into builds and how everything kind of comes together, let me take a step back and cover some of the basics that we will be repeating throughout this conversation. So CI CD sits between developers committing code to their version control systems and productionizing those applications. Essentially, we, Circle CI, are the leading CI CD platform. We take code from your version control system. We build it, we test it, we run a bunch of scripts and we deploy that through to your anywhere, really, like any application from mobile apps to Kubernetes applications, we can deploy that pretty much anywhere. And yeah, that's how this works. And I'll show you briefly how this product looks like, the platform looks like, and how we actually configure it so that when we are talking about this later, you'll know what this all is about. So the platform, circleci.com, app.circleci.com, is essentially this dashboard of your pipelines, your projects. Let's look at one. For example, this is one of my demo applications. It's a Node.js application. And I've configured a bunch of workflows, jobs, builds, and each time I commit something to my GitHub repository, which is connected to it, this basically triggers another build. And what this does is, this is this corresponding repository. It takes the file called circleci config yml, so a YAML file in the dot circleci directory. And analyzes it, reads it, and that's how we describe what we want our CICD to really do. In our case, we're defining a bunch of jobs, so build and test, vulnerabilities, scan, deploy docker, and a workflow called Node, test, and deploy. We'll see this when we return to the dashboard. And we specify how we want our workflow to look like, in what order we want those jobs to execute, and so on. So anything can go really in those jobs. They can run on various different environments, from docker containers to macOS, virtual machines to Windows machines, all sorts of stuff, different sizes, different platforms. So going back to my pipeline, you can see that for each commit that I've made, it's kind of triggered. And it basically does what I told it to do. So in our case, I've configured my pipeline to say, yeah, do some testing, do some building, do some vulnerability scan, and then wait for me to approve this before deploying this to Docker Hub as a new Docker image. But yeah, whatever you're building, it doesn't really matter. But yeah, that's kind of the idea. We've explained workflows, contains everything that gets triggered from VCS commit. Jobs, which is an individual kind of test, verification, building step, and yeah, the whole thing is called a pipeline. OK, let's go back now that we know how everything kind of comes together. Let's go back to build speed. Why do we really want to care about build speed? So the way I see it, teams that don't wait for their builds are more productive, and they can ship more, and they're more effective developers. They're happier developers because they're more productive. Faster builds also mean that we're able to react and action change faster, and actioning change faster is very important in this day and age. And not only faster builds that pass, but also faster failing builds. We really want to get to some kind of signal telling us, OK, our code is not doing what it's supposed to do because some tests are failing, some something else is failing, and we don't want to wait eight hours, for example, for this. We want minutes. We want up to an hour, whatever is important for your project really. And yeah, inverse to this slower builds means you're less capable to change, less capable to adapt, and your business is ultimately less competitive. So yeah, that's why we really need to care and worry about build speeds. Having said that, let's look at how we might want to measure and track our build durations and in order to get on top of them. First off, how do you know that something has gone wrong, something is broken with your build? We're really willing to start with that because you can't really operate on a hunch. You really need to know exactly what and how it's going on. So I've identified a few ways that you can look at this, but there's by no means that they're not all of them, obviously. So you could see that your build times have been increasing. You could be measuring and tracking for each build. You kind of track, OK, this one has taken 10 minutes, and the next week you're up to 20 minutes. So something must have happened that your build time has basically doubled. Builds could be more broken now that they were before, and that's another kind of indicator that something is not there. And ultimately, your team will tell you, like your team will tell you in retrospectives, your team will tell you that, yeah, they have a problem with CI CD. And CI CD is the last thing you want to come up in in retros that it's broken because it means that you're not able to ship, you're not able to do so automatically, and you're relying on a lot of error-prone human local builds and manual processes to action your change that you're trying to deploy. And yeah, a couple of questions is a good thing to start about your builds when you're kind of looking at ways to benchmark this. First off, what's happening, right? I've covered a bunch of them before. Are your builds slower? Are your builds less prone to, well, more error-prone, more flaky? So you might also be able to identify which parts of your builds are the most problematic. Maybe it's not the whole builds that is failing more often. Maybe it's just a part of it that has suddenly become slower, more flaky, and less stable. How stable are your builds? Are you able to say, yes, today we're at 90% success rate for our builds and last week we were at 95. So where did this 5% go? And ultimately, are they deterministic or flaky? Ideally, you want to be as deterministic as possible. So you run a single build on a single commit and it's always gonna be either passing or failing and never halfway, sometimes passing, sometimes failing. But yeah, for speed, we're really mostly going to focus on what's happening to your builds and identifying which parts might be the most problematic because that's often what you'll find is that a certain job, a certain test that are taking longer than the whole other part of your build essentially. Yeah, so we have recently released CircleCI Insights which basically tells you just that. It tells you, yeah, how long your builds are running. So I went from one minute and 50 something to what, five minutes. Reason for this is because I added the hold manual hold task that basically waited for me to finish it but it's still reflecting on my build. You can also see that on your job. So whichever job you're kind of running, so running my tests, running my deployment, it identifies how long that takes and in my case, yeah, bolder ability scanning is what takes the most of my build. And usually you're gonna find something like that when you're looking at it. Obviously you can do this with CircleCI, it's super convenient but like any build tool is able to kind of tell you it's ran for 50 seconds or it's ran for 45 seconds. And if you're so keen, you can actually just drop these into a CSV or a spreadsheet and analyze them yourself, right? But yeah, if you're using CircleCI, you have those insights right there as you're developing, which is pretty cool. So now we know what's happening to our builds. We know which parts are slower, which parts are faster. So we know where to actually start focusing our efforts. So now we can start looking at some techniques for optimization of our builds. First technique is quite obvious, but very easy to forget that software is built on machines and the more powerful your machine is, the faster it's gonna perform, right? So fortunately that's quite easy for CI CD to do, right? You just switch one line of code and you're using a machine of different size of more that's more performing that has more RAM and more CPU cores and that should go faster. If you weren't using a service-based CI CD and you had to rely on your local machine, you would obviously have to replace the processor, replace the RAM, all that stuff and that can take a bit more time. But anyway, common indicators of performance lags that are easily remedied by increasing the resources or ramping up the resource that you're allocating to your builds are, yeah, just out of memory errors. So I used to work on an Android team. Android uses Gradle as a build tool. It's got a lot of Java, Kotlin compilation, which is quite memory-intensive and when apps grow and they do, when you're adding dependencies, you're consuming more and more memory and back then the default machines had 512 megs of RAM and that just not enough for most of those jobs. So you really need to kind of start increasing that to avoid out of memory errors. Sometimes you're running builds and tests that can utilize more cores of your processor and that's when, yeah, a bigger, more processing course is a good idea. What I've found is that in most kind of smaller applications that essentially things you can build locally is trying to build it locally and seeing how long that takes on your local machine with what, 16 gigs of RAM, eight cores, 16 cores, whatever, and seeing if your CI CD is taking five times as long. So if that's the case, then it's very easy to just kind of increase the resources and it's gonna work. Obviously, if you're building something that can't reasonably be built on a local machine or a developer kind of consumer grade computer then obviously you have a bigger problem but yeah, that's kind of a bit easy benchmark that I found in CircleCI. So this is just a snippet of the config file that I showed you earlier. So the way to choose a resource class is by specifying a resource underscore class in your job that you're defining. And you have a bunch of options available for Docker, for machine executors for all different kind of environments that you might be writing your job for. And yeah, for instance, they're all available in docs and I will be sharing the links to all of these things later. So you'll be able to find links to it easier. But yeah, by default you kind of run on two VCPUs with four gigs of RAM. And yeah, if you want something that's closer to a modern kind of desktop grade PC kind of go X large or XX large and it's immediately a lot faster. But obviously not everything is as easy to improve as just adding more machine to it because obviously bigger machines, bigger virtual machines cost more and you're just gonna see diminishing returns to your performance, especially if you're not utilizing all of the VCPUs, you're not utilizing all of the RAM. So you're gonna start seeing some issues come up. And so yeah, if you can't really optimize using having a bigger computer then you have to go parallel and be clever about it. So yeah, what you can do, another thing is you can orchestrate your jobs to run in parallel. So we've seen the workflow that I showed you actually had two jobs running in parallel. And yeah, this will essentially speed up the entire workflow because instead of having one after the other you'll just run a bunch of them together and wait for just the slowest one to finish as opposed to slowest and the next slowest and the next next slowest all the way there. So in my example here, I'll be splitting unit tests, static code analysis and dependency vulnerability check into these kind of parallel running jobs which operate independently before kind of coming together with all the results when they're all passing and only then triggering some kind of deploy job. So yeah, when you're defining your workflows this is yeah, all of these code snippets they're config YAML of CircleCI. So yeah, you're defining your unit test job and you pass in this requires argument and tell it what job it depends on. So for instance, we've had something that requires us first to build something and then we can run these three unit aesthetic check and vulnerability scan together before running deploy which yeah, specifies that all of these three are required. So that's like a very easy thing to do and actually if you're just listing all the jobs one after the other without using requires they will all by default run in parallel. So yeah, you can actually just get a lot of performance just by doing something by default but sometimes people want to see it one after the other and that's something you can fix by parallelizing. But not only workflows when you have a list of jobs you can also parallelize tests. That's something we could have benefited from in my earlier example but this was not available and we actually had to go kind of figure out manually how to split all of these tests. So yeah, if you have a single kind of functional test job in our case which kind of went through all the application screens and ran a bunch of tests like end-to-end tests like a user would you can consider splitting those tests across parallel jobs that essentially just work on a chunk of that small proportion of that whole test suite. And yeah, if your test suite takes one hour or eight hours and you're kind of splitting them to six parallel jobs you'll likely end up running something for 10, 15 minutes instead which is a substantial significant improvement in your build performance and your sanity as well. So how to do this? So CircleCI comes with CircleCI CLI tool that comes installed on all the Docker images all the machines that you might want to use and you use this CircleCI tests command to basically generate a list of tests which you then pass to CircleCI tests split. And that then lets you to split by file names, class names, timings. I like personally timings the most because it's able to actually identify what combination of the tests in each of the parallel jobs is gonna run for approximately the same time so that you're not ending up with one job that's like twice as long as the others. You're ending up with like this kinda normalized job length even though yeah, you'll end up testing couple of classes here, couple of classes there and just gonna mix and match of everything. But yeah, so you're likely to end up with this kind of longer command for testing which basically passes a list of tests to run to your test command. In our case, that's a yarn example. So yarn is a testing package and build tool for Node.js projects. After we're done, if you're using timings, you need to store test results which basically send all that information to CircleCI and not only tells you in a nice way whether your tests are all green or which tests have failed but also it kinda measures the times for each of the tests that it's run and then that kinda helps CircleCI make this educated estimate of which tests to combine together so that your total run is as short as possible. And last thing to do is essentially to set this parallelism value to however many machines you want to run or containers you want to run your parallel jobs in. And the rest happens magically, basically CircleCI figures out, okay, first parallel job is gonna run this first chunk, then second is gonna run this second chunk and it all kinda comes from this command that you see here of splitting tests and it just figures out all of that automatically. And yeah, I mentioned, yeah, this store test results. Let me actually show you this. I can show you in a workflow. Actually, I don't have my workflow here. Okay, it's gonna be here. Tests, yeah, it kinda shows you here which whether your tests are passing or not. So if you go into your test job, you can actually analyze that. Yeah, so that's kind of what we've covered so far, splitting or making sure that your job's run in parallel where that's possible and to make sure that you can split your tests in the way that you kind of cut a very kind of long running test suites into several smaller shorter running test suites. Next up, when we have so many new tests suites, smaller test suites to run, we can utilize some clever caching techniques to speed up their startup times, for example. Yes, if you're using Node or Java or any kind of heavy dependency heavy projects, you'll see that your kind of NPM installs, yarn installs, gradle downloads of all dependencies can take really a long time. And imagine that this happens on every single time you run your parallel test job. So you're basically adding like 10, 20, 50 a minute even to each of those builds. And even though they run in parallel, you're still adding that time. Secondly, you can actually skip some compilation time by basically saying compile once and then store these results and reuse this outputs in other jobs that are kind of running subsequently. So we'll show you this caching dependencies. Obviously that's the first one, the most easiest one to achieve. So yeah, if you're thinking each time you commit your need to download to do NPM install all of that, you can actually use this cache to make sure that if your dependencies don't change, you're kind of just reusing the same local cache storage which really makes stuff faster. The way to do this is add steps to your job that does the dependency installation or dependency needs dependencies. So after we've checked out the code, we call restore cache passing in a bunch of keys that it's going to look for. First off, we're kind of optimizing for whatever the branch name is and whatever the check sum is for this package lock because obviously if you're hashing this, that's pretty easy to figure out whether it has changed or not. Then we're running NPM install which should be very quick if we already have the cache and afterwards we just save the cache. And we've introduced two commands called restore cache and save cache to this. As a bonus, if you're using CircleCI orbs which are reusable configuration, which are reusable configuration tools and I think my video just went blurry, give me a second. Yeah, I'll be just looking at this one, sorry. If using CircleCI orbs, for example, the Node.js orb actually when you're running Node.js test job, it actually come with all these cache steps written for you. So you really don't need to worry about this. It's just gonna do this automagically, which is pretty cool. Next up, we have caching between jobs. So that's, first one was like between workflow runs. So between commits and now it's between jobs. So if you have jobs that run sequentially, you can actually utilize a cache that passes some stuff across those as well. So yeah, for example, if you have a heavy compilation step that produces a lot of kind of built artifacts, then your functional testing job is essentially using that those built artifacts and running tests on that. Then you can actually just build it once and skip that step altogether by just running the rest very easily. So this workspace essentially mounts as a local file system that you can copy and paste stuff from your jobs, which is pretty cool. So what we have here is we have two jobs. One is called flow and the other one is called downstream and our flow job is the first one to run, which calls persist to workspace. You pass in where you want this to be and essentially this copies whatever you have told it to copy somewhere. In our case, that's like workspace echo output file, which is just like one hell of a world. And then our downstream job, which comes after our flow job, it first attaches this workspace and you tell it where, so temporary workspace, and then you can actually just see it here and use it as you had it here. So for example, if you have like compilation heavy step, you can just take those binaries and reuse them like you've already had them. So pretty clever. So yeah, you can see that we've kind of defined the first one and then the second one downstream requires this flow job to proceed. The last thing I wanna show you about caching is caching Docker layers. So this is useful when your output is building Docker images. Docker layers are these kind of individual commands. So like work there, copy, run MPM install, all of these, they are layers in a Docker file. And CircleCI will let you cache each one of these layers sequentially so that whenever you're building the same image again, let's say if we're adding something on line 12, all the lines from one to 11 are going to be kind of immediately accessible from our cache so that we'll only need to execute lines 12 to finish in our Docker file to build our image. So if nothing has changed, your image will be built almost automatically, instantaneously, which is pretty cool. To do this, yeah, you need to pass in either, use either machine image or setup remote Docker environment. Setup remote Docker essentially gives you a bunch of Docker related tools to build images and interact with Docker. And just pass in setup Docker layer caching to true as an argument to this setup remote Docker instruction, which basically does everything for you. And yeah, that's kind of where it is. There are obviously some limits to parallelism. You can like, I think if you're running like more than 30 or 50, yeah, it can have like different, like 50 Docker kind of cache volumes created that cannot be reused across different jobs. Each job that runs at one time can only use a single volume. So you can have 50 of those. So there is like, if you want higher parallelism, then you need to think about it a bit differently. And the last thing is, yeah, Docker layer caching does not apply to running jobs in Docker containers. So all of our node jobs, that we showed earlier, don't really need that. Next up is just choosing what you want to run when. For example, maybe your CICD is set up so that you have an extensive functional test suite, you have coverage tests, you have a bunch of different moving parts set up that, but you don't really care about all of them at all times. So for example, if you're just reviewing a pull request, or if you're just pushing a commit to your source code repository, you really just need to care about unit tests and maybe a few integration tests and that's it. So you can actually choose what gets run when on commit PR on a tag or even on a Chrome job and filter basically on, yeah. Branches, tags, that kind of stuff and decide which part of workflow and which workflow actually gets triggered. We have this filters argument that goes into jobs in workflows and basically specify, yeah, when you want this to run. For example, this one is running functional tests, full functional tests to eat only when we're touching the main branch, which is, yeah, a pull request goes into main branch, you wanna run all the tests, otherwise if you're just working on a feature branch, for example, you don't have to worry about this. Next one is triggering, yeah, maybe you want to do, maybe you want to set up a trigger for a nightly build that kind of produces deploys to some kind of nightly production environment and that's when you can actually use this schedule trigger, which takes a cron and works on whatever branch you specify again, main is usually the most common one. Yeah, so that's like can really speed up your flow. And the last thing I wanna mention here is just kind of not all bottlenecks are really human, technical in nature. Sometimes there are human factors, we are all humans and often there is some kind of lack of trust involved that really leads to lengthy approval processes. A lot of people need to kind of add their stamp to things before we can publish, before we can release. You've seen how my build earlier went from one minute to five minutes just because I added a hold step so yeah, CICD is a tool that's super useful and it should be able to help you win that trust from your stakeholders, from your team members by showing yet the change can really be actioned and managed in an effective manner. If you have a lengthy approval process maybe set up some smart notification so that the person responsible gets it, maybe ping them on Slack or something and yeah, so think about this. It's not always technical factors. Hi, Zian. The right time. Sorry to chime in here. We've got about eight minutes left. Okay. I will be, I think I've got like one more minute so I'm pretty close to finishing. Seems good. Yeah, so what's the right build time for your team? It honestly really depends on your team. I personally like this ambulance analogy to your CICD builds. They need to be as fast as possible, but again, not too fast because important things can really break if you're gonna skipping tests, if you're not really on top of all the things you should be really testing, you should be caring about. And yeah, when the lights go flashing, you need to act on those failing builds immediately and make sure that it gets unblocked. On this topic, we released this report called State of Software Delivery. I think back in late last year, we looked at a bunch of stats from across all the teams across the world and basically assembled some benchmarks for you to see what three-page teams are looking at, what kind of performance they're seeing across the world. It's pretty cool. I would recommend you to check it out. And next up, lastly, tomorrow we're running a meetup on continuous delivery topic. It's an online one. We have Nick Jackson from HashiCorp talking about cannery deployments and Angel Rivera from CircleCI doing a practical introduction to our new feature runners. It's gonna be pretty cool. You should check it out. There is a URL in there, circle.ci slash continuous delivery dash evening. And that's all from me anyway. Thank you for your attention. I will take any questions you might have. It looks like we might already have one in the Q&A box here. I'm not sure if you're able to see that or I can read it for you. Okay, yeah, I can see it. Does Docker layer caching work with OCI compliant image? I don't know, but if you are able to email me, so I know who you are, anonymous attendee, then I will be able to find you that answer. I don't know off the top of my head, fortunately. There's another question. In your experience, what kind of things does a development team go through forming, storming, norming, performing to move from problematic builds to fast builds in development cycle? I would say that in our kind of most drastic example, we were actually performing or at least we thought we were performing when our builds became very slow just because we were so eager at adding tests and kind of making sure that every feature is tested like end to end so that we really forgot about the optimization aspect and just woke up kind of in quotes, woke up one day and discovered, oh, snap, our builds are way too slow. I would say that it really depends on every team when they kind of go from problematic builds to fast builds, but if there is like human factors again, like obviously storming is probably when no one trusts anyone is also when you're likely to see people kind of act as gatekeepers to deployments, that kind of stuff. Do you have CI3D solution for telco vertical? I mean, I don't know exact specifics or needs for telcos. Circle CI really works with any programming language platforms out there, so I would imagine that you might find it works well for that vertical as well. I'm not sure whether there is like requirements beyond running something on your infrastructure or cloud infrastructure or special kind of machinery required to build and deploy or maybe even kind of verifications for, I know we have like several accreditations from various organizations. But I don't know about specifics for telcos. How to manage a large number of teams, each of them are having their own Circle CI. So if you're using a single kind of organization, let's say on GitHub, then every member of that organization could have access to Circle CI in your plan. Obviously, whatever someone sees in your VCS is also gonna be something they can access in Circle CI. What kind of issues are you running into any particular issues regarding managing large number of teams or something else? We definitely have like very large teams on organizations that we cover. But yeah, I'm not sure about the specifics of your question. And we are right at time here. So we maybe have time for one more question if one comes in. But yeah, if anyone wants to kind of ask about any other questions, please tweet me or email me at smarkin on Twitter or Zan at circleci.com. I imagine I can also share my slides with you and you can then post them. Yes, absolutely. Awesome. So I'll add some more resources and links so that folks can actually find what they might be looking for. Well, thank you so much, Zan. And thank you, Circle CI. And thank you to all the participants who joined us today. As we mentioned at the beginning, this recording will be available on the Lenox Foundation YouTube page shortly. Thanks again and have a great day.