 Good afternoon, so I want to talk about some of the decisions you have to make about Along the way of your good ops journey and some of the Consequences of those so I'm not gonna talk about any particular products. I'm not gonna talk about any particular platforms So the first thing I want to do is I want to talk about you know Hey, what all these terms mean like let's level set on what I'm gonna actually talk about So, you know, there's two different things CD means continuous delivery and continuous deployment. There's CI There's get ops. They're all a little different So, you know for the purposes of this talk when I'm talking CI I'm talking about building your code to gain it in an artifact repo unit tests test automation, etc And I'm talking continuous deployment. Hey, this requires continuous integration, but it's not the same thing So this is all about how do we get it the rest of the way into production? How do we actually get it on the right clusters make sure everything deploys safely in the expected way? And when I talk get ops talk about how do we make sure that people aren't doing this by hand? How do we make sure all of this lives in change control? We can audit who did what we can easily roll back So if you think about it, you can't do continuous deployment without doing continuous integration You can technically do get ops without doing either of them So, you know, I the very first time I saw a team doing something I would classify as get ops Except for the fact that it predated get They were they were maintaining all of the the deployments in source control But, you know, they had customer specific deployments. It was a pertinent, you know pertinent application and Someone went in and modified a file and get when a customer needed an upgrade Was it continuous deployment where customers upgrading and getting features immediately? No, but it was get ops So these things are all related, but they're not the same Personally, I think that there is a happy path a lot of companies want to get to where they're doing both continuous deployment and get ops Some examples of where you might be doing one, but not the other. Hey get ops has the word get in it in its name If you're not using it, you're not doing get ops. So if you're on mercurial not doing get ops It also a lot of people Consider get ops to require that your configuration is declarative. This is something I happen to be a big believer you know if you're Checking in a script and if you modify that script everything breaks Probably not really doing get ops the way you want to be doing it so your your configuration of what you are deploying should be in a declarative fashion and You know similarly you might be doing get ops and not be doing continuous deployment if you're doing, you know a bunch of manual processes in the middle you're you know doing different repos and You know it's three months in between the original feature being committed and it actually being deployed to some some environment In the happy middle ground You know all your configuration and source code isn't get and I mean everything your infrastructure is managed to code to your deployments are managed to code everything is and It's all declarative If you're in the middle when you check in a feature it gets everywhere does it just update your dev environment doesn't update staging Updates staging runs your integration tests updates production updates all of your regions So that's kind of the the middle ground that from what I've seen a lot of people Want to be trying to reach so we're gonna talk about some of the problems I see people encountering along the way when they're trying to get there so the first thing I want to talk about is repository structure and The reason I want to talk about this one is I can't tell you how many people I have talked to Who are like I just like XYZ tool and you start talking about why and it's not actually a problem with the tool It's how they configured the tool so Back before get ops was a thing back for DevOps was a thing We came up with a branch management pattern called get flow and it was originally developed for long running Long running releases you need to maintain So, you know if you have five versions you need to maintain for two years in parallel it's a phenomenal way to manage your branches and When when DevOps came around everyone was familiar with this way of managing branches so they started going well, I can do a branch per environment and It looks really really good on paper if you're a small team everything simple works. Okay Bigger you get the more people the more cooks you got in the kitchen the more likely it breaks some some examples of how it breaks Imagine that you're using branches and You want to run different number of replicas in the staging environment versus a production bar, okay? Someone changes the image version Merges to staging When you go and you merge it to production it goes and picks up the replicas set The replica count change another person made and you didn't even realize it So, you know when you start doing branches per environment it can cause a lot of pain It causes so much pain the creator of get flow actually went back to the original blog where they published it And they wrote a disclaimer saying don't use this pattern for this It's not what it was intended for Personally, I'm a big fan of trunk-based development. There's a few other alternatives out there as well, but you know Check in your build on feature branches check them into main when you get to main Go ahead and ship that off to all the environments So, hey, what's the alternative to a branch per environment? My personal recommendation is make a folder per environment in your repo I say a folder not a file for a very specific reason. You don't know what the future holds So, you know if you're using helm you might be like hey, I've I can just you know I've got a values file per environment I can have one name prod one name staging two years from now if you suddenly have another file that you need You're gonna have two files with that staging it gets you don't know how it's gonna grow just thinking a folder. It's easy grows better and you know the other thing I'd recommend is there are a whole bunch of really good tools out there for Doing templating for YAML templating for JSON templating for XML These two are really common in Kubernetes if you're using something else I guarantee there is template a template generator for it So, you know use tools like helm and customize they allow you typically to have All of the definition of the application be shared, but then have parameters per environment So you can go and you can have a file in those folders For each of those environments and suddenly everything's going to be a lot easier to manage The other the other one I see a lot of people Struggle with and sometimes wind up regretting their decision on Doesn't have nearly as kind of dry of a A versus B trade-off in my opinion and that's how you structure your git repositories so you know some people go and they put their applications their infrastructure code all in one repository some of them do a separate repo per app or an app repo and an info repo and My best piece of advice here is ask yourself. How your company works if you're a small team You expect to stay a small team the same people are doing infrastructure who are doing the app dev One repo actually works really really well for that. You know you you don't have as much, you know cognitive overload if you're 2,000 person company and you have controls on who's allowed to touch the AWS account configuration Doing multiple repos helps you enforce the separation of duty. You can more easily set up different permissions per so ask yourself where you are now, but then also ask yourself where you think you're going to be in three to five years because Restructuring repos kind of sucks Don't want to have to do it You know, you know your company's growth trajectory if you're you know building a consumer cell phone app You're probably not going to wind up having a bunch of heavy info seconds if you're a 10-person fintech startup You probably are going to wind up needing those those separation of concerns in the not too distant future The next pattern that's out there that I want to talk about a little bit is push versus pull so The difference here is when I check in that configuration change to get how does it get applied? Is there an external system somewhere that's like monitoring my get repo or? Is like you know my CI system kicking off a deployment job there are two fundamentally different patterns Some of you may not be familiar with both, so if you use kubernetes There's a bunch of kubernetes tools that use pull, but there's very little that uses the pull model outside of kubernetes today It's kind of the newer one so some of the advantages of Push is it's been around longer. It's a little bit simpler more people understand it and One of the things I really like about it is it leaves you with the full full power of your CI systems so you know if you are deploying something and you need to run like Why Q to like change one thing as part of your build script you can and it's it's really easy When you're doing a pull model oftentimes going and doing scripted changes to config can become more complicated The other thing I like that is it's cross-target, so if you're only deploying to kubernetes Hey pulls not a problem if you're deploying to five or six different things and One of your tools is pulled and everything else is push that suddenly means your app devs every time they have to touch one of these tools is having to completely change their rental model and You know the more Things work in the same way the easier it is to get your job done I suspect a lot of people in this room are you know on an SRE teams think of the app devs as your customers and you know The more familiar things are to them the more productive they're going to be the happier your customer will be There are some disadvantages of push so there are push systems out there that require things like You load external secrets into github. They require internet access to like a kubernetes cluster. That's a problem If you need external secrets use the secrets mantra, don't check it into your get repo. I everyone should know that but Also, take a look at how it connects to your cluster because you know a lot of companies when they start when they're 10 people Maybe you can get away with having your your kubernetes clusters API endpoints Internet accessible, but like it's not going to last and you don't want to be poking holes in your firewalls for it So like how you actually connect the API's you're deploying to back to your get repo is a meaningful problem And whatever tool you're using make sure it's got an answer for that you know on The push model there are tools that you know run in the cluster connect back and then You know have frameworks where you can trigger it on the pull model Hey, they're they're connecting to your get repo and pulling from it. So as long as your get repos internet accessible You're usually fine Some of the other disadvantages of the pull model it Tends to have a large amount of Business logic that you need to manage on the system that is applying your configuration. So One of one of the One one of the interactions I had with customer once the kind of opened my eyes on on one of the problem on this problem it was a large bank and they were trying to upgrade a pull model based deployment system for four months and like that sounds conceptually easy, but the problem was they had 400 clusters they had four different versions of this thing running and all of the versions treated the configuration slightly differently So anytime they tried upgrading any cluster. It broke someone else's config so they couldn't get all the different teams on a single version that actually worked the same across everything so Make make make sure you think about that So the next thing I want to talk about is automated validation So we talked a little bit about get ops versus continuous deployment a lot of the tools in the space support both My advice if you want to get to where you're doing full continuous deployment everything fully automated and not just get ops You want to make sure that the tool you are you the tool chain? You are using has an easy answer for how it orchestrates deployments and testing across multiple environments You know very few people deploy to production immediately when someone checks in they tend to deploy to you know pre-production environments They tend to run integration tests. They run to tend to run security scanners. They only go to production if those pass There are a lot of tools that How do you know when it finished updating so like if you're on that pull-based model and it went and it Updated staging. How do you know it's time to go run the integration tests that need that environment to be deployed? You know some of the tools have answers for that some of them don't if it doesn't have an answer you're gonna have to build it yourself So make sure you think through that some of the push-based Systems you're having to go in like monitor and wait for that deployment to finish You know pulling constantly get have actions if you're if anyone's using it You know charges you by the minute of execution time. You don't really want to be pulling and waiting So you know make sure that you know not only can it tell but it can tell in a way That's like a reasonable and not going to cost you an arm or leg some examples of The classes of tests you want to run some of the security scanners So you know I assume everyone uses like you know docker image scanners and static code analyzers That's not what I'm talking about when I say security scanners If you look at like the the OOS scanners a lot of them needed deployed environment to run And basically just like scan even the even the full web server So those are the ones that you tend to need to mix in to your deployment pipeline There's also end-to-end test smoke tests, you know selenium And all all of that stuff tends to need everything deployed talking together Another thing that is in my opinion at least extremely important if you want to get all the way to CD at least if your product Like if stability is truly important to you for production and should be It really should be important to everyone who's here is When you are deploying to production, how do you make sure that if there's a bug you haven't found yet? You can detect it and automatically roll back so a decent percentage of tools but not all of them have Things beyond just you know, hey shut down all of production install the new version turn it on or an all at once Update and a decent percentage also have things beyond like a rolling update, which is the default strategy for gubernet ease you know rolling update basically like updates one pot at a time and There's strategies beyond that So I'm going to talk about two of them the ones I'm going to talk about our canary deployments and blue-green deployments Which are basically strategies that help you both in the case of canary deployments decrease the blast radius of How many users are going to be affected by your deployment and in the case of blue-green deployments lets you actually validate that the Specific copy of the application in production is working as expected before you send a traffic and Provides you the benefit of instant rollback Does have one downside which is you're running multiple copies of the app So cost slightly more, but you know if you're running this for 20 minutes. It's usually not a huge deal so blue-green deployment the the core concept here is Stand up two versions in parallel. So In addition to the old version you stand up the new version, but you don't send it any traffic You validate it depending on where you are in your get ops and CD journey If you're really early, you might be validating it by hand There's a lot of people who are going. I can't do I can't do CD. I can't do get ops like everything's manual Well, you know what if it's manual go put together the building block Have a manual approval like it gets moving in the right direction and then automate the manual approvals So, you know Stand up the new version run some tests against it and look at it make sure everything looks good then send it traffic Then continue monitoring check your monitoring look at it and make sure it doesn't fall over and You know keep that old version ground for a while Once you shut that old version down How quickly you can roll back is limited by how long it takes your application to start You know if you're if your application is cloud native Hey, hopefully it starts in sub 30 seconds if your application some legacy thing that was built on VMs it might take four or five hours and You know that's a problem. So So so run them in parallel for a while make sure everything's healthy That's that's the core concept There are some applications that this is not suitable for it's it's not usually the cloud native ones But there are applications out there that You cannot run multiple copies of they usually have like an exclusive database lock or an exclusive file lock So if you have an application like that, you might not be able to do this But I would highly recommend making sure any CD tooling you are choosing and any get ops tooling You are choosing has a reasonable way to get here when you're ready for it because it will help you ship faster The other advanced deployment strategy I want to talk about is canary deployments. So To give you an idea of why this one matters There's there's a company. I know of who uses these who's a payment processor. So they You know, they're they're like a square They run credit card transactions if they ship broken code and their payments stop running That suddenly means literally every single payment that doesn't run is costing them money and is frustrating their customers and losing them customers like if that's your business You don't want a new version of code to go to everyone immediately So, you know, they send one percent of their transactions through a new version of code For a while I want to say it's something like two or three minutes and then they increase it to two percent go all the way up to a hundred and what that does is it Decreases the risk of shipping. So the whole reason why we're why we're at CD The whole reason we're trying to ship continuously is the smaller changes are lower risk These strategies also help you lower risk So there's various ways of implementing a canary strategy But the basic idea is incrementally increase how much traffic is reaching the new version Some people implement this at, you know, a load balancer Some people implement this by like dynamically scaling how many servers are running The application there's a lot of options When you're choosing a tool set make sure that it has a sane option You can build it yourself if you have to you can start combining four or five different tools but the more of that you're doing the The harder your final solution is going to be to maintain You know a lot of people a lot of companies build really really great in-house CD systems Then the guy who built them leaves and no one else knows how it works and then something changes and It's changed breaks a little Okay, it's fine for another few months someone changes it again breaks a little more So when you when you're looking at this stuff ask what's it going to be like when you leave How do you make sure it's going to survive you and how's your company going to grow in that time? Anyone have any any questions they want to ask about? Get ops CD, etc. Yes you in the back. Sorry. I thought I thought you raised your hand guess guess you were You know when I have a lot to reconcile right How do I make sure that the time that the first? Deployment get the change to the last one that get that change. How do I make that time? How do you make that time small? Yes? Yes, so The the best secret I have to making that time small is to automate every step Which tends to be hard and run as many steps as you can in parallel so a lot of companies have multiple different pre-production environments, so like Integration tests and security scanners are often not in the same environment I've worked for multiple companies where it's like the security scanners must run in this environment And you can't even have full control as an apt team to this like we own this you can't touch it But you know if you can run your integration tests and your security tests in parallel Hey, if either of them fail you don't go on to prod, but suddenly you're not waiting So you know if you're doing one then the other it's going to be slower When you hit production there are a lot of people who Deploy multiple regions in parallel. I have mixed feelings on this it really depends on your application The reason I have mixed feelings on this is like Making changes is not with that with zero risk and when your application is multi region there tend to be Periods of time where different regions are lower have lower usage So there's also companies who do you know maintenance windows and will you know update the the EU servers at Midnight Europe time update the the US servers at midnight US time So if you want to be all the way to production as fast as possible you do it in parallel You can decrease risk farther by going hey the production environments update on you know basically once a day during a maintenance window And since I said the first the first thing I said was fully automate everything You're not there yet get the entire end-to-end flow working even if it's got manual approvals Look at how long each manual approval step takes and then prioritize automating those steps based off how long they take Does that does that answer your question? Hi, I have a question in regards to like the approval of a commit in the get-ups way So we are the the friction I have is between like do we do approvals to go production on the PR level and Let the branch do whatever it wants to Or do we let the commit go through into the main branch and then whatever CD pipeline we use are go ever have a button Say approve go ahead. Yeah, what what what's your view on that? So my opinion is that your PR review process should be Having the code reviewers go. I believe this can go fully to production. So like When I when I review a PR If I see anything that I think would keep it from running I won't approve that PR. There may be automated tests that should always pass that haven't been run yet So, you know anything you're doing from like a code a code review level when that PR emerges You believe it's ready to go to production I'm a firm believer that you don't ask people to review your code until you believe your code will work in production If there's tests that you need to run that need a deployed environment in production. It's a lot of companies who Can use some of the get ops and CD tools to you know, stand up like a namespace in a Kubernetes cluster for a particular branch Spin it up. Hey, you can run your test suite there before it ever merges so PR that should be ready to go to production Sometimes your scanners find something you didn't expect so you may after merging to production have those things running and If they detect a problem, they'll fail it Hey, suddenly you need to go and fix that problem I've got two questions first question is Are there any implications to doing both the get push and the get pull model? I haven't seen a lot of setups that do both vast majority of the time people are doing one or the other I So my product does does the get the get push model and I've talked to multiple people about like hey We can we can push back so that you can have something else that's using it using the get the get pull model at the same time I haven't found a huge amount of interest there. There are a couple tools that are beginning to do that Where they're doing like a orchestration layer that writes back to get but it isn't widely adopted yet and I haven't seen enough of it in the wild that I think we know what its problems are yet if that makes any sense Thank you My second question would be on maintenance windows Is it industry standard to run the maintenance windows? Just off work hours or off I guess where most customers are online on the application So I'm not sure it's fair to call it an industry standard. It really depends on which industry you're in Some industries do maintenance windows some don't so like a lot of enterprise software does where your companies are other software companies Or and where you're doing a SAS solution does maintenance windows I've talked to a decent amount of like consumer tech that doesn't When you are doing maintenance windows, there's two ways people do it and I happen to have a very strong opinion of which way is the right way so the two ways I see people do it is We do it at a time that works well for our engine our internal engineering team that is in 5 p.m. on a Friday because Hey, this is a time that like fits in our schedule. We can block it off easily It's in our normal workday and then the other one that I see is this is a time when not most of our customers are not Using it so it's lower risk personally. I believe the latter is what you should do It's far away what I see most people doing But I do occasionally see people who are who try and schedule maintenance windows not around What's right for their customer but about what's right for them internally and I don't think you should do that Does that answer your question? Yeah, thank you. Yeah, no problem in the many traditional system we have a gate branch strategy such as a gate flow and How do you manage your get branch with CD? so I You might have walked in after I talked about it, but I am very adamantly against using get flow for like environments I Am of the opinion that if you are doing a single repo You should basically have feature branches. They merge to trunk and then once it's on trunk Your change the the goal of your trunk branch or your main branch depends on which repo you're on. Sorry. I'm old Is get it out to all environments When I see people doing a branch per environment That's like the single biggest thing that I see people do where half the time The tool that they're going this tool doesn't work for me It's forcing me to do these branches this way and it's bad like Half the time they're saying that about tool and the tool doesn't actually force it on them. It's how someone configured it years ago So get flow is wonderful for managing long-running versions. I would strongly advise not using it for Like a branch per environment. I would use mainline for this is staging and production Does that answer your question? Yeah, but it reminds me of another questions So generally in in our company we have dedicated the QA teams and Generally, it takes more than a week or months to Verify a certain version so the developer team should They should add a new functionality or new features while the previous version is being verified So in this case, this is where we require the branch strategy So in case of a CC CD, how do you manage such cases? so the best pattern I've seen for long-running QA and a former a former employer of mine, we had a monolithic code base that we wound up ultimately migrating to Kubernetes but To give you an idea of Just how customizable it was it had four different proprietary programming languages in it The place we found bugs was literally stand up every customer's configuration Calculate it and make sure all the numbers are the same and we had customers who took two days to calculate for a single customer So we what we wound up instituting was we had a kind of a Basically two main so we had this is the main that is currently in the process of trying to ship And then we had a main I forget what we called it that was this is the next one so everyone merged their features into that this is the next one and The ultimate main Branch was updated on a schedule from the next main branch And if you missed the day when that cut your feature didn't make the next release and you know At that point in order to ship faster It was okay How much hardware are we willing to throw at the problem to run customers in parallel through this really really slow testing process? So, you know if you have to run that slow I would say have kind of a branch before main that you merged to cut off main when you're ready to start the large The large testing process and manage it via Schedule make sure everyone knows the date. This is when it moves if you miss the state You miss the release and then you have your hardening period But I would also strongly recommend do everything you possibly can to make that hardening period shorter Thank you. Yeah, no problem Any anyone else have anything? Hi, just not question What is your recommendation in terms of the triggers for promotions of deployments when you move from dev stage uat production? Like the teams I work with they love the whole do a huge gha pipeline and automatically promote Have it pause and so on like what's your recommendation for that triggering to promote that? artifact so in an ideal situation Your SDLC should have a set of things which need to happen in each environment before the code can move on So we need these tests to pass as an example. We need these people to sign off You should be able to define kind of that list of constraints of what needs to happen in a declarative way and Once that list of constraints has happened it can move on If you think about things like maintenance windows Conceptually, it's just a constraint that says hey this environment can only update in this time of day So, you know my personal opinion is like if you haven't solved that yet go, okay What's the goal of this environment? What needs to happen before it can move on? Figure out your environment order. Hey, this environment it can't start till this environment's done And after you've defined that list of constraints You've basically defined everything you need to do to deploy and Personally, I like the constraints model where you where you're going like this needs to happen significantly more than like Scripting because it's more change resilient. So some of the some of the tools out there You can actually implement things as these things this list of things needs to happen as opposed to like you're writing a shell script And they are a lot more change resilient to my experience Any other questions? Are you seeing? In your experience. Are you seeing people when they are doing something like a canary deployment? Or or even blue-green, but just in general when something goes wrong. Do you see? People rolling forward versus rolling rolling back. I mean in a Kubernetes native environment. It's always You know the the natural thing would be or let's just point to the older version of the image or are you seeing more? Okay, let's fix the problem rebuild it and redeploy Which which one are you seeing more? I so I see both so in when the one When automated deployment tooling detects the problem You usually roll back and good if you're using any good get ops tooling any good CD tooling It should have like a one-click automated way to roll back On the roll forward. Okay cool. You roll you roll back prod How do you go up go about fixing where I tend to see the roll forward is okay cool? We've mitigated the risk enough. It's not a problem. We're actually going and looking at the bug Let's go ahead and fix the bug. It's drop everything like we're not just abandoning that branch. We're not rolling back main It's all hands on deck. Let's fix main and then you roll forward that change So it tends to actually be a hybrid most of the time like the current environment tends to be a roll back But the the get side of it tends to be managed as a roll forward Did that answer your question? Anyone else have anything? Awesome. Well, thank you everyone for for coming if you have any questions or want to talk more later I'll be out at the armory booth. Happy to talk more