 liked what Andrew had to say about DevOps being a mix of technology and people, right? And that intersection between the two and how everything is related, right? And they're always connected to each other. Can't use one without the other. So I'm here today. I'm going to talk a little bit about change management. I think this is one of those fascinating places where technology and people come together and the process that you define has to work for both of those. And that can be very tricky to figure out. I'm a manager of product operations at Sengrid. If you don't know who we are, we send a lot of email. If you took an Uber to get here today, that receipt came through our platform. It's a pretty high scale environment and keeps us on our toes. So, you know, as Andrew said, we've been working in this space for a while, all of us, and there are a number of problems, right? And you have to solve those in a way that makes sense for your scale and your organization. I've spent three years at Sengrid. That's been a blast and has involved a lot of challenges for me and a lot of learning. Before that, I was a consultant for most of my career, and I worked with a number of organizations from very small mom and pop shops that really didn't have their own IT at all. Through hospitals, government organizations, scientific research centers, pretty much everything in between. So I got exposed to a lot of different types of organization and a lot of different sizes of organization. And that was a really interesting experience that helped me learn a lot about people and a lot about process and how all of those things come together. Throughout all of that, you know, I've been working in the Linux space and the security space and DevOps in the sense of figuring out how to automate things, how to get by with fewer people, how to have better processes, and of course solving hard problems all the way through. I don't think you can get away from solving hard problems if you're anywhere related to the IT industry. So why change management? I think this is a particularly good question at a DevOps conference, right? Some people think DevOps do away with change management, do away with process. We can rely on tools, we can rely on automation. How many of you are currently using some kind of a change management process in your organization today? A lot. How many of you have no change management process? A few. Why might you want change control? Sometimes you don't have a choice, right? Sometimes the business that you're in has regulations that you have to comply with. Sometimes you want documentation of what's happening. You want an audit trail of everything that was done in your environment. Sometimes you're just looking to have fewer mistakes. You don't want people fat fingering commands and taking down S3, for example. Sometimes it's about speed and agility. I know this sounds counterintuitive. A lot of people think about change management as one extra roadblock, more red tape, one more process you have to comply with. This is about filling out forms. It's about getting approval. That's not what I'm here to talk about today. I think if you're approaching it from that perspective, you're doing it wrong. It should be about making yourselves faster. So there are a lot of different regulatory environments that people might be living in, right? SOC2, various ISO, Sarbanes Oxley PCI, I was a PCI QSA. That's a nightmare. I advise you to stay away from credit cards if you can. HIPAA, HIPAA is another big one. So again, like survey, raise your hands if you have to deal with one of these in your environment. Each one of these probably requires some form of change management. This is a small list. If you count up all the regulatory environments in the US alone, it wouldn't fit on a PowerPoint slide, let me tell you. So one of the interesting things to me is we talk a lot about automation in the DevOps world. We talk about infrastructure as code. We want to maximize the systems. We don't want people manually doing things. There are always going to be gaps, right? And a lot of this depends on your DevOps journey, where you are as an organization. Sometimes this is legacy systems that you haven't converted to the new paved road yet. Sometimes you have snowflakes that are, you know, old or just not included in configuration management. How many of you, another show of hands, have systems that are not in configuration management? Snowflakes, legacy things, yeah, that's most people have something like that. Sometimes your organization has done a great job with that stuff, but you just haven't gotten around to the network yet. And so maybe your network devices are not in some kind of automated configuration management, right? There are gaps all over the place, right? There are always going to be things in your environment that involve making manual changes. Sometimes it's just about changes that are complex enough and involve enough people that you really want to spend more time thinking about them. And you have to perform a series of steps in a specific order in order to get that change rolled out. You may be making a schema change to a database. And when you do that, you have to make a simultaneous change to the application that uses that database, right? So if we're talking about checking everything into version control, you want your whole infrastructure, all of your configs, everything versioned, everything has an audit trail. Why not those manual changes as well? Fewer mistakes, this seems like a no brainer, but writing down your plan makes you think through it in more detail. A lot of the time, just that process of writing it out makes you realize you had some faulty assumption or triggers a memory. Oh yeah, we changed that six months ago. It doesn't work that way anymore. Sometimes getting a second pair of eyes on something, even just rubber ducking, right, where you don't have another human being, you're just talking to a rubber duck on your desk. But saying it out loud, imagining somebody else hearing it, having them actually look at your plan, that'll catch things. It catches those faulty assumptions. Are you sure that that backup exists? Are you sure that it was made last week? Are you sure you can restore it? So just this process of writing it down and getting a peer review involves reducing a lot of mistakes. And then speed and agility, I think if you have a well defined change management process, it can help you move faster because it gives you more confidence in what you're doing, right? You don't have layer upon layer of team lead, manager, director, VP, all the way up perhaps to the CEO for some big scary software deployer, you know, rolling out a new product or something. All of those people want to know that this change is going to happen smoothly and efficiently, and it's not going to take anything down. It's not going to cause an outage. If you have a well defined process that people follow and they're bought into, that can help with this. You can avoid all of those unnecessary levels of approval. You can avoid having to run your plan past the CEO because everybody knows this process works and everybody's bought into it. The other thing is on the speed and agility front, if you don't have a well defined process and one of your engineers has to make a scary change, something really big, something that could cause an outage. If you don't have a process, they may not even know who to go to to get approval. How do you decide whether this is safe to do? How do you know how to proceed? If you do have a process, that's all very clear and that can allow you to save sometimes weeks, sometimes months of figuring out who to talk to, who to get approvals from. So things to avoid when you're designing change management, too many layers of approval, right? You want to keep it down to the minimum. Something I really hate, this gets me in a big way and I've seen this at a lot of enterprise organizations, places like hospitals, government. Recurring change board meetings. This is a board of people that meets once a month, every six months and everybody who wants to make a change has to show up at that meeting with forms printed out in triplicate, right? And they all get up one by one and they make their case to this group of people who may have no idea what they're talking about in terms of the technology, in terms of the specific change, right? And so if you want to make a change like that, you may have to wait two months until the board meets again. You may have to wait six months, right? That slows you down in a big, big way. The other thing to avoid is making your process too prescriptive, too inflexible. You have to remember that people are making all kinds of changes all of the time. And you don't want a process that is really designed for network changes or really designed for configuration changes or security updates or something in particular. Have a whole bunch of required fields that they have to fill out that don't apply to 30 percent of the cases or 60 percent of the cases, right? You want to design a system that's fairly flexible and you want to leave it to people's judgment to be able to adapt that system to what they're doing on a particular day. So one thing, you know, I'll say before we really get into the details of the change management system that we designed at Sengrid is there is a whole class of stuff that we said, you know what? We don't care. We have other processes in place for that. And that includes basically any change that you can roll out through configuration management or one of our other automated process. If we already have the code inversion control and we already have a process, an automated CICD process for deploying code changes to production, then we don't need to have a manual, you know, change management approval and a ticket for it because we know exactly how that software gets deployed to production and we know exactly how we would roll it back. We have automated tests in place to detect if there's a problem. That's not something that you have to document, right? You just follow that normal process and it all works and it's automated and you don't have to worry about it. Same thing, we use Chef for configuration management and if you're making changes to cookbooks we've got all that inversion control. We have a very well-defined process for how that stuff rolls out through dev and staging and prod and we know how to test it and we know how to roll it back. That's all very standard so if you're doing any of that stuff, none of this applies. Doesn't matter, right? And it's important to make that distinction. So what do we do at Sengrid? Really all we did was create a new board in Jira, a new project, and it's a Kanban board, right? And it's pretty much that simple. We have a wiki page that describes the process, gives a high-level overview, gives some guidelines. Like I just said, you know, these things are handled by automatic processes. You don't need a change ticket for that. Give some examples of changes that might need a change ticket. But really it all centers around this Jira board and this new issue type that we created called a CMB, Change Management Board. It's got a few simple fields in Jira, description, talking about what you're going to do, a list of stakeholders, people that might be affected by the change or might be interested in it, affected systems. This could be a list of hosts. It could be a product. It could be, you know, network devices, anything along those lines. Detailed change plan. This is a step-by-step plan of what you're going to change. And it should be detailed enough that if you get the flu or get hit by a bus or something else, somebody else on your team could pick this up and run with it without needing a whole bunch of context around what you're trying to achieve. Right? It should be very low level and detailed. Then a detailed test plan. How are you going to validate that your change was successful? Now this is number one, making sure that the change did what you intended it to do and number two, making sure it didn't do anything that you didn't intend to do. You should be testing for both of those things. Right? The last piece is a detailed test, detailed rollback plan. This is what happens if something goes wrong. What happens if that change didn't do what I intended it to do or it did something else and it broke something that I didn't intend to break? How am I going to rollback? And again, this should be very detailed. It should be low level and specific. It should say, log into this host, perform this action, do this thing. And, you know, often if I'm writing a change plan, I make it copy and pasteable. How many of you have to make changes at 2 a.m., 3 a.m., 6 a.m. on a Sunday? Right? I'm not in the best frame of mind at that time. That's not when I'm on my game and really focused and attentive to detail. So I like to be able to copy and paste from the change plan that I spent days or a week creating. I was thinking carefully. That works a lot better for me. Finally, we have a section for risks. This is just to call out, you know, to get people thinking, what are the risks here? What could go wrong? What are the things I'm worried about? What are the things that are keeping me up the night before I make this change? In terms of process, our workflow is pretty simple. Step one, whoever's making the change writes their plan down. They create this ticket in JIRA. Those fields that I listed, by the way, are almost all optional. Out of all of these, the detail change plan, test plan, and rollback plan are the only required fields on this ticket. Everything else is optional. So you create this ticket on the JIRA board, you write it all down, you think through it on your own, and you try to capture all of the detail that you can, right? Think through all of your assumptions and question them. Step two, when you think your plan is in good shape and you've found all of the mistakes and you've thought through it and got everything working, you run it past somebody else. It should be a peer review. It should be somebody who has enough knowledge and context of the systems that you're changing that they understand what you're doing, and they can catch your faulty assumptions. They can catch that typo in the command, right? They can say, you forgot that argument. If you don't have that argument, it's not going to do what you want it to do. Once the peer has reviewed that plan, the peer makes a comment in the ticket saying, looks good to me. I approve this. This sounds good. And at that point, the engineer who wrote the plan is free to go execute it. Now, this could all happen in a period of 15 minutes if it's a small change, if it doesn't involve anything tremendous or scary or whatever, and if there's no need to perform a whole lot of notification and communication with those stakeholders, right? If it doesn't matter whether this change is made, you know, at 3.03 pm or at 6 a.m. or whatever. Sometimes changes are really complex. Sometimes they involve multiple teams. So sometimes this four step process can take a week, sometimes a couple of weeks. We've even had cases where change plans have gone on, people have been iterating through them and discussing complex changes across teams for months before they decided, you know what? We're not going to do this. This doesn't make sense for us as an organization. Maybe we need to go build a new feature or something else, right? Sometimes this process of writing things down and talking through everything required to do something encourages people to think outside the box and to come up with new approaches to the problem and you throw away the change plan and you go do something else instead. That can be valuable as well. So how is it working for us? Everything's puppies, rainbows and unicorns of course, right? Everybody gets it right on the first try. No, I mean in reality, process is hard. You know, technology problems are easy. People are hard and finding a process that works for everybody can be really tricky. And you know, as Andrew pointed out in the last presentation, things that work at one scale often don't work at a different scale. And so as your organization grows and changes and as Sengrid grows and changes, we have to adapt always, constantly. And so we're constantly tuning knobs and tweaking things. And change management is one of those things. We've already adapted it a few times since we introduced it and I'm sure that we will adapt it again. You have to be ready to do that. Some of the challenges that we've faced are cross-team dependencies, right? This is always hard in an organization that's large, that's scaling. As you add more and more engineers, more and more teams, oftentimes your problems start to span teams and you can't, one team can't do something without affecting two or three other teams. And those kind of cross-team dependencies are really hard. You know, and dealing with that in a technology approach is usually impossible, right? The way to solve this problem is usually with communication. You get people to talk to each other more. Face-to-face, hip chat, Slack, whatever works, right? Another challenge in a place where this process doesn't really work well for us is with really repetitive changes, things that we have to do on a regular basis. And for us, a big one is we have our infrastructure that we manage ourselves. And so we're managing servers and we're managing switches and top-of-rack switches. And when we bring in new hardware, we get new racks of servers installed. We often have to make changes to switch port configs, right? And it's changing a VLAN, setting up a trunk, something like that. Those kind of repetitive tasks don't really fit well in a heavy process. So we're looking at that. We may start using a checklist for that or something else, right? And then the fat finger challenge is an interesting one. This is a case where if you do the right thing, if you make the change that you're intending to make, it has no potential impact on anything. But if you accidentally configure the wrong switch port, for instance, right? Maybe you have a new rack of servers and you want to change a port for that rack, but you log into the wrong switch or you think you're executing a command on the database backup host or slave and you're actually executing it on a master, right? Those sorts of things are very hard to protect against using these kinds of systems, right? So that's a challenge as well, trying to make sure that we cover all these bases and have a process that's flexible, that's lean and fast. Your organization may have these challenges, it may have other challenges, but again, it's all about adapting whatever process you create, right? It's about learning, iterating, always being adaptable. So again, every organization is different. Organizations are different scales, right? If you've got five engineers that looks very different than if you have 300 or a thousand or, you know, 10,000 for Google or somebody like that. Culture is different and how people communicate and collaborate is wildly different across different organizations. That can make a big difference to the kind of process that you define for your organization. DevOps is a journey and some people are at different places on that journey and you have different levels of automation, you have different levels of testing, monitoring all of these things, right? If you don't have good monitoring and you don't have good automation and you don't know how to test or you don't have automated tests, you're going to necessarily be slower. You're going to have to take more time and care with your change control process. Hopefully that makes sense to everybody. SLAs, right? Sometimes you have an SLA and you have a contractual obligation to meet that SLA. There are financial penalties if you don't. And for other organizations, there aren't any SLAs and it doesn't matter. You've got SaaS companies. You've got companies that are deploying software on-prem. Those models are wildly different. You're going to need to design a different process to meet your needs. Different outage impacts. Some companies lose hundreds of thousands, millions of dollars per minute that their site is down. And for some companies, it really doesn't matter. That matters to the change process that you design. Different tools. That makes a huge difference. Process has to work with the tools that you have. They have to be designed together to work together. So what's really important? If everything you're going to design is different for every organization, what are the key takeaways? What do you need to pay attention to? Number one, write down your plan. This is the biggest thing. This is going to catch the most mistakes. It's going to give you the most confidence. Just write down your plan. Number two, figure out how you would roll back. Again, write this down. It's amazing how many times in my career I've been planning to do something. And just in the act of writing it down and talking about the rollback and how I'm going to roll this back, I realize, you know what? I haven't looked at the backup system in the last few months. I should really go make sure that those backups are working and that they're there. And I go and look at it and, hey, 90% of the time they are and everything's fine. And it would work fine. But there's that 10% of the time where somebody changed something a few months back and we didn't know and we didn't detect it. And you know what? Backups have not been working. And if I had executed that plan and needed to roll back, I wouldn't have been able to. It's really important, right? And the act of writing it down brings that stuff to mind. Helps you think about it. Talk it over with somebody. Again, best case scenario, you're doing this with somebody on your team who knows all of the systems. They have all the context. They're familiar with it. And you can have a face-to-face conversation. Even if you can't do that, talk it over with somebody else who's at least familiar with the general area you're talking about. Even if they don't fully understand, having to explain it to them and walk through it step-by-step with them can still help catch mistakes. They can still question your assumptions. If you can't have that rubber duck it, literally put a rubber duck on your desk, put something there and talk to it. Saying it out loud helps. But definitely run it past somebody. Talk it over with somebody. Try to get as many eyeballs on it, right? And for simple changes, one pair of eyeballs is probably fine. For more complex changes, we encourage people to talk to as many people as they need to to feel comfortable with what they're doing. This is about giving yourself confidence. If I'm making a really scary change, I may want to run that past five or six people. Sometimes you get better ideas that way, right? Better ways of doing it, more efficient. And the last thing is to document it somewhere. For me, this is really important, right? We've got version control, we've got all kinds of systems in place for auditing and logging and other things like that. We have a pretty good audit trail for our systems. But these manual changes that fall through the cracks where we don't have automation, right? This is a changelog for the environment. I can go back and look at what change somebody made on April 3rd of 2016. And I can see it in detail, step by step. This is what I did, step by step. This is what I tested. That's really important. You can go back six months, a year later, and you can say, how did you test that that was successful? Ah, I see it right here. Okay. Well, that means it was working at this point in time. Must have broken some time after that, right? Or something like that. So I really like the documenting, the audit trail, the changelog for my environment. Step five, profit, right? So general advice for change management, keep it simple. Process is, in many ways, the bane of all organizations. It is so tempting, it is so easy to throw process at people. If you're having problems with technology, if you're having problems with people, with communication, outages, all of these things, the natural inclination is going to be to throw process at it, throw red tape, make people do more work, make people write things down more, right? You have to be wary of that instinct. Keep it simple. If you make it too complex, if you make it too inflexible, if you have too many required fields, people won't use that process. They will find ways to work around it. You have to design a lean and simple process that works for all the use cases. And it's so easy and so wonderful that everybody wants to use it. That should be your goal. Another important takeaway is to use the existing tools you have. It's much easier to get people to adopt a new process if it doesn't involve adopting an entirely new tool set, a whole new technology, a whole new ticketing system or whatever. Reuse what you have. It's far easier, far more efficient, and it usually makes people adopt it much faster. And then remember your North Star, right? What is the goal? The goal is to empower people to move faster. It's to empower them to move with more confidence. So if you keep these things in mind as you're designing your process, you'll end up with a better process. Keep them in mind as you watch how that process works over time. Keep adapting it. Circle back around every 6 months, every 12 months. Is this still working for us? Talk to the people who are doing it. What do you think? Does this work well for you or not? What would you change if you could? And keep these things in mind and be ready to tune it and adapt it. That's all I got? Thank you. Yeah. So the question is, did we notice a difference when we implemented this? Were there fewer outages? Were there fewer self-inflicted outages? And the answer is yes. We've been doing this at Sengard for about two years. And the way that we rolled this out, by the way, is we rolled it out first to the ops group. And we kind of vetted it there for a few months. And then we rolled it out to the rest of engineering. And it was kind of a gradual process. You know, cultural change is hard. Introducing new process is hard. And so it took a few months to really roll it out and get people bought into it. But yeah, we've seen a, let's see. I think over the course of 2016, we measured, it was something like a 40% decrease in self-inflicted outages. And it's hard to say if that was all entirely due to this. There are a lot of factors that go into that. But it was measurable. Oh, that's a great question. So we were using a Kanban board. And the question is, do we limit the number of tickets in each of the states in the Kanban board? And the answer is no. But I'm not sure that's the right decision. For the most part, we let engineers manage the board themselves in the state of tickets on their own. And what that means is the capacity. So as an individual engineer, you can see the state of your ticket, and you can manage that super easily. But for me, when I go look at the board, there are a bunch of tickets in various states, and sometimes that's hard. We probably need to improve that a little bit. If I made a recommendation to you, I would say it's worth assigning somebody to own the board as a, you know, holistic owner of it to make sure that everything is in the proper state. Whether that involves limiting the number of tickets in progress or whatever else probably depends on your organization. It might depend on how many engineers you have, and whether that makes sense or not, how many changes like this you tend to make. But, yeah, I would say that's something we have not found the right answer to yet at SendGrid. And I didn't specify it, but we really only have four states on that Kanban board. The first column is just like tickets being created. They haven't been started yet. They're changed tickets that are in progress or in development, I guess I should say. And then we've got a column for changes that have been approved, and a column for in progress changes actively being made right now, and then completed. So it's pretty simple, and the engineers are responsible for managing all of those states themselves. We do audit it periodically, but we don't have enforcement mechanisms. We don't have, like, you know, a couple of specific individuals who are allowed to move it from one state to another. Everybody does that themselves. Okay, one more time. Give it up for Chris McDermott.