 Hi, I'm Holly from Slack and I'm gonna talk to you about service ownership and our bumpy journey there I've been in software for 19 years in development and then leadership roles and One of the great things about my career is I've been able to take this technical skill set to a huge number of domains so biotech and publishing and Government and entertainment and now at Slack so I've done a lot of different things with computers And I've learned a lot of different things about different kinds of businesses and organizations But I actually started my career as a mechanical engineer. I Absolutely loved studying mechanical engineering and after college my first job was on a team that was building a new kind of engine We were in the design phase and I drew an awful lot of mechanical drawings like this one Sent them out for fabrication. They came back ran my test did all the Calculations redesigned the parts and repeat I worked there for two and a half years, but the entire time was this Really slow slog of designing testing and learning each cycle months long And at the end we didn't even have a product or even a field testable prototype Just a series of test units like this one and we were still just in the lab not in the field At the same time I was writing software to interface with the sensors for testing and for the fuel and air flow control systems And that software grew and matured a lot faster than the engine I found it really satisfying to write that software every day And for me it turned out to be a lot more satisfying than trying to build the engine for years so I made the switch to engineering full-time and haven't looked back and It isn't because I don't love making things. I absolutely do and It wasn't because I wanted to throw away four years of expensive specialized education because I definitely didn't But for me mechanical engineering work was just way too slow You know any given test might give you ten more ideas but it would take you forever to work through any of them and find a great solution and Meanwhile on the product. I was on wouldn't even know if we had product market fit. We just had test units Whereas with software development everything was super fast You could write some code Test it immediately and then fix it right away if it wasn't working And if you had really good tests and you had good user testing then you had confidence you were building the right thing every day I Didn't have the right word for it then but it was the difference between a fast and a slow cycle time around this loop designing a part or some code or a business process Try it out measure results Learn from those results and try it again and the faster you go around this loop the better I Didn't being exposed to all these principles in college when studying the history of manufacturing The Toyota production system developed after World War two Revolutionized car manufacturing at the time and it was the precursor to what we now call lean manufacturing Traditional car manufacturing relies on big stockpiles of parts and finished cars to fulfill orders Which meant that you had to predict like a year in advance How much you would need and then pay for like making and storing all of those parts in cars But the Toyota production system designed out all of that in Order from a two production dealer Triggers the production of a new car at the plant. So the dealer is pulling a car from the plant To fill that empty spot on their lot because they just sold a car And so the plant starts making a car and as they use up parts then they'll pull parts from their supplier They'll track parts at the factory using physical cards called Kanban and the Kanban cards Stay with the parts as they travel through the plant and when all the parts in a bin are used up then you've got this card With no parts and you place that on a Kanban board and a card without parts represents a basically an empty slot in the system That needs to get filled So the cards start on the left side of the board and move right this might look familiar And then as the fabrication is completed. They exit the board and get attached to new parts You can control basically how much inventory you have in the system by limiting the number of those cards you have And so like a decade after I visited a factory and saw Cards attached to parts. I started using Kanban in developing software on an agile team And I'm like whoa like these are the exact same thing I had no idea and so everything about lean manufacturing that I had learned got combined with how to make software So the cards still represent work to do the cards still flow from left to right when they're completed and Hypothetically eliminating waste and reducing cycle timers still the primary goals But besides reducing waste one of the most important concepts in the Toyota production system is Kaizen Which is sometimes translated as continuous improvement So for example line workers in a factory are empowered to change their workstations To improve their work and work teams can actually reconfigure the factory To make things more efficient often without involving management And so everyone at the factory is empowered to design and measure their own work So empowerment is really key to Kaizen Otherwise improvements will go through layers of approval and change slows down So the Toyota production system Kaizen and lean manufacturing all look a lot like the lean thinking we do in software Like most of you I've been through my share of agile transformations I've taken the trainings. I've become a scrum master. I've probably attended thousands of daily stand-ups at this point groomed giant backlogs and I've been in at least two or three really lively debates about What the best way to size user stories is and do you include bugs and the sprint planning? but in my experience a lot of teams practicing agile are putting in a lot of effort to do it, right and not really living the benefits so Organizing the work feels really great and staying in sync with my co-workers with daily stand-ups feels awesome Writing code every day is still great Knowing that my code was compiling and passing tests and giving me the output I expected was still exhilarating, but that's not enough, right? You need to ship to the user. You need to learn in production You need to test in production like we learned yesterday and all that execution has to Actually add up to something and too many agile teams I've been on feel like this a really fast dev team on a treadmill And you tell yourself you're doing well, you know your agile metrics show You've got short lead times or accurate space estimates, but you're not actually shipping And I'm not talking about the differences between scrum and Kanban and lean You know I've used all of those and each of those can fail in this way So personally I've observed two things that differentiate teams that are delivering the right things fast and teams that aren't The first is executive dedication to learning if your highest leaders are not committed to creating a learning and adapting organization One that is fearless in the face of change then no team in that org is going to succeed under those terms either And the second thing is high-trust teams So high-trust teams can really dig into what's not going well and Suggest radical changes to make it better and they can push themselves to do that over and over again a High-trust team can execute the design measure learn cycle and make progress incredibly quickly Too often teams aren't willing to really measure and learn and try new things It is a lot more comfortable to avoid conflict Not talk about the bigger issues It is a lot more comfortable to avoid the pain of change Especially when that change might fail. So a lot of teams languish Not asking the hard questions not really learning But if you're not learning and changing you're going nowhere All right, what does that have to do with slack? So slack launched in February 2014 and it grew really quickly So within five years we grew to 10 million daily active users we went from supporting really small teams to Supporting some companies that have hundreds of thousands of users each And because slack Is a communication tool like people keep it open for nine hours a day? actively using it We went from about a hundred servers to over 15,000 servers in AWS 25 data centers and then of course we grew from eight to 1600 employees in ten international offices. So that is a ton of growth and Slack really lives and breathes this lean thinking and executive dedication to learning is super high and Part of why that is is that slack itself is a massive pivot it started as a gaming company and their game like basically failed to make money fast enough and They're looking at the end of the company Winding it down and they're like well We've got this internal chat program that we wrote for ourselves to make it easier to make the game and Maybe that would do okay in the marketplace and that that kind of worked out So what's great is that from the very very beginning shipping code changes fast to users was a priority So they set up continuous deployment systems where any developer can push code to production in minutes built-in experiment frameworks to test features and interface changes with slices of your use user base and We were always releasing major features testing them with users along the way And we're lucky enough to have a design and user research department that measures how our users are experiencing slack. So great But there's something that along these five years didn't really scale and that was the centralized operations team So who's responsible for the management monitoring and operation of a production application There is no right answer to this question But a centralized operations team was slacks answer for a lot of years One team to do your cloud instances and write all the chef and terraform take all the pages Manage all the incidents, you know divide divide your labor Into specialized areas, right? So the product developers are focusing on features and scale and architecture This model works really well for a lot of companies and to be honest it worked for slack for a long time In the early days most slack developers knew the whole code base and a slack grew ops engineers Generally knew who to contact to help to get help so ops was getting all the pages in the early days and the devs just showed up when there was a problem and Things were basically working out But as time went on Growth really meant that product development scaled Baster than ops and at one point we were at about 20 to 1 on Product devs to op engineers so how can operations reliably reach a developer when there's a problem? So gradually the developers started to go on call Just the most ultra senior developers at first there was a group of like 8 to 12 engineers basically and They had this rotation and they could be escalated to if something was beyond opps's power to fix And so there were sort of different thoughts and feelings happening Ops was happy that developers were gonna be available via pager-duty escalation in some sort of organized fashion to help But some of the devs had never been on call in their whole life So they were you know scared and not confident that it was gonna work So now we're at ops getting the first pages, but the ultra senior devs are on call and that that worked for a little while and You know even if the dev on call didn't know how to fix something you still knew Who in the org did know right like call mode? She knows how this stuff works She'll probably answer So again right like slack is a high-trust organization and this goes on for a few years and More engineers more features more systems and more and more often the on-call dev doesn't know how to fix the problem So we ask this question again like how can operations reliably reach a developer when there's a problem and Call reach the right developer So in fall 2017 most of the product developers went on call Seven new pager rotations were created basically overnight Covering specific parts of slacks infrastructure and product But the change management for this change was pretty bad Ideally people are involved included in the changes to their work right like empowered continuous improvement But this change came from the top and it really disempowered people Ops was feeling pretty happy because we had these new rotations like Search or front-end or back-end so you would actually be able to reach somebody who might be able to help you with a specific problem You were seeing but the devs were initially pretty surprised like oh my gosh. I'm a call now. What happened But they were that that sort of fear was tempered by One of the bad aspects of these on-calls which is that some of them were really big right like all the front engineers or all the back-end engineers And so some of these folks were only on call two three four times a year And being on call is like anything else you learn by doing it and So if you're only on call a couple of times a year you were probably scared each time because you don't get used to the sensation of being on call and sort of the the life patterns you have to set up for it and so And If you get paged you don't know how to be an incident either because again, you're only in it like three times a year So at this point fall 2017 Ops is getting the first pages still now all the senior devs are on call and we've got seven more targeted Pager rotations, so we're evolving At this point we've also got dozens of production deployments every day So we've got that really great continuous deployment system that empowers the developers to push to production within minutes Which means that ops has to keep a really detailed understanding in their heads of the whole system and you have to know Which of those pager rotations to page given what symptoms you're seeing and so Lobs engineers are basically human routers Either finding the pager rotation or the specific people that need to help you in any given incident It's lack keeps growing more people more systems more code Even with seven rotations over time It was a good chance that the dev who was paged didn't know about the subsystem having problems Which left the devs feeling like failures and ops still feeling really overburdened and And still end up calling people who weren't on call who are the ones that knew how something worked so in a learning organization the post-incident meeting or post-mortem like most of us call it There's a there's a chance in that meeting to learn about the unexpected complexities of the system The nuance of how things fail And really extract that learning from people who know it best But the problem at this point was that at slack post-mortems were not a great place for learning They were being run by the ops engineer post-mortems were being run by the ops engineers who again like retired and overworked They didn't have the time or the context to Prepare in the way that you need to to make a really good post-mortem And so the post-mortems really weren't about learning they were about creating lists of action items and Usually people didn't think that attending would be a great use of their time So only the group that sort of felt that they needed to attend Because they were directly involved in the incident attended and everyone else stayed away all right, so right around that same time in fall 2017 operations got new leadership and we had a reorg and a mission change and Like any good Reorg you got a new name. So now operations is called service engineering So we asked ourselves a new question, which is how do we ensure that the developers know that there's a problem? So we decided that centralized operations was no longer the answer Service ownership was basically our depth ops transformation the idea that the dev teams that write the code Owns the operation of that code right down to getting the pages and running the incident response So obviously there was a radical departure from slacks past and that level of change can be pretty uncomfortable But we really really leaned on the fact that slack is a high trust learning organization So we really dug into that trust and got to work So the deal was is that service engineering would focus on providing tools guidance to producing products like a cloud platform storage platform and Slowly push operational responsibility toward the dev teams So what about those really high stakes teams right the ones that really do need that support? We decided to create an SRE team so SRE means a lot of different things at different places. So it's slack So SRE are dev ops generalists who have high emotional intelligence and a mentoring capability Because they are skilled practitioners of dev ops and ambassadors of this new way of working right They're basically selling this way of working to these dev teams So we embedded an SRE into these Few select teams to increase operational maturity improve the reliability of services and This was really a grassroots effort from the SREs themselves Management's role was to empower them remove roadblocks and get out of the way So we're really excited, you know, everyone's super on board. We've got success metrics. We've got the team pairings But that operations work actually didn't go away Now it's SRE. We're getting those first pages, right? We didn't change our alerting strategy first. So SREs are getting the first pages dozens a week There's still dozens of production appointments a week So how do we lower the operational burden on the SREs So the SREs made a plan we're going to categorize and reroute all these existing paging alerts to the right teams So they connect on them, right? And then this non-existent centralized operations team won't be getting these pages anymore and they'll have the time and energy for their embedded teams so We started talking about like what does that look like? Teams should know what they own, right? Like what team owns what? Surprisingly, this was a difficult question to answer. It sort of had to start with step negative one Which is like who owns this stuff? and had to find formal ownership for all of these features and Software that that just had names of people who had worked on them last and Then we said like we defined a whole Set of criteria for like what service ownership meant and it mean means a lot of things But it also includes like hey, you've got to have at least one alerting health metric latency throughput whatever's important for your feature started getting teams on call ready, you know, right sizing them we heard about pizza teams and Starting to think about moving away from all front-end engineers being on the same rotation all back-end engineers being on the same rotation All right, so again the devs are like, okay, this sounds kind of scary, but we're pretty much on board We're gonna need some training some documentation maybe some guardrails and the systems. We don't mess stuff up And so the SRE start planning all of those things And nothing changed Because making progress on training and guardrails is really slow when you've got The the site to keep up all day So the dev teams they're starting to work on their on their health checks and like tuning and tweaking their alerts Into channels to make sure that it's just right and The SREs are working on their training and their guardrails and some automation and everyone's working really hard But again, we're going nowhere because we were aiming for perfect perfection So the SREs are looking every week at all these alerts and the vast majority of them were host-level alerts You know low CPU out of memory out of disk Those have been paging the Operations for years and they were a huge component of the alerting strategy. There was a lot of uncertainty about turning them off So finally we're like, okay, you know test with your users go to those dev teams say here's the alerts We think your team would get like what do you think? And those dev teams are like, what are you talking about those alerts are totally useless like them those mean nothing to us And honestly it was like coming out of a fog. It was like, oh my gosh That's right. Like we shouldn't learn on this stuff in the first place. We shouldn't even be getting these alerts We should just be throwing away these hosts and focusing on like automatically Re-provisioning hosts and making sure the services can actually handle that Great. So Which are we working on that? Guess what that looks like We wanted to do all this automator shouldn't first right we wanted to do that right We knew it was possible. We could envision it in our minds, but You know, we were still moving way too slow and then this breakthrough came Last fall so There are these moments of organizational clarity that happened and for us It was a a push from senior leadership that reliability and fast incident response We're literally the most important things that we could do in engineering and so in this moment of intense clarity We decided we're not gonna waste this crisis. We're just gonna swallow our fears and take the plunge. So one afternoon we just Turned off every single one of those low-level alerts. I walked over to the desk of these dev teams and said, okay Today is the day you were going on call and you're turning on those alerts that you've been Carefully cracking for months now So on that day with no pre-planning The devs went on call for their own alerts And everything worked everything was fine There was there was really bad change management once again No comms plan. We had skipped way ahead on the timeline, but All the people affected had been working towards this for months. They knew it was coming And they were empowered to continue to change those alerts and their own own on-call strategy after that day because they more fully owned these services now and Ever since then those SREs have been able to fully dedicate themselves to the teams that they're embedded in So now today there are dozens of development Rotations one for each team. We have a growing number of paging alerts to tell us when features and services are failing And I've seen all these teams continue and improve monitoring alerting provisioning automation Decoupling services to reduce failures. So in short Teams continue to dramatically improve the resilience of their services I'm not going to tell you everything that's happened in our journey to service ownership, but we continue to make improvements Team developers more fully own their systems. It's totally expected that you're on call for your systems But here's a couple of things. We're still working on We're still asking ourselves And challenging ourselves to improve our postmortems We've hired some experts We've conducted a lot of training and how to do post incident analysis and investigation And we're still continuing to heavily invest in this area We also really want to have trained incident commanders for every incident But this skill set is still too centralized within service engineering. And so we're we've gotten a Lot of trainings together. We've gotten a lot of people trained up we've also added it to the engineering career ladder and to sort of incentivize people to Participate, but it's still falling mostly to service engineering. So we've got some ideas. It's a major focus for us right now And like most places we're asking ourselves, how can we make it easier to operate a service? Without any specialized training and chef and terraform and things like that. So like most of you We're building a Kubernetes platform early early signs are good And we're continuing to invest there as well so In short, we've made a lot of progress There's still a lot of work to do Slack continues to be a high-trust organization. And so You can ask about what's not working and suggest really radical changes And so I know that we're going to continue to make progress So I want to leave you with one last thought The Toyota production system designs out overburden and inconsistency and eliminates waste But their entire management philosophy as a company is designed around that way of working Low inventory levels, which saves money. It's just one really visible outcome. So Some other businesses When they saw how successful Toyota was tried to just lower inventory levels In isolation without understanding the philosophy or making empowered working environments And those projects failed So imitating another company or process without understanding the underlying concepts doesn't work What works for Toyota or Slack or me won't necessarily work for you Just like following the scrum process perfectly won't lead you to amazing results So the most important thing to know is what you're trying to accomplish and be willing to learn and try again and again So I'm not saying that mechanical engineering is a bad profession It just didn't work for me and I'm not saying that centralized operations can't work. It just doesn't work for Slack anymore Change is possible No matter how hard or impossible it feels So ask yourself what feels wrong about your work and imagine a different future and don't be paralyzed by Doubt or perfection or uncertainty You don't have to be ready to make a change if you have the support of leadership if you're in a high-trust environment and you're empowered to make change and You can commit yourself to continuous improvement then progress is inevitable Succeed is the speed and skill with which you go around this loop. So design thoughtfully Measure ruthlessly and learn faster. Thanks