 I assume people can hear me. Excellent. Okay, hi. I'm Roy Rappelport. A few words about me. Hopefully this is actually contextual rather than me just talking about myself. I've been in technology for about 20 years. I've done a whole bunch of stuff which basically tells you that I'm not particularly good at any one thing. I've been at Netflix for a while. Some people who know me at Netflix know that I typically actually count these things in days. That doesn't mean I haven't liked it. I'm just that kind of guy. I work in monitoring engineering by the way where we build the telemetry and alerting platform for Netflix. So it's a good place for somebody who looks at telemetry as a way to sort of distinguish his experience at Netflix. Before I was on the monitoring engineering side, I was in IT ops, as Gene mentioned. One of I think maybe eight people or so who moved over from IT ops in the last few years to the engineering side as part of the migration to the cloud. I was a troubleshooter. I built a bunch of Python things. I was the first person who operated a bunch of Python stuff in our cloud and wrote a bunch of libraries to do that that all of the people used because it was either that or learn Java. I've got nothing against Java. So here's one more thing about me actually. This is my commute to work. It's about 15 minutes on a good day. It's about 45 miles each way. This is actually relevant because this is the reason why I chose this car. This car is much cleaner than mine. But it's the same model. It's a Hyundai Elantra 2011. And it gets about 32 miles per gallon, which is pretty decent and kind of important if you're driving 45 miles each way. This is not the Hyundai Elantra 2011. This is a Maserati Quad reported 2009. It's significantly more powerful. It has a horrific gas mileage. Now, the reason I bring this up is because you don't get one of these cars for a purpose for which the other car is good. I actually have an acquaintance who owns a Maserati Quad reported 2009. And the reason I was thinking of that particular car is because I recently heard him complain to me about the gas mileage he got on his way to LA. I get to complain about gas mileage. If you drive a Maserati, you don't. So, the whole point is that if you look at something and you try to figure out whether or not it is a good thing, and I'm being very vague here, you've got to look at what are you actually designing this thing for. So, when we talk about cloud operations and Netflix, we have to start with what's actually important to Netflix. And the number one thing that's important to Netflix is speed of innovation. Above anything else, above in fact, the number two thing, which is availability, at least on the operations side, speed of innovation is the key because we're pretty much never gonna be available enough to beat our competitors. Nobody's ever gonna go, well, that other company has much better personalization and they really helped me find this stuff that I want, but they have like 99.995 availability and Netflix has 100% availability, so I'm just gonna give Netflix my $7.99 a month. So, speed of innovation is really the key for us. Availability, second, and frankly, cost is third. It's a nice place to be if you're an engineer and you're interested in getting stuff done. And if you look at how we've designed the company and the culture to actually support our priorities, what you find is that the first thing we talk about, and Adrienne talked about this, is freedom and responsibility. Freedom and responsibility has a whole bunch of elements of it, but basically what you can look at that is we hire smart people and we hire smart experience people. The difference is distinct because there's really not a lot of supervision at Netflix. I'm a manager at Netflix and managing Netflix is in some respects the easiest job in the world because you've got a bunch of really smart people who just know what they need to be doing. So in order to do that, they need to have some experience to know how to figure out what it is that you need to be doing. So we hire those smart people. We basically set them loose and then we watch magic happen. So one way to look at that is the people who are the keynote talked about we need better requirements, we need better product managers. For my team, again, we build monitoring and alerting platforms. That means that our engineers are product managers and that means that, for example, recently when we were completely overhauling our alerting platform, I had my engineer who was gonna be responsible for that. I basically said to her, listen, I'm hoping that we're gonna be somewhat backwards compatible to whatever degree you think is reasonable. Have a nice day. And she went and talked to a bunch of customers and recruited some UI people and some UX people and then a month and a half later came up with something that frankly, if I came up with requirements would not have in any way been what I required. It was just so much better, so unimaginable and this is the kind of surprise that I think managers here at Netflix get a lot. When you unleash engineers to do something, when you don't constrain what it is they're gonna do, it's incredible. So, getting back to operations. I wanna talk about what operations does and I think frankly should do as a previous operations guy, this is something I feel passionate about, but almost as passionate about how operations should do things and what it should do, I'm also passionate and maybe more ranty to be honest about what operations shouldn't do. So, let's start with this. How many people here have ever worked in organizations where operations deploys code that engineering gives them? Okay, and how many of you guys were told that the reason we do this is because of separation of duties? Okay, so here's the thing. There's no such thing, like separation of duties this thing that was, yeah, okay, so it was made up. It's not in socks, it's not in ISO. In fact, there's a presentation that John DeLuca, I wanna say, James DeLuca from Ernst & Young gives later on in the day about compliance. You go talk to James, because I had dinner with James last night and I was talking to him and I was like, separation of duties and he's like, yeah, that's not a thing. So, the other thing that I don't think operations should do is manage by runbook and there's a bunch of reasons for why managing by runbook is just terrible. For one thing, it externalizes the cost of operations. Managing by runbook basically means that as developers, you get to deploy pretty bad code that, for example, might require the server to be rebooted like every few hours and I've been there and I've been on the far side of that runbook and at least one developer from Netflix right now is laughing here. And basically you can deploy bad code that requires server reboots and then add to the runbook when the server hits this percentage memory utilization, go ahead and reboot the server. It's terrible. And the other thing that it does is it frankly makes your operations people stupid. This is something that I realized actually only recently. We used to have a knock and the knock at Netflix was kind of stupid and I was really interested in making our knock smarter. So I recruited a bunch of like pretty senior citizens that I knew from like previous times to join our knock because our hope was that by putting smart people into a stupid organization, you would make a smart organization. It actually turns out that what you do when you put smart people in a stupid organization is you get a bunch of stupid people. And I don't mean this like as a derogatory term about these people because I know that as soon as they left that organization, they became smart again. But if you look at the last keynote of the day, I wanna say Linda is talking about myths of organizational change. You can just put smart people in a dumb organization and expect the organization to get smarter. The other thing that you really should try to not do is stop people from making mistakes which may seem a little counterintuitive. There's a few reasons for that. One is fundamentally you're creating conflict because nobody actually thinks they're making a mistake when they make a mistake. I've never seen a deployment go bad at Netflix where when we did a post-mortem and I'll talk about the post-mortem later, the engineer in question was like, you know, I knew that was the wrong thing to do. I knew that was a mistake, but I did it anyway. So you're creating unnecessary conflict. The other thing is that you're working around the problem which is frankly judgment. And you're working around basically an endemic problem that you should actually do with head-on. And the last thing is that you're actually stopping people from making the right call. For one thing, if you're trying to make people not make mistakes, you're assuming that your judgment is right there. For another thing is you can't actually presume that you're gonna know always what the right call is. We recently had a problem with a vendor where somebody did something that completely violated their policies and they caused us some downtime. And I'm really not gonna mention the vendor. So we had a call with them, it was one of those calls where you have a bunch of senior people from their side and they're feeling really apologetic and they're like, okay, listen. So here's what we're gonna do to make sure this doesn't happen again. We're gonna make sure that our tools do not allow people to violate policy. Okay, so that's great. As long as your tools never getting your way of making the right decision when some sort of condition actually comes up where the right decision is to violate policy. That would never fly with us. And I think it frankly should not fly anywhere else. And lastly, it also means that frankly, you're lowering the bar for what successful people in your organization need to display in terms of judgment. So that's a whole bunch of operations don't. Let's talk a little about what I think you should be doing. I think fundamentally what you're looking at is you've gotta help your developers make better decisions. So that's on the education side. It's not enforcement, it's more evangelism. We'll talk about that in a second. The other thing is you've gotta look at basically better and faster recovery to some degree it's preventing downtime. But I think when you try to prevent downtime it's too easy to look at preventing downtime as actually slowing your ability to change the environment. So I would actually focus at maximizing your rapid response and being able to actually respond to the downtime incidents faster and minimize them because that actually gives you the comfort to put more stuff into production, put faster stuff into production, break more often. That ends up basically looking like trying to minimize time to detect, trying to minimize time to recovery. So what does that actually look like for us and for our cloud operations and reliability engineering group? Well for one thing it means that our core group again cloud operations and reliability engineering does a lot of, whoops, sorry, animation. You gotta love it. So there's a lot of best practice and evangelism work and talks about availability patterns with teams. So that means that for example, when we started moving toward active active which was two, three months ago, I wasn't even sure I was gonna be able to talk about active active but then Adrian mentioned it. Our cloud ops people actually drove the process to figure out how as an organization we would switch between regions in case of an outage and came up with the processes that would support that both in terms of a better understanding of what the decision criteria would be and also in terms of some of the tooling around that. It means that our cloud ops people work with teams to adopt things like latency monkey and chaos monkey to make services more resilient. The other thing that they do is they build a learning. So we have an incredibly powerful alerting platform at Netflix and I know because my team is responsible for it we built it from scratch. It is, well here, here's the definition of an alert. There you go. This is what happens when you have PhDs building tools by the way. So it is an incredibly powerful system and we've made it easier for people to do simpler things with it. But I would definitely say that our cloud ops group is probably in the top 5% of users at Netflix who knows how to extract incredible value out of our alerting platform. The other things that ops does is they build tools focusing on improving TTD and TTR. So for example, Kronos is a tool that ops built. Adrian mentioned it briefly but not by name that simply basically takes a feat of changes in production because it turns out that, man I don't know about you guys but whenever I've had to deal with a production issue the first question I always ask is what's changed in production? Anybody else ever have that question? Yeah, okay. And we figured that change control sucks for a bunch of reasons and we don't want people to have to report what they're doing. So we actually built a system that lets every system that creates a change in our environment feed into it a log of what it did. That means that we can very quickly by using Kronos figure out very quickly, sorry very quickly, what changes happened. For example, what code was deployed or what fast property was changed. It's massively reduced. Our time to react, our time to analyze and our time to recover. The other thing that they built was the central alert gateway. This is what happens when you try to find pictures that can note a concept that is kind of hard to find with reasonable CC attribution. So this is a random picture of the Hoover Dam. But the reason we created the central alert gateway was because when we started using PagerDuty, who here uses PagerDuty by the way? So PagerDuty is fantastic, right? One of the things that is great about them is when you send an incident with a given incident key and you send a bunch more incidents with that incident key as long as that incident is open, you don't get more calls, which is great. The thing is when we are looking at how we wanted to do alerting, we also wanted what we called minor incidents which would just be via email, which PagerDuty didn't support. So the central alert gateway basically lets you send an event into this, say it's minor or major. If it's major it goes to PagerDuty, if it's minor it sends an email to the person who's actually on call based on PagerDuty schedules. But dedupts that and only sends about one email an hour. So it's been incredibly helpful. This was actually sort of a prototype that CORE came up with about two years ago and for the last two years it's handled something on the order of about 100 to 200,000 alerts per day. It's been really wonderful. More closely to what people know that we've built, we've got ChaosGorilla, which is actually the system we built to simulate losing an availability zone. So how many people know about ChaosMonkey? Okay, so most people. So you guys know ChaosMonkey basically just terminates a given instance in a given application. ChaosGorilla actually lets us simulate losing a whole availability zone within a given region. We've used that several times. We try to have manual exercises using ChaosGorilla at least once a quarter or thereabouts to make sure that when we do lose AZs, which happens less often than we fear and more often than we'd like, we can actually handle that reasonably well. And most lately we've got ChaosCon. So now that we have two regions, we wanna make sure that we can actually address losing a region pretty well. So whereas ChaosMonkey lost an instance, ChaosGorilla loses an availability zone, ChaosCon will actually simulate losing an entire region. I don't know what goes beyond ChaosCon. I think we'll have to leave the Simeon theme potentially. The last thing that operations ends up doing is frankly, firefighting. Stuff goes bump in the night, right? And firefighting actually looks like three kinds of activities. On one hand, you've got the low-level coordination. So there's a relatively low-level operation person who will start conference calls, who will page people, who will start looking at a dashboard when things go wrong. By the way, this is very important. If your ops people's job is to look at dashboards, I would argue you're doing it wrong because it doesn't matter how good your ops people are. You watch dashboards for enough hours and a day your brain will rot. But these people will look at dashboards as a way to figure out what's going on once we know that there's a problem. Now that's sort of at the low-level coordination side. On a higher level, we actually have a role called the Engineering Crisis Manager. Now for an interesting historical perspective, the Engineering Crisis Manager, or ECM, used to be a role that rotated among all managers and directors in engineering. And then it turned out that if you had a manager in engineering who was the crisis manager, who was the crisis manager every like three months, they weren't very good at that. So we now actually have cycle liability engineers rotating that responsibility. ECMs are important because frankly, what we find is when you get a bunch of really smart engineers on a call trying to work something out, sometimes they benefit from a little bit of guidance. So everybody wants to go in the right direction and everybody's pretty sure that they know which direction that is. Some things that we found ECMs are really helpful in is not pursuing root cause analysis. Because frankly, when we have an outage, what we care about is not RCA, but bringing availability back, bringing the service back. And the other one is sort of coordinating both efforts and sort of plan Bs. So we're gonna do this, and in 10 minutes, if we don't see progress, let's do that. It's been incredibly helpful. Lastly, for those who can't read from afar, this is the San Francisco Medical Examiner Mobile Command Unit, which is a bit morbid. So incident reviews. Incident reviews in Netflix are actually pretty easy because it turns out that in an environment that values judgment, that values basically failing more often, there's really not a lot of blame and there's not a lot of defensiveness. I remember in incident review for which I was an ECM where we walked in and we said, okay, so let's talk a little about the timeline and we had the engineer stand up. And then for the next 25 minutes, he walked us through every step that he made and included a lot of commentary like, and this is where I made the wrong choice. This is where I should have included somebody to check over my work. This is where I should have clicked there instead of this. And basically concluded the whole thing with, and this is how I screwed up. It's all my fault. I'm really sorry, guys. That's a terrible root cause. So I'm a big fan of John Ospa and his talk about incident post-mortems. And I also happen to believe, much like John, that human error is not a root cause because you can't tell humans don't make a mistake. So for a lot of reasons like that, what we find is that our ops group has to look at these incident reviews and figure out what are we actually going to do? Well, how are we going to improve our tooling, our systems to prevent these sort of outages and not simply take somebody falling on their sword as the answer to what happened and what we should not to do again? So overall, what does that mean? What does it take? Well, several things. Obviously, grace under pressure. I mean, this is not a Netflix thing, right? I mean, anybody here who's in ops, anybody who's here who's been in ops knows what grace under pressure looks like and has probably worked with at least one ops person who did not have this. It also means that you've got to have fantastic technical skills and in our environment what that looks like is actually two kinds of distinct domains. One is an understanding of how to debug very complex, large distributed systems. You know, in our service-oriented architecture, we have something like 500, 600 different applications. Inter-application relationships can become a little complex. The other part is actually building tools. If you look at Kronos, if you look at the Central Alert Gateway, if you look at ChaosCon, we're looking for people who are actually pretty strong at building these sort of tools. It means a passion for making things better. I mean, I've got to tell you, I've been at Netflix for a while now and one of the things that is incredibly rewarding at Netflix is the ability to walk into work every day and make something better. And in fact, that's what we get judged on. And lastly, because it is a freedom and responsibility culture, I was actually previewing this with one of our SREs yesterday. And he said, you know, you should have something there about persuasion because unlike a lot of ops groups elsewhere, we can't actually tell people what to do. And that means that, even though you're working with a bunch of really smart people, sometimes what it looks like is they're doing something, you think they should be doing something different and then the job there is to actually be persuasive to evangelize successfully because you're never gonna stand in their way. And that's an interesting challenge for me as somebody who loves persuasion. It's a wonderful opportunity. It's not an opportunity for everybody. So I wanna leave some time for people to ask questions. So I just have a TLDR here. Stay out of the way. Help other people stay out of their own way. Originally I had something about helping other people not shoot their own toes off. Improve visibility and recover faster. That's it. And that's actually all I've got. So what can I tell you that you don't already know? Hi, earlier you were mentioning a case where you had someone go off and do a task. And in one of the previous keynotes, someone said that requirement means to shut up. And so there's this kind of thinking of like giving people freedom to go out and do things. So I'm just wondering how you kind of like go through the process of defining a feature or do you have user stories or what's the process of kind of trying to bound what you're trying to do in some sense and give directions to engineers to go build things? I try very hard not to do that. So in this case, specifically, we're looking at rebuilding the central alert gateway tag because we were moving it from core to my group. We were rewriting it from Python to Scala. Thanks for the plug for Python and Scala, by the way, Adrian. So I gave it to my engineer. I said, ideally we should try to minimize user pain as part of this process. And ideally we should end this by the end of this quarter. That's it. So from my perspective, that's all that was given in terms of context. Now, what she then did is she went and talked to a bunch of our users and asked them sort of where their pain points were. And really, I mean, the standard stuff, right? I mean, the point is that she did that herself. She was the product owner for this system and she frankly continues to be the product owner for this system. So she released a 1.0 that was very successful and now is looking at 1.1 features. And by the way, in at least one of those cases, she's going to release a feature in this product that I don't think she should do, but she owns the product. So she's doing. Did that answer your question? So Roy, you talked about the ESmart people. Do you give them like two or three kind of sign posts, right? Then you get out of the way. So what do you do most of your time? What do you spend? I mean, I don't mean that in a bad way or anything, but there's stuff that you're doing that's really cool, what is that? You mean other than slash dot? So I look at the strategic direction for monitoring engineering. So my team is actually really good at figuring out where we're gonna go over the next quarter. And in fact, if you look at our quarterly goals, I don't really set our quarterly goals. The last time we had a quarterly goals meeting a few weeks ago, I stood in front of a wide board and I took notes. That's fantastic. But I talked to our customers. So for example, Sangeeta there is one of the engineering, hi, is one of the engineering managers for our API team, which is one of our leading edge customers. So I have a lot of conversations with API about where they think monitoring needs to be in a year. So we can continue to sort of align our product with their direction. Yeah, yeah, something like that. I also deal with occasional problems like my team is frustrated with something. Primarily my job is to get obstacles out of the way of my team and give my team context about what I see from customers, not from the perspective of specific requirements, but how our customers think about monitoring, telemetry, and alerting, again, 12 to 18 months from now. Did that answer your question? How does Netflix prioritize across the many different requests? Well, Netflix doesn't. Each team figures out how to prioritize across its many different requests. I can tell you what the answer is for monitoring engineering, that's my group. Would that be useful? So this is something that I, so I've only been managing monitoring engineering for about 680 something. No, no, no, sorry. Not that long. So the way we figured this out, actually this was a sort of a big sort of aha moment I had about two, three months ago, is we have two kinds of customers at Netflix. And by the way, our customers are all internal development and BI groups. They all sort of consume information from us. We have leading edge customers and trailing edge customers. Our leading edge customers always want more from us and are willing to pay more for that. Our trailing edge customers want us mostly to not change anything. And what I discovered was we can't actually serve those two customer groups equally well because that ends up meaning that you serve both of them pretty badly. So what I've actually come up with officially and I've had conversations with our leading edge customers and I've had conversations with our trailing edge customers about this. So I've been totally transparent is we're gonna prioritize future development for our leading edge customers. And to whatever degree is possible, we're gonna minimize changes that affect our trailing edge customers. But our focus in development is what are leading edge customers need and what are leading edge customers believe they're gonna need over the next 12, 13 months. That's it. Now that's a very high broad ambiguous statement and that's exactly right because what that actually means on the ground is whatever the engineers who are in my organization think it means on the ground. That's for them to implement. Does that make sense? It seems like a really unsatisfying answer. Any time for one more? Hey, so my question is a lot of what you're talking about is very systematic, right? Sort of end to end and thinking when you're hiring people, like what do you look for when you're hiring people? Because you're talking about, hey, do the right thing. A lot of people work in organizations where there may be sort of different values in place and maybe they don't think it's systematically. So Netflix talks a lot about its culture. I would say that interviewing is really important to us because fundamentally, when people are the primary asset you've got, you've gotta be very careful about your selection. Obviously we look for smart people. Okay, who here works for organizations that say that they look for dumb people? So obviously we're looking for smart people. We're looking for people with good judgment. We're looking for people with the insight which means that one of the things that I look at when I talk to people is tell me about mistakes you've made. Tell me about failures you've had. We're looking for people who are going to be not too arrogant. One of the, I've done a bunch of interviews in Netflix in my time. One of the probably three smartest people I interviewed who we hired lasted 76 days in Netflix because he was a jerk. And Netflix has a no brilliant jerk policy and we take that policy really, really seriously. So that's something where we miss something in the interview. We try not to miss that. Did that answer your question? Good. Awesome, thank you so much guys. And now you have to tell you, I have loved every interaction I've ever had with Roy and anytime I get an email or a tweet usually I have to spend days studying it. So thank you so much Roy.