 Good morning. Four years ago, I was working at a tech startup in San Francisco, and I was releasing a new piece of code to production. It was a large Hadoop cluster, and we had these task workers on the cluster. And I put a little code in my RPM package to make sure that all of the old workers were killed when the new RPM was installed. It was just like a little kill all dash nine, everything owned by the current user. Neglecting to think that, hey, when I do this install, that's gonna kill everything owned by root, because that's running the install script. So this took out SSHD, syslogd, cronD, and one fourth daemon running on the system, but it left the servers up, it didn't kill in it. So the users of the cluster were still getting work done, and we didn't wanna just reboot all the machines, because they were all running HDFS and it was nightmarish. So we spent the greater part of a day, eight of us logging in over lights out console one at a time into every server and restarting the four daemons with our diceware password that took like a minute to type. And eventually we got it all done, but it was a massive pain in the butt. So the first day on the job, my first job out of college, I am sitting in the knock, done all the paperwork and gone in and they sat me down in front of a console and senior assistant man walks through and says, yeah, just go log into this machine and go kill the foobar process and it's under a daemon tools kind of thing and it'll restart itself. So I log in, not really paying attention and you get the log in banner, it says Solaris 2.5 and I go kill all foobar. You know what happened? So streaming over my console is all of these shutdown messages and stuff and then the phone rings. I picked the phone up and it's a trader and you know it's a trader because the first thing you hear is, what the, what's going on down there? Cause I just killed trading for the whole company. So that was the first day on my job and we tell you these stories because what's really important about when assistant man grows into being an SRE is the experiences that we all have in making mistakes and doing things in production that cause outages and seeing things like single points of failure and the pain that they cause the business and they cause ourselves. So as we look at the evolution of CIS admin to SRE over the past 30 years, once upon a time we all managed individual servers and we installed them possibly using a CD one at a time through a data center and we would know the name of every server and we would know, hey, server Zavix is not working well today because I screwed up one of the libraries on it last night when I was doing some maintenance. Over time that we didn't scale and we had to learn how to manage our servers using code. And I think the real difference between the old world of CIS administration and the new world of reliability engineering is that it's no longer about managing individual servers, it's now managing fleets of servers and it's also worrying about reliability engineering. It's not, I wanna keep one server up, I need to keep the service up, I need to keep the whole fleet, I need to keep the application running. So my name is Jonah Horowitz, I work with Netflix, I'm on the SRE team. I got here through a journey that I think a lot of us had, I started playing with computers when I was in high school, I had a little BBS that my friends would log into and play trade wars with me. I worked on a help desk and I worked at walmart.com, I helped launch walmart.com in 2000. I meandered through a bunch of other small startups around the Bay Area and landed at Netflix about 10 months ago. My name is Al Tobi, I started a little later with computers and probably a lot of people. I was a music major through college and dropped out and got that first job. Worked at Limelight Networks, Sonya online entertainment, Cisco, a place called Uyalla and now I'm at Netflix as an SRE. So the SRE team at Netflix is called Core and it's kind of a background and I'm not gonna read it to you, but our primary job at Netflix is to be the last line of defense. So most of the engineering teams at Netflix manage, deploy and manage their own services. So we're not doing the configuration for them, we're not spinning up their servers, we're not gatekeepers. The one thing that the whole business really relies on our team for is when something falls through the cracks or something, some major failure occurs, we're the ones that are respond and get online immediately and make sure that the right things are happening to bring the service back online as soon as possible. We also do some other things in terms of filling in the cracks. So we do a little bit of research and development kind of things, mostly in terms of ops. So for example, we might discover that while we have this really awesome alerting and monitoring system, it's doing a lot of really neat things but our alert volume might be a little higher than we like it and there are certain alerts that fires say when we fail across regions, even if it's intentional. So we want to do things like Pearson correlations to say, well, the traffic is missing in this region but it's showing up over here and we can automatically squelch those alerts using just a little bit of Python code. And so that's the kind of thing our team would do and then show that to the other teams and hopefully they'll absorb that into their products. It's important to note that one of the really key aspects to how Netflix operates as a service, every engineering team is responsible for both writing their code, deploying their code and running their code in production. So we don't manage any of the services in production but when you take a lot of engineers that come from other companies and maybe their experience level with running code of production isn't very high, we have to consult with those teams to give them the operational skills that they need so that they can be successful running their code in production. So we consult with all the engineering teams, we talk to them about making sure that they have good dashboards, making sure that they have good alerting systems, making sure that their Java garbage collection is tuned properly. These are all things where we take the skillset that we have and we make other teams at Netflix services more reliable. We have a couple of other smaller tasks we do. One of them is where a liaison to Amazon. So obviously we run everything in AWS and knowing what's going on at AWS is extremely important so we have a lot of communication with them. And finally the last thing we do is incident response. So during an incident, during a large scale incident, we're responsible for making sure that the communication keeps happening so we'll engage the engineering teams that are responsible for the services that are impacted and then we'll bring other engineering teams in to the call if they're needed and we will take current status and communicate that outward to any team that's gonna be impacted like even public relations or just upper management. And then after an incident, we document what happened. We work with the teams to do an incident retrospective and we create action items for follow-up and track them to make sure that they get completed. So part of that, the way that Netflix operates is this idea of context and not control. So we never tell a team exactly how they should implement something. What we do is we inform them about maybe why or what some of the other teams do and we show them that by implementing some of the best practices, their lives are going to be easier. So instead of saying, hey, you should have an alert on your CPU if it goes above 80%, we show them that when their CPU goes over 80%, their service starts degrading and let them decide that they want an alert at that level. The other thing that we do is across the entire organization and for those of you who are fans of DevOps, one of the things that Netflix does, I think that's fairly unique is we apply this hire smart people or hire the best person for the position across the whole company and that's a big part of why we can do things the way that we do is because we hire like a professional sports team or as I prefer a professional orchestra. We hire the very best people so that when I'm playing a part, I know that the person next to me is also very good and I don't have to talk to them about tuning their instrument or making sure that they do the right bowings. They know their job and they can do it and it's not my business to tell them how to do it and I don't have to worry about that. And the other part is if you talk to a lot especially Barry as startups, they give you that line about, well, we're kind of like a big family and the thing is, Uncle Bob never does the dishes after Thanksgiving, never once and you don't want Uncle Bob on your team because he's not gonna help out and so with that kind of family mindset, you get that kind of environment where you have people that you have to work around and we've all run into that in our careers and maybe have to deal with that on a daily basis but as part of the way that we hire and staff at Netflix, it's not a concern that we have. It's not a family, it's a professional sports team. People come and go and that's just part of life and it's part of life even without that kind of policy. It's just unwritten. So another aspect of Netflix is freedom of responsibility so a few years ago or about a year and a half ago one of the front end engineering teams said, we would like to switch from running Java as our production stack to using Node and they decided on that because they were a JavaScript for the front end clients and they wanted to run JavaScript on their server stack too. Now, we didn't know anything about JavaScript running it in production but we trusted that they were smart people and they could figure out how to do it and we could support them and help them be reliable as a service. Now, it didn't go perfectly smoothly, there were definitely some outages and some problems caused by that transition but now that team is running JavaScript in production, they're kicking butt, they've got their operations under control and we're helping them fine tune some of the smaller parts of their instances not daily managing, we're not doing their code pushes for them, they figured that out on their own. So we gave them the freedom to decide that they wanted to make that switch and they also took responsibility for the outcome of that. How many of you have heard the line when you're interviewing somewhere, they say, well, this team is kind of like a little startup inside of a big, huge company. Has anybody heard that one before? Anybody think it's nonsense? So when the companies attempt this, what they're trying to do is, they're trying to get that kind of that pace of innovation that comes from a startup and not having bureaucracy and not having kind of all the controls in place that slow things down. And when you break that down and get to the bottom of it, what you really want is a learning organization and you wanna be able to move at high velocity and that's what we do at Netflix is we make mistakes, we make them in production very often but we've optimized for MTTR so that we can, sorry, meantime to recovery and we try to get things fixed as quickly as possible so that we can allow our engineers to make mistakes and we accept that that happens on a regular basis. So, I mean, one of the examples where Netflix very publicly made a mistake and corrected that mistake is if anybody remember Quickster? And so if you look, still on our website, there's an apology from our CEO and that's just a great example that I loved when I came to Netflix, when I saw that, I was like, wow, that is really different from what I've seen at other organizations where they would just either try to bury that in the history or pretend it never happened. So that's a really important part and so that part of our instant review and blameless post-mortem kind of thing is our focus isn't to say, hey, Bob screwed up and deployed to production and we should fire Bob, it's what was missing in the automation or in the interface to deploying code or deploying settings to production that made it possible for him to make the mistake and let's see if we can remediate that. So I talked about this a little bit earlier but this comes back to we give teams the tools they need so that they can run things successfully in production, drive reliability without taking responsibility for it ourselves and running everything for them. So every team in Netflix needs to be able to have tooling and best practices for driving reliability and we don't have like a central team that's doing pushes or orchestrating any of that. So now we're kind of getting more into, we've told you a little bit about how Netflix does things. So how many of you are assist admins right now? So you have the systems administration in your title and how many are SRE production engineer in that kind of title? Okay, good, a good mix, excellent. How about DevOps? No booing. That's cool, I actually like that as a title, I know it's unpopular in some circles but because of the way that DevOps leads us forward into reaching out into the organization, I think that's a really cool way that titles have been evolving. One of the things about all of these things have in common is that like a systems administrator, even when you're a junior admin, you're a generalist, right? You do a million different things and you're maybe not the best at all of them, right? I'm not the best programmer in the world but I do a lot of programming for automation and building tools to help with my job. But I also have to do systems administration kind of things like dealing with Linux distributions and packaging and all of those things and I've dealt with mail servers and a bajillion different applications. So we're generalists and that kind of carries through I think the entirety of our career tracks. The other thing that we do especially when the Netflix SRE, one of the big differences is we kind of leave behind a lot of the fiddly bits, right? I'm not managing LDAP servers, I'm not deploying Sendmail, I'm not messing with Bind, the kind of the things that are constants in our careers as sysadmins. What our goal is is to take that knowledge and experience we've gained over the years of deploying all these applications and dealing with all the bull crap that they bring into our lives and look from a much higher level view, so Google Earth and Zoom out, right? And take a look at the Netflix stack as a whole and understand how things fit together and how failures might propagate through that system and be able to communicate that to engineers inside the organization. But Netflix is so large that we can't actually know the whole system. This is, there's something like 3000 microservices interacting with each other just so that we can stream movies. There is no sysadmin, there is no engineer at Netflix that actually knows how the whole stack works. We have a lot of tools that we use to visualize that and when we're having a problem with a node in that graph, we can see how that node is connected to other nodes, but as an SRE, you can't know everything about the system anymore. It's too big, it's too complex. It's more about allowing every team that's deploying their code in production to have the tools they need to drive reliability. And as SREs at Netflix, we're helping those teams drive their reliability, helping those teams have the tools they need and when we come across a best practice that one team is using, making sure that's socialized throughout the whole organization so that everyone is more reliable. So how many people are using Puppet, Chef, CF Engine, something like that? So back in the day, we all installed things by hand, maybe net booted our servers and then we started using these great configuration management tools, but they have a lot of challenges and one of them is that if every development team is pushing its own code to production, you end up with these choke points around who isn't responsible for managing the Puppet config. You don't wanna give every developer commit access to the Puppet repo because then they can change the Etsy password file and take out logins for the entire production operation. It's very hard to kind of compartmentalize some of these configuration management tools and it's very hard to train all the engineers on how to use all of them. So Netflix has come up with a different way of doing that. We produce baked AMIs. So instead of pushing configuration management, instead of changing those configurations in production after they've been deployed, we deploy entirely new images. Every time we push to production, we push as a complete virtual machine image. So when a new version of a piece of software gets released, it goes through our Jenkins pipeline and gets built into this virtual machine uploaded to AWS and launches a whole new cluster and we just fail from one cluster to the other and then if there's a problem, we can quickly revert to the old cluster running the old software stack and we never have to log into machines after they've been launched. This provides a lot of advantages. So one of the things we're working on at Netflix is how we're gonna use containers across our environment. And one of the things that comes up regularly in those discussions is what's the point? Because we already have the system built in place that brings kind of what I think is the number one value of containers and I definitely love using containers in different contexts but if that's the ability to package up the software in such a way that I can test it in one environment and move it to a different environment and maybe canary it and then when it goes to production and starts receiving production traffic, I know bit for bit that each of those steps that the software is identical all the way down to the kernel interface and that gives a lot more confidence so that you don't have things like libxml2, 0.1 to 2.0.2, it changes some ABI symbol, right? And then the whole system comes crashing down and I've had that happen before, just an app get update destroys the system. Sorry, what's the question please? So we store a state in a separate set of clusters so we have a persistence layer that is responsible for storing state. So when we're talking about rolling what we call a red-black push, we're not failing that stateful layer, we're just switching the higher upstream services. The deployment process for new stateful services is a little bit different, it's more like a rolling release where we'll bring up one node with the new setup on it and then destroy one of the old nodes and just do a sort of cascading release across the entire persistence tier. So that's managed slightly differently but we still release those persistence tiers as a fully baked machine image with a different set of software on it, it's just set up so that it can do the rolling releases. And that's a critical component of microservices that isn't subject to this talk but is one to take home with you is, a microservice, people talk about stateless services. Stateless services don't exist. What you wanna do is hold onto the state as for short of a time as possible which reduces the probability of losing that state during a failover or a complete machine loss, right? And then we bleed traffic as Jonah said from one version to the next, where was I? So yeah, we get this bit for bit deployment or identical deployment into the different environments and that's all managed by this tool Spinnaker which is open source, maybe you've heard of Asgard, this is the new version, we'll talk about it more as we go. And then the other thing that comes along with this is we've got built into these tools automatic canarying so that we can do things like push this, the tool automatically deploy a canary with some percentage of real world traffic so that we can test it in the real world. And as we were just talking about, chaos is something that just exists, right? We can't prevent it. And most of us that have been doing this for a while, you've done all the exercises to try to prevent every single thing that can fail, right? You've got dual network paths, you've got sands with dual, dual paths, you've got RAID, dual power supplies, sometimes you've got these funky machines with mirrored memory and things like that, HP nonstop. These, the thing that all these things have in common is they cost an enormous amount of money and it's not like you go from RAID being two times expensive because you also need a RAID card which means it's more than two times expensive. So the cost of safety goes up quite quickly in comparison to just building it into the system instead. So Netflix switched to this idea of deliberate chaos. We, one of our more famous tools is a tool called Chaos Monkey. It's a, it's not actually very much Python code. It runs around in our production infrastructure every day and it just kills some of our instances. This forces our developers to code in a way that they can't have special instances. There's no magic instances, there's no golden instances because any instance in a production cluster can get killed off at any time by Chaos Monkey. When this was first rolled out, it was a little terrifying and we let teams opt into it when they thought they were ready to take that sort of risk with their systems, but more recently we've actually made it opt out. So only teams that can justify not using Chaos Monkey for some specific reason are allowed to opt out of running Chaos Monkey and we have almost complete coverage on our production systems when it comes to chaos going along and killing systems. We even have chaos running on our persistent data system. So we run a very large Cassandra implementation and even the Cassandra nodes get killed off by Chaos Monkey on a daily basis. I mean, almost every cluster loses a machine every day to Chaos Monkey and this is great because actually if you run an Amazon and you have any scale there, you will get a, I get every morning in 25 or 30 emails saying that XYZ node is going to have maintenance and it's going to be turned off tomorrow or next week or something. When you're running with chaos, you're like, oh, so Amazon's gonna kill that node. It might already be sick, it doesn't matter, it'll be fine. And yeah, there's just no way around having this sort of system running if you're at scale in a cloud environment. It's getting more important over time because when Amazon was fairly new, this happened all the time, right? It was constant chaos when you deployed in Amazon but they've been getting a little bit better at it and their servers are getting a little more reliable so we keep that kind of to keep the little bit of keep people on their toes, make sure that that's always running. So I talked about innovating for, optimizing for innovation over safety and one of the things that happens a lot in larger organizations, especially those that maybe grew large very quickly is things go wrong and then the first thing a lot of people say is, aha, I should put in a process to make sure that this never ever happens again, right? And the more never ever it gets, the heavier the process gets and the slower the innovation is, right? If you've been at a small company and long enough for it to grow into a large company, you've been through that pain where you used to be able to just fix stuff, right? You know, I used to be able to, when I started at this healthcare company I worked at for a long time, it could just log into the server as root SSH root at and just go and fix things when they broke and then over time it became, well, you can't log into the server and fix things, you need to file a change request first. So I go and file a change request and then I would go and fix it and then it was like, well no, you can't just file a change request and go fix it. You have to file a change request and then the change review board has to look at it and the change review board usually made up of people who have no idea what I just said in the change request and then they have to approve the request and then I can go and fix it and then it just keeps getting worse and worse over time, right? Then the security policies come in and then different, it just keeps kind of crofting up over time and it seems to be one of those things that kind of just keeps building up like barnacles and never really gets smaller or refactored down. So we looked at that and there, turns out there is a different way to do it. Right, so instead of having to approve every individual change before it goes out to production, we, and this is a reoccurring theme, let our engineers take responsibility for releasing their code to production, making the changes they need to make in production and then we log those changes in a system we called Chronos and at any given time, there's a literal stream of events going through Chronos, any sort of automated changes and any intentional changes that are getting pushed to production are logged in the system. So when we see an impact, we can look at the start time of the impact and quickly go into Chronos, narrow down through the potential things that could have caused this impact and find the team that's responsible and work with them to mitigate or resolve the problem. So we let them roll forward, we let them roll back, it's really up to them, but the ability to quickly find out what changed, how it changed and when it changed allows us to get rid of that huge overhead of a large change control process. Another thing we do a little differently is how we handle all of our tooling for systems administration, as I mentioned earlier. How many of you are using Nagios today in production? That's pretty good. How many are you using one of these other tools that's listed up here? Cool. So we don't really use any of these because one of the things that as Netflix grew, these tools kind of didn't really follow with the scale. A lot of them have grown up and improved for high scale usage, but they didn't at the time work, so we built a different tool. So the thing about these tools is, is there were things that we also do, right? Like I also had a Nagios installation and I would update the config every couple of days, right? And push check it in to get and had like a cron job to pull it down and that was fine. But the installation and the machine would start to cruft up with different things like all the dependencies for the plug-ins and it would be sometimes two, three years out of date because nobody wanted to take on that job of refreshing the Nagios box because of all of the different little things that we're going to bust and ruin your evening. So this is different again from a traditional large organization or even a small startup. We don't take responsibility for owning these things. Netflix has a team called Insight Engineering and they're great and they run the production instances of our monitoring systems and so we don't have like an SRE ops team that's responsible for writing code, deploying code, managing stuff in production. Also, oh yeah, you have to be responsible for every last piece of monitoring that's running in production and whatever other tasks. We have pushed the authority and the ability to run those tools to the individual teams and we all rely on this team, Insight Engineering, which wrote an open source tool called Atlas. It's on GitHub and it allows us to collect metrics at the scale that we collect at Netflix. So we're doing billions of metrics a minute with multi-dimensionally so we can group them and look at them in all sorts of different and unique and interesting ways and that's, I think there's one. Oh yeah, and we have some screenshots of what that looks like in a minute. Yeah. So to repeat the question it was what would you say you do here? And that's a good question. So like I said, it's kind of a higher level of view so we're not the maintenance men working on the plane, we're more like the pilots or really kind of ground control, right? In terms of we see all of the other ground controls actually better, right? So it is kind of like knock but we don't actually have a knock. Like we're not watching dashboards all day. In fact, we don't have any dashboards in our seating area. No information radiators. We ban them when we move to our new space. Instead what we do is we pay attention to alerts so we have that on-call component that is very familiar to most of you. We engage with the engineering teams which is so there's a big consultative role that's where we spend a lot of our time. We comb over various things so there will be minor alerts that roll through. We comb over those and look for things that were maybe instability in the system that's starting to jitter and hopefully catch things before it bubbles up. Mostly though when we find things like that it means that we need to go engage with an engineering team and make sure that their learning is improved. So that really is the majority of the time and it's then there's kind of that other component which is because we have all this experience with all these other monitoring and time series systems we do have kind of a small role that's very unofficial around being kind of a product manager for those tools so we work closely with those teams to say hey wouldn't it be nice if we had this? Hey can you build this kind of feature? And they have a lot of other demands from the engineering teams but we're one of the primary drivers. We do both, so yeah I forgot incident management so that is the other component that we service that we provide to the whole company so that when there are incidents we get involved during the incident we track all of the what happened, today it's in JIRA, hopefully we'll have better tools soon. You know make sure that we log this happened at this time, this happened at this time so that when we do the incident review we can go through the series of events and figure out what happened and then work with the engineering team to remediate. So talking about Atlas, this is a Atlas dashboard that we use as our team that we built that shows us what's going on across all of the regions and how many errors and how many streams per second are getting started. So the top graphs are kind of how many people in each of the three Amazon regions that we're running in have hit play successfully recently? And on the bottom it shows the number of client errors reported. You can see they're kind of correlated but we took all the numbers off the side so you can't see that those client errors like that's actually very few it's not like the client errors match almost one for one with the starts but we use all these great tools that are written by Atlas team so that we can provide insight and troubleshooting. This system allows us to quickly isolate like are Android users having a problem? Are iOS users having a problem? Are people on Apple TVs having a problem? We're able to select which group of people we wanna look at and quickly deep dive into a problem. This is a tool called SELP. As Jonah mentioned earlier nobody at Netflix really knows what the entire environment looks understands the whole environment all at once so we need tools to assist us with that. So this is a tool called SELP that actually uses data from various systems inside of Netflix to draw the graph of microservices so that we can see how they're all connected and interacting because as engineers can push new services and very often we don't even know what's going on in terms of new microservices being launched, replaced, deprecated, they come and go. We don't even really pay attention most of the time until there's a problem and then when we see that we can come here as one of the places we can look to say what is the chain of dependencies from a user hitting start in the Netflix player all the way down to a Cassandra database or a EV cache cluster? This is a Spinnaker, we talked about it briefly earlier but this is a tool that we use to release code to production so each one of those green boxes represents an instance that's running in AWS as part of this API proxy cluster and Spinnaker manages the code from the, so we use Jenkins for automated build pipeline and then Spinnaker takes that build to AMI and manages the launch of those servers in production and you can define pipelines. This is all open source, we released it a few months ago and you can be like, for instance, I wanna launch a cluster, a Canary cluster with five instances and once those five instances are up and running and taking traffic I want to compare the following list of rules metrics between the currently running Canary instances and the current production cluster and I wanna make sure that I didn't see a increase in errors, a decrease in throughput and you score that and then once that scoring is complete it automatically or it can send you an email and say, hey, we've run the Canary and it looks good and you can say yes, I wanna go to the next step so you can make that manual, you can make it automated but we have teams that release to production every single day on an automated schedule like it's 7 a.m., they push to production and it goes through an automated pipeline where it runs the Canaries and if the Canaries pass it just pushes that new code to production. We can, it also supports things like staggered releases so we always like to release in one of our, one region first and then the next region and then the next region so we don't see, you know, a giant AB, a giant red black push across our entire infrastructure take out everything all at once we rather take out like one third of our production at once. So Spinnaker supports staggering the pipelines across different regions and we're working on a new feature that's gonna allow us to do automated squeeze testing. So that is when a new instance launches and it goes into production we can actually calculate what the maximum throughput of the new release of the software is and then adjust our scaling rules for that cluster to support the new rules. The other thing you can see on here is the kind of the gray part here and what that is is when it deploys a new ASG very often leave the old ASG running so that we have a really fast path to be able to roll back when if something goes really sideways. And go ahead. Now you can, this is a view of the pipeline so you can see some pipelines that were successful. This is like a log so it's just showing each pipeline that's run recently. You can see the last three pipelines were completely successful. The third one down had six stages or more and the other ones had four. You can see below that some pipelines that had some failed stages so they crashed out before they finished. And the developers get a bunch of feedback as to why that happened and they can adjust their code or fix their things before the problematic software gets into production. And kind of the last screenshot, well second to last, is this is a tool called Vector that is also open source. It's built by our performance engineering team. And what this does, this provides the micro view. So Atlas has that macro view in terms of it has one minute granularity for stats and you can see them in graphs. But the problem with rollups is that you lose resolution. You can't see things in a one minute aggregate. You can't see if you have say a 500 millisecond GC pause. That'll just get absorbed by the averages because averages are awful. So what we have is this tool which we can turn on any instance in our fleet and developers can go straight to this web interface and see the low or high resolution stats on the system graphed here. And it's not on here, but down at the bottom, there's a little link where they can actually now click and it'll go off and go and build them a flame graph of say their Java applications, full stack of call traces so that they can see that performance information and it's all automated and self-service. So with that. So that's it. I'm General Horowitz, like I said. And this is Al Tobi. All of our public code is available on Netflix.github.io. So if you wanna take a look at Vector, Atlas, Spinnaker and all of the other Netflix open source projects, they're there. Coming up after lunch, Brendan is giving a talk on broken Linux performance tools. So strongly recommend that talk, it should be great. So we hope what you got out of this is kind of a view of how Netflix does things. We hope it's interesting to everybody. And what I really hope is that as we talk about these ideas and you take them back to your companies, you can start to push on the boundaries outside of your bubble. And maybe we can see some of these practices become more common outside of Netflix because it's a lot less unpleasant in the worst case than some of the worst environments. And in a lot of cases, it's really awesome to work in this environment. Thank you. We have some time for questions if you want. Yeah. So the question was, what are some other, besides kind of the quickster mistake, what were some other production issues that occurred that maybe we had big learnings from? Maybe, one of them is DNS ruins everything. You know, so we're using a DNS provider that had some issues and went offline and they were offline for a few hours and basically all we could do is sit on our hands, right? While we're watching Twitter feeds, right? And if you've had outages with very public services like Netflix or Facebook or something, users get really upset. And so we're sitting on our hands and we're working with this vendor to get it fixed. But in the meantime, we're also looking at other plans already before the issue is resolved and starting to say, what can we do to give ourselves more control in this situation if it happens again so that we can move traffic or make sure that our customers stay online? So that was one kind of thing where we learned a lot from that one and made some changes that were invisible to our users but if it happens again, it will remain invisible to our users. Let's go here. So the question is, how did the move to the countries that we launched a few weeks ago impact our day-to-day life? So we've been working on that, turns out for a while. But one of the things that we did to help with that launch was we have some automated tools that generate alerts and they look for deviance in trend line. So when we launched in all the other countries, it was kind of hard to know if that country's traffic was being impacted by something because we had no historical data on how much streaming activity to expect from that country. So we expanded our automated tool to generate new anomaly detection across all of those, all the countries. So now instead of just having anomaly detection on the countries that we were in, the 60 countries that we were in, we now have anomaly detection across all 190 countries. And so we can see if traffic is getting impacted. There's also some stuff about, in the 60 countries where we were in, it was really easy to work with, or maybe not easy, but we had developed the skill set to work with all of the ISPs in all of those countries but now we have to work with ISPs in 190 countries. So we're still learning some of the lessons on that and that will continue to go forward, I'm sure. And that was a big part of how we did the launch, was we said, there's no learning like jumping in the deep end, right? So into the deep end, we went. I think back here in the green. So do we have plans to use Lambda and API Gateway? Yes, but we're still waiting on some features to be added to that so that we can integrate it with our VPC. Yeah, so that's actually what I do. And we have a project called Production Ready that is really about these are the things you need to know or these are the things you need to do to be successful in production. And that has a list of, it's actually not a very long list, it has about 15 things on it that we go through with each team. And how we engage with the different teams is different depending on how those teams operate as an organization. So teams that are used to operating as using Trello as their tracking system, we'll create Trello cards for them. If they're used to using Jira, we'll create Jira tickets for them. If they just want to track it on a spreadsheet, we'll track it on a spreadsheet. So we just, we work with them and sometimes that's just like submitting a bug fix to their service because we can just check out their code and see if they just change this thing, it would be done and so we can fix it for them or we'll clear hurdles, provide a knowledge transfer they need to check off all those boxes. That's our boss. So to repeat that, Cobra and our boss. It's about 150 people outside of product teams that work on just internal tooling, internal services, including some of the stuff that like internal service discovery, or most of that's actually open source that you can see all that on GitHub, Cassandra. It's about 10. Tanish, any other questions? Oh, okay, I think we have time, yep. So that's a service called Eureka that is also open source. So we have our own discovery system and Eureka has in the back end, I think some Redis and some Cassandra that it uses for persistence. And so every time a service comes up, it has a client library that it uses automatically contacts the discovery service and then figures out where all of its dependencies are. Much like you would do with an SED or a zookeeper or something like that. Correct. So the question was, do we ever update AMI's once they've been baked? And the answer is no, we do not. Once an AMI is baked, it is only ever destroyed and replaced with a new one. It's as close to a medieval infrastructure as you can get. Yeah. I mean, the immutability moves down into the Amazon layer, right? So the question was how quickly can we react to a challenge in production and get an AMI completely baked and out into production? And I don't know that we wanna give you exact numbers on that, but, very. So my experience with configuration management systems is that it often takes about an hour to get consistency across all of your configuration management without running some sort of force update across the cluster. And I will say that we are at least as fast as that. So that is something that we're working on. So there, I mean, there's a certain amount of physics in the way of making that faster, right? In terms of the Amazon process for updating an AMI is not the fastest as it could be. So that's part of where the container work is being done in terms of kind of trying to bring that latency down in terms of being able to deploy code. But it's about as fast as it can be given kind of if you do the math on how fast you can take an Ubuntu LTS image and then roll in a JVM and code and all of that stuff and then get it deployed out into the different regions. And any more, we'll call it done? One more and then we'll call it done. I'd say that's probably, so the question is, do we have any kind of part of our infrastructure that's really solid and boring and critical to our infrastructure? And that, yes, there is, Kafka is part of that. So our data pipeline in terms of all of our streaming logs and things is now running on top of Kafka. And Kafka is one of those tools where once you get it deployed and kind of working, you can forget about it, right? And I love Kafka for that reason. Most of our fleet relies very heavily on Cassandra and specifically because Cassandra is very good at multi-region replication and that's what gives us kind of that power like this video I was showing, I'll show it again. To be able to move across regions at will. So we can at any point in time say, you know what, things feel a little bit, in US East One, we're out and we just do it. And part of the reason we can do that is because we have that multi-region replication and the data is just there and we don't have to think about, oh well, now we have to fail over the MySQL master to across the Europe and have all of the replication events been, it's a painful process for Cassandra, it's already done. Maybe there's Eureka in there. I mean, obviously Atlas is a cornerstone of everything that happens at Netflix in terms of even things like looking at, I think our content teams even use Atlas sometimes to look at what's going on with a particular movie that's on Netflix and then things like we've been able to look at with the big blizzard on the East Coast, we had some graphs going around internally showing the Netflix viewership climbing up on the East Coast because people are staying inside. So that really is kind of a big pillar in our infrastructure. All right. With that, thanks a lot guys. Thanks.