 Thank you for having me here today Every day Netflix offers highly curated and highly produced content to over 100 million customers around the world Stranger things is one of them. I haven't seen the second season, but I highly recommend it. I heard very good things about it so My job at Netflix is to make sure as you said that When customers come to our service They just get what they want. All they want to do is search or find the content that they want to see and Watch it So our goal is to have over a 100 million happy customers But the truth is that we run a relatively complex distributed system and Sometimes something goes wrong And then our customers are not as happy as you may have hoped When you have over a hundred million customers and when something goes wrong in your service Usually not only one customer is affected or even 25 or even 100 Usually many many more are affected and this is something that we think about a lot and one of the things we think about is If you think about what happens when when we have an incident Very very quickly the customers that are affected become faceless There are only 900 pictures on this image and on the slide But oftentimes when we have an incident it can affect it can it doesn't always but it can affect even millions of people so Let me give you a little bit of more context on my work as We said we have more than 100 million customers around the world Together they watch more than 125 million hours of content every day So if you put those two numbers together you will see that if one of my services fails You can affect a lot of people relatively quickly So time is of the essence and as we'll go through this presentation. This will become clear how we think about this Today we have about 380 micro services in production I say about because that changes sometimes It is also the case that some of the services that we run Really don't deserve a name that has micro in it because they're really gigantic But we have that's that's about the scale that we're looking at and Netflix also runs on over 1,000 device types so that ranges from You know your mobile phones even older kinds of devices To set top boxes and to your TV at home to your smart TV All of these we support so over a thousand device types and We do all of this with less than 10 core SREs So you may think that we're a little bit crazy and that may well be true But let me tell you a little bit about how we accomplish this one of the ways that we accomplish this is By having a very strong DevOps culture, which is not the topic of this talk, but I'm happy to tell you all about Offline Essentially every team is responsible for the own services that they run and then the central core SRE team is responsible for the overall reliability of the service This picture gives you a more a better visualization of our distributed service Now these are the 380 roughly services that I mentioned earlier now note that these are only the services that are what we call on the Streaming path and that means when you come to Netflix Everything that you need in order to actually be able to watch a movie Normally you don't need all of them in order to be able to watch your movie But those every circle here represents a service that is responsible for something that you may do on the service search recommendations localized descriptions and so forth So what you see over here is you see traffic coming in from the internet from the left and going through our various gateways so that we have a gateway that routes traffic and The gateway routes traffic to our various front ends the biggest of which you can think of is the the dot here in the middle Which is our API what we call our API and the API? To give you a sense talks directly to about 70 services downstream from it And then of course indirectly to several hundred as we've as we've talked so if you think about the system and our central team being responsible for this What I see is That insights are everything you need to automate The heck out of this thing you need to make sure that whatever you can get automatically you get it is not feasible that one person Understands everything that goes on inside of the system So how do we get insights? We have various tools in our toolbox, but one of our most critical ones is Mantis Mantis is a cloud native stream processing service built on top of Mesos The way Mantis works is that we Instrument a lot of the microservices that you saw in the previous slide not all of them but many of the very critical ones and we Give them the ability to emit metrics to Mantis We also gather we also have the ability to get gather metrics from various other data sources such as Kafka and Amazon s3 and various others and So then the data flows through Mantis and the way this works is that internal? Users so Netflix developers can write Mantis jobs Mantis Like I said, it's built on top of Mesos. It's a Mesos framework that we've developed and then our our Netflix developers create Mantis jobs that can aggregate data that can filter data that can perform Operations on that data and do various other things that eventually lead to such things as alerts operational dashboards anomaly detection So the way to think about this is our microservices and these other data sources emit operational metrics that give you insights about service health and those metrics are processed automatically with Mantis jobs and then lead to alerts operational dashboards and so forth and they do all of that in real time and Again, that is very critical because we have we can have a lot of users affected very very quickly when something goes wrong Let's take a little bit of a look under the hood As I've mentioned Mantis is is a Mesos framework and In order for it to satisfy our own needs our own Netflix needs We developed our own schedule or called Fenso, which is open source Fenso is Optimized for cloud it is very important for us because all of our our entire Streaming Infrastructure and all of our mantis jobs and everything else that we run runs on top of AWS In this case EC2 So what does it mean for Fenso to be optimized for cloud? One of the things that allows us to do is it allows us to scale the underlying agent cluster All the scaling is very important to us and we'll see that a little bit later on But the amount of data that gets pushed through Mantis jobs Various greatly over the course of the day and the course of the week So as our resource needs change we also need to scale the underlying clusters Fenso is also designed to allow us to have to satisfy various other constraints that we have many of which are you know Pertain to the cloud some of which that don't so for instance we can do being been packing task affinity And one of the other things that has to do with you know the way we have our cloud set up is spreading tasks across EC2 Availability zones so that you can accomplish high availability, which obviously is quite important for us Let's talk a little bit about how we use Mantis and what we use Mantis for Internally at Netflix One of the things that we do with Mantis is something called real-time SPS SPS stands for stream starts per second So essentially this is a metric of how often people are able to come to Netflix click play and actually are successfully able to watch So it works This is our one of our most important top-level metrics that we use to determine system health So if SPS Hits a wall and drops we know something is going wrong So real-time SPS, which is powered by Mantis allows us to get insights in real time Into whether SPS is in line with our expectations or not So what you see here generally are two lines And the black line is essentially the expectation that we have of what will happen Which is based on historical data and various algorithms that we have and the blue line is the actual observed SPS So sometimes they're a little bit off I didn't show the numbers here because I cannot but the scale obviously matters here But it gives you an idea of how closely aligned we are with expectations Real-time SPS Let's us know that something is wrong if something is wrong that something is wrong in seconds not minutes We have various other ways in which we can determine whether something is wrong with SPS or our overall system health but real-time SPS really lets us know within seconds sometimes and That can make the difference between you know 900 and 9 million users being affected. So this is something that's pretty critical to us Real-time SPS in our case and for many of our metrics, this is the case We are able to break it down by a region This is important because many of the incidents that we have these days are isolated to one region We run in multiple regions on top of AWS and the way we generally roll out our software ensures that when incidents do happen they happen generally only in one region which is great because then we have several Mitigation techniques that we can apply for instance shifting traffic from one region to another which we do frequently We just did last night And hopefully nobody noticed So breakdown by region is important and then breakdown by device type is also important So SPS real-time SPS and our other metrics are generally broken down by device type Which allows us to narrow in very quickly on which device or devices are affected That is important because as I mentioned we have a very strong DevOps culture. We have subject matter matter experts that are You know in the know and responsible for specific parts of our system and that also Applies to specific device types. So if we have an issue for instance that only affects say TVs Then we know whom to call for help We don't only use mantis for SPS as I've alluded to we use it for a variety of different metrics again The idea is the same teams can set up individual Dashboards and alerts and other kinds of analyses for all kinds of different metrics that We gather So for instance here what you see is I had to black out the specific paths But what you can do here is the API team set up a Dashboard that shows Elevated five hundred for specific paths and the specific paths or endpoints are Specific to specific devices. So again that lets you if you have now elevated 500 Error codes coming from a specific endpoint, then you know That gives you clues as to what might have happened if something does go wrong As I mentioned before auto scaling is really important to us And that's one of the reasons that we developed Fenso we run on top of easy to and we scale the cluster Day in and day out all the time That goes for our services that are on the streaming path as well as for the mantis jobs that we run for analysis and insights As you can see here Roughly 18 million messages flow through mantis jobs at peak where it's only about six million flow through mantis During trough and if you think about it, why is that? This is because you know usage of our actual service very scatly over the course of the day You know people don't want tend to watch a lot of Netflix while they're sleeping But they do tend to watch Netflix While they're at home in the evening and not so much while they're at work So you see this you know the cyclical nature of our of our the usage of our service and the usage or the number of messages that flow through mantis Excuse me mirrors that over the course of the day So as we as we go we we have to scale the cluster so we can make better use of our resources But we don't stop there We do something that I call streaming on demand and the way this works is that We allow users again, that's internal Netflix developers to set up ad hoc queries That allow them to stream data only when it's needed for operational insights The way this works is that you as the developer you can set up a query say that pertains to a specific status code or path and Only then that data is streamed from our say API service or one of the other microservices that we run on the streaming path That data is then streamed and analyzed by mantis and you can get real-time insights and real-time information on On that service only when you need it and Again, that is important because we have so much data that we would otherwise have to push through the system That it's actually just not really feasible So with this we actually greatly reduced the amount of data that does need to get streamed through our system While at the same time allowing for the same kind of flexibility and the ability to have those insights So when you define one of these ad hoc queries a mantis job gets created under the hood and started on one of our agents Hosts Again autoscaling comes into the picture here. So this doesn't follow exactly the same Ciclic nature as we saw for the number of messages But what you see here is essentially how many jobs get set up on top of mantis over the course of a day So essentially when people are at work and they're debugging and they're trying different things Or different smaller things can up come up that they want to chase down. They set up mantis jobs To to dig into before performance and behavior of our service What does that all mean? Let's bring it back. I I think it's important that we sometimes take a step back and think about The infrastructure work that we do or the work that feels like it's very deep down in the stack and take a step back And think about how it actually affects real users In my case what mantis means for us is Faster detection of issues so we know of issues in seconds rather than minutes Which somebody sitting in? I don't know India in their living room trying to watch watch Netflix. It may really affect them We have faster insights into the causes of those incidents. So not only do we understand something is wrong because of mantis We can actually get faster insights into specifically what happened and what went wrong and Again, we have a faster path to mitigation So we can if we know what's going on or if we know for instance, which region is affected We can very quickly take steps to mitigate the issue So if only one region is affected again We can just shift traffic away from that region and nobody no users will ever know that anything bad when happened in the first place hopefully and This is what we really care about at the end of the day all of us right and we care about Making sure that our customers are happy and that they get the service that they actually signed up for If you want to contact me here's my contact information I'll be around for questions if any of you have any thank you