 I normally wear flip-flops as well, but I was always told when you present, wear shoes. So we're going to get started. There's plenty of room in the front seat, front row, free way in the back. So thanks to all the good folks at Constant Contact for inviting me here. It's awesome to be in Boston with you all today. It's going to be an awesome day. The keynote I thought was spectacular. I had never heard Michael speak before, but I would welcome the opportunity to hear him speak anytime. In fact, I'd love to bring him to Netflix and have him present there. And speaking of Netflix, are there any Netflix customers in the room? Thank you very much. We love you. We love you a lot. So as you probably already figured out, I do work at Netflix. I run delivery engineering, which is the – we build the infrastructure that all teams in Netflix leverage to do continuous delivery of various pieces of software into production. And what I want to talk about today is how we do continuous delivery, what we've learned about continuous delivery, and the challenges and the results of it, and then where we're going with continuous delivery. And as you're probably aware, Netflix is a global company. We were – well, over four years ago, previous to four years ago, we were a DVD company, if you remember that. And we made the decision to go – to offer streaming video, and that enabled us to go global. And right now, we're in 41 countries and counting. That number will increase very shortly here. We recently announced that we have – we had exceeded 50 million customers, and that number is obviously still counting. And one thing about the 50 million customers thing is that's an account, right? And as many people in the room are probably aware, you can have multiple people view an account, so we estimate that we have about 200 million people watching various videos. Since we're all engineers, I'll talk a little about our architecture. Netflix is a giant SOA. The new term these days is microservices, so you could say that may be – in many ways, we kind of spearheaded the whole microservices ecosystem or best practices associated with it. Most people are aware that we leverage AWS, Amazon Web Services. I did say we are a global company. We are spread across multiple regions in AWS. And at any given moment, we have thousands upon thousands of AWS instances running. And so why continuous delivery? Streaming video isn't terribly difficult, and in fact, constant contact could be streaming this now if they wanted to, right? Anyone of us can stream video. So the fact that we stream video is not all that – there's a very low barrier to entry to this market, right? So what differentiates Netflix is our data. What we collect about what you're watching. How often you watch it. What choices you make after you watch that. And then we use that data to figure out what content we should buy, what content we should potentially even produce ourselves, and then how we make recommendations, what shows up in search and whatnot. So really, we're a data company. Streaming video is, as we all know, pretty easy. And so because we're a data company, we need to be able to move very quickly and we need to foster an environment that facilitates that rapid innovation across the globe. And I thought Michael's talk as Keynote was fabulous when he talked about some of the aspects of, you know, volatiles and stables and speed, and that's a huge concern of Netflix in the sense of we want to move fast because moving fast is a huge competitive advantage. So again, low barrier to entry for anyone to stream video. And as we've already seen it, you've got Hulu, you've got Amazon, you've got HBO. We have a number of competitors. And so what we see is our ability to make decisions based on data and then change the viewing experience for our, for you, for our customers. If we can do that faster than the Amazons and the HBOs and the Hulus of the world, then we'll stay ahead of them. And the last time, I don't know if people are aware of this, but at peak Internet usage times, or peak Internet usage, Netflix takes up 33% of the Internet, which is just awesome. I think it's Hulu, Amazon and HBO added up are less than 1%. So we're doing a pretty good job. But again, as the keynoter pointed out, we can't sit still because if we rest on our laurels, we'll eventually be beaten. So competitive advantage, it's all about speed. And the reason we want speed is so that we can make the viewing experience for you all better. It changes every day. And in fact, we see continuous delivery as the rails, so to speak, to facilitate all of that speed, all of that innovation. So there are teams at Netflix that are deploying multiple times a day. You may not see the changes to your home screen or whatever it is. But all the algorithms, all search speed, everything is being changed at any given moment. In fact, I'm willing to bet there's probably a deployment. Well, it's pretty early in the Bay Area right now, so there's probably not a deployment going on right now. But by the end of this talk, I'm sure at least one deployment will be executed. So continuous delivery is extremely important to Netflix. We see it as it facilitates our competitive advantage to move fast. And I'll get to some details about continuous delivery at Netflix. But in general, continuous delivery, if you want to do continuous delivery, you need three primary facets. And the first one is you need a process that's repeatable. Moving from check-in to some environment requires automation. I think we're all engineers, we all know this now. Manual steps are error-prone, it's just a non-starter. So we use, where Netflix is primarily a Java shop, and our automation is largely driven by Gradle, which is a build system kind of the next generation. You've got Ant, you've got Maven, you've got this thing called Gradle. So a lot of our automation is spearheaded through Gradle, and then the tools and platforms that my team writes. Next facet for a successful continuous delivery process is it's gotta be reliable. What I mean by this is implicit in all this is you have to have lots of testing. If you wanna move some piece of code that some developer wrote into production as quick as possible, you need some confidence with that. And if there are no tests anywhere, it's gonna break, it's gonna blow up. So when I talk about reliability, it's testing at all levels. Whether it be unit testing, integration testing, functional testing. And I'll show you how we do that. But if you don't have kind of a testing mantra or philosophy, I would start there before you even try to push stuff quickly into production. And then finally, it's gotta be rapid. I said quickly, right? Processes that take days and days to get something into production are a non-starter. Again, we're all engineers, we're very patient types, aren't we? If I have to wait many, many, many hours to get something in production, I won't do it. Or I'll sort-circuit it and I'll do something else. I'll just go right to the production machines and make the change there or something like that. So your continuous delivery process needs to facilitate rapid delivery. And rapid, obviously, is a very loose term. There are some pipelines at Netflix that do take hours. And I'll get to why they take hours and you'll see the benefits of that. But you could short-circuit those particular gates and make it go quickly. But you'll see there's some power of going, or some increased confidence through going through these gates. Two more things, and these are kind of Netflix specific. But I hope that you can learn from these. We make two assumptions going forward with respect to continuous delivery. So the three facets previous to this are given, they're there. There's two more that are kind of, one's cultural, one's technical. And the first one is trust. At Netflix, we have a firm belief or our mantra is this whole notion of freedom and responsibility. Every team at Netflix is free to do whatever they want. There are no, and this is, again, very complimentary to the keynote. Netflix is a very anti-process company. So every team is essentially free to define what continuous delivery means to them. If you remember, I introduced myself as a team building all the platform to facilitate it. Teams don't have to use anything that my team writes. But I assure you every team at Netflix does use what we write. But it's their choice. I said we're largely a Java shop that gave me that little asterisk, right? Any team at Netflix can choose to write things in Ruby, Perl, Python, you name it, Go. But the majority of teams choose Java because of, again, the infrastructure that's there to support, let's say, JVM-based languages, and the tooling there to support continuous delivery. But at the end of the day, every team at Netflix is free to define what continuous delivery means to them and how they can use it. One other thing there that's implicit here, I think it's not implicit, it's explicit, is there are no operations teams at Netflix. If you're a developer and you write, you push some code into some service and you push that service into production, you are on call. You are the one that gets the call at 3 AM when people in Europe can't watch Breaking Bad. If we're able to determine that, it was your service that broke down. There is a reliability team, it's a very small team, it's very lean, it's about 10 people. They constantly watch production and they're looking for errors and what not. We have a bunch of KPIs that are key performance indicators or metrics that this team monitors and when it dips, they'll quickly assess what's going on and we have extensive monitoring as you can well imagine in an environment like that. So we know at any given point if something's breaking, but that team doesn't fix it. That team picks up a phone and calls Andy or Stu to use the key note. Call Stu and says, your stuff's broken, you gotta fix it right now. And that level of responsibility has some cultural benefits in that, it's an implicit and continuous delivery was that reliability aspect. Software engineers at Netflix are extremely prone to write a heck of a lot of tests and take continuous delivery very seriously, that reliability aspect. Because again, if you're the one who's on call at 3 AM and you don't want that call at 3 AM, you will ensure that your stuff gets out there in a reliable manner and you'll take advantage of the kind of the infrastructure that Netflix provides. So trust is a huge, I think, cultural component of continuous delivery. Let engineers make the decisions, treat people as adults and they'll do the right thing for the company. And second is this notion of judgment, judgments via insight. So as you can imagine, continuous delivery is a series of stages, right? As code is checked in, it goes through different processes or again stages before it gets to some environment. Each stage is a quality gate that can be automated or manual if you choose. And there's a decision, do I go forward or do I roll back? And you have to have intensive insight in that pipeline so that either automation or humans can make those appropriate decisions and whether or not this piece of code should continue out to our customers or we should roll it back and do something else. So continuous delivery, you gotta have a lot of automation but a lot of checks and balances and insight, operational insight into how something's moving through this pipeline and whether or not it should continue to move forward. And I'll share some of the insight things we've built to do that. So I'm gonna generalize how we do continuous delivery but you can largely summarize it in four steps. There's a build step and I will go into detail of each of these steps and the tools we use and I'll leave time at the end so you guys can ask me questions and whatnot. But there's a build step where things are assembled. At the end of the day, the asset that we kind of move forward into production because we leverage AWS. How many people here are familiar with AWS? Okay, so pretty much all the room, awesome. So a machine image, an Amazon machine image. Amazon called it an AMI. Previous to joining Netflix, I called it an AMI. So I'll call it an AMI. The end asset from a build process is an AMI. And that AMI then is pushed forward. So implicit in the latter three steps is a deployment. And the key thing about building the AMI here is going forward. That asset, that service, whatever it is that's gonna spin up in these various environments has no dependencies. When it fires up, it doesn't need to make a call to download some jar from Maven Central or whatever it is, X number of things. It's all done here. So once it goes into this pipeline, it is a self-contained single entity instance that can be fired up in any region across the world and makes no dependencies on anything outside in the world that could be down at that time, and that's a key thing about reliability. Once we're done here, we can spin up an AMI even if GitHub or Maven Central or some YUM repository, you just name it. Any operational aspect that isn't available at that time, which is highly likely it will happen sometime, doesn't matter to us, we're good to go. So the verify step is, again, I talked about that reliability aspect, all the testing. Unit testing is a given, but once you get beyond unit testing, you gotta do a whole lot of testing. This is like integration style testing. Canary analysis, I will go into detail about this, but this is something Netflix I think kind of spearheaded. And this is the idea, and this is, it freaked me out when I first got into Netflix because having been the CTO at a different company, we wouldn't have dreamed of doing something like this. But canary testing is spinning up an instance, an AMI instance in production. And allowing some traffic to go to it, trickle traffic, maybe 1% of all streaming requests go to this instance. And we're analyzing it, we're watching it. And it's not so much analyzing the code but analyzing the machine and comparing it against a baseline, is the machine behaving the same way it behaved in the last release, i.e. memory consumption? You name it, I'll get to more details there. So that's called automated canary analysis, huge aspect for us to understand reliability. And then finally it goes live, and again that notion of live could be anywhere in the globe across various regions of Amazon. So the build process, how do we, what is our build process? Pretty simple, developer checks in code, the code is built and it is tested. Again, here we're leveraging Git and we use Jenkins. Jenkins I think is pretty much the standard kind of continuous integration tool across the world, although we are looking at other tools, we're always looking at what's next. But a big part of that is the testing aspect. Testing phase is obviously here is unit testing. A lot of teams will also do kind of different analysis, static analysis like PMD or find bugs or something like that. And then right after that, assuming all your test pass, we assemble that service into a Debian file. And then that Debian file is installed via something we call the bakery, onto an AMI, or an Amazon instance. And then it is baked, snapshotted. And then that AMI is then now our asset going forward. Remember, no dependencies on anything else in the globe. This thing is good to go. We can put it in any region in the world, fire it up and we know it's gonna fire up and be fine. Well, it'll fire up. Whether or not it'll be fine is we'll find out in a little while. So far so good. So I didn't mention this, or maybe I did mention this, but because we're making Debian files, we are running on Ubuntu. Ubuntu is the standard Netflix OS going forward. We did used to use CentOS. So previous to doing Debian files, we were doing RPMs. All right, verification. I think people are familiar with the Netflix kind of open source stack. Open source to a whole lot of software. Not many, awesome. Guess what? Netflix open sourced a whole lot of software to facilitate all this. I highly recommend you look at it. Go to github.netflix.os s. One particular tool that we did open source, and I will talk more about this as we talk about where we're going, is a tool called Asgard. And what Asgard does is it takes that AMI and spins it up in any of three regions across the world. And since everyone here is familiar with AWS, some details about AWS. The highest level construct in AWS is an auto scaling group. So you have instances and you can group those instances into auto scaling groups. That's great, it's fine and well. But for Netflix, we saw that it's kind of limiting. And we've added a model on top of that. So we have this notion of clusters. Clusters are a group of auto scaling groups. Yes, a group of auto scaling groups. A group of ASGs. And then a group of clusters is grouped together into what we call an application. So we wrote a tool called Asgard that helps manage all that. So if you have a service out there, it's largely an application. That application that has various clusters that can be spread across the globe. Those clusters then can have various ASGs. And then those ASGs have various instances. And again, I mentioned earlier that at any given point there are thousands upon thousands of instances running across the globe. This is how we basically manage all that. So rather than dealing with individual instances, you're looking at essentially auto scaling groups, or you're looking at clusters, or you're looking at the app itself. Again, we wrote that. We have a tool called Asgard that manages all that for us. However, we're in the process of replacing that tool with another one. And of course, once you then have copied this AMI into the various regions of the globe, you spin it up and then you run a series of integration tests against it. So once more, you have this reliability aspect where we're verifying, does the code work like we think it should work once it's running up in an environment? So these are tests like, again, these are coding tests. Selenium, JUnit, Spock, you name it. All these things are being run or higher level tests against whatever service it is. Again, to verify everything's working. Another thing we also have is we have multiple environments. So you can run these tests in a simulated prod environment or a test environment staging. I think everyone here is well aware of different environments and familiar with this. We have the same concepts. All right, this is where it gets really interesting. So assuming that step is passed, everyone's happy. We've run all our functional tests. We'll deploy that AMI into a new cluster, and it's a canary cluster. And so what we call this is automated canary analysis. And there is an entire team at Netflix who has built a series of tools that essentially can monitor your app in a production environment. And monitor the app and the machine around it. And we save all that information, so we have baselines. So let's say we have a service called foo, and it's running a production. Constantly monitoring it and understanding the average memory usage, disk usage, CPU load, you name it, log growth, and that's a baseline. So you can use this tooling to essentially take this new AMI, throw it into this cluster, and then via load balancers you can say, give me 1% of all traffic, go specifically to this cluster. It could have multiple instances. And we'll set this machine at it, or this, it's actually not a machine, but more like an entity, a being, Skynet. And it'll watch this thing behave. And in some cases, you may choose to do this for a number of hours. It may take a couple hours to get that baseline. So this is earlier I alluded to, it's got to be a rapid process. But some teams, it may take eight plus hours and whatnot. In this case, what they're doing is they're again, shuffling a little traffic. So at any given point, you may not know it. When you fire up Netflix, your traffic may be going to some new instance. And this entity is just constantly watching it, gathering all the metrics. And essentially, there's a nice big dashboard and it's giving you a threshold and saying, things look good? Go ahead and just spin this up live, or something's weird here. The memory, whatever it is, you name it. Some operational aspect of this app is behaving differently than it did before. And this is all automated, it'll just roll back. Or if you set that threshold, maybe it's 90%, again, it is a score at the end of the day. And it's below the score, obviously you're going to roll back. It's above the score, you'll go forward. And so, if you go forward, we have a very kind of interesting way of going forward or going 100% live. We call it a red-black push. I believe the real world calls it blue-green, yes. I was gonna say black-green or something, but blue-green. But Netflix calls it red-black because previous to a couple months ago, our logo was red and black, but now it's not red and black. So maybe we'll call it red and white, I don't know. But it's a red-black push and it's a four-step, choreographed deployment. The next slides, I'll show you how it works. But implicit here is that it's taking advantage of the cloud. This is the beauty of the elasticity of the cloud, whether it be AWS or Google or whatnot. And it is, most companies do what's called a rolling push. And red-black is, rolling push is cheaper, right? Because it uses fewer resources. Red-black push essentially doubles your capacity, right? In fact, let me show it to you. And again, this is why you can pretty much get away with it in the cloud. It's kind of hard to do it in data centers. Although there is a gentleman, Brian, later on, talking about Docker. So if you're in a data center and you haven't looked at Docker, Docker can facilitate this kind of stuff, maybe potentially nicer. But so red-black push is like this. You have a, here's this AMI that we burned. So to speak, we've baked and it's live. It's version one, it's running out there. In this case, I'm gonna talk about one AMI. But just imagine this is like a thousand. And it's running in an ASG or a cluster. And so what we do is we spin up version two right next to it. So this could either be, again, one instance or a whole another cluster of instances. And they're running side by side. And then we let a little traffic trickle to the new one. And again, we're still doing the automated canary analysis. That's pretty much going on at all times. And then we essentially turn off the old one. Not turn it off, I shouldn't say that. We turn off traffic to the old one. So at this point, traffic's going essentially, it starts out 100, then it's 90, 10, 80, 20. It's slowly doing that until at some point it's 100 and zero. And then this guy's the new one and this guy can either be left out there, basically disabled or completely destroyed. The beauty here is you can roll back instantly. The anti-pattern that we solved or that the industry has solved with green, blue-green, as opposed to rolling pushes, this whole notion is you can always flip back, right? So in this case, if this guy all of a sudden just dies, you can quickly turn the load balancer and go back to here and you're good to go. This is all facilitated via that tool I mentioned earlier called Asgard. We had to write a whole lot of tooling to do all this. And this is how we deploy. This is the end of the line, so to speak. This is the end of continuous delivery, a pipeline for continuous delivery. So far, so good? All right, it's pretty simple stuff. Okay, challenges, what have we learned? It's not the easiest thing in the world to get this far. So, first and foremost, I'm here to tell you we had to build it all and we open sourced everything. So I highly recommend go look at our open source suite. So when Netflix four years ago decided to go streaming, we were in a data center, like many of you. And we knew that if we wanted to seriously go global, we could no longer be constrained in a data center so we elected to go to the cloud. To go to the cloud at that point was pretty darn early. And then to facilitate continuous delivery into the cloud was unheard of. And so we had to build a ton of stuff, ton of tooling, tooling, frameworks, libraries, and we decided, you know what, let's just open source it all. Because it's not, again, our competitive advantage is to move fast, but what differentiates us is the data. So all the stuff that we used and we wrote to move fast, we give away. So I highly recommend you take a look at this stuff if you're looking at going to the cloud and want to do continuous delivery. Pretty much everything I've talked about today is open sourced or will be open sourced shortly. Another reason for open sourcing it all, of course, was attracting talent. I'm actually fairly new to Netflix. I've been there almost a year and I was well aware of what Netflix was up to because I've been following them on the open source world. So it certainly helps for recruiting. If you aren't open sourcing anything that your business, if you have something that's not business critical is not like your special sauce. I highly recommend open sourcing it. It attracts amazing talent. Second challenge and lesson learned is not all tests are created equal. I can't stress testing at all levels. Unit tests are phenomenal or great. I couldn't imagine writing code without writing unit tests but they fall short of simulating what life is like in the real world. So this shouldn't be kind of news to anybody but testing at all levels is very, very important and especially in a continuous delivery environment where you don't have the luxury of downtime or telling customers to come back in three hours while we do this deployment. You need to have heck of a lot of functional style testing and automated canary analysis in many ways is a higher level of testing that I would include like beyond functional. And it requires a significant investment. Writing tests, tests are not cheap. They have to be maintained but they have to be written too. And they break and when they break you have to fix them. So don't overlook the investment in the testing. A lot of people tend to do that unfortunately. All right, results, what have we learned? What can you learn from us? Well, first and foremost, continuous delivery works. It is the cornerstone of all business initiatives at Netflix. It is an unquestionable sacred thing at Netflix. If you're a team at Netflix that for some reason elects not to do continuous delivery, it would never happen actually but it is just essentially assumed by the business. I've worked at companies where the business would come out and say, we wanna get this feature out there and they'd work with me and my team and we'd be like, all right, it might be two weeks. We'll get it out there in the next release. At Netflix, business says, hey, we wanna make this change. Teams at Netflix are like, all right, do you want it today? We'll push this, it may take a developer a couple hours to do it and it'll be in production later on this afternoon, you're cool with that. That's a phenomenal, I think, difference or a differentiator for IT to be able to go back to the business and say, yeah, no problem. Whatever you want, you'll get immediately. To put it into perspective, we've been doing continuous delivery for, again, roughly four years. I'm new to the company. I was brought in about, like I said a year ago. The name of my team is Delivery Engineering. Netflix, I've built a team of seven individuals so there's eight of us building the next generation continuous delivery platform at Netflix. That's how serious the company takes it that they've gone out and hired eight of us and said, we love continuous delivery. It's working really well for us but we want it to work even better so please go off and make it work better. I think that's a testament to how the business views the benefits of continuous delivery. Again, moving fast is a competitive advantage. Don't overlook that, it's not a startup concern either. There's a famous quote, and I shouldn't say famous. There's a phenomenal quote from the CIO of Walmart. Walmart's the number one largest company on the planet. Fortune 500 number one this year and last year. He's quoted as saying, the only differentiation for a company is speed. And I think that's fascinating that Walmart would say that. You know, we tend to think of speed as like, oh, it's a startup thing. Twitter, no, Twitter's not a startup anymore but you know. But it doesn't matter regardless of the size of your company, moving fast is a competitive advantage. Continuous delivery enables you to do that. Continuous delivery doesn't necessarily mean every change goes into production. What it means is any change can go to production. And largely it's, again, it's the ethos of our culture. Again, kind of piggybacking on Michael's keynote. In order for us to survive and continue to be the number one streaming video content provider on the internet, we have to continually innovate. If we ever sit down and rest on our laurels, someone's going to come along and crush us like has happened infinity times in the tech world. So moving fast, it's just part, it's ingrained in the culture. Another thing that's really special I think about Netflix is I told you about freedom and responsibility, but also you are free to fail. There's no problem. There's no kind of finger pointing or firings. If you push something so fast out there and it breaks, while we put a bunch of stuff to make sure it doesn't break, it's okay if it does. Because again, the impetus is if it breaks, fix it real fast and let's learn from it. So moving fast, continuous delivery, constant innovation is, again, it's part of our culture. I think it needs to be a part of any culture if you want to embrace this style of software delivery. Okay. Now I want to share with you three things we've learned and we've definitely made mistakes and you're going to need these three things if you wish to go forward. So first thing, global deployments require detailed insight into metrics. In our case, they're core metrics. If you don't have detailed monitoring in your environment, start there. Then you have to figure out what are the actual metrics that actually matter to you in the business or really to the business, because you should be one of the same. In Netflix case, our core metric is streaming starts per second, SPS. We watch this like a hawk. And at any given point, if SPS dips in any region at any time of the day, people are alerted immediately. That is our core metric, because we know over time, well, it's 11 o'clock here, so Europe right now is watching a lot of Netflix. So SPS is on the rise. If you see a dip, we know instantly something's wrong in the European region. Same thing on the East Coast, we watch the East Coast like a hawk and we watch the West Coast. And so continuous delivery feeds into that, right? Because if Europe right now is in the middle of watching Breaking Bad or whatever it is, a team can elect, let's not do a deployment right now. So continuous delivery requires one that you have core metrics and then the ability to schedule things based on those metrics. Initially, Netflix, when we made a change, we pushed it across the globe instantly. Makes a lot of sense for you on the West Coast. This is a great time to deploy on the West Coast. No one's doing anything on the West Coast. We're still all asleep basically. And wearing flip flops. But like I said, Europe probably starting to watch a lot of TV right now because they're getting home from work and that's that moment of relaxation where we really want the experience to be phenomenal. Because again, if you go to Netflix and you're really happy, you're probably signed up next month or whatever or tell your friends about it. But if you get home from work, you've had a hard day, you try to watch Breaking Bad and you get that buffering error all the time. You're gonna be like, screw this, I'm gonna try something else. So, our continuous delivery processes enable scheduling based on these core metrics. So at peak viewing times, unless it's a critical bug fix, many teams elect, let's not deploy this thing into Europe right now. We'll do it into the West Coast, we'll do it on the East Coast. And then when peak time starts dipping in Europe, then we'll do this deployment in Europe. Not all teams do this but some of the critical ones do this or choose to do this. So, understand your core metrics, watch them like a hawk, leverage them for continuous delivery. Next, this is a, don't forget about the cloud. Use the cloud what it's made for and that's elasticity. Cloud by and large has infinite resources, although I'm here to tell you that Netflix regularly finds out where those infinite things stop. AWS is elastic. For most businesses, AWS is truly elastic. Don't forget about that. One thing that Netflix, and this will be interesting because again, Brian is talking about Docker and I think Docker's fascinating and Docker is something Netflix is keeping on looking at very closely. But one thing that we've done and we think it's a lesson learned and something I wish to impart with you all is that we view an instance in the cloud as an ephemeral thing that spins up, it's got our code on it and it could die at any moment. We don't care about that. In fact, we regularly shoot instances in the head just to make sure that we're prepared to survive an outage. And that again is taking advantage of the cloud. A lot of, in fact, the company I was at before this where we had a heck of a lot of load, we would try to squeeze everything out of an instance. So we'd put multiple apps on an instance so we'd fire up this M3X large and we'd put like six apps on it and we'd be like, yes, we're saving money because we have one machine with six things on it, which is fine. Problem is, is that when the third instance or not the third instance, but the third app on that dies, you gotta actually go to that thing, SSH to it and be like, why did it die? And then kick it back up because you don't wanna mess with the other five apps. Netflix's mantra, you put one app on an instance because again, elasticity, there's infinite amount of instances that you can spin up. If that thing falls over, who cares? Shoot it in the head and spin up a new one. And in fact, production SSH is largely never used. No one really ever SSH to a production machine. You just shoot it in the head and fire a new one up. Now if it continues to keep dying, then maybe someone will SSH to it and figure out what's going on. But, so take advantage of the cloud. The true elasticity of the cloud is phenomenal. Spin up resources and then tear them down as needed. Single tendency works really, really well in this model. Obviously in a data center, you're constrained. You don't really have truly elasticity or true elasticity. So something like Docker may help. I think the Docker equation in the cloud gets really, really interesting and kind of muddy. But I will say right now in the Netflix view of the world, single tendency works phenomenally well. Also note, since everyone here rose their hand for AWS, pricing at AWS is linear. So if you want to squeeze six apps onto let's say a six gig machine, you could actually get six machines or let's make math easier. Two apps on a six gig machine is roughly the same cost as two machines with three gigs. Now, it's not always true with like CPUs and whatnot at Amazon, but the pricing model is linear. So let's keep that in mind. Finally, I forgot when we end. 115, oh sweet. This is where we're going. This is a lesson we've learned and it's version all the things. Okay, so what do I mean by that? I told you that from an automation standpoint we have Gradle or just think of a build file regardless of the technology you use. We have a build file that delineates how your software is or how your project is compiled, tested and packaged. And then we have these other tools like Asgard I mentioned and automated canary analysis. There's all these other tools in the pipeline that actually have information about your app and what it should do. If you think about that, that source of truth is spread across multiple instances. You have three sources of truth in that example. You have your build file, you have Asgard and you have this other thing called automated canary analysis. The problem is that anyone can change that source of truth and not necessarily know about it. And in fact, you could go into Asgard and say my particular app now needs, it's got 200 instances. I'm gonna bump it up to 300. What we've discovered is that that's really, that's fine and well, that's great. People can do that and people do that all the time. Problem is we lose the history of that. We don't know why you went to 300. We have an event that we know it went to 300 but we don't know why you did that. But think about it. If you were to go into your build file or into your code and you changed 200 to 300 you would commit it to Git or whatever Perforce CVS. You'd probably leave a commit message saying you need to go to 300 because we've noticed that this thing falls over at 9 p.m. So what we're doing is largely collapsing that model of what it means to be deployed across the cloud, what it means to be analyzed in a production environment and folding that into essentially our build file. And so that becomes a single point of truth. Single entity that defines how this app is realized in production. Think about it. It may sound kind of weird but it's no different than your build file, right? Your build file says how your code should be deployed so why not, or not deployed, how it should be compiled? Why not add all the other information there like it should look like this in production? It should have this many instances. It should have this kind of load balancer in front of it. It should be running on this type of instance. It should have these thresholds when you do an automated canary analysis in that. And all that is stored in one place that engineers can go to and version and when you make changes to it, everyone knows why. The threshold for automated canary analysis used to be, let's say 700. We've now lowered it to 650. We have a record of who did it and when and why. This is very important. So if you, again, from a continuous delivery standpoint, you're gonna end up stitching together various systems and each system will have its own source of truth. That'll become somewhat of a maintenance nightmare as you grow and new people come on, people leave. People are going to lose track of why we made a decision in that upstream system. So we're moving kind of, again, that source of truth down to one place and then those upstream systems will use that place to determine, okay, what am I supposed to do here? Immutable environments is kind of a hot kind of DevOps term these days and this is largely in line with that notion. So version all the things. So last but not least, I did try to leave a couple minutes for questions. That's my contact information. Thank you very much for your time. I hope I imparted some nugget of truth or not truth. Well, definitely truth. Lots of nuggets of truth. Hopefully something you can take back to your company. Companies. There's gotta be some questions. Come on, I'll help you. For the verify stage, and on the topic of versions, how does, where do you get the tests for that particular version of that Amy? Amy, sorry. So the tests are part of the code base? So they're in Git somewhere? Yes, they're part of the code base and then Jenkins and other kind of like, like Meizos, those type of tools are then running those tests against the AMI in that cluster. Are they versioned together separately? Oh yeah, yeah, definitely. How do you manage compatibility between components across the sort of deployment boundary? Components like services? Service interfaces, data storage, the whole thing. Yeah, so, well data storage, at the end of the day, Netflix leverages Cassandra and you go through a series of services to actually get data in and out. The larger issue is service contracts between, like I said, Netflix as a SOA. How we've solved it up to now is every team provides a client jar, essentially. So a binary that, okay, so I have the service foo and you need to make use of foo. So you come to me and you say, let me have your client library so I can talk to you. So every team is binary, kind of independent and they can run together, you know, run at separate paces and the contract is through that client library. That works, but it's being kind of reevaluated. You know, without client libraries, then you're stuck with the URL, right? The RESTful interface and people try to version those. But then the payload could change, you know, I'll change this, Jason, blah, blah, blah. And that's not where we're going. What we've found is that those client jars tend to introduce a whole lot of other dependencies that then can break. You know, so if you're foo and I'm bar and I give you my client jar that has a different version of some library that you depend on, it can mess you up. So we're building some tools to understand kind of the dependency analysis between services. So that's how we do it. I'm not saying it's foolproof, but it largely works. So where does a QA fit into all of this? With the huge robust testing suite, where does like a human QA actually fit into this? They don't. So let me explain. The client side teams, like I'm pointing here like there's a Netflix box there, sorry. So the Netflix interface, like devices, your iPad or your computer or your Xbox, those have QA teams. And they're regularly basically, that's probably the best job in the world. They watch Netflix all day. And they're watching Netflix in these various environments. But those are the only teams that have QA. There is no formal QA on any of the engineering, like basically server side, all the services that are published. And there's roughly more than 600 services running in production. And not one of them has a QA team that QAs it. It's all engineers. You, again, are responsible for what goes in production. And the belief is that if someone else, if I'm an engineer, I throw it over the fence and give it to QA, then there's this, remember he said something like people are motivated? I forget his quote, but essentially you'll do changes or you'll make changes, depending on how you're motivated. And that's the belief. If I know that what I write today will be in production in a couple hours, I'm gonna make real sure that what I wrote works really well. I'm gonna test the bejesus out of it. Because again, I don't wanna get that call at 3 a.m. So there are really no QA teams at Netflix per se. There is a client side QA team. And again, because you can't automate a lot of that stuff, like someone's gonna sit down and actually watch, do a search and watch and make sure Breaking Bad goes from end to start to end, excuse me. If you haven't, yes, I really love Breaking Bad. Probably say like House of Cards or something, but yeah. Hey, so I was wondering if Netflix is gonna put out like a new service or something like, hey, let's show you a montage of all the videos you've watched over the year or something. How do you establish a baseline or something that doesn't exist? How does the canary testing work for something else? That's great. We basically wing it. A team will wing it and just basically say, all right, we think this is what the baseline is and we'll put it out there in production and find out real quick. So you're right, you largely guess what the baseline is, especially with automated canary analysis. Now, there's a lot of institutional knowledge at Netflix that understands basically the type of app you're building, how much load it's gonna get. I mean, again, we watch our metrics like hawks, so we know that if you spin up a service at this time in this region, you should expect about this much load. So you can largely guess it and then you watch it like a hawk going forward and you refine it. There is a performance team at Netflix that essentially acts as consultants and they'll sit down with teams. Because again, at the end of the day, engineers at Netflix are super awesome at building a service that whatever fires up my little pony quickly on your iPad, but may not understand the implications of the performance of an iPad or the servers that it's coming off of. So that's where that performance team can kind of help in a consultative basis and they'll largely help you to find those initial thresholds. One more question. Is your CD system in the cloud too? No, so the DVD, oh, CD system, I'm sorry. Continuous delivery, yes, of course, yes. I'm sorry, I thought you meant CD as in DVD. No, our continuous delivery system does, the platform is in the cloud. Like I said, right now it's a series of tools. One of them is open source called Asgard. We are rebuilding that. And that tool, that platform is called Spinnaker. It is completely in the cloud. It is completely dog food. So we use Spinnaker to basically continuous deploy. Spinnaker and all the teams at Netflix are using Spinnaker to continuous deploy. But yeah, it definitely leverages the cloud. Netflix is essentially 100% in the cloud. Yes, he said, I'm done. Crane or whatever. Thank you very much.