 So I'm going to thanks thanks for inviting me back and I'm going to talk about Cloud native what that means. That's really the why you should be doing this part then Netflix I was asked briefly the open source tools that we've put out there and Something about the scale and complexity of what we're doing and some benchmarks that we've published over the last few years So cloud native what what is it and why does it matter? Well, what happens is we're always striving for Perfection we want to have perfect code We want it to run on perfect hardware and we don't want to make any mistakes when we're operating it and that's sort of the engineering ideal That's the utopia we're trying to get to but that utopia takes too long to get to so you always ship code with bugs in it And you always push the wrong button and break it when you're trying to operate it all those kinds of problems So there's always this compromise where you're trading off time-to-market versus quality and these utopias So permanently out of reach just one of those frustrating things about engineering just give me another six months. I can make it better But there's a bunch of markets where time-to-market is the most important thing If you're making a land grab and Netflix is making several land grabs. We're doing a global land grab We're launching in Holland this fall. We're doing a content land grab We just announced the deal where we're taking the Weinstein Movie output deal in a few years time. So we're accumulating all of these large-scale content deals and we're disrupting our competitors by Getting inside their OODA loop and the OODA loop is observe-oriented side-act I'll talk a bit more about that later But if you can figure out how to react to something faster than your competitors you disrupt the market And basically anything delivered as web services you can just speed it up and go faster and faster So if you're competing with a service that's delivered on a disk or is delivered in some slow mechanism and take six months Or a year between updates you can disrupt them very easily So this is the OODA loop observe-oriented side-act came out of the dogfighting in the Korean War and Colonel Boyd said basically he was teaching his pilots to figure out what the The people that were dogfighting him were doing and just confused them by being able to react more quickly So when you apply this to business What you really want to do is observe a land grab opportunity Right something that somebody else hasn't seen yet You want or a competitor makes a move and you want to react to that competitor Or you find a customer complaining about something and you want to address that pain point this There are other things but these are sort of several of the different kinds of things We are just trying to look for an opportunity Next thing you're going to want to do is do some analysis on it research alternatives figure out what to do about that Then you want to plan a response get buying from the rest of your company and commit resources to go do something about it And finally you're going to implement whatever it is deliver it engage customers Market sales whatever it takes to get it to the customers And then there's one more step which is you back into observer you're measuring the customers And did they actually use this thing and does it make sense then you carry on around the loop Looking at how the customers reacted to this new thing you did so the faster you can go around this loop The better you can get your products and the faster you can address a market and you can beat your competitors So this is there are big Industry buzzwords sort of that really tie into this and one of them is innovation a lot of people have innovation strategies or innovation groups Or there are big company memos saying we're not being innovative enough and what they're really talking about is their inability to See and react to these of these opportunities The next thing is really big data and big data is about getting the analysis and getting Pulling in the data the unstructured data and being able to ask questions that you couldn't previously ask and Get answers quickly so that you can carry on to the next step The decision parts a lot of this is around corporate culture if every decision goes all the way to the CEO through layers and layers and layers of management by the time it gets there it's unrecognizable and It takes just too long so you want to have decision making that is pushed low in the organization And so people can see an opportunity Do their analysis? React to it and get stuff done really quickly and cloud is really in this this fourth box It's the ability to implement stuff quickly deliver it quickly engage customers quickly cloud-based services like sales force or You know real-time bidding all those kind of marketing and sales operations things where you're very agile So if we're trying to get around this loop, how fast can we get around this loop? How soon can we do it what I'm talking about here is the ability to do code features in days instead of months We have an idea on Monday We've implemented it by Friday the customers use it over the weekend And we're looking at the data to see what happened on that the following Monday And we had a whole series of projects that we did where we cycled 20 or 30 Personalization ideas through the system through Netflix's production systems on that one week cycle So it can be done and we typically take a little bit longer for bigger projects like this morning We launched an update to the what you probably know was the instant queue That was a long project lots of steps in it, but We can get small you know small changes and incremental changes made in a few days When we want hardware it get it arrives in minutes I'll talk a bit later about some benchmarking we did where we were just talking about minutes to create really huge No sequel environments that we could benchmark Instead of taking weeks having meetings with it meetings with finance meetings with management trying to get approval By the time you've done that you've probably spent as much if you look at the total salary cost of the people in the meetings It's probably more than the actual benchmark would have taken if you just ran the machine for a day Or ran the set of machines for a day an Incident response if you're pushing out code very quickly. It's going to be broken sometimes And you want to be able to tell that something's broken in seconds and you want to be able to respond in seconds either automatically or by building Visualizations that show you what's going wrong so that you can react in seconds as well So if it takes you if you're looking at one hour updates or yesterday's data, it's already way too late So this is this new engineering challenge we're trying to construct a highly agile and highly available service from ephemeral and Assumed broken components so our default assumption is that everything we're talking to and all our dependencies are broken Right, we don't assume that everything's perfect and that takes a little bit of getting used to but that switching that assumption is the key thing here So how to get to clown native the first thing you do is you have to give freedom and responsibility to the developers That means they can actually observe the the things that need doing you know There's probably product managers and whatever in the in the innovation part But quite often the developers are there coming up with ideas, too You want to decentralize and automate the ops activities if you have to go have a meeting with it to discuss the code You'd like to push to production every time you do that then you can only turn that code over so fast we're turning code over in a couple of hours from Pushing code to all the way through to production and it's done We do that because the developers are managing that code into production They manage it when it's broken and they understand the exact state It's in at any point in time. There isn't time to explain to the ops guys You know exactly what the state of the system is at any and so that they could react to it And so what we've basically done is take this DevOps organization and integrate it into the business organization So we don't have a separate business separate engineering in a separate operations group It's really one three one sort of group I called biz dev ops or whatever you want to call it So how do you get there? Unfortunately for most companies it means a reorg And that's that's kind of the bad news if you are there already organized this way Or you can figure out how to organize groups and give them the autonomy to operate in this way To maybe build in your product line or something like that But it's very difficult for organizations to actually respond to this which is why you know new organizations come along and disrupt it and Engineers get frustrated in the odd organizations and move to organizations where you can actually Where you are do have this freedom of responsibility So a lot of it a lot of the speed comes from corporate culture as much as any as much of any of the technologies So these are the four transitions the first thing integrating all of your Management all your roles into a single organization so that you can iterate around this OODA loop in days rather than having to all these Handoffs and and meetings and taking weeks or months to go around it Developers this is the one that's actually the hardest transition that we made the move to cloud was relatively simple the move from You know a traditional Oracle my sequel schema based environment to a Relational schema environment to a no sequel environment was the hardest thing to actually get the developers heads around and One of the things here is you have to completely denormalize your data model so that every team what used to be a Materialized view or a table that you know some query talk to becomes a completely distinct database All right, so we have totally separate clusters. We blew up our schema to the point where the back ends are not even in the same cluster We don't have one huge or Cassandra cluster with all of the different tables in it We have 50 different Cassandra clusters Yeah, each of those if you looked back at the way that would have been implemented in Oracle would have been a Materialized view or a table or something but all part of one big schema And if you wanted to touch any part of that the effects would ripple through everything else And you have to go and bounce Oracle or bounce the database to do the altar tables and all those kinds of things So we are continuously altering all of our tables because it's somewhat unstructured and also because it's partitioned and denormalized and People are uncomfortable with denormalized and you have to get over it and because of the speed that it gives you It's one of those things where yes It's a little bit broken, but you can build data checkers to clean things up and you can go so much faster It also makes it possible to be polyglot. We have largely Cassandra in our back end, but we do have some MySQL in there We have occasionally had bits of Mongo in there There are other databases that you can put in there and you can experiment and you can move things back and forth It gives you some migration abilities Because once you've broken it up and denormalized it it doesn't have to all be one big thing because you're never going to try And do a join across it anyway, and then as I said we move responsibility from ups to death for continuous delivery So that means the developer pushes code when they are ready for it And then they understand their dependencies and who depends on them and most changes are hidden behind layers of Abstractions so that you can actually iterate faster. You can make changes that don't change an interface quite quickly So we're talking about decentralized small daily production updates by each team Each you might actually some teams might update once a week others may update multiple times a day But it's very decentralized And then this agile infrastructure, so the ops to dev You have to get developers used to pushing their own code and being on call when it breaks So this is the run what you wrote idea that some companies call it and If you can make a developer be on call They get very good at not getting called by writing code that doesn't break and building automation to make sure that when Something downstream breaks their code points the finger at it to say that broke rather than just sort of throwing errors itself So then we get very good habits by taking away the crutch that developers typically lean on sort of the operations crutch Oh, they'll deal with it. Now if you have to deal with it, then you make it better So inspiration I got a bunch of books here There was actually a blog post today on the black duck software site based off of a previous talk I gave that goes through some of these books Who here knows this book release it Michael Nygaard? Okay, too many people haven't put their hands up. Okay, so this is one of the classic books in How to get code out there in production we learned patterns from this that like a bulkhead pattern and the Circuit breaker pattern that we baked into our code. So bulkheads prevent failures from spreading circuit breakers Basically say this is back currently bad stop trying to call it let it recover and we'll check it occasionally in the circuit breakers Flip back in so typically are the same way electrical circuit breakers work Thinking in systems isn't really about computing at all It's mostly about economics and and about how to build large-scale complex adaptive systems that have the right emergent behaviors So it's a really fundamental book in this space If you're trying to build what looks like a very chaotic system, but you want it to behave in a very stable way It's important to understand how to build feedback loops and things like that anti-fragile is a book that came out last year that talks about how when you hurt things a little bit they get stronger and this is the Principle that we've been using for a while The best analogy for this is if you go and work out really, you know do a really good workout You probably really hurt the next day, right? So that's kind of bad Why would you do something that makes you hurt? Well, it turns out that you get a little bit stronger each time you do it and the reason you're working out is to get stronger And what we do is we work out our systems We work out the Netflix infrastructure by by inflicting a little bit of pain on it That actually finds the weaknesses and makes it stronger over time Drift into failure is the the best way I like to describe this book is don't read this if you're about to get on an Aircraft or about to go into hospital. It's got lots of examples about Everybody making the right decision based on all the information they had available to them And you end up with this tragedy of the commons kind of thing where the connection when you connect together all of these Perfectly good decisions the end is a massive failure Where if we keep pushing the maintenance intervals because the plane's never broken So we don't need to maintain it very often and eventually the plane falls out of the sky Because they finally went a little bit too far and it's never done that before so everyone's all surprised But there's no actual fault. Nobody is at fault in any part of this everyone was optimizing for the right thing so this the problem with this This highlights is that if you build a really really highly available system and Netflix has Run into these kinds of problems when it does fail. Nobody's used to dealing with it failing So all the things creep up on you or the sort of technical debt creeps up on you And then when it fails the people the team that's trying to fix it Has it hasn't been on a call because it thing hasn't failed for a long time And you end up with people thrashing around and not knowing what to do So part of this is to have sort of fire drill type game day type of activities where we practice Breaking the system in a way which involves the users that the the developers actually getting called and having to deal with it As we as we get more and more automation in the system you have to understand what happens when the automation breaks down basically and Part of that as you get into the sort of usability thing is well whenever you have the incident review well It's obvious why it broke right? It's always obvious after the event and the point about this book is how to make it obvious Before the event so that when things break, it's obvious what to do about that Our systems built out of hundreds of distinct services each of them has a rest API We let our engineers figure out what API they want It turns out that making up how to do APIs isn't really is on your own in from first principles isn't really a good good idea There's lots of bad patterns in there So the rest API handbook by George Reese is a very good background book lists all the bad kinds of APIs that are out there And if you're doing cloud he's been Interfacing to every cloud API from every vendor that's ever done cloud for for quite a while and that's where a lot of this background comes from Continuous delivery. I've mentioned this a few times Just humble in particular is a very well known in this space. There's a conference coming up I think it's November the 1st called flow con. It's the first time it's being held and it's in Santa Clara I think or it might be here if it's somewhere in the Bay Area It's a one-day conference first time I'm doing the opening keynote on it So I'm just trying to figure out how to write that up, but Very interesting about how to go faster and faster at getting value delivered to customers basically and finally if you're trying to do So TCO analysis and things like that Joe Weinman's got exhaustive coverage of all the algorithms and all of the formulas you need to figure out You know, what is the real way to do cost analysis and trade-offs between different kinds of environments? So there's this nice quote that says genius is 1% inspiration. That was the inspiration and 99% perspiration So we had quite a bit of perspiration as well as reading a few books And we've open sourced our perspiration way of thinking about it in this open source platform If you go to netflix.github.com you'll find I think currently 34 projects that are out there Why are we doing this? So the goals are to establish these solutions as best practices we put code out there and everyone says that's stupid I prefer doing it this way. That's really useful information. You can go and look at the alternatives So we're testing our ideas in public to make sure we've got the best practices in there We also use it to hire retain engage really top engineers. Some of you may have heard of the J clouds library It's a patchy standards coming out Last late last year were able to hire the guy who has bent the last three years building J clouds as an engineer He came to us particularly because of our strong support for open source That's the other the other Adrian Adrian Cole It also builds up the Netflix technology brand so I get invited to do keynotes at conferences, which is fun And then we're benefiting from a shared ecosystem as other people are working on our code We've been engaged recently with IBM who have done a demo port of the The Netflix example application and they've been working on actually figuring out how to port it to open stack and things like that and soft layer and all their environments, so, you know, we've got engagement from large vendors like that and also from end users the PayPal took some code We that we built and repurposed it to work on at their internal cloud things like that And we're getting benefit from that. We're learning more about how you know What the weak spots in our code and the missing things that we want to add to it? So that's our perspiration We've also got a cloud price. There's about one month left to run on this. So Deadline September the 15th This is to boost the ecosystem again. There's information about this on github We have 10 prizes of $10,000 for the best contributions to the Netflix open source platform That's $5,000 of Amazon credit included with all of those and you get a very geeky trophy that beeps and flashes lights and things like that And that the prizes will be announced at the AWS reinvent conference in Las Vegas So the prize winners get to we'll announce the nominations and then the prize winners will be secretly told and given tickets The flight to Vegas so they get announced there Okay So one of the things that the having our code out there Gave rise to was some demand for people that wanted to use it in ways that we hadn't been intending to use it ourselves In particular, we've got some vendor driven portability here There was interest in using some pieces of our code in these enterprise private clouds. So eucalyptus Basically the last build they did the last product release. They did they said it's it's done when it runs the Netflix code And they used us as an extended test suite and added some features So they started shipping that in June Cloud stack they have a they put up a 10k prize for the best integration You know inspired by the fact that we were giving away prizes. So there's some work going on there and With open stack There's some vendor and end user interest and PayPal built a console based on our console so that's kind of Interesting you put code out there people start leveraging it and figuring out other ways of using it And we're trying to drive that which but it's largely an experiment at this point We weren't quite sure what would happen when we open sourced everything and and had the prize and it's Been quite interesting to see what what's the reactions been So now I'm going to talk a bit about Netflix streaming and the challenges that we have there. So this is a cloud native application it's But it's based on this open source platform So we don't really have any enterprise software in up in our stack We can scale it as much as we want we never have to go talk to vendors We get all our code from github or we write it ourselves And that's a very different approach to the way that most people of all large-scale applications The website home page looks something like this The way this works is you have your own device which is in blue your web browser or your TV connected device That's talking to the all these yellow boxes which are all running on AWS and the red one which is our content delivery network Which is our own box? So the first thing that happens is you visit the website. There's a discovery API There's all kinds of personalization all sorts of information needed to figure out what movies to show you The next thing is you choose a movie you click play on it then the streaming API fires up and gives a the Basically the DRM points you at the right CDN to get the content and starts logging information about whether you're having a good time Watching it what bit rate you're using Whether you're rebuffering and it remembers where you are so that if you turn stop or you switch to another device We can continue from the place you were at so there's quite a lot of logging and a lot of up traffic going back into the system and then the CDN We we get all our content from studios studio partners We have to encode it into the right format so that happens on AWS And then there's a bunch of CDN management and steering back ends for controlling and tracking Where all this stuff is and we have a very large number of these boxes scattered around the world providing content They're like big static web servers There are box most of them are hundred terabytes in a for-you box basically and we ship them out to ISPs and peering peering points So that we do Several well many terabits of traffic through these and we basically outgrew the ability of the Commercial CDNs to provide us bulk traffic capacity that we need and the way we needed it So back in November there was a report that said how much bandwidth is being used and this is the fixed Traffic to North America fixed traffic For media streaming at peak time. So this is no not media. This is the total internet bandwidth delivered to people's houses over DSL lines and cable modems and things like that and Netflix is a third of that traffic So we kind of like that And most of our competitors a long way down the list. So that was good, too Then six months later. They came out with another report. The top line number is up 39% So the total amount of bandwidth being delivered that they measured averaged across all the houses was 39% more So this is a very fast-growing area. Everyone's got more machines connected. They're watching more. They're getting more bandwidth All those kinds of things and Netflix is still about a third and our competitors are still mostly small YouTube's getting big pretty quickly as well. So that's the main other platform Netflix and YouTube together is now more than 50% of all delivered bandwidth. That's quite an interesting way of thinking about it So if you look at that in more detail, this is what our web server really looks like It doesn't have those if you take all those big orange boxes that I had and I look look inside It's there's about 20 services behind the web server and it goes several services deep So in order to generate the home page, this is roughly what the the fan out of requests looks like Some of these nodes are Cassandra. Some of them are memcache s3 buckets random things like that Now if we lose one of these services because they would totally broke for some reason Maybe one of the rows on the website stops appearing Maybe it's the similar's row or the Facebook row or something like that stops appearing But the rest of the site still works because we have all of these other services So let's look at how that works from a no sequel storage point of view What we've built is this highly scalable available and durable deployment pattern using Cassandra and the way it works is that we have this single function microservice pattern and Cassandra isn't the thing in the middle of this. It's the thing in the top right That's a single function Cassandra cluster. It's probably got one key space on it It's got maybe a couple of column families Maybe one sort of think of it as one indexed table That's logically the way to think about it Maybe a little more complex than that depending on exactly how we're trying to structure the data in it Our smallest cluster we deploy six nodes the biggest ones right now are 144 We have quite a few in the sort of 24 48 kind of kind of scale In front of that you put a single function rest data access layer service And everyone goes through that so all of these things on the left are all the clients that want to get access to this information They all make HTTP calls to our to our auto-scaled layer in the middle And that totally hides the fact that it's even Cassandra. You don't know it's not not available to the clients They're making a rest call. Okay, so we have over 50 clusters following this pattern over a thousand nodes in Cassandra the backups compressed backups are over 30 terabytes and The biggest clusters are doing several million writes a second right now You also can see this optional data center update flow The point about this is this gives you a migration mechanism So if you've got your existing system, it's a big database big sequel database or whatever You put a data access layer in front of it and you could convert your applications to get everything out of that layer In fact, you put several in front of it You take each of these tables and materialized views or whatever you've got you put a date to access layer for each of them now behind that It's still your big relational database But now you can put caching and now you're building against this new world of a fine-grain Distributed system and now you can start making copies of this data in the cloud and that's what we did That's how we got from we still have the DVD business runs on a very large You know IBM machine with a very large Oracle license and a big sand and all that That was the code base. We were on four or five years ago We split it off one piece of functionality at a time into distinct Cassandra or in fact originally simple DB But we switched to Cassandra clusters in the cloud So we made a copy of the data in the cloud and we kept everything up to date we wrote checking code that would copy things back and forth and Eventually we turned off the data center access and made the master copy the copy that's in the data set in the cloud and We now have new functionality that's only in the cloud and the data centers just has the remains of the DVD business in it which is I think we have I know 37 million or something Approximately streaming customers now 36 37 million and something like eight or nine million DVD customers I remember I forget they vary a little bit, but that's basically where we are Now when you're building these client systems So these are the systems these single function rest clients they're all trying to use Cassandra to do something or your data access layers trying to do something and You end up reinventing the same recipes over and over again So we started collecting these recipes and we published them. This is part of our open source project So for Cassandra, we have a client library called Aston access for Java client library And these are the some of the recipes that we have for it You can see there's a fairly high-level operations. Some of them are quite powerful like Large file storage if you want to store a gigabyte size chunk or a really really big chunk in a no-sequel database And you store it as a single blob of data. It tends to blow up the database There isn't enough memory and caching and it will hit it will land on one node Which will probably time out and you'll retry and eventually just fail what this does is it splits the data into lots of chunks spreads those chunks over the entire cluster and Make sure the chunks are small enough that they if you fail you just retry a small trunk S3 has a similar Multi-part write mode and this is so modeled the same way Very powerful for and it works for reading and writing the same way And we've actually from the cloud prize somebody contributed a high cardinality reverse in index pat recipe for us So this is what it really looks like we have all these microservices. We're testing them I was talking about anti fragility. We're testing them with chaos monkey that kills an individual service We use latency monkey, which doesn't kill a service It makes it slow and it injects errors into it so you can get a machine and say I want you to just return 500s for a while and see what happens So I want you to respond in 10 seconds instead of you know 100 milliseconds Whatever the normal time would be just to see what how how the errors ripple out and Do we have bulkheading properly set up so that you know that the consumers of that service should sort of contain the fact That it's now a bit misbehaving We have conformity monkey, which goes around looking at all these services making sure that they're set up correctly We have a bunch of rules that we write that that basically are the architectural patterns for how things should be set up And we have a whole load of other monkeys, too This one we have chaos gorilla so we if we take that pattern we replicate this pattern three times So we use availability zones on AWS. So we have three zones We have three copies of all our data We use a load balancer to randomly pick a zone and make a request to a machine in that zone and then we use Cassandra and at the bottom to move all the data back and forth and keep everything in sync and we have a Evie cache is a memcache debased Replication library that we built that makes sure we have three copies of your memcache one in each zone so now we also have a chaos gorilla and We actually took run this in production once a quarter I was at the know which conference at cloud connect conference in March, and we actually did it during the conference It was actually the Tuesday of the conference We ran this in production took out a third of our infrastructure Almost 3,000 machines were shut down as fast as we possibly could and there was no Netflix outage There were a few errors We found a few things that weren't quite right, but it didn't cause a customer visible outage It caused it was basically this is the anti fragile stuff in action. We actually are stronger because we run these tests We run them once a quarter just in case we do get an outage We don't want to hit the quarterly availability number too hard but if you do get this if you lose an entire building because of you know a hurricane goes over a Virginia or something or There's a power outage what you really just want to do is turn off traffic to it So that you just tell the load balancer stop sending traffic We can do continue to run on two out of three It's a quorum based voting system as long as I've got two out of three or my queries will continue to run So that gives us a very powerful mechanism to aggregate any kind of failure up to a zone failure It's like if it's anything at all goes wrong if I can just hide it You know, I don't really care what went wrong as long as the only goes wrong in is that one zone at a time And AWS has got better and better at zone isolation that some of their outages from two or three years ago were caused by things leaking across zones and You know an outage in one zone trickling across to the others and they've got better and better isolating those So what we're doing also now is isolating regions We use Europe EU West and US we decided we wanted to isolate those two regions So that they're running very independently, but keep the customer data You know the fact that you're a Netflix customer is global. So if you travel to Europe you your machine You know your account still works You don't see the same set of movies because you see whatever movies we have in the UK, but your account works Now if we lose the connection between the two because of the way Cassandra works, it's a an AP system It's a available under partition. This is a partition both sides are available for rights So all reads and writes continue and if you happen to figure out how to you know update both sides at once Then whichever upstated latest wins when when it gets fixed it comes back and it re-synchronizes automatically So this is the powerful This is the thing Cassandra is actually really good at that we've leveraged a lot If you lose an entire region the other region keeps running. So that's the key thing So these are the failure modes that that we're looking at Application failure we expect this to be happen all the time because we're shipping code And if it's not happening occasionally, then we're probably not shipping code fast enough, right? So we're expecting that to happen. We want automatic degraded response Region failure is relatively infrequent But it's happened often enough that we decided that we wanted to be able to switch traffic between regions and we're working on that now I'll talk a bit more about that later Zone failure it's part of sort of the normal operation of cloud is that you should expect zone failure occasionally So we want to continue to run under two or three zones with no impact data center failure We've now moved all our dependencies into the cloud So we can continue run to run with our data center completely down and we've had that happen a few times in fact Many of the Netflix outages over the last two or three years were actually data center outages that caused the cloud side to fail One of the last things we moved was hardware security modules and AWS has HSM options There's sort of safe net lunar boxes that hold all your crypto keys We were calling back into the data center to get at those keys for a long time and AWS has that in the cloud now so that we don't have to If you get a data store failure, so you know Cassandra corrupts something or application code scribbles over data And then you need to go back to back up So we have S3 based backups were continuously storing stuff in S3 and if S3 fails totally We want to copy of our data that wasn't on S3 So we have one extra copy that we call we store in a remote archive That is not on S3 We do that sort of a daily update just sort of you know, we still have all of that Netflix data somewhere else So now talk a bit about Cassandra at scale and what we've really been doing here is benchmarking to try and retire risks As we went out and use we pushed Cassandra harder and harder some of these benchmarks were us going well How hard can you push it and and is this going to work? The first one we did In 2011 two years ago was we were currently running on 24 We maybe just started using 48 node clusters and the question was well How fast how far can you scale and we scaled up to 288 nodes just linearly went, okay? I give up it's probably going to continue to be linear. It's that's far enough And what we really came down to as I said can we get a million rights a second? I don't care how much with lots of machines and The data sacks guys started using it in their marketing. So there's actually this million rights in a second pop-up Advert was happening if you clicked on it eventually took you to the blog post. I wrote on how we did this This was using 288 really quite small machines only For CPUs eight ECU's which is the sort of compute thing the biggest Amazon integers are 88 CPUs So this is a much smaller system than you could use A year later We've got solid-state disks So now we ran a benchmark which compared a non-SSD versus SSD and we were able to do away with memcache D and get Basically the same throughput and lower latency But a half the cost because we could run on such a small cluster of machines compared to how many we needed to keep Cassandra running with regular disks So that's we now run. I think Well over half of our Cassandra nodes are SSD based several hundred of them This year we've got some cross-region use cases We've been using it for geographic isolation US to Europe as I mentioned, but we've got a new one We want to do redundancy for failover between East and West Coast and there was some people Well, I'm not sure how well that's going to work Can we trust it how reliable is it if I write it in California in Oregon? How soon is it going to get to Virginia all those kinds of questions? So we decided to run a benchmark on this This is what the benchmark looked like It's we took our most right intensive cluster The one that we hit hardest because previously we'd only be doing multi-region and read intensive because that seemed like an easier thing to Go do so what's the most right intensive one we could and it turns out that AWS had enough spare SSD instances at that point in time a few months ago that we actually grabbed 96 of them and we stood up a 96 node cluster with 48 in Oregon and 48 in Virginia Which took about 20 minutes And this is off a hallway conversation. There was no like going and arguing with anybody about how much this was going to cost Because we only needed it for a few days anyway, so this is a hundred ninety two terabytes of SSD And find it all up and running 20 minutes. So who here can do that? I mean who here thinks that they can fire up a hundred ninety two terabyte SSD Cassandra cluster in 20 minutes Okay, all of you can all this code's open source It's public cloud you just download the code push the buttons and 20 minutes later It's running really it's this is the level of automation. We've got but the all of this is open source code There's nothing special here. We might take a bit longer to get it set up the first time But but that's that's the power of having this stuff up So challenge you to go away and build a cluster now So we've got these two clusters all hooked together The first thing we did was pull 18 terabytes of backup data into one of the clusters that took an hour or two to suck out Of s3 it's 48 nodes in parallel just pulling data in Each of these nodes is two terabytes of SSD and a 10 gigabit network port So my bisectional bandwidth across the country is 480 gigabytes gigabits per second, which is quite a lot really For the internet and then we said well, let's push that 18 terabytes to Aragon and see how quickly it gets there We're a little worried about how fast it was going to get there it might get there a bit too quickly and break something But actually it only ran at 9 gigabit. So it wasn't too bad That's actually not very much That's 48 single threads of data being copied over and we in doing that the we use the boundary comms Network monitoring tool to look at all the traffic flows here And it was measuring a TCP round-trip latency of 83 milliseconds very stably for this. So that was quite nice The boundary I've got a bunch of graphs I use for black boundary, which I didn't have time to include today so We then put test load on both sides and then we also Put a validation load where we wrote a million writes to one side and we read them all back from the other side Half a second later. So you write and then you read back just to make sure that the data got there and they all got there So this internally engineering sort of people that it within Netflix that are a bit dubious about it mostly went Yeah, I guess that's going to work and And we're we've now have this running in test and we're bringing it up in production So over the next few months, we'll actually be running live before the end of this year in this mode a Couple more things we've now got customers that we need to split between these two coasts So we need to use DNS to do that But every DNS vendor has a different API or had different features So we built a library called denominator, which is the highest common denominator of DNS features a very powerful library And it's also got a command line tool if you're doing any DNS management with any vendor just go and get this tool It's a really powerful way of doing it This is what Adrian Cole has been working on for the last six months if you know him. He's been having a lot of fun with it So now if we lose a region with a zone We don't care if we lose a region We stop talking to it and send all the traffic to the other region if we lose a DNS vendor We switch to a different DNS vendor. We have all of the DNS data abstracted outside it and a globally available Cassandra backed data store So what we're trying to do here Is get our get the problems to go away? So the biggest problem you have is a PR level incident and a few times a year There's an outage that's big enough that it makes it into the press and people write articles about it And that has a much bigger impact on our customers than the actual outage that outage itself So you want to avoid anything that hits the public relations impact Below that there's probably ten times as many that actually cause customers people to call CS We don't want those because customers that call CS are more likely to quit Netflix So we want to keep people happy you never want to push them to the threshold where stuff doesn't work And then below that there are ones that cause the just affect the quality of what you got in the test results So by active active and game day practicing we're trying to push down incidents that would otherwise have been PR incidents to be just CS incidents and then by Better tools and practices get the things that would have been CS incident and move them down to be just feature features that got disabled And then by better data tagging we can take all the feature disable and Have clean data on what who actually saw which feature when we're doing all of our AB testing analytics So that this is kind of the way to think about that Okay, I've just got a couple more slides Cloud security is always a big question I don't there's a whole presentation on this that I've linked to here from Jason Chan If you go to slide share slash Netflix slide share net slash Netflix you'll find lots of these But one of the things we do is automate our attack service monitoring So we whenever you create a new s3 bucket There's a thing that will go and find that you've created it and check the permissions and make sure that they're correct Whenever you push new code We automatically do a penetration test against it to make sure that it's good code and we'll send you a little note saying by the way We're testing it your logs might have filled up I've talked about cloud HSM key managements really critical and then the other thing is that AWS is at such a large scale that a lot of the Concerns you have about DOS attacks and things are really mitigated by the Amazon layer So managing scale and complexity these impossible deployments things like hey I need 200 terabytes of SSD in 20 minutes, and I just just thought that up That's not really possible any other way We're jointly building code with partners in public our Cassandra work was done jointly with people from lots of other companies The denominator DNS management. We've been working very closely with the ultra and dine and rack space and HP and the support for a bunch of other DNS backends that we don't use ourselves This gives us this highly available and secure system despite the fact that we're we're running at big scale and we're running really fast So here's some links to various things. We've got We've got some blog posts of their meetups that we've done Where we had contributors and different lightning talks talking about the individual projects You can't read the URLs, but the slides will be out and about if you're particularly interested in cost aware Monitoring how to optimize costs on AWS. I did a joint talk with Janesh Varia, which is actually on the AWS cloud Slide share site rather than the Netflix one Okay, so final slide Which basically we're using cloud native to manage scale and complexity at speed and Open-sourcing it to make it easier for everyone else to become cloud native. So that's the that's your takeaway Happy to connect to anyone linked in I'll be around for the next few hours I'm afraid I have to wander off before the evening, but thanks very much