 Okay, we're back live here on siliconangle.tv. We are here inside theCUBE with Adrian Kopcroft, who is the architectural director of Netflix. Adrian, welcome to theCUBE. Thank you. Okay, we're excited to have you here. Netflix is obviously a great brand. You guys are doing some cutting edge work in cloud. Obviously, the company is a very disruptive company, disruptive industries. It's very much like what Apple iTunes did to the music business. Netflix has done and continues to do to the media business, it's fantastic to watch. Congratulations to the company. A bit more kind of under the covers is there's a lot of tech involved. You guys are, I guess, pure play Amazon, for lack of a better description, cloud solution. So your supplier is Amazon, which made their bones as really inventing this notion of public cloud for developers. So take us through quickly, state of the infrastructure at Netflix, and then let's talk about what happened here at the Cassandra Summit where you gave kind of a risky demo, but it worked. You were a little nervous on Twitter. You tweeted to me last night this morning, a little nervous. I said, stay positive, so good karma came your way. Work, we're going to walk through that. But first, take us through kind of what's happening with the infrastructure at Netflix. Well, about three or four years ago, we made the decision to try and build out on cloud and we did it piece by piece. So we did some trial projects and things like that. It basically has been working. Around about two years ago, we made a similar decision that we needed to look at a no-sequel solution and that was where we got into Cassandra. And for about the last year and a half, we've been working through this transition of getting basically everything we've got running on Cassandra on Amazon. And it's now the master data store for most of our data. So it's not like it's a copy of the data that's somewhere else in the data center, it's the master. And so we have backups and archives and all kinds of things like that. So we've had to turn it around. And as we've done that, as we've moved to the cloud, we've got more and more reliable and availability's got better. Are you guys just relying on Amazon right now? As the only cloud provider? The streaming service runs completely on Amazon. And it's completely, yeah, we only use one provider. Okay, great. Let's talk about, obviously, no-sequel. We're going to get into some of the more of the religious discussions later around no-sequel, relational databases. I'm on record of saying, you know, they're all great and it's not a mutually exclusive situation. Obviously, scheme is important. Schema less or schema-lite is relatively good to be having flexibility with data. I will come back to that. But I want to talk about really the core thrust of what's going on in the developer community is, I have ideas, I want to get them into production as fast as possible. And I want to accelerate the step from prototype to production. And that is, it's been well-documented. There's no need to rehash it here that we're in a beautiful time right now. You can, it doesn't cost a lot to get something prototyped, but getting into production is a whole other ballgame. So talk about what you are doing here at Cassandra Summit and your demo in particular. So we had four goals when we did the cloud migration for Netflix and one was to be lower latency for customers, new better performance. Another one was to be scalable or scale independent so we could scale to any size we needed to be. Another one was to be as reliable as possible because your TV set should just work when you talk up to it. It shouldn't care about whether anything else is working. And the final one was to be very productive for developers. And we put a lot of effort into getting everything out of developers way. And the first slide I had was things we don't do. Mostly wait, wait, wait, talk to IT, wait three months for a machine, all that kind of stuff. That all goes away. I didn't ask permission to run this demo. I just went, oh, it sounds like a good idea. If once a day that sounds like a good idea. But we were just deploying machines. So I deployed 48 machines live in production. Well, actually in our test account, but in our Amazon account. I could easily have done this in production with like one extra click. And we did it using a Jenkins build job and that's our continuous integration system. So basically we worked out this build job. We just click a button and it creates a cluster from scratch, runs a benchmark against it and shuts it down all in less than an hour. What was your purpose of showing the demo? Obviously we all know Amazon's great for provisioning. It's very fast. But what was the point of the demo? The point was obviously the ease of deployment. Two points. One is that there are, so the Cassandra community is sort of split between people in the cloud and people in the data center. We call the data center guys server huggers sometimes. Because they like servers. That's server huggers. So it was trying to blow the minds of as many of the server huggers in the audience as possible. Not tree huggers. Not to be confused with tree huggers who are totally green. The data center, first of all, people in the data center. It's not about being green. It's about wanting to have control and not wanting to let go of control of the machines in your data center. And so the fact that I can create an arbitrary large cluster in real time without having to go and deal with anyone else. With all automation. So that was one piece. The other was that I actually ran two clusters side by side. One of them running the convention. One of them running in production. The two camps are the data center, the server huggers. And cloud. Okay, got it. Yeah, cloud and on-prem. Whatever you want to call it. Okay, cool. So what I wanted to do also was show what happens when you add solid state disks which Amazon did recently. And we were quite involved in the whole process of getting Amazon to do the solid state disks. And we've had early access to it. And we have a pool of them that we're currently testing against. And I managed to get 24 of them out of that pool. And they're very hard to get right now. Most people in the audience are just trying to get access to these at all. So it was partly just kind of coming like, hey, if you want to play with solid state disk come work for us was the other reason we did this. But I ran side by side the same tests. The only thing we changed was the instance type. They have roughly the same memory, similar CPU. One of them has solid state two SSDs. One of them has two regular disks. And they kind of come up and they're sort of okay. And then we halfway through, we order scaled them. Well, we scaled them. So we start off with two 12 node Cassandra clusters. It takes a few minutes to create them within five, 10 minutes they're loading. We're running data against it. That's from nothing. There is nothing. I just got it. I just have a build job in Perforce that goes and runs using Jenkins. The point to the audience is, look, you can go from zero to 24. Zero to a system that's having data written into it. Real production, moving an app and workload into these. Production quality, ready to roll, reliably works. Got it. Somewhere along the way, if one of the instances we're grading from Amazon doesn't work, it actually, I think that happened. One of the instances didn't come up right. Amazon automatically killed it off, replaced it. It just kept working. Delayed one of the instances by five minutes. That came up. The other one, I come up with a solid state disk and it has much better performance. And one of the things about Cassandra is you can increase or decrease the size of a Cassandra cluster, but it's a very intensive operation. So what I wanted to show is that you can use the solid state disks to make that something that you can just go do. So as a Cassandra cluster, as you put more and more load on it, you want to just be able to say, just make it twice as big. Add twice as many nodes to the cluster. Gives you twice the disk space, twice the CPU, all that. But normally if you try and do it, the machine tanks, the whole cluster runs out of performance while it's trying to grow because it's extra work for it. And what I showed was that you can do that and can maintain decent performance with the SSDs. Congratulations on the demo, it worked. Prior to me was hoping it would fail a little bit so we could have something to talk about and kind of get better on. Now I'm only kidding. I was hoping. No, there was the original idea for the demo was way more ambitious than this. This is the scale backwards. It's kind of like the Olympics. You didn't have to do the triple axle to make and nail the landing. And you were settled at 24 nodes. I actually put up. When you're trying to do a 48, was that? I put up a picture of the Mars lander, the MSL with the sky crane dropping the thing to there. That was my slide saying, okay, I'm going to do a live demo now. Well, we had a great demo. We've covered Google IO, MapR, showed Google's compute cloud competitor now as we're trying to be competitive to Amazon. Similar demo, massive amounts of scale with HTFS and Hadoop. Again, the comparisons are like, I mean again, this is not something new. Minutes for provision and scale and production versus the alternative, the old way, which was first get approval. That's three weeks. And then get a data center. You're lucky. Just the costs. How much did it cost you? I mean, it's usually months. The entire thing, the list price was like a hundred bucks for one cluster and 50 something for the other. But with a reservations, it's a fraction of that. And I didn't need it before. I didn't need it after I'm done. Great. So let's talk about Cassandra because we had Jonathan Ellison talking about the horses on the track. Horses for horses and horses for horses. Dave Vellante's favorite expression, my co-host is not here this week, but you got Mongo, you got HBase, you got Cassandra, you got Couch. Size them up for me. Give me your, from a personal perspective, how would you kind of rate those horses and the tracks they should be running on? We've got experience with quite a few of them. So we've used MongoDB for some internal applications. We drove it to the point where it basically stopped scaling. And they say that the next version's going to scale. So I characterize that as eventually scalable. Yeah. So MongoDB is eventually scalable. The next version will scale a bit better, the version after that scale a bit better, but Cassandra doesn't really have a scaling limit. You can just throw more at it. So there are places where we were doing a lot of JSON integration from multiple data sources. And Mongo is a very good example, very good application for doing that. We got going very quickly. We also use SimpleDB for a long time and we still use it for some small things. We've used, let me see what's the other one. Are you using Amazon's database? We're not using DynamoDB. It sort of came along a little bit too late. We'd already basically used Cassandra to meet that need. And for MySQL, we use that for things which aren't customer facing. So there are places where like say, you're trying to edit the definition of a movie, a new movie, a new piece of content's come along. We have to enter the information about it, say this is who's in it, all those kinds of things. It's a very relational constrained data models. We actually use MySQL for that. And there's a few hundred users maximum so it's not going to blow up. Yeah, I mean I think that's, MySQL's not going away and that whole debate about NoSQL's ridiculous. I mean you need schema. The question I want to ask you is really more pertain to your unique qualifications and position that you're in. And quite frankly, your age, you've a lot of experience. You've been through battles. You've seen the history of the business and industry. But you're in a position now where you get to play with some new stuff in Netflix. You mentioned SSDs. But also you've got a really stringent ops environment. I mean, you're ultimately a lot of dev ops. So you kind of do both, right? You want to do R&D, you want to kick the tires, do new things, and also which explores the future. At the same time, you got to run the business, right? So it's challenging. You got a unique job. So my question is this, what legacy things do you look at and say, wow, we're really watching that. That's really important. That's changing right now. And specifically data protection, data backup, and high availability. Because it was in the press, Amazon went down. You guys had some effect on that. I don't want to rehash that whole Amazon fail on the East Coast with that storm with one data center. But high availability has come up with Hadoop, HDFS and NameNode, data protection with SSDs. As you mentioned, it's been real disruptive in terms of performance and economics. Not only in memory and reducing spending disk latency, but also changing the future of data protection. What can you share with the folks of your commentary on those two areas, high availability and data protection? So in the high availability area, we, there were actually two power outages at Amazon about a month apart. One was the first one was a little smaller and the second one took out most of our zone. Both times Cassandra, we lost some Cassandra nodes that we didn't lose Cassandra service. The service kept going. So the backend systems were affect, I mean, we had to replace the machines that had died, but it kept working. So Cassandra has proved to be pretty solid, taking large chunks of capacity being just taken offline. What, the difference between the two was basically one line of code and we had a bug. We should have survived the second outage. We just had a bug that caused it to trigger something that we hadn't run into before. And we have a test program. We have this thing called the Chaos Monkey that kills individual instances. We have a thing called the Chaos Gorilla that kills entire zones. We were testing that, we hadn't got to that piece of testing. So we hadn't found that bug. So it was just unfortunate that we hadn't actually found this bug in the past. But wasn't that sizable outage though? Was it? But we're designed to take the loss of a complete zone. There's three zones, we store all our data in three separate buildings. Let's come back to the monkey and gorilla. I can lose a whole building. Let's come back to the monkey. I want to have a specific drill down on that because that's notable and interesting and very fun to talk about and relevant. How about data protection and backup? An old, boring vertical data domain. Very important. This is the master copy of our data is in the cloud now. So I can't just say, oh, sorry, we lost it. We have no subscriber records anymore or whatever. So what we do is we continuously archival the data out of Cassandra. We put it in S3 and then we make another copy of it on the other coast. So most of our machines are running on the east coast. We make a copy in the west coast and we make more copies of that data so we can't lose them. So we can then go back to any point in time. So if a developer writes some bad code and starts corrupting the database or we have a bad version of the code or something goes wrong and we start corrupting things, we can wind back to just before that. We haven't actually had to deal with that yet. And we can do restores from our archives. You mentioned that your demo was here at Cassandra 12. The killer demo you did standing up the 24 node cluster and in minutes, very low cost was designed for cloud and the data center guys, the server huggers, as you say. Really, that brings up a point about this whole migration to the cloud and the cloud was the revolution. Oh yeah, everything's going to go off premise and very plausible. Three years ago, I'm on board the cloud. I love it. It's the future. Of course we want substations in the cloud for a metaphor sense. But what's happening with the legacy data center guys is that that path is not happening and the hybrid seems to be adopting. Mainly because, not that the cloud's not happening. It's just the revolution's not happening the way it happened. Mainly because from what we're finding is SSD. Solid state has changed some dynamics in the data center where economics and performance are good enough now where it's not forcing people into the cloud. Yet, cloud is viable for new use cases. Can you talk about that from your perspective? Do you agree with that statement? Do you see that SSDs has changed some of the storage paradigms and server paradigms that allows much more dynamic on premise solutions? And how does that change the notion of cloud? Specifically because cloud is not going away. Pushing stuff to the cloud makes a lot of sense. Whether it's new applications, big data. But it's not that land grant to the cloud. So what's your take on that? This is a real interesting conversation. So I think what's really happening is that as large enterprises move to cloud, they are doing it both with on premise and on public. And they're both happening. The IT organizations are finally getting to figure out how to build something they can call cloud. And they're trying to go build an in-house cloud. And those are starting to roll out with some having more success than others. What's happening though is the developer organizations are going straight to public cloud and running around them. And the reason they're doing it is because they're getting shadow IT. Yeah, because they can actually do it faster without approval. Yeah, and that battle is still playing out. There is a lot more shadow IT than anyone knows or admits. A lot of organizations are actually doing more and more in public cloud. Big traditional organizations are getting in there. So that hasn't slowed down. Yeah, yeah. But what's happening now, it's being legitimized by the fact that you're now trying to do cloud in the data center as well. It's one of these things where when you stand up a competitor, you legitimize the thing you're competing against and actually increases adoption overall. Yeah, I mean. So I think in both sides, I mean, we're hybrid. We started off in the data center, then we moved things piece by piece to the cloud. So we're a hybrid app. But what we're getting to is the point where our data center can totally go down and our cloud stuff still runs. And we have more, it's newer code. It's better architectured for availability in our cloud-based application. We have a lot of legacy stuff in the data center which we're cutting the ties from so that it doesn't actually cause problems for us. I actually think shadow IT is actually a good dynamic. Now, I would get slapped around by the compliance guys for public companies when you look at the compliance issues involved, right? You know. But what we're moving to is a model where the corporate IT, some of the corporate IT still stuff is in the data center but the product itself is running in cloud because it's so much faster to provision and we can build more available structures there. Hey, I'm a big believer in competition. And like I say, competition's a good thing when you have developers in a playground like public cloud with all the benefits that you showed in your demo and Cassandra you're rocking and rolling. So that's cool. Well, let's get back to Amazon and some of the things that you talked about the gorilla and the monkey. So the chaos monkey is a concept that you guys have been playing with and also chaos gorilla. Explain to the folks what it is. It's a methodology that you guys put together for really bulletproofing your own. Failure injection for your QA, for your stability. Take us through that. So we have a service that runs all the time. We open sourced it. We have a collection of things at GitHub. So if you go to github.com slash Netflix you can download the code for it. It basically picks a particular application and kills one of the instances at a random rate. Picks one at random. And we build applications that designed to survive that. And that's basically it. So if you put that in on day one and that's the developers. It's the Kobo-Rashi-Maru of the data center. That's the Star Trek, that one program that Captain Kirk has to pass and he fixes the code to beat it. Possibly. You would see it. I'd recognize that one. Okay. But basically it's a. If you wrote the code then you could actually get around the actual. Well, say if you're a developer and you think I'll just write this data to this file on this machine and it'll probably still be there tomorrow. But if you know that there's a Chaos Monkey out there that's just going to delete machines continuously you know that really I have to write it to Cassandra. So the only persistent place I can store things is in Cassandra. Just don't do a ranking mechanism on it because then the hackers will actually look at the code and code around the monkey. You can opt out of the monkey if you want to. Because we have a whole army of these things. There's one that injects latency. We have ones that look for security post. So you mentioned you got a lot of blog posts hits on that blog post. So it's popular. Yeah. So the Chaos Monkey has been our most popular blog post that we've had and it's the number one download we have on GitHub right now. So what's the Chaos Gorilla? Chaos Gorilla, the idea there was that we wanted to be able to migrate from one zone to another zone. If a zone was having problems we wanted to be able to shut down everything in that zone, bring it up in a different zone. So that's what we've implemented with the Chaos Gorilla and we can also use it to simulate zone level outages. So we should be able to survive loss of a third of all the machines that we have as a unit. And then just have everything continue to run on the other two thirds and without any customer visible outages. Adrian, final question. I want you to share with the audience just your personal experiences of the past. Just go back five years. Actually, five years is about web 2.0 we're in the thrust of the whole web 2.0. Netflix is a known brand. Really kind of hitting the stride on pre-streaming the web and you're booting up your operations. Take us through just your own personal and share with the folks out there what you've learned and what you would share with folks to avoid. And then what's going forward? What are you expecting? What emerging trends are you excited about? What are you watching closely? Are you worried about? Well, going back five years, I've been at Netflix a little over five years. I joined managing one of the personalization teams who are totally data center based but we did have the streaming service that just launched. This was the beginning of 07. And we basically had a handful of machines. We ran on 20 web servers, I think, and we had another handful. So there were maybe 50 or 100 machines total that ran everything Netflix had and they were all sitting in the data center. And it was very painful. They were all slightly different configurations and they break in odd ways and have different bugs and it was an awkward thing to deal with. And as we built more automation around it, we decided that it was easier to automate cloud-based things than try to get automation into our data center. So that was part of the transition to cloud, was the ability to automate everything, the ability to click a button and have an entire benchmark run like I just did. So that's really what's changed over the last few years. The automation and the maturity of things has really moved on. So now we're at the level where we have 10,000 machines instead of 100 to run the website. A lot of folks who are doing with things with HBase and Cassandra, their first reaction is, I want to bring the bare metal on premise. Easy to handle, might be cheaper. Is that true statement, not true statement? What's your advice? How I want to start taking in all the Twitter data? I want to start playing with blog data. Should I use the cloud? Should I bring it on site? Generally, if you have a constant workload that isn't changing, that you're going to run for years, you can do it cheaper in-house. If you've got any dynamics in your workload, like you're growing or you're just very agile and things are changing a lot, then the way it looks in six months doesn't look like what it looks now. So you've bought the wrong thing. One of the reasons we went to the cloud was we couldn't plan what they send us to build where with any reliability, and we were worried we'd run out of capacity. So the best example, we launched in Europe at the beginning of this year, we just fired up some machines in Amazon Europe. There was no, we have no employees, with no data center manager, there's nobody out there, we didn't have to buy any hardware, we didn't have to pay local taxes or whatever it would be. We just went, okay, I need a thousand machines in Europe. Okay, there's a thousand machines in Europe. It took a few days to fire them up, right? There was no, so, and then we had just the right number of machines. If you're going to go into a new continent, you don't just buy a data center that could fit a thousand machines, because in a year you're going to maybe need 2,000. You had regulations and all kinds of stuff. No, but you have to oversize on day one. So you end up paying an enormous amount of money for this big facility, and then you change your mind and decide you're not going into Europe just before you launch, right? Or you, something else, right? So we were able to size it to be just big enough at launch and to grow it as we need to, and we can play around with the configuration. So if you've got a ramp from a low level in your business plan, it's like launching in a new market or your startup, I mean, the only way you can affordably do that, I mean, it is so much cheaper and more effective to do that on cloud, because otherwise you have to pre-buy at such a high level. But it's the incremental capacity to run one constant thing. Sure, yeah, you can do that. And that's a great example of why cloud's so awesome. You basically, if you've got a foundation, you scale up when you need it and you kind of buy as you grow versus the overbuilding. Fantastic DevOps conversation. Just to kind of take one more questions. That wasn't my last question because you made me think up two more. You know, everyone talks about like, oh, I'm a Python guy, I'm a Rails guy, oh, I'm a Spring guy, I use VMware. So, you know, as you move up the stack, that's where the differentiation's happening and you know, Amazon's, you know, working to build more features in and you see all the cloud guys, platform as a service. Obviously realizing that if they don't differentiate with software, it's a race to zero. Obviously SSD is a specific feature they got to have, but they all will have SSD that they don't, they'll be out of business, no doubt. But, you know, there's more features, more tooling. So as you move up the stack, is there a programming language that you see that you like or is it, what's your philosophy on all these different frameworks? Is there one that's better, more DevOps-y than others? Chef, puppet, there's all these choices out there. I mean, how do you sort all that out if someone asks you at a cocktail party? That's a pretty long final question, but what we, we're a Java shop and we basically, anything that will run on a JVM is fine. So if you've got one of the language variants, like we use a lot of Groovy, Groovy Grails code for our user interfaces, but it runs on the JVM and it can call out. So we have a platform infrastructure that's based on Java. It's hard to run a multi-language platform because the platform itself is evolving very fast. It's hard to keep all the languages in sync. So we also support Python, but it's a struggle to keep that in sync. So that's basically our two supported toolkits right now, a Python and anything that runs in a JVM. On the DevOps side, we're a developer organization. We don't have a separate ops organization for the cloud. We have some ops guys working in the developer organization. So we do internal. You're pure DevOps. We're sort of, yeah, we're a little bit, we're actually more of a continuous integration kind of thing. So we're fairly advanced in the sort of continuum of DevOps and there's a number of things we're doing that are pretty bleeding edge. All right, we're getting running out of time. We see the head honcho from Datasax, Billy Bosswood's coming on board next, but I want to just give you an opportunity to share with the folks out there. Just, you know, you're hiring, obviously always looking for new people. What's the personality of Netflix? Every company has a personality. Intel's, Moore's Law, you know, double every six months. Some ship early, not. I mean, every company's got that one thing that they just kick ass on. What is Netflix? So the personality, we have a freedom and responsibility culture with a lot of, with basically very adult culture. So we don't have any interns. There's very few contractors and external groups. So it's a lot of very, it's a small number of very senior engineers working very efficiently and with a relatively low stress environment. So it's not crazy. There's not lots of kids running around. You don't have to clear up messes. Lots of extremely good people building amazing amount of things in a very short time. There's literally adult supervision of adults. We don't, yeah, it's just adults. You don't have to deal with the adult supervision if you just have the adults. Yeah, someone's going on vacation, they know the responsibility. You remember, Brooks Mythical Man Month, remember that? The final thing there was like the way to get done quicker as far as all the engineers have the managers write the code. That's basically where we are. Okay, awesome. And you guys are hiring. So you do have to hire some young guys. We're hiring, we don't hire very many young guys. We look for people with five to 10 years experience. We don't, we really don't take people out of college. All right, so if you have five to 10 years experience, that's the minimum hurdle. Don't even call Netflix. But we do are looking for some Cassandra operation. I mean, our entire several hundred Cassandra nodes currently operated by three people doing all the upgrades and automation. And they're definitely looking for some help. All right, final, final, final question is, what is, for the folks out there who are trying to clear the mud around Cassandra, what is the biggest thing that you could share with folks that they might not know about Cassandra or should know about Cassandra? Benefits of Cassandra though, the use cases. It's the perfect system to run on SSDs. That's it. It works so much better than any other data store with SSDs. SSDs, we've been covering it. We love convergent infrastructure, siliconangle.com, wikibon.org is the site. Check out David Floyer's research, it's all free. We're open source content at Silicon Angle Wikibon. So we have some of the most cutting edge reports over there. We've been covering Fusion, Violin, all converges those guys. We love SSDs, it's changing the game. Adrian, thanks so much for coming on theCUBE and sharing this in-depth conversation. We'll be right back with the CEO of DataStacks right after this break. Thanks.