 Okay, we're back here live at Hadoop Summit 2012. I'm John Furrier, founder of SiliconANGLE.com. We're on day two of wall-to-wall continuous coverage of Hadoop. We're all with Jeff Kelly, my co-host. And our next guest is Mohamed Sabah, who's with Netflix, a company we all know, touted running in the cloud on Amazon, known for groundbreaking business model. And in this era of Hadoop, big data, business models are always the big thing. And technology drives it. So, Mohamed, welcome to theCUBE. So, Netflix is a pretty aggressive company. Everything's cutting edge. I'm on the Bay Area geek list, so I have a lot of Netflix employees on there who run the infrastructure. So, even though they don't reveal any secrets, you get a feel for that pretty dynamic organization. I'm share with us some of the things you're doing. Obviously, everyone knows with Hadoop, you can do all kinds of recommendation engines, et cetera. But you're a data scientist there. So, how about your role and what you guys are doing? Sure. So, I joined Netflix a year ago and I work primarily on, well, I do a bunch of things on data science at Netflix, which would involve basically, you know, finding out things like, what do the customers like? What do you want to show to the customers? What do you recommend to the customers? When you're searching for something, in what order should the results show up? You know, things like that. And even you want to predict how many, what is the demand for this movie or what is the title that you're going to get? So, you know, to cut it short, you know, I think many of the people here are familiar with the Netflix price. So, we have moved a long way away from the Netflix price. Since then, it ended in 2008. And since then, you know, we made this move to the cloud two years ago. And it was a conscious decision on our side to do that because of the scale of the data that we're dealing with now. So, remember in the streaming world, everything that the user is doing is getting logged. You pause the movie, it's getting logged. You start a movie, it's getting logged. You fast forward, you rewind and all that. So, we have an even stream that is coming in every day of the users, you know. And the users are in the north of 25 million and growing worldwide. So, in order to tackle that scale, and many of the data sources that we're dealing with are unstructured or semi-structured, right? We have data from Wikipedia. We have data from Rencrak and so on, right? Facebook and Twitter are the examples that we use. So, the challenge here is, you know, is we have a, it's a big challenge. It's also very exciting times because, and that's where Hadoop comes in. So, we use Hadoop to do a lot of large-scale machine learning and lots of predictive mining algorithms, which would have been, things that would have taken days or months to finish. On Hadoop, it actually takes hours to finish. So, there's a lot of efficiency to be gained by using Hadoop. So, yeah, I mean, I'm a Netflix user. So, you know, use it on my iPad. So, talk about some of the, how you guys come to understand what the user wants almost before the user does. And how important is it to present, you know, when I log on and it recommends some things for me and versus, you know, people just kind of natively searching around and finding what they want. How important is that to your business to get that optimized as much as possible? So, one interesting tidbit that I want to throw out here is, you know, in Q4 of 2011, we streamed two billion hours, right? In Netflix? How many hours? Two billion. And that is public knowledge, right? And 75% of the hours came from recommendations. So, you can imagine how important recommendations are as that answers part of the question. So, the other part is, how do we decide what recommendations to show? So, we have limited real estate on the page, right? So, we have two problems, right? We have to choose what roles to be shown. By a row, I mean, you know, when you log into Netflix, you see a row like a bunch of movies in one row. And let's say it's called top 10, or it could be based on a genre and movies that you have seen. So, we want to show you rows that have the highest propensity to click and watch. So, imagine if you're coming from an ad industry, you would get a click-through rate as a metric. Here we have something even richer and something deeper, which is from the length of the content that you played and how many titers of that type you play, we can actually get a lot of, you know, we can know a lot about your interest profile. We can build your interest profile. As an example, let's say you started Mad Men series one, sorry, season one, episode one. You didn't like it, you just stopped after 10 minutes. Then you moved on to the different series, right? That gives a very good indication for us, right? So, what that tells you is, this user probably did not like this movie. Whereas if you watched a movie to completion or a series to completion, that means you really loved it. As opposed to the DVD world where we were a few years ago. You ship out the DVD to the person, he returns it back after a few days. You don't know how, you know, did he like it or not. So, you have to wait for ratings. So, and what we have found out is, so we still ask users to explicitly rate titles that they have seen and explicitly give us, you know, taste ratings. Let's say the genre types that you like and you don't like. But what we have seen is the implicit signal is stronger. And by implicit signals, I mean, you know, things like what time of the day is he using it for? You know, how many, what was the length of content that he sees a movie through? What are the types of movies that he watches? Is he like documentaries versus sports movies versus romantic comedies and whatever. So, you get the interest graph. Yes. You get some interest graph data and you can look at all kinds of different data. Absolutely. So, unpack a little bit of what you're instrumenting. Because this is really the value, right? I mean, you're talking about, you can measure everything. So, talk about some of the things that you measure that people might not know about. Sure. So, at the end of the day, every algorithm that we roll out, it goes through a bunch of phases, right? Before we roll it out, it has to pass some offline metrics. Then it goes into A-B testing or, you know, A-B testing or bucket testing. And there we measure principally two metrics. So, one is retention. We want the retention to go up and not go down. Second is engagement. And engagement is measured by a number of metrics. For example, you may look at the number of hours that the viewers are spending. You can look at the hours per sub, hours per subscriber. You can look at, you know, what is the distribution as a discovery source? Is it coming from recommendations? Is it coming from search? Is it coming from, you know, a different source? So, basically at the end of the day, you want to drive those two numbers up. And you also want, the third number that you want to measure is acquisition. So, let's say you go, that's where we are leveraging the social aspect of it, you know, Facebook and so on. You also want to acquire users at the same time and retain your existing users, drive the engagement up. So, these are three principal areas where we have metrics on in the A-B testing. So, talk about, you know, we're here at Hadoop Summit. So, it's about talking, all talking about big data all week. So, you know, at Netflix, so you're using at Hadoop. So, does this allow you essentially to, you mentioned everything is pretty much tracked and creates a lot of files, et cetera. So, Hadoop allows you to analyze all of that data versus it's kind of sampling. Talk about, as a data scientist, how important that is to you. That is extremely, extremely important. You know, I cannot overemphasize it. So, sampling introduces a bias no matter how you do it. So, but that is not all. You also have to do a lot of computations that are a lot of experiments which are computationally intensive, memory intensive. And those are things that you can do very easily when you switch to a MapReduce paradigm. So, I mean, you'll be surprised at how many algorithms that you can, once you move it to the cloud, they scale much better, both in terms of, you know, both in terms of the number of data points it can handle and the number of features it can handle. So, you know, I won't go into details of any of the algorithms. You know, in the talk I mentioned, you know, a few of them, like I mentioned matrix factorization, I mentioned Markov chains, I mentioned collaborative filtering. Now, if I compare, so I had some baseline numbers. For example, for collaborative filtering, if I don't do it on the cloud, if I don't do some MapReduce paradigm, it's going to take me days or weeks to process the same data, right? And I don't have that much time, right? I want to do get things quickly because you also want to figure out what those parameters are. You know, you want to do a lot of cross-validation and so on. So, Hadoop gives you a great medium for doing fast experimentation, playing with the parameters and fixing them before you launch something like. So that is one area. And the other thing is, we actually do a lot of pre-processing and post-processing of the data, even before we go into the real modeling piece. And if you ask any data scientist, you know, modeling is actually, if you look at the time that is spent, it's actually much smaller than you would think. Most of the time is spent in feature engineering and feature selection and, you know, pre- and post-processing the data. And that is where Hadoop actually adds a lot of value. Okay. And yeah, I mean, talk about, you know, when you're doing sampling, you miss a lot of the outliers. You know, and how important are those, you know, they may seem kind of anomalies, but they might come back to actually mean something pretty important. Do you have any kind of examples? You could potentially share with us? That's a great point, you know. So if you were to do sampling, what will happen is probably you will just capture the more popular or the more normal stuff. What you will miss is the stuff at the tail. And what we have seen at Netflix is, you know, there are users, there are niche audiences. So for example, there may be a cult movie out there, which there's only a specific audience for that. If you do sampling, you may not get enough signals. And if you find so, that is where, that is avoided by using Hadoop because you don't worry about the scale of the data. Very interesting. So, you know, talk about the tools you're using on kind of on top of the Dupen and the importance of, you know, kind of integrating, you know, there's been, you know, SaaS has been around for a while, R, I'm not sure what you guys are using, but kind of taking those tools and bringing them into a big data landscape and how you guys have gone about doing that. Absolutely. So we still use R, but for small prototypes. If you want something quick and dirty and on a small data set or sample, just to see if this approach is, you know, workable, if this approach works, then we use R. Otherwise, for most of the stuff, we use our own, we use big and Hive and Java. So, Hive is mostly an SQL interface, which we use for pulling data and for joining, doing a lot of joins. But for most of the heavy lifting that we wanted to, we use big and Java. And that allows us to write, you know, libraries of, you know, modeling algorithms. And we also use some streaming, some, you know, Python streaming, Hadoop streaming APIs as well. But, you know, most of the stuff is written in Java and big. And we also use some libraries. For example, Mahoot has some components and where we find it, you know, we use it and if it's dependable. For example, we use K-Means clustering from Mahoot. Mohamed, talk about the data science role. We've obviously been covering it. It's a real growth area. People talk about the skills gap. It's a real, it's a real practitioner role. Right. Talk about your experiences and share the folks out there. If people think they're interested in being a data scientist, tell them what it's like and what the opportunities are. Sure. It's a very challenging field and something that you will, you know, that you will, the more you learn, the more you think there is out there to learn. You know? So it's a very challenging area. Just like, you know, so just like machine learning, you know, you can think of it as applied machine learning, but these are the components that you need. You need to be expert on big data. You need to be, you need to be an expert on infrastructure side as well. You need to understand how Hadoop works. And you also want to be good at programming. So you have to have, let's say, high-level language like Java, comfortable with Java or big or Python, you know, because at the end of the day, you know, at the end of the day, what you want to do is you want to learn from this data. So the data is sitting there. It's good that you can access that data and so on, but at the end of it, you want to have some predictive mining algorithms. So you have to be well-grounded in statistics as well. So that's why I said it's a very challenging area, but it's something that if I were to, you know, point out three areas or three dimensions that you want to have for data science, I would say one is infrastructure and knowledge of Hadoop and MapReduce. One is the machine learning or statistics background. And the third is programming. Because you want to produce things that work, you know, algorithms that work and that can scale up and that can be deployed pretty quickly. And what would you say to the the engineers or people in college as you know, the world's getting competitive and it's changing, it's a big opportunity. It's kind of a new discipline. So whether you're in college or high school, talk to the computer science student and also the non-computer science, just the math guy. Or an analyst type. What would you say to them? That's an awesome question. So data science is actually at the intersection of statistics and math and computer science. So you need both for you to be an effective data scientist. So, you know, if you're a computer scientist, I would say stress, I would actually stress the importance of statistics and linear algebra. And if you're a math guy, I would stress the importance of, you know, computer science and large scale big data infrastructures. So, you know, you actually need a combination of both. You cannot just live with one. Okay, Mohamed, thanks for coming on theCUBE. Netflix, love the shirt. Love to get one from my repertoire. Love it's reds, got Netflix, we're a big fan, great brand, great Silicon Valley success story. Again, pushing big data, creating good user experience, good methodologies, great success story. Love to follow up and do more work with you. This is theCUBE, we'll be right back with our next guest after this short break. Thank you.