 TheCube at Hadoop Summit 2014 is brought to you by Anchor Sponsor Hortonworks. We do Hadoop. And headline sponsor, WAN Disco. We make Hadoop invincible. Welcome back everybody. You're watching TheCube. We're live here at Hadoop Summit 2014. I'm Jeff Kelly with Wikibon. We've got a really interesting guest coming up. As everybody knows, Hadoop really kind of was born in some of the web companies and one of the most prominent being Yahoo. We're really honored to have Jay Rosseter, SVP of platforms and personalization products at Yahoo. Join us here for this segment. Thanks for coming on. Oh, my pleasure. So we were talking a little bit before we got started and you've kind of been involved in this journey with Hadoop and Yahoo and Hortonworks from the very beginning. Oh yeah, yeah. No, it's been great. So you know that we gave birth to essentially Hadoop for, you know, since roughly 2005, 2006. And in 2011 we joined up with Benchmark Capital and spun out Hortonworks with a lot of the core developers, some of the top developers that actually created Hadoop when we started the company. And so I've been involved from before Hortonworks and all the way through till now. Well, if you don't mind, I'd love to kind of just take a trip down memory lane and talk a little bit about those early days, 2005, 2006. What was the environment like? I came a little later. Oh, a little later. Well, tell us about what you did join. Sure, sure, sure. So what happened was the Hadoop project started really with search at Yahoo where we were trying to harness big data to make search work. And then over time it started really evolving and taking on a more prominent role in the company. And what we realized was that this was revolutionary technology and we wanted to keep that technology rolling and we felt that open source was really the way to do it so that there would be a community of people working on the technology and making it real and then that would allow us to keep leveraging it and then use it to power our business. So I understood that Yahoo obviously isn't an enterprise software company so we started an enterprise software company to do that but the whole idea was to do it through the Apache Foundation so that it was open source. And it's been great. So if you kind of fast forward and you look at this conference which is incredible. I don't know if you saw today that half the people at the conference hadn't been here before there's over a thousand different companies representing this incredible. It's been wonderful. And what you have now is you have exactly what we were hoping for. You have this juggernaut which is Hadoop the whole family of products of which we also develop on you have Hortonworks as sort of at the center the nexus of this working with the community to make it happen. And then we have technology that is not only revolutionizing the world in many ways but it's something that we use at Yahoo to power our entire business. So it's actually working incredibly well. Well so it's interesting. It makes a lot of sense to me why from Yahoo's perspective you'd want to keep this open source because you just you enable so much more innovation when you involve the community. When you and your team were discussing spinning out Hortonworks I imagine you talked about different business models. And why was, like I said I totally understand why it's great for Yahoo for Hortonworks to remain open source. Why is it equally important for Hortonworks to be open source? And how did that decision come about? No, no, no, it's a very good question. So you can understand for us, right? We just wanted as many people powering this and we want to use it and we use it at such a scale that having an open source solution made sense for Yahoo. For Hortonworks the key was that the Hortonworks team knew, understood and understand today if you actually talk to them that this type of disruption and this type of technology to really have it grow and take off and change the way basically data works throughout the world is something that's beyond what one company can do. And so the primary model was that while Hortonworks certainly is a commercial entity and wants to profit by it, right? But the idea is that if they had gone alone and done this it would not create the revolution and wouldn't have the level of innovation that it actually has now and you could see it. So if you think about what Hadoop is a lot of people think of it as just a core of like HDFS and MapReduce but there's a whole family of things around it, right? There's the evolution of where the platforms move to in the next generation which has been like with Yarn and Tez project and stuff like that but also there are many products that have come out of other companies like HBase or Storm. So there's a variety of technologies that people have contributed to and have helped really develop and to foster that innovation. So it makes sense. If we had created Hortonworks and they had gone on as a private entity you wouldn't have Hadoop today wouldn't be what it was. What it is, right? Instead you would have splintering and different products that people would build and none of them would be as good as what's happened through the whole community effort. But is there a trade-off for Hortonworks from a business model perspective in terms of ramping up and getting that growth factor with the model being basically sport maintenance services versus having your private IP that you're selling licenses to and that kind of thing. Is there a longer time frame? Were there any trade-offs that you thought about? Did you ever consider doing something more like the open core model and keeping some parts open and simple criteria? I mean what was the debates like internally or was there any? So the question you're asking is is this the best model to monetize is basically what you're asking, right? And the answer clearly for Hortonworks and you should talk to Rob Bearden about it is yes. Because the Hortonworks succeeds as the ecosystem grows and as the world uses it. So if you have a pie that's this big and you take half of it that's interesting, if you have a pie that's that big and you take a third of it, you're doing better, right? So overall, the fact that they've embraced the open source model has allowed it to grow to a point where everybody's talking about big data. I mean it's in the meme of like my mother will talk about, but she won't know what it is but she'll talk about big data, right? And that happened to a large degree the cost of the open source play here. So I think it was, to me, was the right model. So let's talk about Yahoo and the way you are contributing still today to Hadoop. So do you remember that your team dedicated to contributing back to the open source code? I know you work with Hortonworks to test and develop the different projects associated with Hortonworks. So you just kind of for our audience who isn't really familiar with Yahoo's role, tell us a little bit about what you're doing. So as I mentioned, early on, we really incubated the whole thing and really got the industry rolling. The story we haven't told has been our role both with Hadoop in general and with Hortonworks since the spin out, which was in 2011. And what's happened is we've remained major contributors to Hadoop and we have many, many committers and PMC members and that kind of thing. But the interesting thing has been the relationship that we have with Hortonworks, right? And the way the relationship is is that we have almost a virtual team on how we work together on an engineering level. So we pick projects together, we set roadmaps together and we found that probably about 80% of what we do and what Hortonworks does overlaps in terms of what we're trying to actually accomplish. Hortonworks obviously does many things that are important for an enterprise market that may not be important to Yahoo as an entity that runs it and then we have some things that may not be, but overall there's probably about 80% overlap. So what happens is we set these roadmaps and we work on these projects advancing them and then we harden them because the way we use Hadoop at the scale we use it and the types of use cases that we have crush the technology in a sense. And it's through that beating of the technology and really driving it to quality is what makes the product overall excellent and really work. So a great example would be Yarn, right? So we worked, Yarn was an idea we had before we even did the spin out and we worked together with Hortonworks on Yarn came out in October of last year as HTTP, right? And we've now to get Yarn which is really a complete remake of how Hadoop works and just changing the game in terms of how you can get data and different kinds of data, it was a journey. It took a long time to really make it battle ready. And that's something we did at Yahoo driving it through our 32,000 servers, our 26 million jobs, you can imagine. And that type of partnership both on the development but also on just making it real it then allows us to give it back to the community where there's a real product that works and people can rely on. So it's, and we do that across many different things. So it's been put through its paces and if Yahoo can't break it then you can be pretty sure if you're a more traditional enterprise that it's going to hold up. No, it's absolutely true. So I did want to pick on something a little bit more. So you mentioned that there's kind of that 80% overlap and there are critics maybe in the more conservative circles who would say, well, you know, big data, you have the big web companies, it makes sense for them but why does it make sense for me as a more traditional company? Maybe it doesn't. But you're saying there is a significant overlap between what Hortonworks is trying to do. So maybe rebut some of those. Yeah, yeah, yeah, no, no, no, no, no. So look at this, I think in general most companies now understand that harnessing data and harnessing unstructured data or harnessing data quickly is a business differentiator. That's what makes this whole market happen, right? So I don't think there's any question about the power of data, whatever it is, whether you're in banking or you're in sensors or you're in machines, internet of things, anything, right? It doesn't really matter. There's value, if you're an analyst, you want to have analytics, you know, whether you're the NSA, unfortunately. But you realize that being able to manipulate data and analyze and get value out of that data quickly is important for everybody, right? What we do is we're a little ahead of the market in some cases in that we need to drive many diverse use cases through data because the heart of an internet company really is data and what you do with it, personalizing, advertising and that kind of thing, right? But a lot of the core technology that's used is exactly the same, right? And so you can imagine that you take that technology and you're leading in many ways what the technology needs to deliver and then it gets packaged up and platformized in many ways or the companies that build like Hortonworks see where it's going and they actually create those capabilities. So just some simple examples, right? If you just look what's happened with Distinger initiative and Tes and Hive, you know, who would benefit being able to have fast analytics and be able to answer business questions quickly? We do, right? We run our business on it but everybody needs to be able to do that very clearly or if you take the ability to do iterative processing or stream processing where you're looking at things that are just happening and you're analyzing it, makes sense in many, many industries including ours where we'll take a technology like Storm or we'll take like an H-based technology and we'll put it into a bundle to create a solution. So it's the same kinds of things. The only difference is that we do it at such scale and we're also pushing the envelope on some of the uses. Right. Two quick examples. Sure. One would be, you know, latency. Very, very focused on taking latency out of stuff. So how fast can they get an answer out of something? Really important when we're trying to personalize an experience for you and we want to take into account things you're doing now and we don't want to wait, right? Or we want to see if anybody, if there's any abuse happening in our systems and we want to detect it now, right? So there's a lot of, there's a lot involved in taking the latency out of these systems that are really important. Just a very, you know, kind of a very simple example. Another one is machine learning, right? Deep insights. And you can read a lot, people talking about how you get insights into data, facial recognition, that kind of thing. And deep insights and machine learning is key to what we do. So as an example, you know, we have Flickr. And with Flickr, we have, you know, a corpus of something like 10 billion photos and we have processed all those photos, looking through and tagging animals and scenes and stuff like that where we could tag it all. But now what do we want to do? On an ongoing basis, when somebody uploads a photo, we want to instantly tag it, which requires taking all the latency out and being able to take the models that we developed and perfected as we go through the full corpus and doing it on the next one. That's true of what everybody needs in many, in many businesses. You want to learn from the data you have and then you want to train it so that you could do something more intelligent with the new data that comes in. So it's just a matter of being a little bit ahead of the curve and what happens is with the partnership with Hortonworks, we're living it together, you know what I mean? So they're seeing where we are ahead of the curve to some degree, we're also hardening things and then they're working obviously with other partners also but it lets them also be a little ahead of the curve and then have product that works in a more robust manner overall. Well, so you touched on a couple of things that you're doing at Yahoo, I mean, I would love to hear a little bit more, maybe a couple more examples of some of the more things that kind of get you excited that you're using Hadoop for. Maybe if you can give an example of something that's customer facing, something that I might see as a user or maybe something behind the scenes, whether it's to detect fraud or whatever the case is. There's many, many, many examples. Okay, give me a good example. Sure. So one of the things we're doing is we have a large investment in native advertising. So native ads, you probably see a lot of things you see on your Twitter stream or something like that and we have native ads. If you go to yahoo.com or many of our media properties you'll see this stream and in that stream there are native ads, so ads that look like the content and we do two things. One is we actually will personalize the content you see so the news you see is different than mine we'll also personalize the ads that you see and the ads you'll see are different than mine and what we need to do to make that happen is we have to do massive extraction of information from the content and from the ads to actually understand what they are and then we have to understand a lot about you and your behaviors and things you've chosen and that kind of thing and feed that into a model so then we could match the content and match the ads that are most likely to interest you and then you click on the ads and we can monetize the ads. So it's a way of just basic taking content, understanding it, choosing what you should see and then understanding what ads actually would be of more interest to you so you actually would be more likely to engage with those ads which is how we actually monetize those ads. Right, so that's an analytic problem number one but it's also an operationalizing, you also have to operationalize that insight. Yeah and there's lots of examples, give me another example. So if you just stick on advertising, if you're an advertiser or an agency and you want to run an ad campaign, you want to be able to hit a button, put your, basically put your bid in, set up your campaign and you want to see it happening now which means lots of calculations about how that will play what the bid landscape looks like, how the other players are, what's happening in the marketplace, what your budget is, instantaneously computed, determining what ads then to serve by understanding you and that kind of thing and then we want to give you the ability to on demand get a report and understand how's it going, how's your campaign going, what demographic you're hitting, how much budget have you gone through, what your average spend is, instantaneously to allow you then to adjust your campaign if you want to really move away where it needs to go. So all that's about crunching data, right? At the start, crunching data to understand the content, crunching data to understand the ads, crunching data to choose what to show in a bidded marketplace which is like an exchange, like a stock exchange and then in a reporting point of view and all that is utilizing Hadoop in different ways and different components to create a solution, intent. So it's a fascinating, I love it. We could talk all day. Unfortunately, we've got to wrap up soon but I wanted to ask one last question. Sure. I mean, if we're looking out, if you can put your prognosticator's hat on. Where's Apache Hadoop going? When you take into account both, you know, there's the vendor ecosystem, what's happening there, you've got the community, you've got practitioners like yourselves, where do you see Hadoop going in the next one, two, even maybe five years? So the way I look at Hadoop, and this is interesting because you know, people have different ideas of what it is and Hadoop to me is a family of technologies that work together and play together to manipulate data to value, right? And one of the things that's interesting about Hadoop is the level of innovation that's been happening, the level of change, the speed of change. Like, you know, like the MapReduce paradigm is kind of moving into a little bit since Sunset Tez is there. Yarn came out that fundamentally changed the game because now you can get it dated and you can use it in many different ways, right? And the way I see it is the demand for data will only increase. If you think about the Internet of Things, which I know is a new buzzword, but it's everything, for real, right? And you think about deep learning, which is understanding dozens, hundreds, thousands of characteristics of data, which requires much greater processing. You know that it has to evolve and you know that some parts will get antiquated, other parts will actually join the Hadoop family, you know? And add to it, like now we have streaming or we have incremental iterative processing and that kind of thing. And I expect that evolution will continue because it's a living, breathing object. And so the key is to keep the community intact so everybody's playing it, have vendors who are looking out and understand what's happening, and then engage with vendors like us who are dealing with a scale that frankly is above the market but also driving a lot of those requirements into the market. So to me it's more of an evolving beast that will just get better and better and better as we keep working on it. Absolutely, well great stuff. Jay, we could talk for hours but unfortunately we gotta wrap up but thanks so much for taking the time, I appreciate it. Jay Rosseter from Yahoo. We'll be right back with our next guest live here at Hadoop Summit.