 The Cube at Hadoop Summit 2014 is brought to you by anchor sponsor Hortonworks, we do Hadoop. And headline sponsor WANDISCO, we make Hadoop invincible. Hey, welcome back everyone live in Silicon Valley in San Jose for Hadoop Summit. This is The Cube, our flagship program. We go out to the events and expect a seamless for the noise. Day two coverage of three days of live coverage, exciting. The big data space is exploding. The reports from the field here are fantastic. People who are having success, harmony, there's some collaboration. More importantly, the entrance of the big whales and a lot of people making their serious seed funding and growing like crazy and the startups are still flowering up. It's super exciting. Our next guest is John Acred, founder and CTO of the Silicon Valley Data Science Group, which is a great group of people putting together all the business practices and doing the analysis and really getting into the trenches. So welcome back to The Cube, good to see you again. Thanks so much John. So you got a challenging job, you got to be the CTO. You got to be a handicapper, you handicap the tech, but also advising clients. Give us an update on what's going on, data science, your group. We know Ed Dunvella over there as well, he's been on The Cube many times. You guys doing some good work. So give us the update. Sure, so you know, I think I was at the session by the Spark guys this morning and I think, you know, it's a signal of a larger trend of people doing more and more advanced analytics with these kinds of architectures. So having founded a company that exists to help customers solve hard problems with data and use advanced analytics techniques on top of these kinds of architectures, of course, that's a really good thing for us. So we're seeing a lot of interest in our services. We are a drop in the bucket of what is needed in the world in terms of people able to do this stuff. So it's both really encouraging to see the interest in the marketplace. And you know, you go around all the sessions, you can see lots of use cases of folks doing more and more interesting things, going from summaries, statistics and understanding things at a basic level to much more sophisticated notions of what their customers are doing. Interesting year we've had so far in the big data space. You go back to Hadoop Summit last year, you have bifurcation of the two shows. Hadoop World is run by O'Reilly Media, no longer Cloudera running that for now a couple of years. You have Hadoop Summit, which is run by Hortonworks. But it's let that go to the community. It hasn't really been bow guarding it so much. That's sort of the usual expression. They're letting the community kind of have its way with it. And the community is flourishing. Meanwhile, Cloudera has certainly had a great exit. I mean, the fantastic liquidity event that they've had was fantastic. I mean, what Cloudera has done has been a spectacular rise. I mean, all the early founders cashed out, investors cashed out and have a ton of working capital, essentially quasi IPO and now building a really great big business. So, you know, Cloudera is off to the races kind of in that pack of big whales and the community's grown like crazy. So I got to ask you, what's next? Is it the platform or the apps coming? I mean, we're seeing Yarn and some of the stuff going on and the enabling side, the data platform at Hortonworks kicking butt. Yeah, I think, well, so I think Yarn is a big story of both this conference and in the technology space in general. And, you know, I saw the presentation by LinkedIn and what they're starting to do on top of it. So, you know, it's promise as a layer upon which people can build a lot of really interesting things that make all of this more flexible and useful in the enterprise is certainly starting to bear fruit. You know, that said, it's early days and so you see, you know, Spark out there getting a lot of attention. You see Impala still getting a lot of attention. That was the big announcement from last year's Hadoop World, if I remember, and all other kinds of people trying to get, you know, interactive access into these data stores, taking them from the batch query load processing that they started at and moving out into much more interactive and real-time workloads, which is sort of one trend. And I think having a diverse community of vendors who have large war chests driving out a couple different solutions for these kinds of problems is a great thing for the rest of us in the community who don't so much care about the distinction between, you know, the shows and things like that. We just want great technologies to work with and to solve hard problems. So I think you're seeing a lot of great innovation coming out of these war chests and hopefully it keeps happening. And also you mentioned the sessions are packed. So the commentary there is obviously there's not, it's not a suit-driven culture, still a lot of developers, I mean, certainly the big announcements around security, innovations in the platform across the board. But the presence of Cisco, Oracle, IBM, I mean, these are the big whales. And Avi Mehta from Triseta said, I'll take a pack of piranhas over whales any day. Kind of indicating he's a startup guy. So he's got to start, he's controversial, but interesting perspective, right? I mean, so the whales are in here shopping, figuring out where their moves. What's your take on that? So, you know, I think we've seen the starts of the consolidation, right? You see these security startups getting gobbled up by platform vendors. And, you know, I think the America's cop is going to be an interesting race in a few years. Oracle will still be around and Larry will probably still be able to fund his pursuit of that sailboat race. But, you know, on a larger level, as these startups are maturing and getting more traction, the big enterprise vendors with deep customer relationships would love to bring them in. That said, you know, when you've got funding events like that, those become pretty big people to swallow and become whales of their own, right? So it's interesting to see how that evolves and on the one hand, Cloud Air are going out there. On the other hand, you know, we'll see what clouds... Well, Cloud Air has made their move, right? So their moves made, they're growing like a big company. So they have sales presence, they're growing, they're doing webinars 10 ways from Sunday. So, ton of field sales force. That being said, can they sustain that with Intel while IBM, HP, Oracle, they're pretty sizable field sales forces. So, they need to really figure that out. I think they're vulnerable personally. I think Cloud Air is very vulnerable on that front. And needs to really figure that out to protect themselves. You know, I think one of the things we look at when we're handy, as you said, we have to handicap technologies and make bets. Both as a company, I have to decide what I'm going to train my people up on and what I'm going to build services around. And one of the things we look at a lot is, you know, what's going on over in GitHub? What are people using? You know, how are those projects progressing? What is the diversity of committers? And so, you know, an enterprise sales force is definitely going to be an important thing in the market. But I think you hit it on the head. You still have to get traction with developers. And when I'm trying to understand and make recommendations to my customers, you know, I take a very business-driven approach, which is you got to solve a business problem. You know, we're technologists. We love shiny objects. It's natural, but when you point those objects at a real business problem, that's when you have success. So, you know, you have to engage both the C-suite and the developer community. And that's, those that are successful, I think will be the folks that are able to do both of that. But we keep a very, well, you know, one eye on that developer community at all times, because it's very, very important. Yeah, and I think I was asked last night about the Cloudera strategy, and I said, well, obviously they don't brief me on their strategy, and I'm not working in their office anymore, but like, I look at what they're doing now. To me, they're on a two-front vector, right? They are investing in field marketing heavily at services and sales presence at season sales reps, heavily funded. Same time, the relation with Intel's very telling. You have the management selling their stock and the VCs cashing out early and early investors cashed out. Okay, that's interesting signs of liquidity. That's just data. But Intel could be their biggest developer. So, Intel as a developer is bigger than all developers combined. So, with system on a chip and software defined data center, you've got to ask yourself, that's a nice hedge. If I'm an investor, and I like those two fronts, if they win on the field sales force, Intel can move from an OEM to a direct potential. If that fails, they have all the software for all the embedded systems. So, if you kind of think about it, what do you think? Is that my crazy? What's your take on that? Is that the viable thinking or am I out the lunch? I don't think you're crazy. One of the, I was at a conference last week in St. Louis called StampedeCon and someone was presenting on some fantastic big, big, big metal hardware. And I was sort of thinking to myself in the audience, how long is it going to be that any of us care about hardware? And in a sense, as developers, we'd prefer not to. But what we do see is that, with Intel making this investment, now you're starting to have the opportunity for a real scaled hardware development to back some of these architectures. As somebody who builds the data systems right on top, I only want to care about that so much. I think, as that hardware market is changing, that's a very, there's a lot of feedback loops between what's going on at that level and how these architectures operate. And perhaps, that's a great advantage for Intel to have. So, I got to ask you the technology question. What are the things that you're getting excited about right now? And again, you have to make these bets on building a practice out. So you don't want to bet on a short fad or anything that's going to be, not have longevity, right? So what courses are you saddling up and writing? So I think I've said it a couple of times so that the rest of the broken record spark is definitely one of those technologies, right? The release of 1.0 just the other day, and the energy in the room of their presentations, and the vibrance of that community is huge. But I think one of the reasons for that, and this transcends to a couple of other products or translates to a couple of other products as well, is that they are enabling new classes of use case to sit on that same data architecture. So in a sense, making that data hub or lake strategy a bit more viable because now you can service a broader range of workloads. And I think, I think Impala, I don't think we'll see a clear winner between what Hortonworks offers in Impala and Spark, but those sort of interactive query engines paired with machine learning libraries like MLlib is something that we're working with quite a bit. We also work with Mablib, which Joe Hellerstein originally worked on on top of Green Plum, but it's been ported to Impala as well. And so I've got folks training models in both because as a good investor, I play a portfolio of technologies. But that portfolio is narrowing and it's narrowing around what, you know, Python are the big things we're using in data science and often that has us working in, you know, with those kinds of architectures like Impala and Spark and doing that kind of work. So I think that's a really big part of it. The other, another area that I've seen some really, really interesting work and going on is around metadata and discovery, right? If Hadoop is a lake or a reservoir, whatever we're going to call it, you can put a lot of stuff in there and the more stuff you put in it, the harder it is to find and the more important it is to have good metadata to be able to locate and work with that data. So, you know, we see a lot of interesting startups doing things in the data discovery space in attaching metadata to that so that, you know, a tool like a platform, for instance, that once you've got your data organized and the cluster makes it, it makes a very powerful capability, becomes a lot more effective when you perhaps also have trifecta in your house and have it a lot easier job. Well, is it trifecta? We had Joe Halestino, I got to ask this question because, you know, I realize I should have asked him this question when I had him but I didn't think of it but it came up today. The whole idea of big data is to actually get good data. So not to have dirty data. So like, if trifecta is all about setting up the data, does that change the game or is that just a compatibility mode to pander to the weak, to the old school, people who have bad data? So, I mean, that's nirvana. Getting good data is like, hey, I want only good data. It's ultimately almost the holy grail that's never a tea, a date. Sure, so, you know, I have perhaps a bit of a contrarian view on data quality and that I think there's a corollary from one of the great powers of Hadoop and the like is schema on read, right? And, you know, what trifecta enables is structure and quality on read. And so, you know, one of the examples I like to use is perhaps you are an organization that uses sensors in your product, right? Maybe you have temperature sensors in your product. You have a car and you have a thing that tells you how hot the oil is. And on the one hand, it's very important to understand how hot the oil really is. So if you're doing something analytically with that data, you want to, when you feed that model, have a real temperature. On the other hand, when you want to decide which sensor vendor you like the best, you actually want to understand that raw incorrect data in a certain way so that you can say, hey, look, I've been watching this and I'm starting to have to correct for, you know, these sensors go bad more quickly. And if, you know, you solved your sensor data quality problem when you wrote the data, you wouldn't have the data available to do that kind of analysis. So, you know, I think data quality on read is a very important corollary to schema on read because data quality is really in the eye of the purpose or the beholder of the purpose. I agree. You just mangled that analogy terribly, but. Yeah, no, I totally see what you're saying. I mean, you want to have, that's the extraction point. You can have a lot of bad data, but then you have to have this, I mean, that's what SQL has been, if you look at what's, I mean, there's never been an unstructured environment. It's always been some structure. I mean, we use a lot of social data with Twitter, Firehose, it's like, they have some data, the guy who tweeted this tweet and it's like, okay, they call that unstructured though, but I guess it's structured, but I don't have to build a database schema to deal with the data, but I do when I want to roll it up. So I got to ask you the question. Tony Baer wrote a post this morning called, he's an analyst, great analyst on strategies. Is SQL the gateway drug for Hadoop? To the enterprise. So, interesting point because SQL is known. MySQL is very popular. Yeah. So, I think one of the, a class of use cases we saw early in actually Hadoop's rise, that I kind of think of as the gateway drug is offloading ETL processes and staging tables from expensive data warehousing architectures onto Hadoop, right? And so, it's not SQL on the way in, but it's coming out that way because it's going into a data warehouse. So, absolutely, I think that use cases like that started a lot of enterprises down the path of Hadoop and they now look at it more broadly. Obviously, there's just, I spent some time on a large system integrator, as you know, and who has sold many a data warehouse and the investments in tools and technologies and business process around those are phenomenal. So, while, sort of, if it was only SQL on these architectures, it's kind of like, well, you've got a Ferrari and you're still just driving to the grocery store once a week. You've got a fundamentally new capability and you're using it for the old thing. That said, being able to address that broader range of use cases of which you've already got tremendous investments. I think it's really important. But what if you could have an elastic vehicle that's a Ferrari when you need it on the sparse roads and a bus when you need it. So, to me, it comes down to the classic agile elastic environment where you have data. You can shape the data based upon what the use cases are. To me, I think that's what people aren't talking about and the reason why I like SQL to the enterprise gateway drug is because it's like kindergarten. You're big boy diapers now. Here you go, you're used to it, but the end game is not even close to that. I mean, isn't the purpose to automate all this? I mean, connectors, so Salesforce connectors, Oracle connectors, I mean, I got to line up schemas to do that, right? To your point about SQL reads. I like to tell a story about, you know, that we have the technologies that exist so that an enterprise who is engaged in some kind of production and monitors that production, the world is not too far away where they could very easily introduce a new sensor into their production line and that sensor can say to a data bus, hi, I'm here and start sending its data and the repository that makes the most sense for that data goes, oh, hey, I store that kind of stuff and it starts storing that data and then along comes a condition monitoring routine that discovers, hey, there's a new sensor field there that I didn't know was there before. I'm going to train a new model and see if that helps and lo and behold, maybe it does and maybe at this point, we want to send some kind of note to a human data scientist to make sure we just didn't overfit or do something analytically crazy, but that's about the only intervention you need to actually incorporate a sensor into your larger condition monitoring data architecture. Now, nobody that I know has put that system together yet, but you can sort of see the seeds of each of the technologies that you would need to do that are getting a lot of maturity these days. So yeah, I think there is a automated world in our future where a lot of that stuff doesn't take as much human intervention as it does now. So we're talking a true car and it was on earlier and he used the term open correlation hunting. And for data science jockeys, they know that what that means essentially create a hunting ground to essentially wander and look for new data points of correlation, which assumes that you have a data rich environment or you're full with data, which means you want to have data access to data and then have the tools to go kind of wander and make new connections. So with that in mind, what's your take on that phenomenon? I mean, that ultimately is what we want systems to do. You talk about Spark and machine learning libraries. You can almost say cognitive computing is coming down the pike. So if you connect those dots, it's very AI like. It is. And so is that going to be encapsulated? When do we get that functionality? Well, as a data scientist, I'm not worried about my job security anytime soon. I'll say that, but you know, when we work with a lot of customers talking through a framework where we look at their data value chain. And so what you've just described for TrueCar, I think is an instance of a process that you go through where you discover some data and test correlations around it and discover potentially useful insights and then figure out some way to deploy that insight into a product or service that is valuable for your organization or your customers. And in today's world, there's a ton of friction in it. In my last example, I talked through that completely automated chain of discovering a new sensor, incorporating it into an analytical model, and then incorporating that into the monitoring system that a human user is using to, I don't know, manage a fleet of offshore oil platforms, say. Right now, that's very, very, there's a lot of friction in each step of that process. So companies like TrueCar, we've done some work with Edmunds.com, you know, are taking that and going, okay, how do I reduce that friction because my ability to generate that insight and deploy that insight into my products and services is how I'm going to win in the marketplace. And so we work with them to say, okay, let's find out where your worst problems are on that and address that with technology and address that with skills and work through that. And I think it ain't in the next five years, but in 15 years, we could see a lot of this extremely well automated. Silicon Valley Data Science, CTO, co-founder, welcome to theCUBE, great commentary, great to have you back on again. You guys doing great work. And again, I mean, it's early days and your job is totally secure. And I guess theCUBE is secure too, if people still want, there's still good action going on. We'll still be on the ground getting it and we'd like you guys a lot and congratulations to success. We'll be right back with more from theCUBE here at Hadoop Summit right after this short break. Thanks.