 Live from New York, it's theCUBE covering big data NYC 2015. Brought to you by Hortonworks, IBM, EMC, and Pivotal. Now your host, John Furrier at George Gilbert. Okay, welcome back everyone. You're watching Silicon Angles theCUBE, our flagship program. We go out to the events and extract the similar noise. I'm John Furrier, the founder of Silicon Angles, joining my co-host George Gilbert. We are live in New York City, 100 yards away from the Javits Center where Strata Hadoop is going on. In conjunction with Big Data NYC, our event we're here for three days of wall-to-wall coverage, getting all the data and sharing that data or putting a data pipeline of content together for you, covering all what's happening here in New York City. Our next guest is Rishi Yadav, as the CEO of InfoObjects, CUBE alum, and now published author of his new famous book. It's sold out on New York Times bestseller list. Welcome to the CUBE. I'm only kidding about the New York Times bestseller list, but let's chat. So congratulations on publishing the book. Last time we spoke, we had the cover, and it was cover on another book, but it was a good prop for the camera. Today we don't have the book, because they're all at the booth getting signed. Give us the update on the book title, how to get it, URL, and comment. Yeah, so the book is Spark Cook book. It got published five, six weeks back. It's getting amazing traction. Yeah, you can go to Amazon or Bonson Noble, so any usual place and you'll find it. So what was the big thing? So let's just revisit our story, because we've been well-documented on the CUBE, the interviews you can go search at youtube.com, slash look on the angle on InfoObjects, look at the videos. Going back three years ago when we chatted, it was not obvious to everyone in the industry as it is today. I mean, they should call this, rename this event from Strata Hadoop to Spark World, because all the talk is Spark, Spark, Spark. You know some ancillary announcements, but you were on the front end of that curve. Tell the story and are you happy you made the bet and what are some of the outcomes that you guys are doing today from a product standpoint and then customer benefits? Yeah, so we took a big bet on Spark and that's been a blessing for the company. I mean, for a consulting company, choosing one path and then getting all the inbound client leads. I mean, that was the biggest proof for us that it's working. I mean, before Spark, we focused on Hadoop in general and there we had to struggle to get clients and the moment we focused on Spark, the business has been really, really good. So I saw Merv Adrian had a tweet from Gardner just earlier today saying, finally everyone at Hadoop is talking about data in motion. That really is kind of categorically discussing real time. This is some of the real time analytics pieces that you guys are focused on. How far along has the industry come with respect to Spark? Certainly it's on the radar of everybody. We've got some success companies being developed. What are you guys seeing in terms of Spark, VisaV, Hadoop and how it all fits with moving the ball down the field faster innovation? So as far as Hadoop is concerned, the Hadoop store is the HDFS, right? I mean, that's become a defective standard in the industry. In fact, HDFS I would say is also eating the lunch of HBS and Cassandra's of the world. Because what's happening is with the new file formats like Parquet and all, now HDFS, what HDFS did not have was the schema, right? And now with the file format like Parquet and what's happening is that now it's as good as having a database, right? So you have all the distributed power of the database, right? For scans, in other words, like MPP type scan and aggregate. But the HBS and Cassandra were more oriented towards point lookups and simple transactions, no? You know, where HDFS was kind of a different end of the spectrum. Yeah, so HDFS, the problem with HDFS was that you could put anything and everything there, right? Which was a good thing, right? As far as putting the stuff was concerned, as well as data ingress was concerned. But when it comes to taking the data out, right? When you're trying to read the data at that time, you have to figure out that what type of format you want to put the data in and that had its own challenges, right? So schema being separate and data being separate. What has happened with Parquet and Orc format and a couple of new formats coming in that is that now one file will have its data, the file will also have a schema, right? And in Spark, at the same time, they moved from the RDD to a data frame, right? So data frame also has data in it and it also has schema in it, right? So now in one simple call, you can read the data and load the data the way everybody's used to loading the data that is in the case of databases, right? So, but database is a very, very small subset. Here, you're talking about petabytes of data, right? So now you can access petabytes of the data the way you are used to accessing in the databases and at the same low latency as you are used to accessing. So the paradigm that we're seeing, I want to get your thoughts on this, George. I know you got some more questions and hold on for a second. I want to get the perspective here from the expert. The old model was put everything in Hadoop. You saw that, made some money, did some good business with customers and then realized probably what we're seeing even today. Good call, by the way, because Gartner's also saying that POCs are up but productions aren't as heavy, so you had good decision to go with Spark. But you move the data into Hadoop and you move compute to the data. That's where storage has got value, hence the Cloudera announcement. But now the big trend is pipelining, getting into the flow and using real time machine learning, being access to that, trapping on data and learning about the data, the machine data. Machine speed is faster than human speed, so there's a new streaming, and we hear all these things. Talk about what this is going on. Can you share your thoughts and color around the trend of, not the Hadoop putting compute to the data, but happening in line, in the memory, in the data, in the flow? Yeah, so we, it would not be right to talk about the streaming part without talking about the IoTPs, right? So IoTPs. So one thing is IoT is that you see it and all the sensors and bulbs and all, right? But the biggest IoT piece is in the industrial IoT, right? And all this, I mean, every industry has got tons of sensors. Industrial IoT. Machine data. Machine data, right? Yeah, now with the power of Spark and the streaming and all this machine data, you can get in real time and then do online machine learning on that, right? Yeah, so you're right, John, on one side we just were storing data using Hadoop and doing some analytics using Spark here and there. But the biggest power now has come is that now in the real time, you can stream all the data from various locations, thousands of locations, hundreds of thousands of locations and do online machine learning on that. What is this doing to the landscape for application developers? Meaning more catalog driven, is it much more SOA? What is the impact of the developers, this new trend? So application development actually has got simpler. Two, three years back, if you're developing something in Hadoop, you have to be a Java developer, right? Then in Spark you have to be a Scala developer and now all that has become what is it in the database industry that you just need to know SQL, right? So most of that has become just putting SQL jobs. So the whole application development landscape in the big data, actually that has become easier, simpler and narrower. I got to ask you a question. So we were having a crowd chat earlier, crowdchat.net slash strata Hadoop and one of the most voted up threads was a comment from Renee Yao, she commented, Gartner mentioned 41% of organizations don't know of big data. ROI will be positive or negative thoughts. She put a chart up there. Essentially, the vendor community, it's all about ROI to the customers. What's your thoughts on that? Are customers scratching their heads with ROI? Are they still in the low-hanging fruit stage? And if so, what is that low-hanging fruit? Talk about this dynamic. We're almost half of the organizations don't know if big data ROI is positive or negative. What is that? Do you believe in that or what's your thoughts on that statement? So what I see on the street is completely different. What I see is no client has any doubt about the ROI of big data. The challenge they have is that how do we get that ROI using Hadoop and Spark and other technologies? And what they're doing is on one side, as I talked about IoT piece, the other big piece is the enterprise data piece. So Hadoop being unlimited storage, so there the enterprise data which used to go on tapes now is going in Hadoop. That's number one. But at the same time, a lot of compute also is moving from the enterprise data systems to the Hadoop systems and Hadoop and Spark-based big data systems. All right, so I want to ask you a question about Spark. What is your biggest learning since we've last talked in the Spark community, Spark technology, customer adoption, all of the above? What is the big realization that you've come to at Info-Objects and also you as an individual? I think what we talked last time was the data frame which I mean it's just a technical piece but what it has done is that it has made a Spark object like any other table, right? And that made it very easy. So now you can have the same data in the enterprise data store and the same data you can have in the HDFS and using Spark on one side and let's say SAP HANA on the other side, right? To be clear, the data frame is like the equivalent of a table, so you have a table data type that works across all the Spark APIs and it works across all your familiar tools outside the Spark ecosystem. Exactly. So I got to ask you, you have a blog post on your site that's very impressive. We oxygenate the ecosystem. Explain what that is and the motivation behind that post because since past decade that you've been involved in the industry here. What does that mean and what does that mean for the future? Right, so I mean companies like InfoObjects, we have a unique role to play here. I mean on one side you see Hadoop vendors like Cloudera Mapper and Hortonworks, right? On the other side you see big consulting companies like Accenture and all, right? I mean do it all kind of companies, right? And we go as a trusted advisor to the client, right? And we go to the client and we tell them that in a vendor neutral way that for you, I think adopting this big data strategy would get you the most results, right? So we are kind of filling this gap here, right? In which on one side client is looking at their ROI, the Hadoop vendors, they are looking at more their stack perspective that what their technology stack has to offer. Just to clarify, when you go into a customer, do they bring you in to advise on a big data strategy or do they bring you in to say look, we've got this application that we really want to start making use of near real time data, generated from machines and we want to be able to create a pipeline that has a really fast feedback loop for improving predictions or recommendations. Where in the spectrum do you come in? So when we started in Spark, most of the clients would come to us for troubleshooting that I already have a cluster and somehow it's not working. How can you help us optimize it? What I see more and more is as companies are getting deeper and deeper into Spark, they are involving consulting companies like us early on, right? So now companies are involving us from the advisory phase itself, right? Okay, I have to have a big data cluster here, but let's help us architecting that cluster, right? So, and that helps us a lot because then what happens is that all the learnings which we have from implementing it for other clients that we can implement early on. So they don't have to repeat the same mistakes which other clients did. You bring up something that's interesting. You talked about also in your blog post, throwing bodies at a problem, these big consulting shops, whether it's outsourcing or whatnot, they weren't really working the solution. They weren't really being a trusted partner, even though they have a relationship. And Jim McHugh from Cisco said on the Cube earlier today that a lot of customers have been vendor hopping. It's kind of like party hopping. You go to one house, then the next one, mainly to test, oh, they got this version of Hadoop, during that early exploration stage. So that party hopping, kind of vendor hopping, we're seeing people out of that phase, at least for the most part. There's still that going on, but for the most part, they kind of it's tasting around. What does that mean for the customer who's looking at it? Hey, I've tried a little bit of Hortonworks, I've tried some Cladera, I got some EMC, I got the Pivotal. They're like, okay, they need to step back. What do you guys say to that customer? What is the main message? Yeah, so it becomes a challenge and customer, when they do vendor hopping, they are not happy about it, right? Because they say that this vendor hopping is happening in our time at our cost, right? If we already knew that what right strategy we have to follow for our needs, we would not even have to get involved into that part, right? And that's where info objects come as a trusted advisor, as I said, that now more and more clients are involving us early on, right? And at that time, we always look at what client wants, right? We do not even care about what stack a specific vendor has. Sometimes we can even combine the stacks to look at the client needs, right? So that's where we come and we are completely vendor neutral, I mean, we do not represent or resell any vendor solutions, right? So that also helps clients a lot, that in fact what's happening is, say a client may be expert in their own industry and now because they are partnering with us, now they've also become big data expert, right? Now loaded with this expertise when they attack the big data challenge, right? Their strategy is much more successful. So what is the big takeaway for you this year at Strata Hadoop? When you look at the ecosystem, the floor, the vibe, what's your take on what's happening? So it's too early, so I mean, we are just setting up booths today. So maybe in a day or two, I would have a better perspective about it, but the floor looks much more vibrant right here. I mean, I see a lot more shops here than I saw at Hadoop for a couple of months back. So I look forward to see what they have got. How about the vibe? What's the vibe of the show? I mean, is it sold out, crowded? What's the vibe? I found it fairly crowded today, so. Let me ask, I'm curious, we talked about how when you first started coming in to cluster engagements early on in Spark and it was troubleshooting and now it's more and more advisory. Sort of, how have the problems changed? The customer use cases changed and what are they talking about? Not just what are they doing now, but when they talk about their plans two, three years down the road, what do those look like? So the big data lake, I know this term has been overused, but now actually, in fact, I wrote in one of my blogs that