 Live from the San Jose Convention Center, extracting the signal from the noise. It's theCUBE, covering Hadoop Summit 2015. Brought to you by headline sponsor Hortonworks. And by EMC, Pivotal, IBM, Pentaho, Teradata, Syncsort. And by Atunituan Disco. Now your hosts, John Furrier and Jeff Frick. Welcome back everybody. Jeff Frick here. You're watching theCUBE. We're a day two of Hadoop Summit 2015. We've been going wall-to-wall coverage yesterday. Today, we'll be back tomorrow for day three. Going out to the events, extracting the signal from the noise, finding the smartest people in the room and really having them share their insight with you. And we're excited for this next segment to be joined by Rishi Yadav from Info Objects. Welcome back. I'm glad to be back here. I'm good to be back. So let's talk a little bit, we're talking a little bit off camera about this is great, this new technology, Hadoop and giant scale, but people don't come to the party usually with a green field, right? Usually there's some stuff that they're dragging along. What are you seeing in the field? How are they kind of rationalizing what they already have and they're still trying to incorporate Hadoop and all the promise of Hadoop? Yeah, so I mean, a year or two back, what was the case was that somebody will build their Hadoop cluster, maybe 15-odd or 100-odd. And the only thing they would do is Hadoop as a silo in that cluster and nothing to do with what the rest of the enterprise is doing, right? Now all the cases I see is there, the clients come to us and they say that we already, we understand that there's a need to have this one enterprise data hub or a data lake. But we want data to come from all the sources there, right? So that's what they're going to talk about now. So even in the POCs, I mean, two years back when POCs were there, they were the actual POCs, they were more testing whether Hadoop works or not. Right, right, right. So it was a POC about Hadoop then the POC about the business case. But now the POC itself, they want to see that, okay, we have data coming from these many databases, we have data coming from this OLAP store, right? And we already have so many reports being generated, those reports are being generated on MicroStrategy or they are being generated on Tableau. And we want to make sure all that works. We have a lot of batch data which is being processed, all the batch jobs they should run just as it is. On top of that, yes, we like Spark, we like the real time, we like the sub-second latency, right? We want that to happen as well. But there should not be any cost to pay for that, right? So that's what is guiding to all the legacy which is, and that's a good thing because it means the clients are actually getting serious about big data, incorporating big data into their story or rather making that central part of the story. Right, right. As Bill Chamarzo, the Dean of Big Data from EMC says, Yahoo uses it, Google uses it, it works. Let's get past the whether Hadoop works or not. And I think as Merv Gartner said yesterday, here on theCUBE, if you aren't getting on this train and doing something, you're just getting left further and further behind because it's going, the train's already down the tracks, you better get on now or you're just going to be further and further behind. So that said, we're past kind of the POC, it doesn't work, and you just explained a really complex kind of ecosystem of speeds and feeds. How are you recommending to your clients where are they getting started? Where do you see the specific workloads, applications that have the highest probability of success to really foster a continued, you know, kind of land and expand strategy? So one thing which I hear from a lot of clients their first pain point is SQL and the old school joins and all. I have 15 joints, I have 20 joints, they should work just like that. And the good part is that from last two to three years, all the SQL on Hadoop and whether it's Spark SQL or Apache Drill or Cloudera Impala, all of them have been focusing on that. Right, right. So I think the big data community early on realized the importance of all these workflows and that you would need a lot of SQL support. I mean, Hive came with a SQL support but that was the bare minimum they could get away with. But now it's not about that, it's about the full SQL support. And it's much more than what you would think in the big data world because you have unstructured data, you have semi-structured data, why do you need that much SQL? Well, SQL is going to stay there. The relational databases have gone, but from there also whatever, after the fact data, you get which you used to get or we still got in a lot of OLAP stores. So that with data will come, but the SQL as a primary source will stay there. Right, right. And it's not only the technology, but it's the people, right? People are used to working in a SQL environment. You've got a huge installed base of people that know how to use those tools. So you really want to make sure you're catering to those folks as well. I think people case was much more important couple of years back. Okay. But now I think it's more about the business case that the kind of work they want to do can only be done in SQL, right? I mean, there have been a few like in Spark, DSLs have come and now it's latest version also they're coming closer to the SQL, right? So SQL is there to stay. Okay. And then on the people side, just to stay on that track, I remember a couple of years ago, you know, trying to find people for your own company to help you manage your Hadoop was difficult. There wasn't a lot of folks out there that were trained that had the skills that you needed. Is that log jam kind of opened up or are we still having, you know, a real shortage of folks that can work on stuff? Short is still a big problem. And we are, obviously we have our big training program. So we, so besides serving clients, we are also in the business of creating big data folks. But even then we are not creating them fast enough. I mean, the need in the market is much, much bigger and it's much more diverse also. That's another thing. So what happens is that every client's need, because this is a big data, obviously, data is coming from all the sources, right? So their needs are very, very diverse, right? And big data talent to come to that level in a shortest amount of time, that's a challenge. So, but I think, so I think it's chicken and egg. I mean, yeah, there is a need and then their talent and both are catching up with each other. Yeah, I was talking to some people here at the show talking about doing their own training, you know, aggressively training their own people. And the comment was made, well, what if they leave to go take another job? Well, that's the risk you got to take, right? You need trained people, train them up yourself or, you know, find them out there. Let's shift gears again a little bit. You're on the front lines. A lot of talk about workloads, a lot of talk about starting small, having success. I wonder if you can share some specific examples, not necessarily specific customers, but examples of where people were able to find some of that early success beyond just the simple POC, you know, does a dupe work. What are some of the workloads that you're seeing that are really great places to start? So the standard which I have seen is the old ETL or we call them ELT workloads these days. Most of the customers start from there. It's not easy to find a customer who comes and says that I want to run a machine learning program or I want to do some deep learning or something like that on big data. Typically, they start with that I already have a workflow and I know that if I use big data that is going to make it 10 times faster or 20 times faster, right? So that's the lowest and hanging fruit, right? So they say, okay, let's start with that. So we can see all the money we are investing. We can see the value right then and there. And then we will talk about the other workflows. What's the scale of those type of early projects in terms of duration, weeks, months? The good part is that the big data projects because when we got into big data space we were fearing that being a consulting company we do not want it to be like a Salesforce kind of a thing where the project duration is a few weeks. The big data projects, they are multi-year projects because once you start, the work keeps coming up and more and more old stuff you move to Hadoop and more and more new insights you can draw from that. So the projects are long-term, projects are perpetual which is music to the ears of the consulting company. Right, right, but they're long-term because you're just iterating on new deliverables. They're discovering new things they want to do. You're really expanding base but your sprints if you will from a consulting point of view not from a development point of view what are kind of those sprint durations to get to a deliverable even as you continue to grow? Yes, so I'm not sure the big data project can be put like a Java project in which you have a two to three week sprint and you have deliverables there. It is still going more in the, what happens with the analytics projects where I mean most of the deliverables I've seen they take like two to three months time here because obviously you have a, if you're taking an Informatica workflow, they are very large, very complex workflows. It takes some time to do that. But as it matures and more and more use cases come, I think like Java that would also be like, okay, now we already know what to do, right? Now let's put resources there. So there are no uncertainties, there are no unknowns there, okay? Let's have three weeks sprints or two weeks sprints and get it done. To get something out. So then let's shift gears a little bit and talk about Spark. Spark Summit is next week. Spark seems to be the darling of the ball right now. Everybody's talking about Spark, they're excited about Spark. Why are people so excited about Spark? How is Spark a potential game changer? So Spark is a potential game changer. The biggest thing is obviously the latency, right? I mean what Spark is doing SAP figured it out couple of years back far ahead of it and came up with SAP HANA, right? But which is the rich guys game as I say, right? So what Spark has done is that they have brought all those benefits to the commodity servers because every commodity machine has some memory in it, right? Most of that memory was being used for compute. Now what Spark has done is that Spark has brought that memory for compute as well as storage. So storage is the main part, right? And there also then they have taken it to the next level. Now Techian project which got funded I think three months back, right? So now that creates its own storage layer, its own in-memory off-heap storage layer, right? So what that does is number one, it reduces the latency big time, right? And it also increases the availability. So all the good things which could come, they come with such a simple architecture. Right, so where do you see the biggest impact in terms of a workload point of view? Where is it going to get adopted quickest, fastest ROI or is it just going to be just general purpose performance improvement? I think it is general purpose. A lot of times even folks ask us that, would it going to affect any vertical play? I say it's going to remain a horizontal play. And in fact, because memory itself is a very horizontal play if you think about it. And that's the reason I think folks at Techian, they are thinking that why not, why just to limit it to Spark, right? I mean that Techian is so general purpose, right? That off-heap memory storage is useful everywhere, right? So it has a universal use, right? So it can be used everywhere. So I think that both Spark and Techian and HDFS has already been proven. They all are very general purpose. Yeah, so let's shift gears a little bit about the show here. We've been coming to Hadoop Summit I think for four years with theCUBE, five years at least 2012. I know for sure. What's kind of the vibe of the show? You've been coming to these things for a long time as well. You've come to all the big data shows that we put on. What's kind of the vibe of the show? How has it changed since last year? This is our first time at Hadoop Summit. Yeah, I mean we have been going to Strata from last three years and a lot of other shows. So it's slightly different show because I think the name itself suggests it's Hadoop while for example Strata is mostly about big data. So I think it attracts more people. It's been an interesting show. What are some of the conversations that you're having? So surprised you maybe. Nothing has surprised me much here actually. So I think that the diversity of, I was expecting many more conversations to be happening here. So I think it's a more, I don't want to use the word smaller show but it's a more focused show. I would say in which how you can use Hadoop in one way or another. That's where most of the conversation is centering around. So we've got to get more conversations for you. Everybody stop by the info objects booth. It's over here by theCUBE just a little ways away. All right, we're sure. I'm going to give you the last word. We're getting the hook. We're the last people in. Everybody's headed over to the party. Kind of last word here from Hadoop Summit 2015. What are you looking forward to from between now and when we see you a year from now at 2016? Well, now the biggest thing is that now customers have understood that Hadoop is needed and when I'm using word Hadoop is for the Hadoop ecosystem partners, right? Whether it's Hadoop plus Hive plus Spark and all other technologies. So that is there that every customer knows that they need to have a SDFS storage, right? So now it's more about how to figure out the budget. It's more about how to get started ASAP. It's more about how to finish the POC fast so that they can show the business case to the management and get ahead with it. So next one year is going to be very, very interesting that it's not going to be about whether Hadoop or Spark can work or not, but how fast we can adopt it, right? So the latency there is going to be how fast a company can move through all its bureaucracy and other things. I mean, how fast they can adopt Hadoop as opposed to whether to adopt Hadoop or not. And then as you said, the projects go on and on and on because they keep kind of fueling themselves as there's more discovery, more opportunity. Wow, I could do this and maybe I could do that now. Well, exciting times. Rishi, thanks for stopping by. Again, theCUBE, always good to see you. I'm Jeff Frick here. We're at Hadoop Summit 2015, wrapping up day three, or excuse me, day two we'll be back with our next guest after this short break. Thanks for watching theCUBE.