 At Big Data SV 2014 is brought to you by headline sponsors, WAN Disco. We make Hadoop invincible and Actian, accelerating Big Data 2.0. Okay, welcome back everyone. We are here live at the Hilton and Santa Clara. This is Silicon Angle and Wikibon's theCUBE. This is Big Data Silicon Valley or hashtag Big Data SV. A continuation of our geographical coverage of Big Data. First, we were at Big Data NYC a few months ago around the Stratoconference when it was in New York. We were right across the street there. Here it's the same thing. We're right across the street from the Stratoconference going on right behind us. A lot of news, a lot of developer action. Here was where the CUBE action is where the entrepreneurs come, so the tech athletes come. Where we extract the signal from the noise. I'm John Furrier, the founder of Silicon Angle. I'm John Mike Coase, Dave Vellante, co-founder of Wikibon.org. And our next guest, CUBE alumni, who was on last year's Stratoconference 2013, Rishi Yadav. Thank you for coming on again. Appreciate it. CEO of InfoObjects. Last year, your video was pretty popular. As we were just talking before we went on, was thousands of thousands of views. How do people are interested in what you have to say? Yeah, thanks John and Dave for having me here again. I think what I give is the honest opinion and which is very tough to find. I mean, even if I take any technology and I go on the search on the internet and it's very tough for me to find the real stuff. Some call us the live quora. Except there's no Q and A, we just answer a pepper of questions. So let's get, by the way, we love the straight scoop. That's our job. A lot of noise out there. This year, if you walk through Strada, we did yesterday, and you go to the exhibit hall, there's a lot of companies I've never heard of. More and more companies are popping out of the woodwork, kind of like mushrooms growing in the mushroom patch. But some are new, some won't be around. So a lot of noise. So I want to ask you first question is, we're kind of in the multi-generation of big data now. Couple years ago, when it really got and got going, you knew who was the only handful of players. Now it's busting out the big money involved, real solutions being talked about. What's the difference between the pretenders and the winners, in your opinion? So let's first start from the technology perspective in Hadoop. So what has happened is that the HDFS, that has really established itself. The biggest thing about HDFS is that in Hadoop, that you can store such a huge amount of data, petabytes of data, exabytes of data, you can store in a very cost effective way. So that is not going away anywhere. I would say for a lifetime, it's not going away anywhere. What was happening was that MapReduce, that was doing a very small role, a very important role. And from there, now it's evolving. Now it's evolving into all the real-time applications, the graph applications, and the whole plethora of applications which are now you can do, because now you have all the data in one place as the peodle guys are calling Data Lake, which is a nice terminology. So the first was creating Data Lake. Some folks call it Data Landfill, the people who don't like Hadoop that much. Yeah, so depending on where you're coming from, whether it's a Data Lake or a Data Landfill, so that is created now. So after that, what you do with it. And because you have all the data there, so you want to run all of your applications on that data. So what companies were able to sell a couple of years back that the data is in a Hadoop, now you move the data out of Hadoop to MicroStrategy and then they would run their analytics there. Down the line, folks will say, CIOs will say that's not making sense to us, because now we have spent so much money in putting all the data in this Data Lake. And now we want to derive all the intelligence, all the insights from there itself. And that's where the market is going to evolve. Yarn has been a big factor there. So now with the yarn, it has completely democratized the compute part of Hadoop. Now you can run any type of application on Hadoop. That's broad in the use cases, Yarn, right? I mean, explain that a little bit. Yeah, it has made it unlimited. So earlier MapReduce, as I said, was a big use case of Hadoop and that made Hadoop very popular, but that was only limited. Not every problem is MapReducible, right? And there are all kinds of graphics, one case, but there are all tons of applications which you want to run on Hadoop, where your data is. I mean, wherever you can put your data in a cost effective way, say email applications, right? So now with Yarn, what's happening is, number one, you can run any type of application in Hadoop besides the usual MapReduce, which has been ported very well, thanks to all the contributors and committers. I know most of them come from CloudEra and Hadoop, and they have done an awesome job there, right? So yeah, so that was the main thing. So now any type of application, you can move to Yarn. Right, okay, so now we've got, we've got from MapReduce, we broaden the applications. Where do you see this all going? I mean, now we're starting to hit, everybody's saying that there's more suits this year at Strada than there are, than there are hoodies. We've crossed that point. Where do you see this going now? Are you seeing real business applications? Are you seeing real dollars spent? What are your customers asking you to do and where do they want you to take them? So as I said last year, the customers are sitting on the fence. They're still sitting on the fence, but now less customers are sitting on the fence than they were sitting on the fence last year. And Yarn only came in October, so it's just been a couple of months with it, but you already see a lot of applications, the Storm and Spark and Giraffe, and tons of applications are now moving to, already moving to Yarn, right? So all these applications, which were popular independently, right? The Tutor Storm, for example, was popular independently, and now it's been ported to Hadoop. So earlier they were also ported to Hadoop, but as a MapReduce application, and which had a huge latency issue, MapReduce obviously being a batch processing system, you cannot make it run much faster beyond a certain limit, right? But now you can natively work on it. In fact, you see all these new real-time applications like Apache Drill and all. So what they are doing is, they're actually going to the machine, the slave machine which has the data node, and they're actually drilling the hole there as the drill goes, right? And they're actually pulling the data from there, and then they are running their query engine there. Okay, so talk a little bit about what you guys are doing at InfoObjects. Go to your website, it's a great resource, first of all. You have a lot of tutorials, sort of what this is all about, so what is InfoObjects all about? Yeah, so we are a consulting company, and we are proud to say that we don't have any, hold any IP, right? Every company you see in the big data market, they have some IP. So they say open source, but they are either open API like MAPR, or they are open cores, open core. In our case, we don't hold any IP, right? We are in the business of building client's IPs. So our business is that we take the open source Hadoop and its ecosystem partners, and we use them to deliver value to the clients, to build the custom applications for them. So that's our business, and we think that the open source Hadoop itself is good enough to solve all of your big data problems, as opposed to finding the proprietary solutions here and there and mixing and matching them, because you see the NoSQL space, like they're about maybe 100, 200, I don't even remember how many NoSQL databases are there now. Now, if you buy one of them, if you commit to one of them, what do you know that after two years, whether they exist or not? There are tons of those companies, every day is a new NoSQL company, props up. So what are some of the more interesting applications that you're helping clients with, that you worked on, some of your favorites? Yeah, so there, and one part of them, they are my favorite, second thing is that's where the market is evolving is, so analytics is one thing, but the next generation of analytics is going to be visualization, right? So visualization has existed since the dawn of civilization, but with the big data, I think visualization will evolve into something really big. So at present, there are a lot of applications which do a lot of visualization for the traditional BI, and now they are going to move to work on the raw data of Hadoop. DataMir has done a good job there, but I'm hoping, and I'm pretty sure that within a year's time, you will see some real good open source applications coming on that. We are doing a lot of interesting work with our clients on D3.js. So ClickView and DataMir, and all these companies, they are using D3.js, which is a JavaScript framework for visualization, and what we do is that we help our clients develop custom applications, custom visualization apps using D3.js. We just want to ask you about, maybe think of Node.js, which has been very popular in the DevOps world. You got Node, you got these real-time trends going on, real-time data pipelining. Can you talk about that trend in particular? Not necessarily Node.js, but real-time data. You're seeing Spark becoming a really popular storm, these technologies. What does that all fit into the map reduce in the yarn layers? Is it separated or they integrated? How does someone understand that trend? That's a good question. So as I said, one part, which is going to remain there, that is that you have this data which is stored in a distributed way in a very cost-effective way. That is not going away anywhere. And that's good, by the way. And that's an awesome thing. I mean, if you see the cost of storage in Hadoop, it's like many times less than the traditional storage systems. So now comes the question, how do you want to use it? So number one was that MapReduce, as I said, that was the traditional way of doing it. It has very high latency, so it has its own challenges. And then the companies came and they wanted to work on the actual storage layer. Then Spark came and they said, you know what, let's do it in memory. Because storage also has its own latency when you're doing the disk IEO. So with Spark, they are doing it in memory. Now then they developed Spark. They said, you know, that Hive, so Hive works with MapReduce. They said, let's replace MapReduce with Spark. So everything else will remain same, right? You are still going to access Hive the same way as you were accessing. It's just that behind the scenes, the MapReduce has been replaced by Spark. So that's why, so in memory is that trend which is happening. And the interesting part is if you see a typical Hadoop slave node that has anything between 32 to 48 GB of RAM, that's a lot of RAM distributed across. You know, Dave, it just hit me. Rishi's the professor. We have the Dean of Big Data with Bill Schmarzo from AMC, a good friend of ours. He called him the Dean of Big Data. But you really have good handle on some of the technical things. I really appreciate it. We call you the professor. The professor of Big Data. I only have a counter of those. I'm just an upgelber. No, no, you're good, you're good. So I go to the next level. So let's just talk about now application developers. Because now, you've got all the scale, you've got all the storage, you now have some real-time integration. Where is the development market going? Because everyone right now is kind of, oh, I want some Hadoop developers. What does that really mean? I'm a Hadoop developer. It's a big data developer. We're trying to put a frame around that. What does that mean to be a big data developer? And what are the things that people need to know about what that means? Or is it being defined now? Yeah, so we have an interesting perspective on being a consulting company because we see the actual need and actual consulting needs in the market. So I don't see a lot of need for MapReduce programmers, to be honest. I love MapReduce, but I don't see a lot of customers asking for MapReduce programmers as such. What they ask for is they need somebody good in Java because Hadoop still is based on Java and then they ask for skills like high work pig. Pig, I see a lot of traction for the pig. And down the line, what's going to happen is this UI skills. I already see a pattern, but down the line is going to become more and more apparent that the UI skill, the DT.js and other UI skills, they are going to be more and more popular. Because once you have data, the customers want to visualize that data. They want to get a story out of the data. And that can only happen when you've got very good visualization tools. Is there a programming language that you think is more relevant or has more traction relative to the big data, data science and developers? Obviously you have C, C++ and you have Objective-C. Is there other languages you're seeing emerging that are tools of choice for developers? So Java is going to be the de facto. I have done Java all my life, so I'm biased here. Older guys like us did Java. A lot of the young guns might say, hey, Rishi, you're old. We'd like to, you know, I mean, I'm quite a genius. Yeah, so the cool language is, Python is a cool language here that works very well with Java. But repeating that, I think, these new JavaScript frameworks, I think they are going to be really big hit down the line. Because everything else will become pretty much standardized. But where most of the value will come is that how you can view data in different ways. And that's where most of the visualization and infographics, as they call it. So that's where I think the market is going to be. Do you think there's too much overhead in Java or is that just overhyped? Overhead in terms of overhead programming language? Well, Java may have some overhead, but it's not as much as it's being criticized for. Right? So what's happening? For example, Claudia and Impala, that's written in C++. But I think the biggest power of Impala is that it's a bypassing yarn. It's directly going to the slave node, which is the data node running. And that's where it's getting data from. So yeah, C++ may be adding some value to it. And I have all the respect for C++ developers. But I think the biggest thing is the new architecture which they are using, where they're directly accessing the data rather than going through the whole Hadoop compute layer. What are some of the other things you might be tracking? You know, when you look at John, I was talking about the big four. Now, John's been talking about them for a long time, cloud, mobile, social, big data. Everybody talks about those. But when you look at where the roots of those developments were, cloud was Amazon, big data you could say was Google and what Yahoo has done, mobile, I guess Google and Apple, Apple in particular, social obviously, Facebook. But we watch what some of those, what we sometimes call the hyperscale crowd, the internet guys, the big giants are doing, they tend to go mainstream. Certainly you saw that with MapReduce, things like Bigtable. Are you watching anything in particular that you see is coming out of those innovators? Or is this kind of a, you know, this big data theme, Hadoop kind of a once in a two decade type of trend? Is there anything else you're watching that you're excited about that's coming down the pipe? We are way too focused on big data to focus on other things which are happening. I think there's a lot of new work happening in the networking field. So the software defined networks and all SDN and all, I think that's what's going to come. But the big data itself I think is once in a decade phenomena. What's happening with the mobile and social, that's feeding data into big data. So a couple of things which have happened which made the whole big data perfect storm. Number one was this mobile and social and a lot of other sources which have come up in the last five years. And then the cost of disk, the SATA has gone down like anything. It has dropped like a rock. Now you can go to any store and buy as much disk as you want at no price. So and the third thing is the open source movement. So the Hadoop evolved and a lot of other technologies which came in the open source. So it was interesting that a problem evolved but the solution also evolved at the same time. So that way I think the big data is going to be a huge overarching phenomena which is going to exist and mobile and social will also become part of it. That's my take on it. So essentially you're competing. I mean, Jeff Kelly just quantified the big data market and the biggest sector broadly is services. If you break it down hardware, software and services. Services is the biggest and that doesn't look like it's changing. It's been that way for the last several years and will likely stay that way. And you see some big names, you know, the IBMs and then the Accentures will get into it, the Deloitte's, Ernie Young's, Captain Gemini. These are guys you presumably compete with, right? So how do you compete with those big whales? What's info objects differentiator? So in our case, our biggest differentiator is the faster turnaround time. We are much more agile than the bigger companies. Bigger companies get into the bigger clients also. So that's where it changes that a bigger enterprise like Bank of America may prefer to go with Accenture. But SMBs, they want more agility. They want more turnaround. They want someone they can call, right? So that's where most of our clients are in the SMBs. That's where it's easier for us to get in. Really, so a lot of the S's are jumping on big data. Can maybe you describe some examples of what you're seeing for some of the SMBs? What are they doing with big data? So what's happening with the big data is that SMBs, forget about even the very small startups, now when they are designing their data model, now they are thinking, okay, this should go to MySQL or Postgres in some cases, and this should go to Hadoop. So from the very start, they are separating their OLTP and OLAP load. So what's making that is that the need for Hadoop is from the very start when the company is starting. And then in SMBs, what's happening is that there's a lot of data which they were throwing away because they did not have somewhere to store it and they did not even know that they can use that data. And what's happening with the Hadoop is that now you can store that data and clickstream data. Who would care about clickstream data 10 years back, right? But now the clickstream data is there and all the web logs are there and the sensor data is there and all that kind of data. So they're storing all different various kind of data and now they're trying to figure out what to do with it. And that's the reason one another thing which you're going to hear a lot in next one year is about ROI. The skeptics will say, well, it's okay, there's all this Hadoop and big data cool technologies and I do have data. So what? Where is the money? So that's what's going to happen around the line at least for a year. And for any new technology, that confusion remains for some time. For Hadoop, this is the time. So in 2014, you will see a lot of people questioning it. They would say, yes, Hadoop is a great technology. I'm able to store all this data, but now what should I do with this data? Where is the money? How should I drive my ROIs? How should I drive my KPIs? So that's where I'm saying that all these analytics tools and all the visualization tools and all the infographic tools, that's where I see a lot of momentum building this year. Appreciate it. I want to ask you about a trend we were watching. We just get a demo. We did a demo just before we came on with CrowdChat application, which is a crowd source people, group conversation, application, crowdchat.net that we launched in beta one preview. But people as data, now people are information themselves. So how does that factor into some of the conversations that you've seen in the business side when you talk to your customers? People are connected with their cell phones. You have, actually they're in databases as well. You're seeing the role of people as data. It's a trend Dave and I are trying to get our arms around is what does that mean people as data? So people directly we, because we work with enterprise customers, we don't encounter, but what's happening is that the customers are interested that what people are talking to me about Twitter and the Facebook and other places. So that's where we see a lot of interest that they want us to capture that data and then analyze that data and see where the trends are going. So on these social platforms, that's where we see directly crowdsourcing, I don't have a perspective on it. We never focus a lot on that. But I think there are a lot of companies which are doing a lot of interesting development there, including crowdfunding and all. So, but I don't have a lot of perspective. Okay, so what do you see? Let's talk about the Hadoop marketplace. As it starts to grow up, you're starting to see trends here. We've had other people on theCUBE earlier and talking about, Hey, this is the first show we've seen and people talking about POs and budgeting, not just POCs like real planning. What's next for the market in this space here? The ones that are expanding. What do you see that next chapter of business meets the technology enhancement? So what's the intersection on the business side with the technology? As I said, that analytics and visualization, I think that's going to be a big thing for next two, two, three years. That's where it would head. Coming back to the ROI's part, people want to see where the money is. And for a services company, because the marketing departments, et cetera, they have much more money than the IT department. So it's like Hadoop infrastructure. I don't see a lot of money coming to the services business from there. Most of the companies, they already have their data center set up. Yeah, so we have to work with the IT departments and all, but do we add a lot of value there? No, because Hadoop infrastructure being there, Hadoop cluster being there, for them, it's just another cluster. They already have tons of clusters there in their data centers and it's just one more of them. My final question for you is, put a bumper sticker. Summarize this big data SV event here, Strata Conference. What's happening this year in Silicon Valley? That's the most notable for the folks out there who aren't here. So as compared to last year, I see less number of NoSQL companies, at least on the floor, right? So the database companies have reduced, which has to happen. Down to 200 now. I would count it, but I'm pretty sure. Yeah, 300, now it's 200. I see few services companies. So I thought there would be many more pure services companies, but I only see less than five pure services companies. One reason for that could be that all the companies, like Cloud Data, they are also providing services. So one pattern, which is obvious there, is that everybody is into services here. So yeah, so even if you create a new product, you open source it, once you open source it, then all the IP there is gone. Then you have to again focus on services. So the good trend in this industry is that everybody to survive has to focus on the good quality of customer support, customer turnaround and all. You cannot say that you have got the best and the greatest product in the world and let the customer figure it out themselves. So that customer support would be a key, that would play a key role in generating revenue in the driving of the companies. Well certainly you guys are doing great work. Thanks for calling the professor a big data because you're so knowledgeable, but also more importantly, you're out on the trenches, you're helping folks take that journey, you're crossing the chasm. I claim that the chasm being crossed, but you can't have a keynote speech saying the chasm has been crossed if otherwise you wouldn't be speaking. So we're still crossing the chasm, so to speak, that's the keynote at Strada. Rishi, thanks for coming on theCUBE again. The market's exploding, a lot of great valuations, a lot of activity in the customer front, real proof of points hitting the market, it's exciting, big data, SV, this is Silicon Angles theCUBE, we'll be right back after this short break.