 Live from Midtown Manhattan, the Cube's live coverage of Big Data NYC, a silicon-angled wikibon production made possible by Hortonworks, we do Hadoop, and when this goes, Hadoop made invincible. And now your co-hosts, John Furrier and Dave Vellante. Hello everyone, we're here live in New York City, this is the Cube, our flagship program, we go out to the events, extract a signal from the noise, in some cases create our own events, in this case it's Big Data NYC hashtag, I'm John Furrier, co-founder of SiliconANGLE, I'm John Mike Gose, Dave Vellante, and we're here with guest Abhi Mada, Cube alum, been on the Cube many times, coined the term Data Factories, and our first Hadoop, this is our fourth year of Hadoop World, now it's Strata, Abhi welcome back. Well thank you so much, and congratulations on the setup, you know I can see CNBC, ESPN, and the Cube, you know, on Times Square, this is awesome guys. So Dave and I were talking about our four years here, and now we're kind of outside the event, surrounding it with Fox News on one side, the Cube on the other, obviously our orientation is open source, but I want to get your take on the vibe here, obviously four years later Hadoop was never even heard of in mainstream, a lot of the alpha geeks like yourself, who saw the future in their early now, it's maturing, you're seeing a lot of maturity, you're seeing people settle into their center point of where they want to compete on, so talk about from your perspective what's changed, and still the vision of this industrial revolution modern day, data factories has certainly been very relevant to your point four years ago, right on spot, so on the trend line congratulations, you see that vector clearly, take us through what's happening today, what is the settling in point, what are the key positions being developed, and what is the vector going to look like going forward, what is the trajectory going to continue? Well first of all I think we together have done some phenomenal work in the space, some predictions we made over the years have been very interesting, I think the biggest trend has been the validation of them John, I think you can see clearly the trajectory of sitting around and starting with our first conversation where we extoll, I still remember Dave asking me to explain moving the code to the data versus vice versa, and you see and you hear today everybody talking about including customers moving compute to the data versus vice versa, you know we should not forget, I think it's a critical point that Hadoop is a computational platform, it is not a storage platform, and that's what makes it powerful, so I think that fundamental concept that you can do and perform computations at massive scale on data to solve business problems that hitherto were impossible to solve is by far the biggest and most liberating concept of big data, and I think that hasn't changed, I think there's actually more validation, the buzz is very interesting, I think this is a critical moment for the ecosystem where Hadoop World, Stroller, New York is going to increase its presence and its buzz compared to Hadoop Summit, Stroller, California, because this is where the business sits, you know we've all seen what big data can do on the West Coast, big data will not realize its economic potential unless it works equally well on the East Coast, I think I see that transition in buzz to the East Coast finally after four years around commercial large concepts on making big data fundamentally transform business models in every industry vertical, so I think that is a very refreshing change and maturity of the ecosystem. I think the last part which has been, I wouldn't say tricky for the market or I guess tricky for the market not tricky for the three of us, we've always said that the value in the ecosystem sits really high up in the stack in what we have coined the term advanced analytics applications and the market has taken time to get there. The distribution infrastructure is a race to zero. The battles are being fought, some claiming one, but the war is far from over. So I think with this massive race to zero in large parts of the historical data analytics stack where billions of dollars were currently being made is now going to transition over to the highest point of the stack and what we call analytics applications predictive analytics software that works in Hadoop, not on Hadoop, not against Hadoop, in Hadoop. I think that is the trend to watch John for the next four years. When we come back and you guys are on Times Square, you know the cubes on Times Square, we will all sit and talk about that. I think that is a critical concept to focus on. So have you said Hadoop is not a storage platform, it's a compute platform, drill down on that, what do you mean by that? Because everybody talks about Hadoop as a storage platform. You think differently. Yeah, I think people, I think I said this on this show a year ago with you, Dave, that we as a community will do this service not just to big data's ecosystem, but the larger industrial social economy ecosystem if Hadoop simply became a storage platform. Hadoop was architected to be a computational engine, you know. Google runs its algorithms at scale across hundreds of petabytes of data in a Hadoop-like framework, right? They were the birth of it, as does Yahoo, ad does Facebook. In fact, these companies have built successful economic models and companies around running computation at scale. Running computation at scale requires three key parameters. You need to share nothing architecture, you need the ability to actually store, process and analyze all kinds of data, not just structured or unstructured, structured and unstructured together. I actually don't like those terms anymore, but together. And then, take analytics processes, algorithms, process flows, intelligence, and run it against all the data at the same time. Not in chunks, not with moving data back and forth because it doesn't add up. That fundamental architectural components I just laid out to you is how Hadoop was built. I coined this term, I'll use this term now. I call it the BC and AD era of analytics technologies. BC being before cutting and AD being after Doug. And any technology that was created or company before cutting doesn't work in this paradigm anymore. After Doug open sourcing Hadoop, when you actually apply those computational parameters to data at scale, it's a game changer. And I think we have to remember that Hadoop isn't a storage platform, it's a computational platform. And you essentially alluded to that four years ago. You actually made that comment about storing, processing and analyzing. Now, two years ago, Mike Olson stood up at Hadoop World and basically said, I predict this is going to be the year of the apps. Next year is going to be the year of the app. Where are all the big data apps? I think it was a early prediction. I think, let me answer the question in two parts. We all realize, at least the three of us realized that the value is going to sit on the top of the stack very early on. I think the market has finally caught up. I think Claude has announced today rebranding themselves as the enterprise data hub, you know. In a way, seems like conceding the distribution race to Hortonworks. Pivotal's massive launch yesterday at NICI, what IBM is doing in the space. Everybody is kind of admitting that in order to maintain the billions of dollars of economic value technology companies realize from data analytics, you have to move up the stack. That was the prediction, right? That was Mike saying, without that, this thing is not going to work. The reality is, and we've always been on the spot, right? We had a, Trisada is an advanced analytics application predictive around three core areas. Risk, marketing, and we're going to make a massive announcement soon with you guys, I hope, around fraud. We are now on a mission to eliminate fraud from the world. Eliminate transactional payments fraud from the world. We have focused on risk and marketing for now, but everybody talks about fraud being the perfect big data app, no one's got anything around it. The challenges Dave and you and I talk about is it's very tricky to build applications. So we have a recipe for it. It's quite simple, but difficult to pull off. You need technology expertise, you need domain expertise, and you need scientific expertise. Unless you can pull those three things together, you cannot build predictive applications that work in Hadoop. And I think while we have focused at Trisada to pull that talent together, that team together to architect it, after two and a half years of being in the market, we have tremendous success doing it. Not, I would ask you, how many companies do you know who have pulled those three things together? People are doing signs, people are doing tech, they're doing domain. No one's doing all three together. That is the trick. And I think it will come. It's just very hard to do. You mentioned risk, marketing, and fraud. All three of those disciplines used to rely heavily on sampling. When we first met, you said sampling, forget it, you can't do this with sampling. So how are you doing it? What's the science behind it? What's the technology behind it? Obviously of the domain expertise. Talk about that a little bit. Absolutely. Let me take you through all three. I'll give you short examples on each. On risk, we now have a customer, it's public, it's a company called L2C. We made a press release announcement on that. Over the last two years, have completely changed their risk measurement storage performance analytics and scoring engine on our technology. They score 225 million people using our software every day, looking at factors around risk that are not just purely financial, but non-financial as well. You know, looking at things like rental histories and magazine subscriptions. And you can only do that. You can only enable scoring 200 million people a day on technology. That scales because populations grow. So what if it becomes 400 million? That's easy. Add more servers. We can solve the problem for you. They transition from traditional technology where they would wait months to score their population out. As we all know, consumer behavior changes before months. I mean, yesterday you were happy, man, because Boston Red Sox won and you were everybody in the bar drinks. But you were not doing that two days ago. They were losing. Consumer behavior changes every day. Let me ask you a question. This is something that we talked yesterday about. So the data science has always been something we've talked about and seamless integration into applications, as you mentioned, as new advanced analytics platforms happening. Orchestration becomes important, but everyone's talking about the data scientists and the stats that we had yesterday is about, this is roughly about 200,000 data scientists in the world today. And that number will grow. But there's over two million business analysts out there. So not only are people talking about this shift of that the data science is going to come from normal, smart people in their jobs. And that's where innovation could come from. It could be an analyst saying, hey, I can improve my airline if you're united or someone else of your GE industrial Internet. One change of one percent in a workflow at GE's customer base. One of their oil and gas customers creates billions of dollars of real value. So the shift to data science is not just geeks riding Python. Could you elaborate on what your vision is there? And one, do you agree and two, what things do you see happening with that regard of data science putting in the hands of people? I agree. So first to answer your question simply, I agree. I've always said that the one aspect of data science that does not get spoken about is business knowledge. You cannot be a good data scientist without having business knowledge to use all this massive amounts of data and make sure if I brought together social, mobile, geo, and transactional data, what problems should I solve? You can't answer that question without domain knowledge. That's the first part around this conversation around can you enable business analysts to become smart data scientists? The answer is yes. However, there's a small trick to it. I think the trick is around this concept of machine learning. This is my simple explanation of machine learning. Machine learning and machines are designed to mimic what humans do and they will never be able to completely replicate what humans do because fundamentally our brains work and wired in a way that no machine can be wired in. So what is the goal of data science and machine learning in actually enabling smart business people to make better decisions? And this is where I bring in the concept of what is called, I used to call it the last mile and I'll call it the last one tenth mile. I think machine learning will automate nine tenths and nine and then nine tenths of that last mile completely in the data storage processing analysis cycle. Everything will be automated. Well, so we have these platforms called we have a data fusion engine that takes data from every single platform and fuses at the unique entity level. We've been doing it for two years incredibly successfully across transactional data, financial data, and social data. Facebook, Foursquare, Twitter, around voice data and email data. So you can use machine learning to bring it all together, run the algorithm, provide the insight to business analysts who can then take the insight and make the best possible decision. You should never let the machine make the decision for you. You have to have machines take petabytes of data in the world, make sense of it rapidly, present it to smart people like you and Dave, so you can make the best call possible for your company. And that's the future. Let's talk about some of the how we get to that future, the landscape right now. We talked about the marketplace. I like this notion of inside Hadoop not just on Hadoop, Hadoop is a big part of this new enablement, this new landscape. So what grounds do you see developing out there? I see data warehousing as an initiative has been around business intelligence. It seems that business intelligence seems to be taking hold here this year. Is that a tell-tale sign that the analytics piece, as you mentioned, advanced analytics, or is that minimize the data warehouse component? I mean, Cloudera is talking about the data hub, Hortonworks is showing use cases with Spotify. So there's a lot of platform transitional conversations happening at the end of the day. What's your view on which key ground is stabilizing first and what's taking hold? So I think a fantastic question. I think that you only need three key pieces to enable the next-gen data platform. The first piece is infrastructure. I think that battle is clearly one. Hadoop is the infrastructure. It is, there is no doubt in anybody's mind that Hadoop, the ecosystem around Hadoop will become the data operating system for the future. You know, a term we coined on the cube together. That part is clear. That part is already well established and set, John. The second part, which is this conversation around BI, and I don't think BI is the right path forward. This focus on BI is as misguided as the focus was on calling Hadoop a storage platform. I say it around the way, there will be lots and lots of mechanisms to access both the data and the insight being generated in Hadoop. But that is, just like infrastructure, a commoditized skill. Ben Worth has said it. BI is BS. He's said it today. Yeah, exactly. So I think I would call it a misguided focus of, you know, I actually don't like the fact we are using words from the last generation of data analytics that was built in the 50s to this next generation. Enterprise data hub. I mean, come on guys, it used to be called enterprise data warehouse. You know, we all lived it. I don't like it. You know, this is the wrong terms being utilized around a technology that is fundamentally changing that stack. So I think you have to say data operating system infrastructure solved. BI, BI is a horizontal commoditized platform that multiple technologies and tools will solve. Can you make it proprietary? I don't know. The last aspect is actually the most interesting, which is the combination of incredibly smart data domain knowledge with incredibly smart data science or algorithmic knowledge combined on a platform to make and deliver insights at the segment of one. We spoke about it. You mentioned me saying sampling, you know, it's dead four years ago. The next generation, what has not been built on as a utility and needs to, and this is what we have spent a lot of time on, not because we wanted to, because we had to, is if we buy, so you have to help me here, if we buy the logic that every single piece of data in the world will sit in Hadoop. Do we buy that? Yeah, the vast majority sits in Hadoop. The second question becomes, all right, the data sitting in it, how do you actually commercialize and monetize it? You can't monetize that at segmentation levels, at sampling levels, at the segment of one. So you need to build a data fusion engine that takes data from all of those sources and aligns it at the entity level. We have an engine for that called Tree. It works incredibly well across financial data, at scale, and across putting financial data together with social data. Then graph engines have become the new killer engines in big data. We have one that works in Hadoop. It's called Orion. So you find who Dave is across all data, you try and find the relationships Dave has across all data, and then you try and predict what Dave will do. If you only do it with transactional data, John, there's a flaw. There's a flaw with transactional data analytics. The flaw is, it's backwards looking. Just because we know Dave was watching the Boston Red Sox game last night doesn't mean he'll see it again tonight, right? There is no game tonight anyway, right? He will see it again two days from now. So you have to bring in transactional data that tells you what you have done with social interaction data that will give you an indication of what you will do at the individual segment level, because even though the two of you are practically twins, your behaviors are different. That, no one is talking about that. What is the layer on top of the data infrastructure that will enable analytics at the segment of one? We've already built that. But yeah, so you're saying you're building that into Traseta app. Correct. Okay, but so, but there's a lot of companies out there building those horizontal capabilities, trying to essentially say to you, build your app on top of our platform, and what do you say to them? I don't need your platform. No, exactly. And because the problem is they're building things, Dave, that we don't need. We don't need a database. No analytics companies. This is my advice. You know, free advice. You don't need it, guys. You're small entrepreneurs. To anybody who wants to build a predictive analytics software company on big data, you don't need a database on Hadoop. You don't need FastSQL on Hadoop as a company, you know? It's all free. You don't need someone to come in and say, I'll enable caching at scale on Hadoop. I'm not going to pay for it. It's all available for free. What you need is if Salesforce, I don't know if I said it, if EMC, if Pivotal, and IBM, and Cloudera, and Hortonworks want to enable companies to build applications at the segment of one, you know what you need? You need a data fusion engine. We've already solved that problem. It works beautifully. You need a relationship discovery engine. And then you need to combine what we call a data index and assigning intent algorithms. We've spoken about that before, to build an app. That's what's missing. The Hadoo Zolder needs to be built isn't a foster database. It's a data fusion engine. We already have it. We have a question from the crowd. So on your three pieces of your data platform, the question was, infrastructure like Hadoop, data OS, business intelligence in a new way, data insight and access. And three, the combination of smart data, data knowledge, domain data science, and a platform. The question comes from the folks. Of those three, where's the biggest gap at his point? I'd argue number two is you have to grow to the strength of either one or three. No, I think the biggest gap is number three. So respectfully to your user on CrowdChat, I would say the biggest gap is building a layer. And I don't want to use the term middleware, but building an analytics application layer. It's not even an API. It's a fundamental layer on top of the data infrastructure that allows you to build applications at a segment of one. And for that, you need a data fusion engine in your relation to a discovery engine. Without that, no app can be architected at a segment of one. That's what the biggest gap is so when a client buys our applications, we bundle those two things in. It comes with it because all of our applications monetize big data at a segment of one. So I would say the biggest gap today is our number three. Okay, Abhi, thank you for coming on theCUBE. We really appreciate it. You've been a great guest. Congratulations on all your success and your continued involvement in the community. You've been a great steward of the vision, but also on the ground, on the field doing, building a business. Congratulations. This is theCUBE on the ground. Big data NYC. We'll be back right after this short break with our next...