 Live from New York, extracting the signal from the noise. It's theCUBE, covering RapidMiner Wisdom 2016, brought to you by RapidMiner. Now, your hosts, Dave Vellante and Jeff Brick. Welcome back to RapidMiner Wisdom. We're here in downtown New York City, the Big Apple. Vamshi Kemitaganti is here. He's the general manager of financial services at Hortonworks, a company we know well. Vamshi, welcome to theCUBE, it's good to see you. Great, thanks, Dave and Jeff, and glad to be here. So Hortonworks been riding the big data wave since the early days. Everybody knows the story, spun out of Yahoo. We've had guys like Rob Bearden and Sean Connolly on many times and many other practitioners within the Hortonworks community. But give us the update. It's specifically to financial services. We're here in the heart of the financial services world. It's your wheelhouse? Yeah, absolutely. So it's an interesting industry to be in. Dave, just the Hadoop and the Big Data Movement. But just to add to that, with banking being in the forefront of a lot of the predictive analytics, the real-time decisions, as well as historical space with risk-state aggregation, anti-money laundering. Really exciting times. So Hortonworks, as you know, is the leading open source, pure open source provider of Hadoop solutions in the Gardner Magic Quadrant and all those good places. But banking really forms our number one or number two vertical. So we don't just work with banks in a classical Wender setting, but we pursue to be partners. So they're helping us improve the Hadoop platform and the ecosystem and driving it to be a true application level ecosystem in addition to having strong data management, aggregation, governance capabilities. And that's where I think the story gets really interesting with predictive analytics. So is that really kind of what's happened? In the early days of Hadoop, the banks were basically building out data pipelines and sort of eliminating sampling, you used to hear that all the time. And so is now the emphasis on hardening that capability, that solution, or as well, are we seeing new use cases? No, it's a great question. And the way I like to answer that is if I take an analogy to Web 1.0, where you had the portals on the internet about 10, 15 years ago, and then now you're Web 2.0 with really interactive services being offered over mobile channels, over your Android, your IOSs, et cetera. Big data, you would call it big data 1.0, where banks did exactly what you described a couple years ago in terms of being able to understand big data, understand how Hadoop was different and how disruptive Hadoop could be to the data ecosystem. But banks spent a lot of time ingesting the data, landing it in Hadoop, and transforming the data, and basically using Hadoop to supplant existing data warehouse architectures or relational databases. Now I think we're at the cusp of big data 2.0, where the simple problems or a lot of the plumbing challenges have been solved, but banks now want to use big data in what I like to call the defensive dimensions with it being able to understand the risk that is being run across all the different products they offer from a market risk perspective, now with all the tumult in the markets, also from a liquidity risk perspective in terms of the web of connected financial institutions, and also from offensive, which is to use the data and the real-time interactions to drive more of a responsive experience to consumers, much like what you get with an Amazon or a Yahoo or a Google.com, and really to use that to transform their business as what's known as digital transformation, essentially. How much effort is being placed on making the predictive models better, and how are organizations doing that? Is it more data? Is it tuning the model, injecting machine learning? Talk about that a little bit. Great question. So I think to be realistic about the power of predictive analytics, we're probably at 20 or 30% of the journey to Nirvana, which would be essentially being able to plug in predictive models, models that do clustering, segmentation, classification, and also maybe deep learning, but do all of that at scale. The challenge is being that, and I see quite a bit of that even now, is that typically a large portion of the project is spent in what I like to call data janitorial work by the data scientists. Bringing the data, munch the data, transform the data. Do I have all the data I need, or am I overfitting my models, whatever I have you? And then I would say 30 or the 40% of the work is productive work, spent in really creating models and deploying them to do credit card fraud detection, anti-money laundering, or whatever have you. Projects are results in real business value. But I think Kudos to a lot of the work done by RapidMiner and some of the other leaders in the Gardner Analytics Quadrant. You'll see that that percentage of that work is the janitorial work is going to start decreasing. And more of the time we've spent in being productive and being able to help the data scientist leverage his or her domain expertise to create actual business outcomes. And Hortonworks has been a part of this. We're not just a Hadoop company. We also purchased technology known as Apache NiFi in the community. I would call it Hortonworks Dataflow because we recognize that a big bottleneck in realizing value is being able to ingest data scale using file based ingest, databases, message queues, whatever have you and being able to help the data science process from an overall onboarding the data, cleansing the data perspective. Yeah, so what you described before is people spend all the time cleansing the data. That's what everybody complains about. But you're saying increasingly we're now able to shape the algorithms to the data as opposed to spending all the time on the data. Is that a correct interpretation? Exactly. So with the fact that you have tools like RapidMiner which open up the data science field to not just the core techies in our organization, we're getting to the point where data science is not a dark art anymore, but it's more accessible to data analysts and to software developers who can use some of the models that are being created by a team of data scientists and collaborate in a true team environment. Given that there's so much customization involved in at least in the early days of analytics solutions, particularly in the financial services world, there has to be a lot of paranoia about IP. How is the industry dealing with that? Are they trying to minimize outside services? Is it contractual that protects them, which kind of might not? How is dealing with that? A great question, Dave. So we see a couple of different things out there. In banking specifically, there are things every bank must do to stay compliant with regulation. And you could go to any of the big five research and you should look at the fact that 60% of all IT dollars are going to arrest data aggregation. What we're trying to push the industry to, and the industry largely recognizes that, and I think you'll see this more of this as blockchain technology, as it gets into prime time, is that banking customers should consider building utilities where they can onboard a lot of the common plumbing type tasks and not have them replicated from bank to bank to bank. So in areas like KYC, know your customer, we've already seen a couple of utilities being built, and I was part of one while working at Red Hat. But what we're trying to push with industry is obviously a couple of things, right? Where possible, the industry partners with players like Hortonworks or RapidMiner to create these baseline models, but design these baseline models in such a way that the banks can create their own IP on top of that. And now whether they wish to contribute it to the open source community or not is a question left to them. But I think with Google and Yahoo open sourcing, Google open sourcing their neural net deep learning framework, and Yahoo releasing a lot of public data sets to the public to help improve models, I think you're not too far off from banking realizing that, hey, what could be for the benefit of everybody, we should probably release it out there and be seen as thought leaders? And what's IP or really core to our business secret sauce we keep in house? Well, it's interesting, there's a lot of things you mentioned in the blockchain, Google open sourcing, essentially it's AI, we hear a lot about IBM Watson and we actually asked one of our colleagues, Paul Gillan, asked Bob Pitchiano at the MIT conference, will you open source Watson? No. And now you see, in the data science circles, people say, well, Facebook and Google have the killer AI and cognitive and so it's really going to be interesting to see that but I wanted to ask you about blockchain. It's like two years ago, three years ago, it was like Bitcoin, what the hell? And now it's like this awakening to the potential application of blockchain. So, and you're seeing open source projects and security initiatives, so it's, what's your take on that? I mean, it's likely there won't be one blockchain. No, great question, Dave. So I think blockchain is going to be the number one disruptive technology in many industry verticals, banking just being one of them and blockchain really got its start with the Bitcoin technology or the Bitcoin currency and obviously the whole financial industry is built around intermediaries, right? So when you work with a credit card, you're either, there's a Visa or a MasterCard taking a few points off every transaction and you talk to a retailer, it's a lot of hard work about the energy and fees or whatever have you. If you look at, even though blockchain and Bitcoin might have their origins in the libertarian way of thinking, there's a lot of, again, keeping that whole side of the argument, it's a different, what kind of economics is right, Keynesian versus the libertarian high school. I think what it really enables you is a couple of different things. Blockchain 1.0 got its start, as we like to call it, from the digital currency and freeing the world up from this intermediary problem that you have. But you look at certain markets like Argentina or even the developing countries where 40% of the world's population does not have access to banking services. Just being able to bank using a simple cell phone is a tremendous win and banks would want to be part of that. But I think with the Blockchain 2.0, with smart contracts, being able to write programming into contracts and being able to have a global ledger, the potential for that is going to be disruptive across healthcare, financial services, IoT, retail, et cetera. One last point here is to what you mentioned. Blockchain, we're not talking about one blockchain to conquer them all. They're going to be vertical blockchains that serve a certain industry or constituency of interest. So I could see a KYC data utility moved to a private blockchain with a bunch of banks collaborating on a set of access permissions and tighter controls as they want to see it and not have any of the data leave, the data center, not be publicly available, even though you're the strongest in a cryptography backing it up. But you also are going to see inter-organizational blockchains as well or whatever you like to call them or variants. And then there's always going to be a public blockchain, which I think is a very robust technology that we begin with. So in short, I think it's going to be disruptive to the way we view the world in at least a few years. But if I heard you correctly, you don't necessarily see a scenario where the trusted third party disappears, but you're saying that the role of that trusted third party transforms and adds value in different ways. No, absolutely. So this is something I pointed out in my blog as well. I don't think enterprise practitioners or bank CTOs or healthcare CIOs should view the blockchain as replacing the existing structure in their industry. But I would see blockchain forming a complimentary, paradigm to start with. I think over a period of time in certain industries, you're going to see mortgages as an example. The ability for a consumer to request a mortgage and to have that mortgage be granted using a smart contract and then have that credit checking all go online, I think is going to be disruptive to certain industry segments. So it depends on the segment you're talking about, but it'll start being complimentary and over time subsume some of the third party nature that's omnipresent across banking, healthcare, manufacturing, whatever have you. Right, so let's bring it back to sort of big data, Hadoop. We talked to, given a couple of good 1.0, 2.0 analogies, everybody talking about real time, sparks all the buzz. What's happening in the Hadoop ecosystem? It's increasingly complex and burgeoning, which I guess is good sign of growth. The markets are a little shaky these days. When's the funding drying up? A lot of people are asking that question. What's your take on Hadoop, the evolution of Hadoop, the impact of things like Spark, you guys embracing it, sort of coming up with your own versions. Talk about that a little bit. So a couple of things, right, from let me first address a business perspective. Obviously, Hortonworks probably the largest public Hadoop player. We're seeing a lot of growth in our business, right? There's 890 plus customers signed in. We're signing new logos every quarter. A lot of banks, a lot of healthcare organizations, a lot of IOT type shops are moving from having the technology leave the lab to having it produce actionable business insights. So where we are now compared to two years ago, if Hadoop was this small and you had the whole other big, you know, the box, which was the EDWs, the RDBMSs, now we're starting to see Hadoop becoming bigger and bigger as time goes back. So you have that whole motion on the data side itself, where Hadoop's becoming a bigger chunk of the operation. And we see that being a trend that's going to continue exponentially. On the technology side, a lot of work is being done, a lot of fantastic work is being done by Hortonworks and even our competitors. And if you look at the Yarn project, as it's known, Erwin Murthy is a Hortonworks founder and he conceived of the Yarn project. And what the Yarn does is it forms a spine on top of the Hadoop distributed file system and explodes the use cases that you could put on Hadoop. So Hadoop moves from being a data technology to being an application technology. So you mentioned real time. In Hadoop 1.0, the focus on batch-oriented processing with 2.0 with the Hortonworks 2.x releases, now you can get batch, you can get real time, you can get streaming, you get time series and your ability to extend Hadoop is only limited by imagination. So we see a lot of clients not just putting in to talk about financial services, a batch-oriented risk application where a lot of book or record transaction data, wire data, core banking data, payment data shows up and you calculate a bunch of analytics that provide risk exposures in a day. We're actually seeing real time transactions being streamed in to, hey, if a wire transfer happened of money from one entity to another, could this indicate money laundering? So the New York Times had an article last week about shell companies buying up real estate in Manhattan. That's the classic money laundering pattern you're talking about. Using Hadoop, Hortonworks Hadoop and RapidMiner predictive analytics, you can intercept that in real time and you can marry that with historical data to detect what your confidence level is in this transaction being fraudulent or not. So I think that opens up Hadoop to, you know, a whole bunch of use cases and we'll see more growth of that in years to come. And that's a good example of, we agree, one of our research themes is building out systems of intelligence and essentially bringing transaction and analytic systems together and affecting business outcomes in near real time. And we're starting to see that, you know, come to reality. So we have to leave it there. But thanks very much, Mamzy, for coming on theCUBE. It was great to see you again. Great. Thanks, Dave. And thanks, Jeff. Keep right there. We'll be back from New York City, RapidMiner Wisdom 16. Right back.