 Live from San Jose, in the heart of Silicon Valley, it's theCUBE, covering DataWorks Summit 2017, brought to you by Hortonworks. Good morning, welcome to theCUBE. We are live at day two of the DataWorks Summit and have had a great day so far yesterday and today. I'm Lisa Martin with my co-host, George Gilbert. George and I are very excited to be joined by a multiple CUBE alumni, the co-founder and VP of Engineering at Hortonworks, Irene Murphy. Hey, Arun. Thanks for having me. It's great to have you back. Great to have you back. So, yesterday, great energy at the event. You can see and hear it behind us. Great energy this morning. One of the things that was really interesting yesterday, besides the IBM announcement, and we'll dig into that, was that we had your CEO on, as well as Rob Thomas from IBM. And Rob said, you know, one of the interesting things over the last five years was that there have been only 10 companies that have beat the S&P 500, that have outperformed in each of the last five years. And those companies have made big bets on data science and machine learning. And as we heard yesterday, these four metatrends, IoT, cloud streaming analytics, and now the fourth big leg data science. Talk to us about what Hortonworks is doing. You've been here from the beginning as a co-founder I mentioned. You've been with Hanoop since it was a little baby. How is Hortonworks evolving to become one of those big users, making big bets on helping your customers and yourselves leverage machine learning to really drive the business forward? Absolutely, great question. So, you know, if you look at sort of the history of Hanoop, it started off with this notion of a data lake. And I'm talking about the enterprise side of Hanoop, right? I've been involved in Hanoop for about 12 years now. You know, the last six of it has been on the, you know, as a vendor selling Hanoop to enterprises. We start off with this notion of a data lake. And as people have adopted that vision of a data lake, you know, you bring all the data in and now you're starting to get governance and security and all that, obviously the one of the best ways to get value of the data is the notion of, you know, can you sort of predict what is going to happen in your world with your customers or, you know, whatever it is with the data that you already have? So that notion of, you know, Rob, our CEO, talks about how we're trying to move from a post-transactional world to a pre-transactional world and doing the analytics and data science is the obviously the thing. We can talk about, and there's so many applications of it, something as similar as, you know, we did a demo last year of, you know, of how we're working with a freight company and we're starting to show them, you know, predict which drivers and which routes are going to have issues as they're trying to move, right? You know, four years ago we did the same demo and we would say, okay, this driver has, you know, we would show that this driver had an issue on this route, but now we're in a world that we can actually predict and let you know to take preventive measures upfront. Similarly, internally, you know, take things from, you know, machine learning and log analytics and so on. We have an internal problem, you know, where we have to test different versions of HTTP itself and as you can imagine, it's a really, really hard problem. We support 10 operating systems, seven databases, like if you multiply that matrix, it's, you know, tens of thousands of options. So to do all that testing, we now use machine learning internally to look through the logs and kind of predict where the failures were and help our own sort of software engineers understand where the problems were, right? An extension of that has been, you know, the work we've done in SmartSense, which is a service we offer our enterprise customers and we collect logs from their Hadoop clusters and then we can actually help them understand where they can either tune their applications or even tune their hardware, right? They might have a, you know, we had this, you know, example I really like where I had a really large enterprise financial services client. They had, you know, literally hundreds and, you know, thousands of machines on HTTP and we could, and we, using SmartSense, we actually found that there were 25 machines which had bad NIC configuration and we proved to them that by fixing those, we got a 30% throughput back on their cluster. At that scale, it's a lot of money, it's a lot of capex and a lot of op-ed. So as a company, we try to accomplish ourselves as much as we kind of try to help our customers adopt it. Does that make sense? Yeah, let's drill down on that even a little more because it's pretty easy to understand what's the standard telemetry you would want out of hardware. But as you sort of move up the stack, it's the metrics, I guess, become more custom. So how do you learn, not just from one customer, but from many customers, especially when you can't standardize what you're supposed to pull out of? Yeah, so we're sort of really big believers in sort of dock-putting your own stuff. So we talk about the notion of data lake. We actually run a SmartSense data lake where we aggregate data across the hundreds of our customers and we can actually do predictive machine learning on that data in our own data lake. And to your point about how we go up the stack, this is kind of where we feel like we have a natural advantage because we work on all the layers, whether it's the SQL engine or the storage engine or above and beyond the hardware. So as we build these models, we understand that we need more or different telemetry, right? And we put that back into the product so the next version of HTTP will have that metrics that we wanted, right? And now we've been doing this for a couple of years, which means we've done three, four, five tons of the crank, and obviously something we always get better at, but I feel like compared to where we were a couple of years ago when SmartSense first came out, it's actually matured quite a lot from that perspective. So there's a couple of different paths you can add to this, which is customers might want, as part of their big data workloads, some non-hortonworks services or software when it's on-prem. And then can you also extend this management to the cloud if they want a hybrid setup where in the not too distant future the cloud vendor will be also a provider of this type of management. So absolutely, in fact, it's true today. Microsoft's a great part of ours. We work with them to enable SmartSense on HDI, which means we can actually get the same telemetry back, whether you're running the data on a non-prem HTTP or you're running this on HDI. Similarly, we shipped a version of our cloud product called hortonworks data cloud on Amazon. And again, SmartSense is pre-plummed there. So whether you're running on Amazon or Microsoft or on-prem, we get the same telemetry, we get the same data back. If you're a customer using many of these products, we can actually give you that telemetry back. Similarly, if you guys probably noticed, you were probably there in an analyst's day, but we announced a flex support subscription, which means now you can actually take the support subscription that you get from Hortonworks, and you can actually use it on-prem or on the cloud. So in terms of transforming HTTP, for example, I just want to make sure I'm understanding this. Are you pulling in data from customers to help evolve the product? And that data can be on-prem, it can be in a Microsoft Azure, it can be in AWS? Exactly. The HTTP can be running in any of these. We will actually pull all of them to our own data lake, and we actually do the analytics for us and then present it back to the customers. So in our support subscription, the way this works is we do the analytics in our lake and it pushes it back. In fact, who are supporting tickets in our sales force and all the support mechanisms, and they get a set of recommendations saying, hey, we know this is the workload you're running, we see these are the opportunities for you to do better, whether it's tuning a hardware, tuning an application, tuning the software, we sort of send the recommendations back and the customer can go and say, oh, that makes sense, I accept that, and we'll update the recommendation for you automatically. Then you can have all you can say, maybe I don't want to change my kernel parameters, let's have a conversation, and if the customer is comfortable with that, then they can go change it on their own. So we do that sort of back and forth with the customer. One thing that just pops into my mind is, we talked a lot yesterday about data governance. Are there particular, and also yesterday on stage, we're- On stage, with IBM, yeah. Yes, exactly. And when we think of really data intensive industries, retail, financial services, insurance, healthcare, manufacturing, are there particular industries where you're really leveraging this kind of bi-directional because there's no governance restrictions or maybe, I shouldn't say none, but give us a sense of which particular industries are really helping to fuel the evolution of Cortenworks data lake. So I think healthcare is a great example. When we started off sort of this open source project called Atlas a couple of years ago, we got a lot of traction in the healthcare, sort of insurance industry. Folks like Aetna were actually founding members of that consortium of doing this, right? And we're starting to see them get a lot of leverage out of this. Similarly now, as we go into Europe and expand there in EMEA, things like GDPR are really, really important, right? And you guys know GDPR is, it's a really big deal. Like you pay a, if you're not compliant by I think it's like March of next year, you pay a portion of your revenue as fines. That's big money for everybody. So I think that's why we're really excited about the partnership with IBM because we feel like the two of us can help a lot of customers, especially in countries where they're significantly highly regulated than the United States, to actually get leverage, our sort of giant portfolio of products. And IBM's been a great contributor to Atlas. They've adopted wholesale as you saw in the announcement yesterday. So you're doing the keynote tomorrow. Give us maybe the top three things. You're doing the keynote on data lakes 3.0. Walk us through the evolution data lakes, 1.0, 2.0, 3.0, where you are now and what folks can expect to hear and see in your keynote. Absolutely. So as we kind of continue to work with customers and we see that maturity model of customers, initially people would stand up their leg and then they'd want sort of security, basic security of it, Kuberos and so on. Now they want governance. And as we're starting to go to that journey, clearly our customers are pushing us to help them get more value from the data. It's not just about putting the data lake and obviously managing data with governance. It's also about can you help us in our new machine learning? Can you help us build other apps and so on? So as we looked at this, a fundamental evolution that Hadoop, the ecosystem had to go through was with advent of technologies like Docker, it's really important first to help customers build, bring more than just workloads which are sort of native to Hadoop. So take Hadoop startup and MapReduce, obviously Spark's been great and now we're starting to see technologies like Flink coming. But increasingly, you want to do data science, mass market data science is obviously, people want to use Spark, but the mass market is still Python and R and so on, right? Non-native, okay. Which are not really built, these predate Hadoop a long way, right? So now as we bring these applications in, having technologies like Docker is really important because now we can actually containerize these apps. It's not just about running Spark or running Spark with R or running Spark with Python which you can do today. The problem is in a true multi-tenant governance system, you want not just R but you want specific sort of libraries for R, right? And the libraries George wants might be completely different than what I want and you now you can't do a multi-tenant system where you install both of them simultaneously. So Docker is a really elegant solution to problems like those. So now we can actually bring those technologies into Docker containers, so George's Docker containers will not conflict with mine and you can actually go to the races, after the races doing data science, which is really key for technologies like DSX, right? Because the DSX if you see, obviously DSX supports Spark with technologies like Zeppelin, which is the front end, but they also have Jupyter, which is kind of what the mass market uses for Python and R, right? So we want to make sure there's no friction whether it's sort of the guys using Spark or the guys using R and equally importantly, DSX in its roadmap allows to support things like the classic IBM portfolio of SPSS and so on. So bringing all of those things in together, making sure they run with data in the data lake and also the computer in the data lake is really big for us. Wow, so it sounds like your keynote is going to be very educational for the folks that are attending tomorrow. So last question for you, one of the themes that occurred in the keynote this morning was sharing a fun fact about these speakers. What's a fun fact about everyone, Murphy? Great question. I guess people have been looking for folks with 10 years of experience on Hadoop. I'm here finally, right? There's not a lot of people, but it's fun to be one of those people who've worked on this for about 10 years. Obviously I look forward to working on this for another 10 or 15 more, but it's been an amazing journey. Excellent. Well, we thank you again for sharing time again with us on theCUBE. You've been watching theCUBE Live on day two of the DataWorks Summit, hashtag DWS17 for my co-host, George Gilbert. I'm Lisa Martin. Stick around, we've got great content coming your way.