 Live from San Jose in the heart of Silicon Valley, it's theCUBE, covering DataWorks Summit 2017, brought to you by Hortonworks. Welcome back to theCUBE. We are live at DataWorks Summit 2017. I'm Lisa Martin with my co-host, George Gilbert. We've just come from this energetic, laser light show infused keynote. We're very excited to be joined by one of the keynotes today, the CTO of Hortonworks. Scott, now Scott, welcome back to theCUBE. Great to be here, thanks for having me. Great to have you back here. One of the things that you talked about in your keynote today was collaboration. Talked about the modern data architecture and one of the things that I thought was really interesting is that now where Hortonworks is, you are empowering cross-functional teams, operations managers, business analysts, data scientists, really helping enterprises drive the next generation of value creation. Tell us a little bit more about that. Yeah, great, so thanks for noticing by the way. I think, you know, the next important thing kind of is a natural evolution for us as a company and as a community is, and I've seen this time and again in the tech industry, you kind of move from like really cool breakthrough tech more into a solution space. And so I think this whole notion is really about how we're making that natural transition. And when you think about having all the cool technology and all the breakthrough algorithms and all that, that's really great. But how do we then take that and turn it to value very quickly and in a repeatable fashion? And so the notion that I launched today is really kind of making these three personas successful if we can focus combining all of that technology, usability, and even some services around it to make each of those folks more successful in their job. And so I broke it down really into three categories, right? We know the traditional business analysts, right? They've been doing SQL and they've been doing a predictive modeling on structured data for a very long time and there's a lot of value generated from that. Making the business analysts successful in a Hadoop-inspired world is extremely valuable. And why is that? Well, it's because Hadoop actually now brings a lot more breadth of data and frankly a lot more depth of data than they've ever had access to before. But being able to communicate with that business analyst in a language they understand, SQL, being able to make all those tools work seamlessly is kind of the next extension of success for the business analyst. We spent a lot of time this morning talking about data science and the data scientist is kind of the next great frontier where you bring together lots and lots and lots and lots of data. So we basically just get it math and heavy compute with the data scientist and really enable them to go build out that next generation of high definition kind of analytics, right? And we're all certainly I am captured by the notion of self-driving cars. And you think about a self-driving car is purely the success of that is purely based on successful data science. In those cameras and those machines being able to infer images more accurately than a human being and then make decisions about what those images mean. That's all data science. And it's all about raw processing power and lots and lots and lots of data to make those models trained and more accurate than what would otherwise happen. So enabling the data scientist to be successful, obviously that's a use case. Certainly voice activated, voice response kinds of systems for better customer service, better fraud detection. The cost of a false positive is like 100 times the cost of missing a fraudulent behavior, right? That's because you've irritated a really good customer with this one. So being able to really train those models in high definition is extremely valuable. So bringing together the data but the tool set so that data science, data scientist can actually act as a team and collaborate. And spend less of their time finding the data and more of their time refining the models. And I said this morning last but not least the operations manager. This is really, really, really important. And a lot of times, especially geeks like myself are the operations guys just pain in the neck. Really, really, really important. We've got data that we've never thought of making sure that it's secured properly. Making sure that we're managing within the regulations of privacy requirements. Making sure that we're governing it and understanding how that data is used alongside our corporate mission is really important. So creating that tool set so that the operations manager can be confident in turning over these massive pulls of data to the business analysts and to the data scientists and be confident that the company's mission, the regulation that they're working within in those jurisdictions are all in compliance. And so that's what we're building out in that stack of course is built on open source Apache Atlas, open source Apache Ranger. And it really makes for an enterprise-grade experience. And a couple of things to follow on to that. We've heard of this nation for years that there is a shortage of data scientists and now it's such a core strategic enabler of business transformation. Is this collaboration, this kind of team sport that was talked about earlier, is this helping to kind of spread data science across these personas to enable more of them to be data scientists? Yeah, I think so I think there are two aspects to it, right? One is certainly really great data scientists are hard to find. They're scarce, they're unique creatures. And so to the extent that we're able to combine a tool set to make the data scientists that we have more productive. And I think the numbers are astronomical, right? You could argue that with the wrong tool set data scientists might spend 80 or 90% of his or her time just finding the data and only 10% working on the problem. If we can flip that around and make it 10% finding the data and 90% that's like an order of magnitude more breadth of data science coverage that we get from the same pool of data scientists. So I think from an efficiency perspective, that's really huge. The second thing though is yeah, by looking at these personas and the tools that we're rolling out can we start to package up things that the data scientists are learning and move those models into the business analyst's desktop. So now not only is there more breadth and depth of data but frankly there's more depth and breadth of models that can be run. But inferred with traditional business process which means turning that into better decision making, turning that into better value for the business just kind of happens automatically. So you're actually leveraging the value of data science. Let me follow that up, Scott. So right now the biggest sort of time sink for the data scientists or data engineers, data cleansing and transformation. Where did the cloud vendors fit in terms of having trained some very broad horizontal models in terms of vision, natural language understanding, text to speech. So where they had accumulated a lot of data assets and then they created models that were trained and could be customized, do you see a role for not just sort of next gen UI related models coming from the cloud vendors but for other vendors who have data assets to provide more fully baked models so that you don't have to start from scratch? Absolutely, so one of the things that I talked about also this morning is this notion and I said this morning kind of open squared, open community, open source and open ecosystem. I think it's now open to the third power, right? And it's talking about open models and algorithms and I think all of those things are really creating a tremendous opportunity. The likes of which we've not seen before and I think it's really driving the velocity in the market, right? So there's no, because we're collaborating in the open things just get done faster and more efficiently whether it be kind of in the core open source stuff whether it be in the open ecosystem being able to pull tools in and of course the announcement earlier today with IBM's data science experience software as a framework for the data scientists to work as a team but that thing in and of itself is also very open. You can plug in Python, you can plug in open source models and libraries some of which were developed in the cloud and published externally. So it's all about continued availability of open collaboration that is the hallmark of this wave of technology. Okay, so we have this issue of how much can we improve the productivity with better tools or with some amount of data but then the part that everyone's focused, that everyone's also pointing out besides the collaborative experience is the ability to operationalize the models and get them into production either in bespoke apps or packaged apps. How is that going to sort of play out over time? Well, I think two things you'll see. One, certainly in the near term with again with our collaboration with IBM and the data science experience one of the key things there is not only not just making the data scientists be able to be more collaborative but also ease the ease of which they can publish their models out into the wild and so kind of closing that loop to action is really important. I think longer term what you're going to see and I gave a hint of this a little bit in my keynote this morning is I believe in five years we'll be talking about scalability but scalability won't be the way we think of it today. Oh, I have this many petabytes under management. I have this many petabytes. That's a piece of it. But truly scalability is going to be how many connected devices do you have interacting and how many analytics can you actually push from model perspective, actually out to the sensor or out to the device to run locally. Why is that important? Think about it as a consumer with a mobile device. The time of interaction, your attention span, do you get an offer in the right time and is that offer relevant? It can't be rule space, it has to be model space. There's no time for the electrons to move from your device across the tower in, run an analytic and have it come back. It's going to happen locally. So scalability, I believe, is going to be determined in terms of the CPU cycles and the total interconnected IoT network that you're working in. What does that mean from your original question? That means applications have to be portable, models have to be portable so that they can execute out to the edge where it's required. And so that's obviously part of the key technology that we're working with in Hortonworks Dataflow and the combination of Apache, NiFi, Apache, Kafka and Storm to really combine that, how do I manage not only data in motion, but ultimately how do I move applications and analytics to the data and not be required to move the data to the analytics? It's a question for you, you talked about, real-time offers, for example. We talk a lot about predictive analytics, advanced analytics, data wrangling. What are your thoughts on preemptive analytics? Well, I think that, wow, that sounds a little bit spooky because we're kind of mind-reading. I think those things can start to exist, right? Certainly, certainly because we now have access to all of the data and we have very sophisticated data science models that allow us to understand and predict behavior. Yeah, the timing of real-time analytics or real-time offer delivery could actually, from our human being perception, arrive before I thought about it. And isn't that really cool in a way? Gee, I'm thinking about, I need to go do X, Y, Z. Oh, here's a relevant offer, boom. So it's no longer, I click here, I click here, I click here and in five seconds I get a relevant offer, but before I even thought the click, I got a relevant offer. And again, to the extent that it's relevant, it's not spooky. Right. If it's irrelevant, then you deal with all of the other kind of downstream impact. So that, again, points to more and more and more data and more and more and more accurate and sophisticated models to make sure that that relevance exists. Exactly. Well, Scott now, CTO of Fortinworks, thank you so much for stopping by theCUBE once again. We appreciate your conversation and insights and for George Gilbert, I'm Lisa Murn. You're watching theCUBE live from day one of the DataWorks Summit in the heart of Silicon Valley. Stick around though, we'll be right back.