 Live from New York City, it's theCUBE. Covering IBM Data Science for All, brought to you by IBM. Welcome to the Big Apple. John Walls of Dave Vellante here on theCUBE we are live at IBM's Data Science for All going to be here throughout the day with a big panel discussion wrapping up our day. So be sure to stick around all day long on theCUBE for that. Dave, always good to be here in New York, is it not? Well, it's been kind of the data science weeks, months. Last week we were in Boston at an event with the Chief Data Officer Conference. All the Boston Data Roddy were there. Bring it all down to New York City, getting hardcore really with Data Science. So it's from Chief Data Officer to the Hardcore Data Scientist. The CDO, hot term right now. Daniel Hernandez now joins us, our first guest here at Data Science for All who's a VP of IBM Analytics. Good to see you. David, thanks for being with us. Pleasure. First off, your take, let's just step back high level here. Data Science had certainly been evolving for decades, if you will. First off, how do you define it today? And then just from the IBM side of the fence, how do you see it in terms of how businesses should be integrating this into their mindset? So the way that I describe Data Science simply to my clients is it's using the scientific method to answer questions or deliver insights. It's kind of that simple. Or answering questions quantitatively. So it's a methodology, it's a discipline, it's not necessarily tools. So that's kind of the way I approach describing what it is. Okay, and then from the IBM side of the fence, in terms of how wide of a net are you casting these days, I assume it's as big as you can get your arms out. Okay, so if you think about any particular problem, that's a Data Science problem. You need certain capabilities. We happen to deliver those capabilities. You need ability to collect, store, manage, any and all data. You need ability to organize that data so you can discover it and protect it. You've got to be able to analyze it. Automate the mundane, explain the past, predict the future. Those are the capabilities you need to do Data Science. We deliver a portfolio of it, including on the analyze part of our portfolio, our Data Science tools that we would declare as such. So Data Science for All is very aspirational. And when you guys made the announcement of the Watson data platform last fall, one of the things you focused on was collaboration between data scientists, data engineers, quality engineers, application development, the whole sort of chain. And you made the point that most of the time that data scientists spend is on wrangling data. You're trying to attack that problem and you're trying to break down the stove pipes between those roles that I just mentioned. All that has to happen before you can actually have Data Science for All. I mean, that's just Data Science for All, hardcore data people. Where are we in terms of sort of the progress that your clients have made in that regard? So I would say there's two major vectors of progress we've made. So if you want Data Science for All, you need to be able to address people that know how to code and people that don't know how to code. So if you consider kind of the history of IBM and the Data Science space, especially in SPSS, which has been around for decades, we were mastering and solving Data Science problems for non-coders. The Data Science experience really started with embracing coders, developers that grew up in open source that lived and learned Jupyter or Python and were more comfortable there. And integration of these is kind of our focus. So that's one aspect, serving the needs of people that know how to code and don't in the kind of Data Science role. And then for all means, supporting the entire analytics lifecycle, from collecting the data you need in order to answer the question that you're trying to answer, to organizing the information once you've collected it so you can discover it inside of tools like our own Data Science experience in SPSS, and then of course the set of tools that are around exploratory analytics all integrated so that you could do that end in lifecycle. So where clients are, I think they're getting certainly much more sophisticated in understanding that. You know, most people have approached Data Science as a tool problem, as a data prep problem. It's a lifecycle problem. And that's kind of how we're thinking about it. We're thinking about it in terms of, all right, if our job is answering questions, delivering insights through scientific methods, how do we decompose that problem into a set of things that people need to get the job done, serving the individuals that have to work together? And when you think about, go back to the days where the data warehouse was king. That's something we talked about in Boston last week. It used to be the data warehouse was king. Now it's sort of the process is much more important, but it was, very few people had access to that data. You had the elapsed time of getting answers and the inflexibility of systems. Has that changed? And to what degree has it changed? I think if you were to go and ask anybody in business whether or not they have all the data they need to do their job, they would say no. Why? So we've invested in EDWs, we've invested in Hadoop. Part of, in part, sometimes, the problem might be, I just don't have the data. Most of the time it is, I have the data, I just don't know where it is. So there's a pretty significant issue on data discoverability. And it's important that I might have data in my operational systems. I might have data inside my EDW. I don't have everything inside of my EDW. I've standard up one or more data lakes. And to solve my problem, like customer segmentation, I have data everywhere. How do I find and bring it in? That seems like, that should be a fundamental consideration, right? You're going to gather this much more information, make it accessible to people. And if you don't, it's a big flaw. It's a big gap, isn't it? Yeah, so, yes. And I think part of the reason why is because governance professionals, which I am, you know, I spent quite a bit of time trying to solve governance-related problems. We've been focusing, pretty maniacally, on kind of the compliance and regulatory and security-related issues. Like, how do we keep people from going to jail? How do we ensure regulatory compliance to things like EDiscovery and records, for instance? And it just so happens, the same discipline that you use, even though in some cases, lighter weight implementations, are what you need in order to solve this data discovery problem. So the discourse around governance has been historically about compliance, about regulations, about cost takeout, not analytics. And so, you know, a lot of our time, certainly in R&D, is trying to solve that data discovery problem, which is, how do I discover data using semantics that I have, which, as a regular user, it's not, you know, physical understandings of my data. And once I find it, how am I assured that what I get is what I should get so that it's, you know, I'm not subject to, you know, compliance-related issues, but also making the company more vulnerable to data breach? Well, so presumably, part of that, anyway, involves automating classification at the point of, you know, creation or use, which is, actually, was a technical challenge for a number of years. Has that challenge been solved in your view? I think machine learning is, and in fact, later on today, I will be doing some demonstrations of technology, which will show how we're making the application machine learning easy. Inside of everything we're doing, we're applying machine learning techniques, including two classification problems that help us solve the problem. So it could be, you know, we're automatically harvesting technical metadata. Are there business terms that could be automatically extracted that don't require some data steward to have to know and assert, right? Or can we automatically suggest and still have the steward, you know, for a case where I need a canonical data model, and so I just don't want the machine to tell me everything, but I want the machine to assist the data curation process. We are not just exploring the application machine learning to solve that data classification problem, which historically was a manual one. We're embedding that into, most of the stuff that we're doing, often you won't even know that we're doing it behind the scenes. So that means that oftentimes, well, the machines, ideally, are making the decisions as to who gets access to what and is helping at least automate that governance, but there's a natural friction that occurs. And I wonder if you could talk about the balance sheet, if you will, between information as an asset, information as a liability. You know, the more restrictions you put on that information, the more it constricts a business user's ability. So how do you see that shaping up? I think it's often a people process problem, not necessarily a technology problem. I don't think as an industry, we figured it out. Certainly a lot of our clients that haven't figured out that balance. I mean, there are plenty of conversations I'll go into where I'll talk to a data science team in a same line of business as a governance team. And what the data scientist team will tell us is, I'm building my own data catalog because the stuff that the governance guys are doing doesn't help me. And the reason why it doesn't help me is because it's, you know, they're going through this top down data curation methodology and, you know, I've got a question. I need to go find the data that's relevant. I might not know what that is straight away. So the CDO function in a lot of organizations is helping bridge that. So you'll see governance responsibilities line up with the CDO, with analytics. And I think that's gone a long way to bridge that gaps, but that conversation that I was just mentioning is not unique to, you know, one or two customers. Still a lot of customers are doing it. Often customers that either haven't started a CDO practice or early days on it still. So about that, because this is being introduced into the workplace, a new concept, right, fairly new CDOs as opposed to CIO or CTO or you know, you have these other, I mean, how do you talk to your clients about trying to broaden their perspective on that? And I guess emphasizing the need for them to consider putting somebody with sole responsibility, primary responsibility for their data instead of just putting it, lumping it in somewhere else. So we happen to have one of the best CDOs instead of our group, which is like a handy tool for me. So if I go into a client and, you know, it's purporting to be a data science problem and it turns out they have a data management issue around data discovery and they haven't yet figured out how to install the process and people design to solve that particular issue. One of the key things I'll do is I'll bring in our CDO and his delegates to have a conversation around them on what we're doing instead of IBM, what we're seeing in other customers to help institute that practice instead of their own organization. We have forums like the CDO event in Boston last week which are designed to, you know, it's not designed to be, here's what IBM can do in technology, it's designed to say, here's how the discipline impacts your business and here are some best practices you should apply. So, you know, if ultimately I enter into those conversations where I find that there's a need, I typically am like, all right, I'm not going to, tools are part of the problem, but not the only issue. Let me talk to you, let me bring someone in that could describe the people process related issues which you got to get right in order for someone, in some cases, the tools that I delivered a matter. We had Seth Delbrunon last week in Boston, in DePaul Vandari as well, and he sort of put forth this enterprise, sort of data blueprint, if you will, the CDOs are sorry. We're using that in IBM, by the way. Well, this is the thing, it's a really well thought out sort of structure that seems to be trickling down to the divisions, and so it's interesting to hear how you're applying Seth's expertise. I want to ask about the Hortonworks relationship, you guys made a big deal about that this summer. To me, it was a no brainer. Really, what was the point of IBM having a Hadoop distro and Hortonworks gets this awesome distribution channel. IBM has always had an affinity for open source, so that made sense there. What's behind that relationship and how's it going? It's going awesome. So perhaps what we didn't say, and we probably should have focused on, is the why customers care aspect. There are three main buying occasions, use cases that customers are implementing where they are already, even before the relationship, they're asking IBM and Hortonworks to work together. And so we were coming to the table working together as partners before the deeper collaboration we started in June. The first one was bringing data science to Hadoop. So running data science models, doing data exploration where the data is. And if you were to actually rewind the clock on the IBM side and consider what we did with Hortonworks in full consideration of what we did prior, we brought the data science experience and machine learning to Z in February. The highest value transactional data was there. The next step was bring data science to where often for a lot of clients, the second most valuable set of data, which is Hadoop. So that was kind of part one. And then we've kind of continued that by bringing data science experience in private cloud. So that's one use case. I got a lot of data, I need to do data science. I want to do it in resident. I want to take advantage of the compute grid I've already laid down. And I want to take advantage of the performance benefits and the integrated security and governance benefits by having these things co-located. That's kind of play one. So we're bringing data science experience and HDP and HDF, which are the Hortonworks distributions way closer together and optimized for each other. Another component of that is not all data is going to be in Hadoop as we were describing. Some of it's in a EDW. And that data science job is going to require data outside of Hadoop. We brought big SQL. It was already supporting Hortonworks. We just optimized the stack. And so the combination of data science experience and big SQL allows you to do data science against a broader surface area of data. That's kind of play one. Play two is I've got a EDW either for cost or agility reasons. I want to augment it. Or some cases I might want to offload some data from it to Hadoop. And so the combination of Hortonworks plus big SQL and our data integration technologies are a perfect combination there. And we have plenty of clients using that for kind of analytics offloading from EDW. Then the third piece that we're doing quite a bit engineering and go-to-market work around is governed data lakes. So I want to enable self-service analytics throughout my enterprise. I want self-service analytics tools to everyone that has access to it. I want to make data available to them, but I want that data to be governed so that they can discover what's in it in the lake. And whatever I give them is what they should have access to. So those are kind of the three tracks that we're working with Hortonworks on. And all of them are making stunning results instead of clients. And so that involves actually some serious engineering as well. It's not just sort of a Barney deal or just a pure go-to market. It's certainly more than architecture and just works. Big picture down the road then. Whatever challenges that you see in your side of business for the next 12 months. I mean, what are you going to tackle? What's that monster out there that you think, okay, this is our next hurdle to get by? I forget if Rob said this before, but you'll hear him say often and it's statistically proven. The majority of the data that's available is not available to be Googled. So it's behind the firewall. And so we started last year with the Watson Data Platform creating an integrated data analytics system. What if customers have data that's on-prem that they want to take advantage of? What if they're not ready for the public cloud? How do we deliver public benefits to them when they want to run that workload behind the firewall? So we're doing a significant amount of engineering, really starting with the work that we did on a data science experience, bringing it behind the firewall, but still delivering similar benefits you would expect if you're delivering it in the public cloud. A major advancement that IBM made is around IBM Cloud Private. I don't know if you guys were familiar with that announcement we made. I think it's already two weeks ago. Yeah, great. Right, so it's a Kubernetes foundation on top of which we have several microservices on top of which our stack is going to be made available. So when I think of kind of where the future is, our customers ultimately we believe want to run data and analytics workloads in the public cloud. How do we get them there considering they're not there now in a step-wise fashion that is sensible economically, project management-wise, culturally, without them having to wait? That's kind of like big picture, kind of a big problem space where we're spending considerable time thinking through. And we've been talking a lot about this in theCUBE in the last several months, even years, is people realize they can't just reform their business and stuff it into the cloud. They have to bring the cloud model to their data. Wherever that data exists, if it's in the cloud, great. And the key there is you've got to have a capability and a solution that substantially mimics that public cloud experience. That's kind of what you guys are focused on. What I tell clients is if you're ready for certain workloads, especially Greenfield workloads and the capability exists in the public cloud, you should go there now because you're going to want to go there eventually anyway. And if not, then a vendor like IBM helps you take advantage of that behind the firewall, often informed factors that are ready to go. The integrated analytics system, I don't know if you're familiar with that, that includes our super advanced data warehouse, the data science experience, our query federation technology powered by BigSQL, all in a form factor that's ready to go. You get started there for data and data science workloads, and that's a major step in direction to the public cloud. Well, Daniel, thank you for the time. We appreciate that. Didn't get to touch it all on baseball, but next time, right? Go Cubbies. Yeah, right. Sort of spot with me, but it's all right, go Cubbies. All right, Daniel Hernandez from IBM, back with more here from Data Science for All. IBM's event here in Manhattan, back with more on theCUBE in just a bit. All right, thanks a lot. Thank you.