 Live from Boston, Massachusetts, it's theCUBE, covering Spark Summit East 2017 brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. Welcome back to Boston everybody. Seth Dobrin is here, he's the Vice President and Chief Data Officer of the IBM Analytics Organization. Great to see you, Seth, thanks for coming on. Great to be back, thanks for having me again. Oh, you're welcome. So, yeah, Chief Data Officer is the hot title. It was predicted to be the hot title, now it really is. Many more of you around the world and IBM's got an interesting sort of structure of Chief Data Officers, can you explain that? Yeah, so there's a global Chief Data Officer that's Interpol Bondari and he's been on this podcast or video cast a few times. Right. And then he's set up structures within each of the business units in IBM where each of the major business units have a Chief Data Officer also. And so I'm the Chief Data Officer for the Analytics Business Unit. Yeah, so one of Interpol's things when I've interviewed him is culture, the data culture, you've got to drive that in. And he talks about the five things that the Chief Data Officers really need to do to be successful. Maybe you could give us your perspective on how that flows down through the organization and what are the key critical success factors for you and how are you implementing them? Yeah, so I think there are, I agree, there's five key things and maybe I frame them a little differently than Interpol does, but there's this whole cloud migration. So every Chief Data Officer needs to understand what their cloud migration strategy is. Every Chief Data Officer needs to have a good understanding of what their data science strategy is. So how are they going to build deployable data science assets or not data science assets that are delivered through spreadsheets? Every Chief Data Officer needs to understand what their approach to unified governance is. So how do I govern all of my platforms in a way that enables that last point about data science? Then there's a piece around people. How do I build a pipeline for me today in the future? So the people piece is both skills and it's presumably a relationship with the line of business as well. There's sort of two vectors there, right? Yeah, the people piece when I think of it is really about skills. There's a whole cultural component that goes across all of those five pieces that I laid out. So finding the right people with the right skill set, where you need them is hard. Can you talk about cloud migration? Why that's so critical and so hard? Yeah, so I think if you look at kind of where the industry's been, the IT industry and it's really been, it's been this race to the public cloud. And I think it's a little misguided all along. I think if you look at how business has run, right? Today enterprises that are not internet born make their money from what's running their businesses today, right? So these business critical assets. And just thinking that you can pick those up and move them to the cloud and take advantage of cloud is not realistic, right? So the race really is to a hybrid cloud. So how do I get, our futures really lie and how do I connect these business critical assets to the cloud? And how do I migrate those things to the cloud? So is that the CIO might say to you, okay, I kind of, let's go there for a minute. I kind of agree with what you're saying. I can't just shift everything into the cloud, but what can I do in a hybrid cloud that I can't do in a public cloud? Well, so there's some drivers for that. So I think one driver for hybrid cloud is what I just said, right? You can't just pick everything up and move it overnight, it's a journey. And it's not a six month journey, it's probably not a year journey, it's probably a multi-year journey. So you can actually keep running your business. So you can actually keep running your business, right? And then the other piece is there's new regulations that are coming up, right? And these regulations, EU GDPR is the biggest example of them right now. There are very stiff fines for violations of those policies. And the party that's responsible for paying those fines is the party that the consumer engaged with, right? It's you, it's whoever owns the business. And as a business leader, I don't know that I would be very willingly give up, trust a third party to manage that, just any third party to manage that for me, right? And so there's certain types of data that some enterprises may never want to move to the cloud because they're not going to trust a third party to manage that risk for them. So it's more transparent from a governance standpoint, it's not opaque, you feel like you're in control. Yeah, you feel like you're in control and if something goes wrong, it's my fault, right? It's not something that I've got penalized for because someone else did something wrong, right? So at the data layer, help us sort of abstract one layer up and the applications. How would you partition the applications? You know, the ones that are managing that critical data that has to stay on-premises, and then what would you build up potentially to complement it in the public cloud? Yeah, I don't think you need to partition applications, right? The way you build modern applications today, it's all API driven. You can reduce some of the cost of latency through design, right? And so you don't really need to partition the applications per se. I'm thinking more along the lines of that the systems of record are not going to be torn out, you know, and those are probably the last ones, if ever to go to the public cloud, but other applications leverage them. So I guess if that's not the right way of looking at it, you know, where do you add value in the public cloud versus what stays on-premises? Yeah, so some of the system of record data, there's no reason you can't replicate some of it to the cloud, right? So if it's not this personal information or highly regulated information, there's no reason that you can't replicate some of that to the cloud. And I think, you know, we get caught up in we can't replicate data, we can't replicate data. I don't think that's the right answer. I think the right answer is to replicate the data if you need to, or if maybe the data in system of record is not in the right structure for what I need to do it, and let's put the data in the right structure. Now, let's not have the conversation about I can't replicate data. Let's have the conversation about where's the right place for the data? Where does it make most sense and what's the right structure for it? And if that means you've got 10 copies of a certain type of data, then you've got 10 copies of a certain type of data. Would you be on that data, would it typically be other parts of the systems of record that you might have in the public cloud or would they be new apps, sort of Greenfield apps? Yes. Okay. I think both. Okay. That's interesting. And that's part of, I think in my mind, that's kind of how you build that, that question you just asked right there is how you, what one of the things that guides how you build your cloud migration strategy, right? So we said you can't just pick everything up and move it. So how do you prioritize, right? Well, you look at what you need to build to run your, new things need to build to run your business differently and you start there and you start thinking about how do I migrate information to support those to the cloud? And maybe you start by building a local private cloud so that everything's close together until you kind of master it. And then once you get enough critical mass of data and applications around it, then you start moving stuff to the cloud. We talked earlier off camera about reframing governance stuff. Yeah. I remember I used to head a CIO consultancy and we worked with a number of CIOs that were within legal IT, for example. And we're worried about compliance and governance and things of that nature. And their ROI was always scare the board. Okay. And that the holy grail was can we turn governance into something of value for the organization? Can we? You know, I think in the world we live in today with ever increasing regulations, right? And with a need to be agile and with everyone needing to and wanting to apply data science at scale, you need to reframe governance, right? Governance needs to be reframed from something that is seen as a roadblock, right? To something that is truly an enabler, right? And not just giving it lip service. And what do I mean by that, right? So for governance to be an enabler, you really got to think about, all right, how do I upfront classify my data so that all data in my organization is bucketed into, you know, some version of public, proprietary and confidential, right? Different enterprises may have 30 scales and some may only have two, right? Or some may have one. And so you do that upfront. And so you know what can be done with data, when it can be done and who it can be done with, right? You need to capture intent, right? So both in what are allowed intended uses of data and as a data scientist, what am I intending to do with this data so that you can then mesh those two things together? Cause that's important in these new regulations I talked about is people give you access to data, their personal data for an intended purpose. And then you need to be able to apply these governance policies actively, right? So it's not a passive after the fact or you got to stop and you got to wait. It's leveraging services, leveraging APIs and building a composable system of policies that are delivered through APIs. So if I want to create a sandbox to do some, run some analytics on, I'm going to call an API to get that data. That API is going to call a policy API that's going to say, okay, does Seth have permission to see this data, right? Can Seth use this data for this intended purpose? If yes, the sandbox is created, right? If not, there's a conversation about really why does Seth need access to this data? And so it's really moving governance to be actively to enable me to do things. And it changes the conversation from, hey, it's your data, can I have it? To there's really solid reasons as to why can it have data? And then some potential automation around a sandbox that creates value. Absolutely. For sure, but it's still, I think the example you gave public, proprietary or confidential, is still very governance-like where I was hoping you were going with the data classification is can I, I think you referenced this, can I extend that schema, that nomenclature to include other attributes of value and can I do it, automate it at the point of creation or use and scale it? Absolutely, that is exactly what I mean. I just used those three, because those are three to understand, that are easy to understand. So I can give you as a business owner some areas that I would like to see a classification schema and then you could automate that for me at scale, in theory? In theory, right? That's where we're hoping to go is to be able to automate and it's going to be different based on what business vertical, what industry vertical you're in, what risk profile your business is willing to take. So that classification scheme is going to look very different for a bank than it will for a pharmaceutical company or for a research organization. Well, and if I can then defensively delete data, that's of real value to an organization. Well, and actually with new regulations, you need to be able to delete data and you need to be able to know where all of your data is so that you can delete it. Today, most organizations don't know where all their data is. And that problem is solved with math and data science or? I think that problem is solved with a combination of governance. Sure. And technology, right? Yeah, technology kind of got us into this problem. Yeah. Technology can get us out. Technology will get us out, yeah. On that technology subject, it seems like, I mean this, with the explosion of data, whether it's not just volume, but also many copies of the truth, you would need some sort of curation and catalog system that goes beyond what you had in a data warehouse. How do you address that challenge? Yeah, and that gets into what I said when you guys asked me about CDOs, right? What do they care about? One of the things is unified governance, right? And so part of unified governance, the first piece of unified governance is having a catalog of your data that is all of your data, right? And it's a single catalog for your data, whether it's one of your business critical systems that's running your business today, whether it's a public cloud, whether it's a private cloud, or in some combination of both. You need to know where all your data is. You also need to have a policy catalog that's single for both of those, because if you have more than, policy catalogs like this fall apart by entropy. And the more you have, the more likely they are to fall apart. And so if you have one, and you have a lot of automation around it to do a lot of these things, so you have automation that allows you to go through your data and discover what data is where, and keep track of lineage in an automated fashion, keep track of provenance in an automated fashion, then we start getting into a system of truly unified governance that's active, like I said before. You know, there's a lot of talk about, you know, digital transformations. Of course, digital equals data. If it ain't data, it ain't digital. So one of the things that, in the early days of the whole big data theme, you'd hear people say, oh, you have to figure out how to monetize the data. And that seems to have changed and morphed into, you have to understand how your organization gets value from data. And if you're, you know, for revenue, for profit company, it's monetizing something and feeding, you know, how data contributes to that monetization if you're a healthcare organization, maybe it's different. I wonder if you could talk about that in terms of the importance of understanding how an organization makes money to the CDO specifically. Yeah, so I think you bring up a good point. Monetization of data and analytics is often interpreted differently, right? If you're a CFO, you're going to say, boy, you're going to create new value for me, right? I'm going to start getting new revenue streams. And that may or may not be what you mean. Just sell the data. It's not always so easy. I mean, it's not always so easy. It's hard to demonstrate value for data, you know, to sell it. There's certain types, like, you know, IBM owns a weather company. Clearly, people want to buy weather data, right? It's important. But if you're talking about, you know, how do you transform a business unit that's not necessarily about creating new revenue streams. It's how do I leverage data and analytics to run my business differently? And maybe even what are new business models that I could never do before I had data and data science? Would it be fair to say that there is, as Dave was saying, there's the data side and people were talking about monetizing that. But when you talk about analytics increasingly, machine learning specifically, it's a fusion of the data and the model and a feedback loop. Is that something where that becomes a critical asset? You know, I would say, I would actually say that you really can't generate a tremendous amount of value from just data. You need to apply something like machine learning to it, right? And machine learning has no value without good data. And so you need to be able to apply machine learning at scale, right? You need to build deployable data science assets that run your business differently. So, for example, right, I could run a report that shows me how my business did last quarter, how my sales team did last quarter, or how my marketing team did last quarter, right? That's not really creating value. That's giving me a retrospective look on how I did, right? Where you can create value is, how do I run my marketing team differently? So what data do I have and what types of learning can I get from that data that will tell my marketing team what they should be doing? And the ongoing process. And the ongoing process, right? And part of actually discovering, you know, doing this catalog of your data and understanding data, you find data quality issues, right? And data quality issues are not necessarily an issue with the data itself where the people, they're usually process issues. And by discovering those data quality issues, you may discover processes that need to be changed and in changing those processes, you can create efficiencies. So it sounds like you guys got a pretty good framework having talked to Interpol a couple of times and what you're saying makes sense. Do you have nightmares about IoT? Do I have nightmares about IoT? I don't think I have nightmares about IoT. I think, you know, I think IoT, you know, IoT is really just, it's a series of connected devices is really what it is. And if you think, I'm going to, my talk tomorrow, I'm going to talk about, you know, hybrid cloud and connected car is actually one of the, one of the things I'm going to talk about. And really a connected car, you just have a bunch of connected devices to a private cloud that's on wheels, right? And I think I'm less concerned about IoT than I am, you know, people manually changing data, right? IoT, you get data, you can track it, if something goes wrong, you know what happened, right? So I would say no, I don't have nightmares about IoT. If you do security wrong, well that's a whole other conversation, right? So it sounds like you're doing security right, sounds like you got a good handle on governance. I guess there's obviously scale is a key part of that. You could break the whole thing if you can't scale. But you're comfortable with the state of technology being able to support that, at least with an IBM. I think, at least with an IBM, I think I am. You know, like I said, a connected car, which is not basically a bunch of IoT devices, a private cloud, how do we connect that, that private cloud to other private clouds or to a public cloud? You know, there's tons of technologies out there to do that, right? Spark, Kafka, those two things together allow you to do things that we could never do before, right? Can you elaborate, like in a connected car environment or some other scenario where it's a, other people call the data center on wheels, you know, or in this case, you know, think of it as a private cloud, that's a wonderful analogy. How does Spark and Kafka on that very, very smart device cooperate with something like on the edge, like the sort of city, like buildings versus in the cloud? Yeah, so, you know, if you're a connected car and you're this private cloud on wheels, right? You can't drive the car just on that information, right? I mean, you can't drive it just on the LiDAR, just on knowing how well the wheels are in contact. You need weather information, right? You need information about other cars around you. You need information about pedestrians. You need information about traffic, right? All of this information you get from that connection, right? And the way you do that is leveraging Spark and Kafka, right? So you could love Kafka as a messaging system. You could leverage Kafka to send the car messages or send pedestrian messages. Hey, the car, you know, this car is coming, you shouldn't cross, right? Or, you know, vice versa, right? Get a car to stop because there's a pedestrian in the way with before even the systems on the car can see it, right? So if you can get that kind of messaging system in near real time, right? If I'm a pedestrian, I'm 300 feet away, a half a second that it would take for that to go through isn't that big of a deal because you'll be stopped before you get there. What about the, again, the intelligence between not just the data, but the advanced analytics where some of that would live in the car and some in the cloud? Is it just you're making real-time decisions in the car and you're retraining the models in the cloud or how does that work? No, I think some of those decisions would be done through Spark, right, in transit, right? And so one of the nice things about Spark is, you know, you can do machine learning transformations on data, think ETL, right? But think ETL where you can apply advanced, you can apply machine learning as part of that ETL. So I'm transferring all this weather data, positioning data, and I'm applying a machine learning algorithm for a given purpose in that car, right? So the purpose is navigation or, you know, making sure I'm not running into a building. And so that's happening in real-time as it's streaming to the car. That's the prediction aspect that's happening in real-time. Yes. But at the same time, you want to be learning from all the cars in your fleet and the ecosystem. And that happens. That would happen up in the cloud, right? I mean, I don't think that needs to happen on the edge. Maybe it does, but I don't think it needs to happen on the edge. And today, you know, while I said a car is a data center, you know, a private cloud on wheels, there's costs to the computation you can have on that car. And I don't think the cost is quite low enough yet where you could do all that, where it makes sense to do all that computation on the edge, right? So some of it you would want to do in the cloud. And plus you would want to have all the information from as many cars in the area as possible. We're out of time, but just some closing thoughts. You know, they say live an interesting, may you live an interesting times? Well, you sum up to some of the changes that are going on in the business. Dell buys EMC, IBM buys the weather company. And that gave you a huge injection of data scientists and which talk about data culture. Just last thoughts on that in terms of the acquisition and how that's affected your role. So I've only been at IBM since November. So all that happened before my role. Okay, so you inherited. So from my perspective, it's a great thing because, you know, before I got there, the culture was starting to change. And, you know, that like we talked about before we went on air, that's the hardest part about any kind of data science transformation is the cultural aspects. All right, well Seth, thanks very much for coming back in theCUBE. Good to have you again. Thanks for having me again. You're welcome. All right, keep right there, everybody, we'll be back with our next guest. This is theCUBE, we're live from Spark Summit in Boston. Right back.