 Live from Cambridge, Massachusetts, it's theCUBE at the MIT Chief Data Officer and Information Quality Symposium with hosts, Dave Vellante, Jeff Kelly and Paul Gillan. Hi everybody, welcome to Cambridge, Massachusetts. This is Dave Vellante with Paul Gillan and Jeff Kelly and we're here at the MIT Information Quality Symposium. This is day two for us. The second year we've been at this symposium, it's a chief data officer event and probably the premier event in the industry bringing together thought leaders and chief data officers and data governance gurus. The emergence of the chief data officer has clearly been occurring over the last several years, in particularly within highly regulated businesses like financial services and healthcare and government. The number of organizations three years ago that had CDOs was in the single digits. Last year it was in the low teens. This year survey data suggests it's up over 20%. And so the role is, as I say, very clearly emerging particularly within those industries. It's somewhat bleeding into more general commercial industries. The other aspect here, the other dimension, is the CDO generally does not report into the CIO. We had a CIO on yesterday from Partners Healthcare who said that he sees the CIO role actually morphing into two roles, one of a COO, an operational role and the CTO, the technical role with the CDO emerging reporting to the COO. So that's quite interesting to hear that from a CIO. Paul Gillan, we were talking before this event, we were sort of questioning, will the CDO role be around in 10 years? So that was quite an interesting comment yesterday. Well, I think it's interesting. I mean, what is information but data, right? So we've got a chief information officer, a chief data officer, data equals information. Why do we need two people to handle those roles? I think what it really means is that the CIO role, as we've known it, is basically an infrastructure role. It's a technology and software role and the data officer is about data. In fact, the CDO, maybe what the CIO role ultimately needs to evolve to. Well, and we've been talking this morning off camera about the whole notion of data governance and Jeff Kelly, this is coming up a lot. These projects that are being run, these big data projects, there's really not a lot of oversight as it relates to data governance. You know, it's okay, let's spin up a Hadoop cluster, let's see if we can get people to click on our ads or let's see some experimentation, let's do some sandboxing. What are you seeing in terms of how these projects relate to the overall corporate governance structure? Unfortunately, often they don't. So a lot of these, I think what we've seen over the years, not necessarily around big data, but just data generally. Data and analytics projects have often been siloed in different departments or sometimes even work groups within departments. And there's a lot of experimentation going on and not a lot of oversight at a corporate level, at an enterprise level. So tying corporate governance policies to specific projects is difficult in that sense. And then, of course, you've got kind of the shifting landscape around compliance regulations, things like that that are changing all the time and things that are not yet settled around. I mean, the public at large is just starting to understand the implications of all of these free social services, Facebook, Twitter, whatever, and the implications of using services like that and the data that's going out, the data that they're essentially giving away with their permission with these terms of service, which pages long, but they're in there. And so how do you use all that information ethically, legally, I don't think there's a lot of oversight right now in most enterprises around that. And most organizations are still treating these as kind of siloed projects and have yet to take a kind of larger corporate wide view of some of these governance policies. Something I want to add to what Jeff said, because that's important, there's been a lot of talk at this conference about unstructured data. That seems to be the big theme I'm hearing. And relating to the social networks and who owns that data and how do you capture that data and how do you use it? And we tend to think of data as being structured. It's something that's captured in databases. But in fact, 80 to 90% of the data most organizations have is not structured. And so the big governance problem I think is going to be how do we make sense of that unstructured data, how do we define policies for capturing it, making sure that it's valid, archiving it, and using it. Yeah, and we had a conversation with Joe McGuire yesterday and he said that essentially, he really didn't use the term best practice but he said a technique that companies are using is with that unstructured data, they're putting a layer of metadata on top of it, which is structured and can be structured. But the real holy grail he was saying is the data itself, the information inside the tweet. And that's where the industry is sort of struggling. Now, we had a conversation about whether it was classification technologies or using search as a blunt instrument, but this seems to be a major challenge for the industry. Jeff, I want to come back to you. You just conducted a major survey of big data practitioners by annual survey. Two times a year you do this survey. Anything in there that is relevant to this discussion or the CDO discussion that you can share with us? Absolutely, so we asked, part of this particular survey was around what are some of the biggest barriers or challenges associated with moving big data analytics projects from POC into production. And one of the biggest barriers from a non-technical perspective is confusion around privacy and compliance issues. People just don't quite understand what their responsibilities are necessarily. And they don't want to get into legal trouble, they don't want to get into, maybe even if it's not legal trouble doing something unethical that's going to get them bad PR. So it's one thing to build an application in kind of a proof-of-concept environment where you're taking data from multiple sources, could be customer data, could be sensitive data, coming up with an application that maybe targets particular users for an ad or for an offer, and then want to move that to production and then they're getting people, other people in the organization involved and you're getting the stop sign saying, wait a minute, we're not sure if this is something that we can do legally or ethically. I think it's not so much that it's necessarily a lot of these applications are going to cross legal or ethical lines, but it's just a lack of understanding about what those lines are. There's still a lot of confusion and I think that's really holding up a lot of these projects. Well, and it speaks to the lack of some kind of data czar. Whose responsibility is it to provide that direction? If they took that into account before they started some of the POCs, you can build some of those frameworks, some of those safeguards against crossing some of these lines into the POC environment so that when you get the time to go into production and scale this thing out, you don't necessarily have as many of those questions to answer, but you've got a lot of these experiments taking place in the corners of the organization and then they bring them to the wider organization to go into production, like look at this great application we could use to drive business and then you're getting people either in maybe the finance, maybe the risk department saying, wait a minute, have you thought about the implications of this? Well, but I mean these initiatives, these POCs and big data, they were being driven by the Cowboys. They're P&L managers, they got budget, they're ahead of plan, they don't care, they're trying to reach new revenue opportunities and maybe the whole compliance, security pieces, a bolt-on at the end when it goes in production. The question to you guys is, can it be a bolt-on? Is that a viable approach? No, and I think that the CDO's job fundamentally has got to be not about owning data but about owning governance. CDO is a process job, it's a job about defining how data is handled and making sure that those rules are followed by everyone who needs to be involved. So the data has to be, the important thing is not whether compliance is bolted on the back end, it's people to understand what compliance means and what data and the data that they work with relates to compliance issues. So I want to shift gears here, talk a little bit about the morning. You guys were in the keynote, Paul in particular, you had a conversation with Bill Inman, going back probably 30 years now. Bill Inman who it turns out I gave him his first article on computer world 30 years ago and he says I launched his career. That was an amazing thing to hear. He remembered it well. He remembered it well. He's the father of data warehousing, author of 52 books and we'll be delighted to have him on this afternoon. So now I don't know if he heard the disclaimer yesterday which was, hey guys, maybe I don't know if Rich repeated it this morning, but if you don't want to tweet it, don't say it. I heard there was some pretty controversial things that in his keynote. He was talking about working with BP. He worked with BP before the Gulf oil spill and after and he said these guys didn't care about safety before, deep water, they don't care about safety today. And he said that to an audience of 300 people. So- He said they still don't care about safety, it's interesting. They still don't care at all about safety. So I assume he's not gonna get a lot more consulting about BP right now, but it was a pretty powerful thing for him to say about such a big company. Oh, that's- Very, very definitive. What else did he put forth? I mean, big question, Jeff Kelly, is the data warehouse as we know at a dinosaur? Right, well, it'll be great to get his perspective being the father of the data warehouse. What do you think? I think, what I think we can agree on is that the approach taken for the last 10, 20 years has largely failed to live up to expectations. I think we can agree on that. That kind of centralized, pristine enterprise data warehouse are going to bring all your data from operational systems into this one environment and with a rather inflexible data model. And we're going to just, basically if you want to make any changes to it, you've got to take time, you want to add new data sources, you want to ask new questions, just not really something you can do in an iterative way. So I think we can agree that it hasn't lived up to expectations. Whether it's a complete do-over, that's a great question to ask him. I'd love to get his opinion. In terms of what he talked about in his keynote, he's talking about some of the more advanced things he's doing around text analytics, really interesting stuff around taking really dense, textural-based information and making sense of it. And he talked about the difference between just understanding the content and understanding the context. He talked about understanding the content of a textural piece of information is pretty easy and anybody can kind of do that, identifying some of the subjects, some of the people, some of the actions, what's taking place in terms of the content. What that actually means, the context, that's the challenging part. And I'll be really interested to get him on theCUBE and talk a little bit about how he's attacking that problem. And anecdote, after his talk, he was speaking with some people who came up to visit with him and saying, someone asked, what's a project that you were unable to do? And he said he was approached by a company that does training and they asked him to come up with a way to grade papers, to grade student papers using your text analytics and context, as Jeff mentioned, he said, we couldn't do it, we just couldn't figure out a way to do it. So there are, all goes back to this unstructured data problem. The big hairy issue that we still have to solve is how to make sense of unstructured data. Yeah, so it's not an easy problem to solve. I know we've dabbled with it with our Twitter products. I want to stay on the theme of the data warehouse. So Jeff, you said it failed to live up to expectations and I, of course, have been pretty vocal and somewhat critical of the data warehouse not living up to those expectations in its vision, 360 degree view of the business and real time decision making, et cetera. I think much of that criticism is pointed toward vendor marketing. Essentially, the data warehouse after the Enron debacle became this sort of compliance engine, this reporting system that CFOs had to use as part of their compliance edict and it sort of saved the day in a way. But the marketing was always around a lot of the things you're hearing about Hadoop now with real time. My question is, if you look at your survey data, as I recall, it was a huge percentage of customers said that they have already shifted resources from their traditional enterprise data warehouse into their Hadoop projects and another huge percentage said they would do so by the end of the year. I mean, I think it was well over 90% of the sample said that by the end of this year they will have shifted some resources, significant or not, I don't know, into from EDW to Hadoop. True, and what do you make of that? Well, right, there's some nuance in that. So, about 60%, or a little more than 60%, have shifted some one workload or another from Hadoop or an existing mainframe into, sorry, from a data warehouse or existing mainframe into Hadoop. And the majority of those are more around the data transformations, the T and the ETL process or ELT process. Using the power of Hadoop and the fact that it's much less expensive than something like an enterprise data warehouse from a teradata or an Oracle to do some of that grunt work, some of that real large scale transformation work. So that's not necessarily the really high value workloads, but it makes sense to, as an initial project, to move some of those workloads to something like Hadoop. Now the second most workload that's been moved most frequently from Hadoop, from data warehouses to Hadoop, has been around BI reporting. Now it's significantly far behind transformations, but nevertheless that tells me that's kind of the direction we're moving. Organizations are starting to experiment a little bit with moving some of those core reporting functions to a much less expensive platform such as Hadoop. So what does that mean for the data warehouse business? I think what it says is, I think that's an illustration of the frustration that we were just talking about, that the data warehouses fail to live up to expectation. So you're seeing when there's an opportunity to move some of these workloads to a much less expensive platform, and to one that has a lot more long-term potential about, once you get that data in there, maybe you're doing some more traditional reporting against it, but wow, what are the other things you could potentially do with it now that you have it in a more flexible, scalable platform? I think I want to give you my perspective on this. I feel like when, I don't know if you guys remember, Paul, you may, Jeff, you may or may not, when the whole Walmart beer next to the diapers came out, everybody was sort of salivating over the enterprise data warehouse. So very clearly, there was value to be extracted from investments in data warehouse. And I think what the big question I has were those competitive advantages sustainable? Or did the data warehouse business become essentially table stakes that you had to invest in to maintain some level of, some degree of parity with your competitors? Because when you talk to data warehouse practitioners, you hear the same story. It's like a snake swallowing a basketball with data. It takes too long, it's too complicated to build cubes. We're chasing chips. Every time Intel comes out with a new microprocessor, we have to go buy it because it'll speed up our processes by a little bit. But the time it takes for us to actually get insights out of the data warehouse is too long, it's too complicated, it's too expensive, yet we still do it. So I question whether or not that competitive advantage that occurred with the enterprise data warehouse was sustainable, and then the obvious question there is, will it be with a dupe? Well, I certainly think there was advantage that they came out of data warehouses. And if you look at something like Target stores and their innovations in targeting customers, in customer segmentation, and many other retailers do this, Best Buy, have really used warehousing to their greatest effect in marketing. I think that's where we've probably seen that some of the greatest impact of data warehouses is in marketing and customer segmentation. Amazon, look at what they're doing, but their data warehouse is really, they're not calling it a data warehouse, it is a developing one-to-one profiles, individualized profiles of their customers, and knowing their buying history and their patterns, being able to target promotions and cross-selling opportunities at them. I think that in retail and in consumer applications, we have seen a lot of payoff from data warehousing. Where I question is in banking, in healthcare, in manufacturing, big industries, I haven't seen examples of data warehousing having a big payoff. Well, it certainly will take, I think, financial services in terms of data, certainly with marketing, risk, fraud detection, you're seeing that today. I mean, those are the big three, right, Jeff? That you're seeing with so-called big data and Hadoop, what Abhi Meta calls about the building of the data pipeline or the data factory. You're seeing impacts there. So my earlier question is, will we see sustainable competitive advantage in Hadoop and big data? Or is it going to be sort of like the data warehouse all over again? Well, it's sustainable as long as you, as an organization, are continuing to innovate. And I think it goes back to something Dr. Watson was talking about yesterday about digitizing your assets. And continuing to do that is new types of assets are built, whether that's physical assets, putting sensors on all your products or your equipment, or as he discussed parking spaces, who would have thought to digitize a parking space? So as long as you continue to innovate, I think when you talk about Hadoop and big data, that's just an enabler. That's a platform, that's a technology, it's an enabler. It's what you do with it. And if you continue to innovate and you build a culture of always looking for new ways to exploit the value inherent in your data, then yes, you can sustain an advantage over those organizations that we talked about yesterday a little bit around in Major League Baseball. The teams that kind of just do technology at the problem. If you can build that culture of always looking to exploit your data to its fullest, then yes, these technologies can enable you to maintain a sustainable advantage. So Moneyball looks like the Oakland A's have sustained their competitive advantage, at least for now, but so yeah, it's the process and the people more so than the technology is what I'm hearing. So we're going to unpack these and other issues today with real practitioners. We're going to move the pundits aside and we're going to drill into the heads of the folks in the front lines that are actually doing this stuff in the field within financial services, within healthcare, within government. So stay tuned, Paul and Jeff and I will be back. This is theCUBE, we're live from the MIT Information Quality Symposium. This is Silicon Angles theCUBE, we'll be right back.