 Hi everybody, we're back. This is Dave Vellante, I'm with Wikibon.org. And I'm here with my co-host, Jeff Kelly, who is the lead big data data, now the data quality analyst at Wikibon. Jeff, of course, has done a great job. You've seen his manifesto. You've seen his market studies. And we've been covering this space, the cube, silicon angle, Wikibon. We've been covering this space for a number of years now. But what we haven't done, and the reason why we're so excited to be here at MIT is really focus on some of the challenging practitioner issues around data quality. And in particular, the role of the chief data officer, which is emerging as an increasingly important role, especially in data-driven organizations. Stuart Madnick is here. He's a professor of engineering systems at MIT's Sloan School of Management. Stuart, welcome to the cube. Thank you, pleasure to be here, meet you. Yeah, great to see you. Yesterday you gave a talk on the state of the CDO. I'm very familiar. Of course, with the Sloan School. For years, just got very much involved in a number of your activities. Some of the CFO conferences that you guys held. So you guys do a great job. I've been a big fan of the Sloan School for a number of years. So keep up the great work, I really appreciate it. Talk about the discussion that you gave yesterday. The keynote. Very good. Well, what I tried to do as a keynote we gave yesterday was talk first a little bit about the whole issue of big data, because that's been an important driving force in the emergence of the Chief Data Officer. And actually, I go back a little bit in history, as you mentioned, I've been around for a long time. So I go back several hundred years, and I use it as an example to kind of motivate it at the discovery of the microscope, which is actually discovered by accident. Turns out the inventor was a Dutch gentleman, and he was trying to build a way to measure the thread count in tightly packed woven thread. And one day he aimed the microscope at a drop of water, and he observed, gee, there are animals in that water. Now the reason I use that anadolp to motivate it, it's not that those animals had not been there for hundreds or thousands of millions of years. It's just we were never able to see them before. And one of the amazing things about big data in general, it's giving us the ability to see things that we never were able to see or understand before. And so that's the power of big data. It's obviously gotten a lot of excitement and hype, but also a fear of attraction in industry. And I also like to, pardon me, my voice has worked out yesterday quite a bit. I like to give some examples of kind of, I call, insights. Things like, one of the insights I call is, we can see what you're thinking, insight. And this is some work done by my colleague, Eric Bernelsen, he and one of his PhD students analyzed Google search data, and by using that was able to detect increases in housing prices or decreases in housing prices in major cities in the United States. Because if you think about it, imagine you're a graduate from MIT, and you're gonna be moving to San Diego. Well, several months before you move, assuming you wanna have a house, for example, you start shopping around, looking around. So by monitoring these results, we're able to kind of say, what are people thinking about doing in the future? And we're able to do all kinds of ways to anticipate what people are gonna do by observing things they're doing now to indicate what they're thinking about. So those are the kinds of examples of things that there was really no practical way that we could have done these things in the past. So you guys at Sloan School, you're very data driven. I've seen Eric's work in the past. You guys love to mine data. You take a bath and data, I say. And so, I like your attitude. A lot of times, we've heard this at this conference, people, of course we've heard big data, big deal, which Dat was on this morning talking about that. I love that line. And you guys have been doing data for a while, but there is something different about this so-called big data theme. A lot of people don't like that word, but with the web scale, it certainly feels new. And I think it is new. Examples that you're giving bring a new feel to it. But one of the things that we haven't seen is a hard connection between big data and data quality. And that's something that somebody out of the data quality community is going to naturally think about. So do you think about that and how do you think about that? Well, I think it's a very important issue is I often simplify by saying having a lot of data doesn't mean anything if it isn't very good. And so quality of data is very important. One of the misunderstandings I think some people have is that if you have enough data, then the fact that some of it is bad kind of washes out. And I'm sure there are cases where that does apply. But if you don't really understand your data, there can be systematic misunderstandings that can really distort it. And so we've been talking about this issue of data quality and understanding data for a long time. Let me give you a simple example if you don't mind. As you know, we're now working our way out of another housing crisis. We seem to go into housing crisis in this country every decade or two. Well, there's a story that goes back to the previous housing crisis. And there was a headline in the Boston Globe, the second most authoritative source of news besides you guys, that basically said, oh, good news, housing sales have dramatically increased. Well, how do they do that? This goes back a decade or so. So it wasn't really big, big data. But what they did, they went to the registry of deeds. And as you know, when a house is sold, a new deed is filed. And they may have counted how many deeds haven't filed that month. And it was significantly up over previous months and previous years. So the good news was more house being sold. The trouble was, is remember, when a bank repossesses your house, they take ownership and there's a new deed filed. What was really going on was the number of foreclosures was hitting an all-time high. Not that housing sales had increased. Now, the reason I mentioned that here is a simple example how you can have the data and the mathematics they did was perfectly correct. They added the numbers up correctly. But the way they interpreted what the data meant was totally wrong. And so that's why we believe that data quality is so fundamental. Understanding the data, understanding what it means, understanding the consequences is so critical. And unfortunately, it's vastly misunderstood. Well, so much of that is antithetical to what we hear in the big data world. Jeff and I and our colleagues, we travel around and we hear from these big data practitioners, these famous data scientists. And a lot of what they do is inference, it's sentiment, and it's very fuzzy. I feel like you and your colleagues are trying to really bring discipline to that fuzziness. The microscope is going back to that. So do you feel hopeful? What gives you hope that this new big data theme is not just going to overwhelm us with bad data? Well, that gives me a good opportunity to go into another recent project we have. One of the things we've been pushing in our own research group here is a notion of looking at big data, both bottom up and top down. I think a lot of what you're describing is I call the bottom up approach and looking at it. They just grind it and grind it and grind it and see what kind of bubbles up out of it. So one project we just are finishing now was looking at the London riots that took place about a year or so ago. And on the top down approach, we had a theory that we were exploring and the theory goes as follows. I'll try to say it quickly, is that in our heads, we have narratives, thoughts, a narrative might be the police are here to defend us. Another thought might be the police are here to oppress us. And in fact, we probably have all these thoughts kind of in our minds at the same time. So we did this project, we identified a number of important narratives related to the London riots, and then we tied that to the data feeds like Twitter's and C to see which of these narratives were being reinforced by the outside events. And we were able to take this modeling approach of looking at these narratives and how they have been flow and relate to reinforce or dissipate each other and the data and were able to basically forecast in retrospect, that is forecast the evolution of the London riots. So our belief is there are some things you probably can get out of just grinding the data. But we believe this combined top down bottom up approach, we have a theory of what's going on and then use the data to refine that theory is very important. And I think that's going to be the way of the future. NSA prism would be another great example. Any polarizing effect would be a great example. Well that's interesting because I think that's related to the question of correlation versus causation. And in the big data world, what we're hearing a lot of I guess big data evangelists say is big data doesn't really help you answer and so much the why, but what's happening? And in big data, I'm sorry, I forget his name but one of the correspondents for the economist, his book really focused on really it's the correlation that big data is focused on. It's the what rather than having a hypothesis and testing it. So you're suggesting that it's kind of, you need kind of a, as you said, an approach from both sides. And how do you practically make that happen when you've got such large data sets, myriad technologies you're trying to use, you've got different stakeholders. That could be a complex process I would imagine. How do you practically try to make that happen? Well first thing I want to just pick you back on a comment you mentioned I think is a very key one and that's the issue of correlation versus causality. I'm probably gonna get the joke wrong but there's a whole story about if you look at the size of a fire and the number of firemen at the fire, you notice that the more firemen you have, the bigger the fire is. And so the obvious conclusion is that firemen set fires and the more firemen you have, the bigger the fire you have. I apologize whoever first came up with that joke I'm sure I butchered a little bit but that's the challenge you have. A lot of times if you just grind the numbers the numbers are correct. Just like the numbers I mentioned regarding how say our deeds filed in Boston. But understanding what that really means, what is causing what to happen can does not pop out in fact and almost always gets hidden in those numbers. So that's why and once again I don't wanna be dogmatic. I'm sure there are situations where the numbers will be self revealing. So I'm not saying it doesn't happen but all too often you run into these problems of either causality not being well understood or the numbers give you a distorted result that doesn't make sense or conversely may make sense but incorrectly making sense. And that's why we think wherever possible if there is a fundamental theory that is much more solid so to speak, that's driving it and you can use the data to either A validate that theory or B in some sense parametize that theory. Another project we worked on just a few years back that's kind of related to this was that we were looking at using data and models combined to predict the stability and instability of countries. An interesting challenging issue. So there's whole bunch of things that I think it's hard to do with just the raw data alone but if you combine the raw data with some fundamental theories of how people, organizations or countries behave you can do a lot more. And that speaks to I think the required skill set in the data scientists or even to some extent chief data officer and people who work with data they've got to have the ability to crunch the numbers if you will, but in order to build those hypotheses you've got to have some domain knowledge and domain expertise. And once again, I can't, I don't want to be dogmatic about it because I'm sure there's many roads to success and probably equally many roads to failure but I don't want to kind of push exactly just said on that view because that's one that we think has been underappreciated and underexplored is combining the domain knowledge with the data analysis to get a lot more out of it. If you think about it, you know I used the analogy a minute ago of the microscope if you think much of what happened subsequently in medicine and biology came out of that those are fundamental theories over time. Now maybe the initial observations were just reading the thoughts or reading the animals in the water but there often are theories that lie behind it and one of the things we try to do with the data analysis is try to understand what are the underlying principles underlying theories that are driving these things. What about the role of the CDO? Talk about that a little bit. You've got some perspective there. You talked about that yesterday at the CDO event. Talk about what is a CDO? What makes, what are the characteristics and properties of a CDO? What's the history of that role? Sure. Well, a couple of things regarding the history. I forgot the precise here. I think around 2003 we've actually found the first recorded CDO. I think it was my Capital One organization. And then someone did a recent data mining experiment using LinkedIn, I've got the actual number but the number of CDOs this year compared to last year's up by a factor of five. So obviously there's something going on out there. But one of the things we wanted to do in our research here was try to understand kind of how to characterize CDOs. The same phenomenon you have with CIOs to some extent. There are lots of people in many organizations with that job title or that label on their business card but often do radically different things. And so we did a series of things. First we conducted a series of in-depth interviews. We had about 40 different either CDOs or people who we believe were doing CDO-like activities even though they may not currently have that job title and try to understand what they were doing and the outcome of that research, we came up with, I often use a joke, if you're at Harvard you use a two by two matrix. We're at MIT, we use three dimensions. I made this, we came up with three dimensions. And so the three dimensions we came up with these CDOs were as follows. The first dimension was we call directionality of a primary looking inward into their organization or outward by inward in mean of the primary looking for ways to improve the productivity, to reduce inventory, I'd be looking more at ways to improve the operational efficiency within the company or looking more outwardly in terms of marketing, how to expand the markets, how to build new products, kind of looking at the outside world. So one issue of the perspective are the directionality, logic inward or logic outward. That was one dimension. The second dimension may seem a little funny but you'll see why it's important is the kind of data they're looking at. Are they primarily looking at what I referred to as logic traditional types data. You know, maybe sales data, maybe inventory data and such. Or they're primarily looking at what they're called kind of Nuvo data. It could be things like Twitter feeds or social media or cell phone traffic or location based traffic and so on. The reason I say that is we might think of traditional data as being kind of traditional. I would argue most organizations have been able to get value out of a teeny portion of the information they already have. And so it's nothing wrong with trying to harvest a lot more value on stuff that you have that just haven't figured out how to make good use of. And we have lots of examples as well. But then we also have the Nuvo data. So the question is your perspective primarily in the traditional data or primarily on the called big data or Nuve data. The third dimension in many ways I think may be the most critical one for where it's gonna go in the future. And that is whether the CDO sees their job primarily as a service activity or a strategic activity. What I mean by a service activity means someone says I need to know X. Could you please get me that answer? Can your organization get me that answer? So you're acting kind of responsibly to the needs of your organization which is a perfectly good thing to do. A strategic issue is being able to help your organization much like my example, the microscope and the water help your organization to see things and see needs that they are not currently aware of to help to set strategic directions and new insights for your organization. So if you think of this as a three dimensional cube very appropriate, I think I hope you appreciate the connection. If you think about this as a three dimensional cube what we did is we identified each of the coordinates at the corners as a kind of role and describe what that role looks like realizing that probably no CDO is solely in one role. They're usually a cluster of them. But like we asked them like what of these eight roles which the one you see is being the dominant role and which you see is your secondary and third one. So that's what we try to do which because right now there is no blueprint if you will for a CDO or even a way for an organization to think about what kind of CDO they might want to have. Well what I like about that model is it's organic because the drivers are going to change. The sales and marketing organization you hear about CMOs are going to be spending more on technology than CIOs. Well that's going to affect the affinity with which the CDO approaches that cube. So while we had you here I want to ask you because you're a deep expertise. You don't mind me interrupting you. I'm sorry, you missed my thought. Because you had a brilliant thought. Thank you. I should pause and breathe, allow some filler every now. One of the things we also showed in our chat when we interviewed these CDOs we asked them a little bit about their career and often we'll see that they start off large in this corner then they end up moving to this corner, then this corner and the current plans often move to this corner. So you're right it is an organic evolving role. Excellent, no thank you for that. Appreciate it. So I wanted to pick your name because you're very accomplished. You've written numerous, I think I told you in college I'm quite certain that one of your textbooks was required reading. And so I'm interested in your current work. You're working on things like connectivity among disparate distributed information systems, database technology, software project management. Let me actually start with database because database was kind of boring a decade ago. You had a couple of companies out there and you had some platforms and now all of a sudden database is hot again. What's happening? What do you see in terms of database technology? How is that evolving? What's changing? Well, that's a subject for several hours so I'll try to keep it somewhat focused and I'll address it in maybe two tiers if you don't mind. One tier is one that probably most people are already pretty much associated with the big data movement. And that is the fact that big data typically involves large volumes of data and large amount of processing involved looking for new kinds of architectures both hardware and software architectures that scale Hadoop and so on. So there's a lot of activity going on there and a lot of either new companies coming out to provide these services. One of my students working on a thesis this semester on the idea of big data as a service. What would that look like and so on. So I think there's a lot going on kind of at the architectural software hardware level. The thing that I've been most interested in though, if I can go to that level, is something that has been around for a long time but it's only now beginning I think to be more fully appreciated. And that's what I call the issues involving both integrating and contextually understanding data. That's when you're dealing with data singularly with individual slice of data there's lots involved in processing it. But when you think about, let me give you an example I use as a challenge question for my students. I say, let's say we wanted to be able to do things like predict whether there'll be food riots and Buenos Aires in the next month or not. What are things you might want to know? What you might want to know is how is inflation going on in Argentina? What is the unemployment levels looking like in Argentina? How are food prices going up and down in Argentina? There's a whole, and what's the Twitter traffic saying? What a mood of people. It's the sentiment, right? The sentiment. Any one of those things may be up or down and the others may compensate for it. So unless you kind of pull these pieces together you may not get a picture sufficient enough to do that. We have not solved that problem yet but this is kind of an example of a challenge question. The problem is when you try to pull information together from a diverse set of sources you find they don't all mean the same things and getting the data to reconcile. I think we've heard that at some of the keynote speakers today. It's a problem in the government sector and the private sector. It's been there for years and years, not a new problem. It just gets worse and worse and worse. I often tell my class that goes back a long way. It goes back to Genesis chapter 11, Tower of Meadow. So it's a big, big problem. We've been doing research here at MIT under the name we call Context Interchange to understand the meaning of data in its different contexts and then ways to automatically reconcile it to make it fit together. So that's the kind of research that I think refers to with that. Okay, and so presumably the underlying architecture of these systems is evolving to be able to incorporate both that traditional data, how did you describe it before in the nouveau data. And so there are a lot of initiatives in the industry trying to blend those two models. And some purists are saying no, they're meant to be separate for a reason. Others are saying that's crazy. You actually need them to be together and you need to increase the size of the databases. Do you have any thoughts on that? Well, first a quick comment regarding activities going on at MIT more broadly. As you know, we're one of the focal points of the World Wide Web Consortium and Tim Berners-Lee who heads it up has been promoting for a number of years now the notion of the semantic web, which really is not limited to the web. It's more the semantics of data, the meaning of data. And so I think, and we're kind of big fans of that idea. So I think the idea of being able to understand data and be able to share it and exchange it. I'm not gonna give his pitch, but his pitch is to a logic set is that is a tremendous value we have. We have got huge amounts of information out there, most of which is either untapped or untappable. Because we don't know how to pull together. And I think by keeping things in the separate bins, although there are some issues regarding security and privacy, but keeping things in separate bins seriously diminishes our ability to learn things and make new knowledge. I'll give you one example. One of my other colleagues you may know is Professor Pentland, Sandy Pentland at the Media Lab. And he does a lot of work with mobile technologies. One of the things he shared with me recently was an interesting project. With your smartphone, it does lots of things, of course. It knows who you call along with the NSA. It also, of course, knows your location. If it has Bluetooth on, it actually knows who I'm sitting next to, didn't think of that before. But also with the accelerometer, it knows how I'm moving. And what they were able to do is instrument it and watching people's movement, they could detect the early onset of certain types of diseases, particularly mental illnesses of various kinds. You go into your doctor's office for 50 minutes, he gives you his test, not clearly notice it. But by monitoring millisecond by millisecond that your behavior, your movement behavior today is more jerky, more erratic than it was a day before, you can detect something that neither you nor your doctor knows going on inside yourself. So that ability to really harness these kinds of almost micro-level details is an enormous breakthrough that I think we have. It's really exciting, the whole internet of things, Jeff's written about the industrial internet, the Google Glass, the Apple Watch, I mean, this whole wearable computing thing is really the next wave that we could go on. Stuart, if they let me, who's getting this high sign here, we could go on forever, and if my colleague John Furrier was here, he'd never let you off. But really, it's been a pleasure meeting you and thanks so much for coming to the queue. Well thank you for sharing the time with you. Pleasure, thank you for having a great time here. Thank you. Have some nice weather for you. Appreciate the invite, yes, beautiful. It's about 98 degrees here. All right, keep it right there, buddy, we're right back after this.