 Hi everybody, we're back. This is Dave Vellante with Jeff Kelly. This is theCUBE, Silicon Angles. Wall-to-wall coverage here. We're at the MIT Information Quality Symposium. Michael Rappa is here. He's a PhD and the Executive Director for the Institute for Advanced Analytics at North Carolina State University. He's a distinguished university. Michael, welcome to theCUBE. It was really great to have you here. It's my pleasure. Thank you, Dave. So, we're here at the Information Quality Symposium. We've been saying it's kind of in the boring of an important category. We cover a lot of stuff around big data and it's a hot topic and everybody likes to geek out on all the technologies but this notion of information quality is really starting to, we've talked about this earlier, Jeff Kelly and I, really starting to bubble up. What's your take on all this? You've got a background in analytics. You're doing a keynote this afternoon on the new data scientist, Hal Varian, made that a very popular term and folks like Jeff Hammabarker. So, where we come from in all this and what's your perspective? What's your role here? Well, as director of an institute, I run a fairly large educational program to produce data scientists and we run 80 students through a 10 month format with some very clear objectives but I think this is part of a very interesting decades long progression we have going on from really the electronic data processing, doing things, not just recording it electronically but storing it electronically in the 40s and 50s. By the 70s and 80s we were amassing fairly significant amounts of data at that time but then the web came along, tied together all those databases and gave us even more data on top of it and now we're getting to the point where people understand that there's value in data. We have to do things to understand the data that we have, draw insights from it that we can make decisions around and it's a very value added occupation in terms of the ability to actually turn the information you have into an asset. I like to say that if it's not an asset, it's a liability so if you're just holding data, not doing anything with it beyond processing it, it doesn't help you. I think that the future is being transformed by data now in an incredibly rapid way. I mean it's just astounding. I've been in this now almost three decades. Right, it is astounding. I mean ever since we graduated from our alma mater. We went to college together. We've been hearing about how do we get value out of data? Where is the value and all of a sudden? It starts with quality, so this is a big important issue and that's why I'm here because data quality is at the core, right? So we collect all sorts of data. Sometimes we can run algorithms where we can deal with a certain amount of ambiguity in data and to draw useful insights from it but the great majority of data analysis at least up until this point has relied on data quality and knowing that the information that we have in the systems is real. I think even today we work with a lot of organizations and what's amazing is that as you get into the realms of really large amounts of data when we were in college you could flip through a few thousand lines on a spreadsheet. Even later you could flip through a few tens of thousands of lines. You can't flip through a billion lines on a spreadsheet and so you don't get to see your data anymore. You have to understand it in ways in which you can uncover the potential problems and it is a real common thing to see organizations not always understanding the problems they have with their data and that's what we're trying to do, create people who can really help organizations better understand their data and it starts with the quality, making sure the quality is there. So we spent a lot of time in various big data conferences and we were just at the Hadoop Summit a while ago and you hear stories that make you feel as though that emphasis on data quality that we've had for decades is giving way to, as you pointed out, algorithms and inferences and good enough. So what are your, as a data quality practitioner, what are your thoughts about that? Is that mindset in for a rude awakening or is there going to be a hybrid approach of blending of those two worlds? What's your take? Well, I think, first of all, data cuts across the whole economy. So we see it in virtually every industry sector, whether it's private, public sector and it really starts with what you do with data. It's the problem that you're focusing on and certain kinds of data problems demand a ruthless type of data quality. I mean, you want to wake up in the morning, go on to your online banking system and know that it's right. Let's say make a good mistake, I guess. But even that's not a good mistake. It'll come back to you. Yeah, so don't spend that money, please. But in other things, if you're dealing with the large, if you're dealing with the West Coast, I tend to think there's an East Coast, West Coast reality around big data. I've written a little bit about this. The West Coast is a little bit more dominated by the big web plays. And so when you're dealing with log files, when you're dealing with tweets and other sorts of things, then all of a sudden the relative precision becomes different in terms of being able to draw insights about a user or to draw some sense of sentiment from tweets. You can start dealing with the probabilities and be a little less driven by the fact, or you have to, you're driven to, because the tweets are human language and you can't process this. And false positives aren't life-threatening. Yeah, so exactly, absolutely the point. So I think what's interesting about this, and I don't actually like the term big data, I do believe data is big. But the reality is that it's here, it's here to stay. It's accumulating faster than we even realize. People say, oh, data's so cheap, we can just keep stirring at the rally as we're consuming it faster than you'd ever imagine. And unless you want your organization to look like the Utah desert with 2.5 billion auto data centers, you're really gonna have to come to grips with the data that you're collecting. How do data reduction, data quality, making decisions? I think one of the most important things going into the future is going to be the decisions you make about data retention. Data reduction, do you really need this stuff? If you don't use it, if it's not an asset, it's a liability. What are the governance rules over it? And beginning to make decisions about what is used on a daily basis, what goes into deep storage, cold storage, right? You shouldn't be cooling at a bytes of data 24 by seven if you're not using it. It doesn't make a lot of sense. Indeed. So Michael, you're in the business really of educating, I guess you might say, the next generation of data scientists. And I think when people hear the term data scientists, they think of people like Jeff Hammerbocker and DJ Patil and doing some of the really interesting analytics on data. But what is the role of the data scientists in terms of data quality and data governance? Because we hear, we also hear at the same time that 80% of the time spent by data scientists is getting the data in a format that it can be analyzed. So what role does the data scientists play in the data quality, the data governance? Maybe the less sexy aspects of being a data scientist but critically important. Sure, I'm still waiting to come across the sexy data scientists. So I'm not sure how that's gonna turn out. But I'll tell you that if we call them data janitors, I know we have no students anymore. They wouldn't come into this game. The term data scientists is a bit of an awkward one. And I think in my talk here, I'll discuss a little bit more about this. The fact is that there are, Jeff, Jeff is a great example, there are others. And they have this unique combination of skill. They have the deep programming skill along with really good solid math and statistics understanding. And they can meld those together. I call them, they're sort of like big foot really rare in the wild. You don't see them very often. And honestly, it's not scalable. There's no way to scale enough PhDs in math and computer science or statistics and computer science. It's not gonna happen in our lifetime at all. And even right here, we're at MIT. A few weeks ago, they graduated their class and out of about 1,000 undergraduates, there were less than two dozen graduates who majored in AT&T, which is math and computer science. And so- And that's at MIT. And that's at MIT. And so this is not scalable. So how you look at the data scientist, I think as you point out is rightly so. It's really a spectrum of activity. The modeling part is just one piece of it. And so the, and this is the most important thing, understanding the business problem. So the data scientist shouldn't be a white jacket person who's just running algorithms and cranking through data. Starts with a business problem if they're not driving value by addressing a business problem and understanding the nature of that problem before they even get into the very difficult reality of cleaning, integrating, really understanding the nature of the data that you have, the distributions all before the point that you get to model anything. And in fact, there's so much great work being done at the higher tier in terms of the folks who produce the statistical programming engines where so much intelligence is being embedded in those tools that in many ways, a few Jeffs go a long way. And a few of those guys go a long way. A few of those PhDs go a long way. What we need is battalions of people who are educated like we do, I believe, at a master's level, who have the basic programming skills who have on top of that understanding of the mass and statistics, but are much more like, you know, a lot of people don't realize this, but universities produce more MBAs each year than anything. Okay, some 170,000, that's an incredible number. We need those kinds of numbers graduating with deeper knowledge of how to deal with data. And we can do it at a master's level in fairly large numbers. And I think within about five years we'll see quite a large number of universities across the country churning out whether it's 50 or 100 or more, starting to really increase those numbers where people go in the organizations, they interact with the client, they interact with the business, they understand the business problem. They take what is probably the most difficult thing. How do you, given the data that you have, how do you frame things in such a way that you really can address the business problem, draw insights, and then communicate it in a believable way to management because management doesn't believe you. Well, to your toast, it's nothing will go anywhere unless management believes that you've done what you can do. Absolutely, and to your point about the whole data science graduates versus the MBAs, the financial payoff's certainly there, California, kids coming out of college that are so-called data scientists, any from 90 to $120,000 right out of the master's programs, and it's upside from there. But my question is, do you see competition? I mean, certainly there's this meme in the industry about the CMO is gonna spend more than the CIO on technology. You actually do see that competition in certain fields and my friend Peter Burris, who's the head of the CIO practice at Forrest, is of course debating that Gartner stat, but nonetheless, you're seeing large marketing organizations compete for those individuals throwing money at them. Is there a brain drain right now that's going to marketing? Like Jeff Hammerbarker says, the best minds are. Well, so I think this notion- Clicking on ads. The focus on what is now more the conventional definition of a data scientist, a PhD, has the programming skills, has the- They've been around for a while. They've been around for a while. Nothing new here. And they are being cherry-picked out of universities. I think a bit of a problem for universities, but I think the goal is to push it down to the master's level. It can produce many more master's students and we turn out now 80, 85 a year and they're getting those 100K and up salary ranges with an intensive 10 month education. PhD can last you five, six, seven years in the process and I think going forward we're going to see a different kind of data scientist emerge and maybe the parallel is with the MBA but with a very strong data focus. I tell my kids, data is sexy. Maybe data scientists aren't sexy, but data is sexy. Data is sexy. I'm with you on that one, so I'm definitely there. And I think that in a matter of a few years we're going to see that everyone is going to look through their organizations through a window of data. They're going to make decisions based on a much better understanding of where they've been, where they are and where they're going because they have a data lens on everything. We, my institute, we measure everything. We make decisions real time with data in hand and so I live it as well as teach it and I can tell you I'm not going back to that other world. So we talk a lot. It's a much better reality to work with data. There's a lot of talk about data driven organizations. Do you see the emergence of data quality driven organizations or is that just too far a field right now? Well, again, for certain sectors, in banking, in areas like the health care, it is in certain areas, it's always going to be a core issue. I think more what's going to come up is again going back out to the West Coast with the big web plays. At what point do their attempts to go beyond click streams and things happening there to tie that data to individuals, to other socioeconomic behavior, to other sorts of things that then all of a sudden maybe data quality starts to rise in the hierarchy of importance with the... That's a good point. Clicking on ads could be a big predictor of other type of activities. So if I was younger and coming out of school, I'd see enormous fascinating opportunities in places like Facebook and Twitter and others. Not just centered on straightforward, how can we understand whether this person's gonna click through something? But the matter is, is that now Facebook understands more about human behavior than we've ever had in the history of civilization. I mean, to go in there and really start to analyze some of the data they have and then start to wrap advertising around deeper understandings of that behavior. It's where the opportunity is going forward. And so just looking at Facebook, it's incredible the kind of research that you could be doing there. Now we mind Twitter pretty extensively as well. We haven't really hit the Facebook. And Twitter is also amazing and fantastic thing that... And these things haven't even been around five years. I mean, I feel really old at this game here. We'll wait another five years and we'll see what you can predict with some of the information coming out of social media. And that's where the speed of this is astonishing. How rapidly data is consuming us. And we have to start producing people in larger numbers who can deal with this. Our students are literally the most sought after and highest paid graduates of the university and really some of the most sought after and highest paid graduates of any university in the country. And I've said publicly and I've gotten some looks. If we can do this with 80 students, we could be doing it with 800. I mean, the demand, we recently doubled from 40 to 80 and demand went up. Just deepened. Yeah, so tell us a little bit more about the program itself. You and I spoke several years ago. I know you started the program in 2008. You mentioned now you're up to about 80 students. Tell us a little bit about how the program has evolved over the years and what it looked like today. It's been a great experience. So I think what's unique about what we've done at NC State was is that we really started from scratch and really tried to understand what employers were looking for when they hired this talent and literally built the degree from the ground up lecture by lecture, not just in terms of deciding, like we do more or less conventionally, like what content, what math, statistics, computer science, business elements you want to integrate into it, but also through the structure of the program. So it's a very intensive 10 month experience. Students become completely immersed in the subject matter. Everything is structured around teamwork because teamwork is an absolutely essential skill set for people working in this space and they come out of it in very short period of time, but with very now fairly solid skill sets but capability to do things for the organization right out of the box. And so it's been a huge success for us. We've had six straight classes graduating now to full employment at, as I said, some of the highest salaries. I mean, it's almost, you know, it really even makes me blink sometimes to see the demand factor for the skill set in it. But it's because they can do things. You know, they can join organizations. Managers can put them on teams and they can start doing things from day one. And that's what's needed. You know, you can't wait another two years to get your data science effort going or start making sense of your data. You've got to move and move fast with it. So the program's been great. We have great partners and companies like SAS work very closely with them to really bring industry standard tools into the curriculum. But we really look at the employer as the customer, you know, and really, and that sounds normal to you. Right? It's a typical thing for you guys and for industry, but for universities, it's an alien concept, right? So how do you do that? How do you keep the polls on what the enterprise is looking for? How do you keep that communication going? Because as we're advancing so rapidly in the types of data and the ways we want to use it. We interact, you know, consistently throughout the year with industry, with employers, both public and private sector employers, but we also have at the core of our curriculum in lieu of a master's thesis, a practicum, which is a team-based exercise that stretches over eight of the 10 months. And those teams will receive real data from a sponsoring organization. In large quantity, tens to hundreds of gigabytes, typically, and they'll march through a very structured process of understanding the business problem, cleaning the data, doing the data modeling, preparing a report and presentation to their sponsor. And that keeps us very grounded. These are real problems. And if you look, if you visited our website, you'd see we do 17 projects a year now with wide range of organizations from some of the world's largest brands like Procter & Gamble, GE, down to the Houston Astros, as well as to various government agencies who have more data than... Where do people go to get more information? What's the website? So analytics.ncsu.edu. You can just Google us as well and find us very quickly. And you'll see the Institute for Advanced Analytics website there. We provide enormous amounts of information over the website about the program. We're very data-driven, and we share all that data openly about what we do. So please visit if you want to learn more, certainly. You'll find it more than a brochure. All right, Michael, well, listen, thanks very much for stopping by The Cube. It was a really pleasure talking to you, and great to meet you. It's an absolute pressure. Yeah, it's absolutely great to meet you, and it's great as well. All right, buddy, keep it right there. Go to siliconangle.com. Check out the blogs associated with these videos. Go to youtube.com slash siliconangle to find these videos on demand. And of course, go to wikibond.org for all the research. Keep it right there. I'm Dave Vellante with Jeff Kelly. We're live from MIT. We'll be right back after this. Doing great stuff with this. I really enjoy it.