 Live from Cambridge, Massachusetts, it's theCUBE at the MIT Chief Data Officer and Information Quality Symposium with hosts Dave Vellante and Paul Gillan. Welcome back to Cambridge, Massachusetts everybody. This is Dave Vellante. I'm here with Paul Gillan. This is theCUBE. theCUBE is SiliconANGLE's live mobile studio. We go out to the events. We extract the signal from the noise. This is the second year we've done the MIT Information Quality Symposium and the Chief Data Officer Forum. Joe McGuire is here. He is a consultant. He's with Data Quality Strategies, a former Gartner and Burton Group Analyst, former software guy. Worked for large companies. Joe, welcome to theCUBE. It's great to be here. Thanks for having me. So is this your first year at the MIT event or were you here last year? No, no. It's someone was saying earlier in one of the introductory set of remarks that this conference is in its eighth year, which was a surprise to me because I've been here for nine. I've been here for what seems like 20 years. You stepped on the punchline. I've been doing this for, involved with this conference for six years. I started just as a presenter of some ideas and after a couple of years of that, I got roped into intensified activity and now I'm on the committee that organizes the program. Oh, excellent. I got a chance to meet last year, so I'm glad we can and thanks for coming on theCUBE on short notice. So what is the state of Data Quality? You've been in this business for a long time. We were talking earlier, showing our age about the old days of case and you've seen the cycle. Where have we come and where are we and what does all this big data stuff mean? Well, I think that the state of Data Quality, there's no simple answer for that, there's no one answer for that because there are different kinds of data and there are some forms of data that have really enjoyed a lot of attention. The data quality specialists, the data quality researchers and the data quality vendors have lavished attention on certain forms of data, most notably the kinds of data that might exist in a relational database and consequently the current state of the art in data quality is skewed towards ensuring the quality of relational data. So a lot of the techniques involve confirming that referential integrity constraints are met and there are certain relational query-based techniques for scanning through data and making sure that it's high quality. For some other forms of data, a lot of which is what, which a lot of big data is about, it's much harder to even define what constitutes quality. When you have free flowing text and paragraphs, it's hard to even recognize whether a piece of data is high quality or not. One of the ways that folks sort of cope with that and try and bring some of the expertise of data quality techniques that apply to structured data onto unstructured data is to superimpose on unstructured data, metadata, which is structured and apply data quality techniques to that metadata. So tagging is one way to decorate data, but just in a JPEG format, there's a header and the header in a JPEG file has got some metadata in it and that structured and that kind of data could be extracted out and put into a relational database and so that kind of data is amenable to these data quality techniques that were designed for structured data. Certainly over the last few years, since data quality, MIT has been advocating for information quality for over 20 years through a formal program, we've seen the way data quality is collected has changed. Much more of it is collected by machine now, it's collected automatically from devices rather than entered by humans. How has that changed some of the issues that you face with data quality problems that the customers you work with? Well, there's a couple of issues there. One of them is the issue of data that's collected by machines versus data that's collected by humans and the other is the issue of data that's collected perhaps by some organization or some bot that is not under the purview of your particular enterprise, you know, data from the Twitter, so I'll tackle those questions in order. The issue about whether data has collected by humans or by machines is that humans quite naturally make mistakes and humans can have mechanisms to cope with the mistakes that they make and our software systems are sometimes dismissive of the essential nature of this data whose ultimate source is human beings and we ought to be designing systems that and designing policies and processes that expect to encounter bad data and one way to, you know, to cope with the human realities of poor data is to make sure that we make systems that aren't so brittle that they yield dreadful, dreadful results when the data is bad because we know that the data is going to be bad and an example that I can in a little story I'll tell you about that has to do with a credit card company that uses the presence of pristine, perfect, 100% defect-free data in certain contexts as a sign that something is wrong because data shouldn't be perfect and 100% pristine if it's coming from humans. Humans are going to make mistakes, sometimes they're not even mistakes, they're just inconsistencies about one day I'll spell out my address S-T-R-E-E-T and the next day I'll spell it out I'll abbreviate it, S-T, and that sounds like, you know, that if all the data that's coming in from a particular point of service for a credit card company has all the addresses perfectly normalized exactly as the post office probably fraudulent that's a sign of fraud and what an organization ought to do is develop an appreciation for just how dirty the data ought to be if you know that there's a human source for it and be prepared to respond accordingly if the data deviates from that level of dirtiness in either direction because if it gets too clean you should smell a rat. So Joe I want to clarify something you said earlier so as we move from sort of the systems of record to systems of engagement there's a lot of people like to say you're saying a lot of organizations are sort of creating a metadata layer and applying it to that unstructured data are you saying that's a that is a viable approach that's a best practice or that's that's that's certainly a viable approach and it is a way to get some semblance of or to harvest what we know about techniques for data quality and structured data and and applying it in some way to the problem of unstructured data there are all kinds of other techniques that can make sense the unstructured data the payload of an unstructured data file is the unstructured data it's not the metadata the metadata is interesting but the payload is the paragraph or the tweet and companies are looking to find out new things like sentiment analysis on the payload they may be looking for words of disenchantment you know stinks dislike sucks very near their brand or their company name and to do that kind of analysis you need to look at the data payload itself you need to look at the unstructured data and so the techniques for making sure that that data is high quality or detecting are less mature than the techniques for ensuring quality data in structured environment and in particular their difficulties with detecting sarcasm detecting irony it's a hard problem I mean I'm sure the NSA is trying to solve that problem right the the major bit of a bit of advice that I that I would have for for organizations that are buying text analytics piece of text analytics software is to reach out to a linguist the these text analytics tools are very powerful and you can do all kinds of cool things but you may not know what you're doing you may not know how language works well enough to really understand this understand the results that these tools are giving you and analogy from 20 25 years ago it's when products like SPSS that's statistical package for the social sciences came out it was sort of statistics for the masses and the initial reaction response was oh this is great now I can do statistics and people were using those products although they didn't know statistics and they eventually figured out pretty fast I really understand what a p-value is and I guess I need to hire a statistician well I see some of the same stuff happening now you need to hire a linguist because you may think that a frequent use of personal pronouns is a sign of narcissism and so you can analyze the speeches of presidents and decide who's a narcissist and who's not but you really need to understand how pronouns work you really need to understand whether or not there's a link between pronouns and narcissism and then you probably need to understand whether or not there's a link between narcissism and effectiveness as a precedent too so there's a lot of people who think they understand about language and they're very eager to use these text analytics products and we need the expertise that's not technological expertise about bits and bytes is technological is technical expertise about language to really get our money's worth out of these products I think what you're talking about is good data but bad analysis where where the data itself may be true but it's badly collected badly interpreted massaged presented and I think of all the research is available on the internet now anybody any any idiot with a survey monkey account can can generate a research report and so you can find research that will justify almost anything you want to believe does your work do you see this as an increasing part of the data quality problem where the data itself may be accurate it may be true it may be accurate but it's not true well I think that is it is a malign effect or a potentially malign effect of the big data world in that people can say basically well I've got my conclusions now I'm going to go find the data to support it and there's enough data out there in the blogosphere and Twitter feed in whatever whatever tech text data you may be collecting if you're the NSA or whatever to find the the result that you want so yeah I think that's that's a problem I think that big data presents some other other unique problems to data quality that don't have to do necessarily with structured data versus unstructured data one of the biggest of which is just a lot of the big data you're analyzing you don't know and so it doesn't fall under the purview of your data quality initiatives of your data quality programs so you have to live with the quality of data that you get so Nate Silver was on the cube last year at the Tableau user conference in DC and we were probing Jeff Kelly and I were probing him about the data quality issue on Twitter in particular but social media generally and Jeff just walked in you might remember this we're talking about Nate Silver and Nate essentially put forth the premise that the data is not there it doesn't exist it's all it's it's crap it was it is today now may evolve to get better do you buy that or do you believe it's a fundamental data quality issue Nate Silver was saying that the data is not there to be analyzed the base data is not good enough to draw inference from yet it's not mature enough that it's not solvable with better data quality practices or better engines or whatever it is the data just sucks and we sort of debated that I'm not sure I buy it but I'm not sure that I would go that far but I would say that the best data quality practices that we have for unstructured data aren't good enough to solve the data quality problems that do exist okay so the but that's a different conclusion right so it's a much harder problem than data quality for structure because by its very nature structure data you struggle to understand what the data means beforehand right heavy lifting on your data model for a database before you ever create a single right versus a schema less environment but it's also a fun function I presume of what kind of what question you're trying to answer who's ready to buy yes yes that's true do you see any technologies that are that are promising in this area any any potential great leaps forward and being able to understand unstructured data better um I think that what the data tamer well I guess they've changed their name to tamer what they're up to is a is a big help but I'm following that space that space super super close it's a it's a technique I just had to leave the presentation that that data tamer was giving here in order to come in speak with you know so I know half of a presentation as much as I should about it but it's it's a technique for it's a it's a tool for curation of structured and unstructured data at big data ski at big data scale so one of the big problems I think that problem is is we have simpler problems we haven't solved yet and one of them has to do with this this word scheme and in a in a large part of big data technology is about schema less implementations and you have things that are based on Google's big Google's big table which is a very loose sort of schema and folks are using schema less implementations because there are technological reasons why it might be advantageous in their organization to do that but a schema less implementation does not necessarily mean that the phenomenon that you are automating is a schema less phenomenon and there's a serious risk here that if the phenomenon itself from the user's perspective from your from your clients or the data constituents perspective perspective is a schema rich phenomenon then you should model it and the fact that you are producing a a technological solution that doesn't require much of a model does not obviate the need for you to model the phenomenon so that you understand it and one of the ugly truths about how folks have been doing system design for the past 20 years that they do all of their modeling in a relational context conceptual modeling business level modeling they do all of that modeling in while they're doing logical relational modeling and now that you have a schema less context you say we're not using a relational technique we don't have to do that relational modeling anymore you're effectively throwing throwing out the baby with the bathwater and not doing any business model I think that's really insightful Joe I'll tell you a quick little tangent here squirrel one of the companies here in Cambridge their CTO Adam Fuchs basically has talked about so squirrel basically spun out of the NSA they the NSA built the prism which is it was a big table mimic and Adam talked about how essentially they took the schemaless environment and put structure to it and it was critical in terms of being able to actually get the data out of it that they wanted being able to layer a model on top of that schema so in practice I mean the NSA is doing it you know it's causing a lot of fervor but clearly it's been an effective in some way shape or form so that's an example I guess to answer one of Paul's questions the other thing I'd I'd observe in this world of data governance you mentioned tagging before Paul and I don't know if you meant by that human tagging or some kind of automated tagging no well I could be either but the idea of the idea of structuring unstructured data through metadata I would think of that as being an example of how you do that yes yeah and and different efforts to classify autoclassify data it seems like the industry has just defaulted to using search as a brute force instrument Google plus for example auto tags comments a post in Google plus so they're applying some level of automation to unstructured data which you have to do the scale presumably right to scale yeah and so tagging and entity extraction are our legitimate ways to decorate excuse me to decorate unstructured data with some semblance of structure so you can tease out anytime you see mister or doctor then you can or mrs or mrs that you can see okay this is the name of a person anytime you see street you can sort of format things in an address and then you can link up those addresses into addresses that are articulated elsewhere perhaps in a structured context and they're a mathematical you know well-known mathematical concepts to classify data certainly support vector machines probabilistic latent semantic indexing and things like that that people have used for years that with varying degrees of effectiveness by the way one of the things when people talk about information classification you know one of the one of the typical applications of that is if you're a company and you are involved in some kind of lawsuit about a particular product of yours you have to go through all of your documents excuse me all of your documents and classify them as yes this document is germane to our to our research development about this product and this one isn't and information classification is thought of as this thing that you do in response to a particular situation a lawsuit but information classification can work that you can do it on all your documents and you need to to classify things like this is data that is falls under the purview of the HIPAA act this is personally identifying information this is financial information this is company confidential information this is intellectual property and that kind of information classification ought to happen both on your own structured data and your structure because you can certainly look at a database column and say this is HIPAA well this is personally identifying information and from 2006 with the federal rules of civil procedure just let's say 2009 prior to the big data meme really taking off that was a general council sort of driven initiative has my question is has the bit flipped where now information quality is focused on the business opportunity as opposed to sort of the mitigating risk information well information quality the how do you measure how much information quality is there and there are some sort of nitty gritty technical ways of measuring information quality in the highly structured environment but generally the information quality and data quality has has involved so that we recognize that the way to measure information quality is by measuring the outcomes that it yields so more sort of indirect so we had a time but so we didn't talk about half the stuff we wanted to talk about because we had an interesting guest and we went in all kinds of directions so we'll give you the last word give us the the summary of thoughts on the conference or any other activities that you're working on with this is great conference I like I like being here I will I will say this try and squeeze in one of these bullet items yeah please chief data officers who are one of the target target audiences this conference should recognize that big data is not revolutionary it's important they should pay attention to it but you think about what a revolution is a revolution is something that where where the results of it cannot coexist with whatever came before and that's just not the case we don't want to allow big data to induce us to forget all the hard lessons that we learned since the 1970s about data quality about modeling about doing careful requirements analysis about the fact that a single data source ought to be able to serve many different applications you don't want to return to the 70s kind of mentality of applications have files and many of the big data best practices seem to be reminiscent of those bad ideas that we learned lessons about over the past 20 years so the chief data officer's job with respect to big data is to recognize big data doesn't change everything it changes some things and not others and the chief data officer is responsible for differentiating which is which and to establish policies accordingly so that's my that's my last piece good good good rap Joe thanks very much for coming to the cube it was a pleasure having you guys great to be here squeeze you in thank you very much all right keep it right there everybody we'll be back with our next guest this is the cube we're live from MIT in Cambridge Massachusetts and we'll be right back