 Good morning everybody. This is Dave Vellante of wikibon.org. I'm here with Paul Gillan, who's my co-host today. We're here live at the MIT Information Quality Conference. We've been covering this event for the last two days. This is day two for us, but on day one, the folks at MIT had a chief data officer event, gathering of CDOs. Willa Pickering is here. She's a senior fellow in Information and Data Architect at Lockheed Martin, and also chaired the CDO Forum. Willa, welcome to theCUBE. Thank you, it's my pleasure. So, let's see, first talk about your role at Lockheed Martin. What do you do there, and what's your involvement in data? Yes, I've been with Lockheed for over 30 years, and my specialty is data. I have started out on the space shuttle, and then we had a general database, and that was records and files and so forth before the relational database. And then during the years, I've used many technologies, it has evolved. We began with the relational database for transactional or operational purposes, went into data warehousing, went into XML, and so as the technology changed, my position changed, and how I was handling data changed. And most recently now, of course, we're challenged with big data. So I'm spending most of my effort now on unstructured data, big data, how we're going to address that. So the space shuttle must have been a bittersweet moment, the last flight, but congratulations. Yes, it was, it was a great experience, but it was sad. I went out to see it at the Smithsonian when they brought it in. It was rather sad, and it was rather shocking because I was involved with it when we were building it and it was all clean and pretty and pristine, and when they brought it in, it was used, and it just shocked me. Now, so you were talking about the sort of evolution of the data, you know, conception over the years, and you've seen the database have been flow. Database, you know, 10 years ago was kind of a boring topic, and now it's exploded again in all these startups and everybody's trying to figure out which one to use, and mine's better than yours, and so that's kind of interesting. So what else has changed about data over the last, you know, a couple of decades? You mentioned XML, and we were all enamored by XML. When it started now, it looks sort of arcane when you see all this big data stuff going on, so we'll talk about some of the changes that you see and you mentioned big data and what you see as some of the trends going on. Yes, it has changed, and of course that's what really makes it interesting and not a boring field because it has gone, even though for about 30 years we've been involved with relational database systems and some with the object database systems, but now it's how do we handle all of this unstructured data and the volume of the data, the big data? We have, we're talking about zettabytes of data, which is like 10 to the 21st. I mean, that's unbelievable amount of data, and especially when I started, and we've mentioned where I started, one of my jobs was to monitor the storage, the hard disk because I mean, we were very, very limited, and so of course we've seen that so that the storage and the capacity has changed tremendously to handle all of the data that we handle. So, and then we have the volume that is coming in. You know, we look at our iPhone and you could just see, I mean, I think that is a good example of how we're getting so much data. You look at it, it's the variety of the data, there are video on there, there's GPS, there's all of this different kind of data. So we're going into unstructured data. We know pretty well how to handle structured data. That doesn't mean that there aren't still problems with it, but I mean, we really know how to handle the quality of it and we have processes and so forth. So we know that but now we're getting into this unstructured data and we're getting into semantics and context meaning and so it's a different world and we have a lot more to learn now about how we're going to handle it. And so the variety of this data, the other aspect of it is the velocity of the data. Again, when you turn on your iPhone, you expect to know right away you get that data immediately. I have an app on mine that gives me the flight information. So I have it right there, I don't have to wait until they change the gates or something. We're used to real-time data. Well, when you think about how are you controlling that quality, if you want it right now, you're not going to do it as we did in the past where maybe we had weeks to go through the data and to cleanse the data. So that makes it a new world. And then Professor Stuart Mannick who's also involved with the CDO Forum and the Symposium has added the veracity, a third, fourth V to that and that is the quality of the data. And are we going to get into more fuzzy statistics since we have more data and it's different than the data where we had the structured data. If you look at the finance world, for instance, you expect your data to be absolutely correct but maybe that's not necessary. And in the forum, somebody mentioned that it only needs to be believable. If you don't believe it, you're not going to use the data to make any decisions. So how accurate does that data have to be and how accurate can you even make it if somebody wants it real-time and they want to get it in immediately to make their decisions? You're opening up, it's a huge topic but it's a fascinating one. Unstructured data, we hear a lot about unstructured data and how to tame that problem. In the example of getting your flight information, that's a highly structured form of data and the airlines are very good at capturing that. But increasingly, the decisions we have to make revolve around unstructured data and as we see now with social media, we see huge amounts of information that accompany any event of which some is true and some isn't, a lot of misinformation we see now. But it's out there so somebody may believe it. And when you're making business decisions or as an organization, the government entity, you're trying to make decisions based upon what people are saying, not knowing whether it's true or not. Seems like to me like we're heading into a very risky area. Do you think technology is gonna help solve that problem? Well, I think it is a problem that we have to address and it's certainly true that it's changing and that you don't know what data's out there. There's noise in addition to the signal but there's also malicious change of the data. So there is a lot of risk in the data that you use and that sort of gets back to the fuzzy statistics and the believability and what kind of risk do you want to take in using real-time data? Well, right, so the other big change that we used to try to, let's say even a decade ago, all the really important data we put into a box. So we had it centralized but now data by its very nature is distributed. You're talking all the variety and it's everywhere. Mobile has really changed that, the cloud as well. And so the, and then you mentioned real-time. Now you have this thing in the industrial internet and you have machine data. So is the role, you heard Peter Aiken's discussion, is this role of a CDO? Can somebody actually have a single role to coordinate a strategy around that data? Is that the right approach or do we need to, as Paul was asking and sort of pushing it, Peter, a little bit, do we need to make individuals responsible more the IDG model, Paul, the distributed nature? What do you think about that? Well, I think that what we have defined, it has changed a lot. We're still, I think there was a question about, well, no one that they've heard through the Cube has agreed on really the role of the CDO. You're hearing a lot of different roles. But it is evolving. And this was the third year that we had the CDO forum where we get the senior executives together to discuss it. And the outcome of this was that the role of the CDO is to look at the relevance to the business. It is a business case. So you look at your data and you say, if I'm in marketing, for instance, I think that just this week, there was an article on they were gonna monitor how long women stayed at the perfume counter. So if you're in marketing, you're going to have one perspective. If you're in a government contractor as I am, your perspective might be different. But the role of the CDO is to look at what is relevant to the business and then how do I get that data? And what is the strategy for that data? How do I decide? Is it more important that I get that data out there real time? Or is this one of those cases where you really do still wanna keep that data controlled because it's very critical data? So as a proxy for the CDO form that you guys had earlier, I mean, one of the things that strikes me is this audience is pretty pedantic when it comes to data. They're very focused on that notion of data quality. But you mentioned big data, sometimes it's okay to be fuzzy. That's true. I can infer. So what can we learn from the, let's say traditional data quality folks that we can leverage and what do we have to be careful about in terms of creating that business relevance? Well, of course, that is the major question and that's where we have to go next. I mean, we really have to say how can you add value and how can you leverage what we already know and the technologies that we have? Which ones are relevant? And part of the role of the CDO is the data analysis. You will hear a lot of talk now about data analytics and the data scientists. That's a very important role because that is where the value is added and that's where these decisions have to be made about how you're going to get that value out of the data. Derek Strauss actually had an interesting concept. He said he was sort of a proponent of this notion of a data marshaling yard and building processes around that. So that maybe accommodates some of the fuzziness and then maybe pick the pieces off the golden nuggets if you will, put them in the God box. Yes, I think that has evolved from Bill Inman. Bill Inman was the father of data warehousing. So in the early 90s when I built my first data warehouse, that was the only book that was available. And as time has changed and as more data is available, he also has continued to write and to evolve and how we use the technologies that we have but yet how we have to expand them to meet this new world. And part of it is that he went from the traditional data warehouse with structured data and he looked at unstructured data and now he's looking at this marshaling yard approach and he and others, the other technology that we hear a lot about are data warehousing appliances. And what they do is they combine the techniques that we have for structured data with unstructured data. And for example, you might have, if you just think about a library, your index is very structured. So you can have that in your relational database but then if you look at the individual books, then they're very unstructured data. And so then you handle those differently but you combine the two so that the structured data is like a metadata or pointer to that. And then with the unstructured data, you have to look at techniques for context meaning and semantics and how you're going to go through that data and pick out the information that you need. The metadata is going mainstream. Even Paul, the New York Times writes about metadata now. Exactly, yeah, it's cool. One technology, I asked you earlier about technology to solve this problem. There was a, the IBM Watson technology got quite a bit, raised quite a bit of interest. A couple of years ago, it's been now, that's essentially a machine that takes large amounts of unstructured data and makes sense out of it. Do you see technologies like that coming along that you think are going to significantly improve our ability to understand unstructured data? Oh, absolutely. I think that there's more and more research on how we're going to have the technologies to handle unstructured data. And we are going to see that evolving. In the very far future, we're probably going to see quantum computing. Which is different, of course, than our analog and then our digital computing. How does that relate to the data problem? I think that we don't know fully the answers to it. But it does, what it does is it looks at data more as an object and so you have the object views of the data and you have the different dimensions of the data. And I know that some will say, well, it's not coming along very fast. But since I've been around for a long time, I remember my background was at Los Alamos and when they had the first digital computer, they were very primitive. And so I think that the quantum computer, we can't really say all the capabilities that it's going to have, but we have to give it time to evolve. But that's just one example of where we might have a technology that would help us more in looking at the context and all this large data and saying, where is the meaning in that data? Or how can I find what I need to find in that data? But I don't think right now we could predict where all those technologies are going to crop up. But I definitely think that they're coming in the future. But in that example, you've got a blob, the data's an object and you've got, and if I understand it correctly, you've got metadata, maybe embedded or accessible about that object, which is obviously a different concept and that changes the way in which applications interact with that data. But it also is a much simpler way to manage data, isn't it, conceptually? It is, and it certainly is a new way to manage data and it gives us the ability to then handle all of this textural data that we now have and that we have to work on. And so that's the future. And I think that we see more and more startups, we see more people looking into, what can we do with this data? How can we get value out of the data? And that really is the key. It is that if the data isn't relevant, I mean, you can have all the data in the world, but if it isn't relevant, then it has no value to it. So what is the value added proposition and then how do I make that occur? So I want to ask you another question. A lot of things going around in my mind here. One is open source. We haven't talked at all about open source at this event and when you think about big data, it's all built on open source. What are the implications to information quality due to the propensity of applications being developed around open source? Is there a connection there? Well, certainly, if it's open source data, then it is questionable. What is the noise in it? It's what we mentioned before. Has it been maliciously changed and so forth? I mean, you are taking a risk when you use open source data, yet on the other hand, it opens new worlds for us and new abilities. For instance, the prediction is that every object, which may be a little extreme, but every object will have a sensor. Well, if you think about things like even like car, driving your car down the highway, if you have sensors and you can know what the traffic is ahead, you can know it can monitor how fast you go up the ramp. I mean, there are a lot. If the sensor's right there in your car, it also can affect how you get your insurance because what is happening now, if we apply for insurance, it's what age are you? Have you had any tickets and so forth? And then based on that, you fall in a certain category. But in the future, because of all these sensors, can it be more personalized? Can you say if you have sensors in your cars and so forth, then you know, how's that person driving? Are they going up real fast and putting their brakes on and so forth? So it's changing our world. Doesn't one of the insurance companies actually have a product where you can install a sensor? You can do that now. I mean, you talk about, I mean, there's real-time traffic. You can get real-time traffic monitoring using smart phone apps right now. And the positioning capability of the cellular network. And I use it to save 15 minutes this morning by being rerouted. Unless you're what have made ways out of a keynote. There are privacy issues there. There are surveillance issues. I mean, there's a lot of thorny cultural problems. And privacy is exactly true. I mean, do you really want that sensor? I mean, you're getting it now where it's publicly available. Do you want the sensor where it's very personalized in your car? Do I want my insurance company actually monitoring my speed? Exactly, exactly. So many issues. I want to ask you about governance because obviously Lockheed is a company that has had to be at the forefront of really understanding data and having rules around data. What are the characteristics of organizations that you think really do understand the data governance problem and have internalized the need for quality? Yes, it certainly, it improves obviously the quality of their data. There makes them more reliable in the decisions that are made. And that's very critical because if you make a decision and your data, you're not providing. The data person is not providing good data. Then if a bad decision is made, they're going to turn off your project right away. They're gonna say, that's just no good, I don't trust it. But so we have very strict rules but they're very established rules on data governance, how you set it up. The important thing is to have all the stakeholders involved, all the people who are processing and building that data. And then you have profiling that you do to monitor the data, auditing. And so we have a very established way of handling our data governance. And it is very critical anytime you have a lot of data that you handle your data and get it as correct as you can. Now of course Lockheed Martin is a huge government contractor and you have some of these issues, I imagine are mandated by your government, the conditions of your contracts with the government. What about organizations that don't have rules like that? Where will market forces, do you see market forces actually enforcing the need for data quality? I see that, I particularly saw it in the CDO forum. As we talked about the evolving role of the CDO, one of the main things was data management and data governance. I mean that was right up at the top. That we have to govern our data. We have to do the best we can to have quality data. And so I think that I'm sure that many non-government contractors are already have established data governance and it will continue to grow depending on the value that you want out of that data. So we heard from Peter Aiken that only one in 10 organizations has a board approved data strategy. And he said the number of organizations that have a CDO is significantly smaller than that. How do you see that changing over the next decade? Oh, we've already seen it. When we started three years ago, that's where we were getting to see the trend, that more and more people, more and more organizations were adding the CDO function. And so they were coming to Professor Wang and saying, you know, how do I do that? What are the best practices? What should I do? These are the problems that I have. And now, I think Peter mentioned that 70% of the CDOs have been hired in just the last year. And at our forum, we had CDOs, actual CDOs attending the forum. And you know, before it was CIOs and CEOs. We still have a lot of CEOs and VPs. We do limit it to senior executives. But now, we're getting in actual CDOs. So if people want to get involved, where do they go? They contact, the best place is to contact MIT. And Professor Richard Wang and Professor Stuart Maddening are the founders of these. All right, Willa Pickering, thanks very much for coming to theCUBE. We're a great segment. You have tons of experience and great perspectives. You've seen the industry evolve and I really appreciate you coming on and Paul and I enjoyed the conversation. Thank you, it's a pleasure. All right, keep it right there, everybody. We'll be right back with our next guest, Dave Vellante and Paul Gillan. This is theCUBE. We're live from MIT in Cambridge, Massachusetts. We'll be right back.