 From Cambridge, Massachusetts, it's theCube, covering MIT Chief Data Officer and Information Quality Symposium 2019, brought to you by SiliconANGLE Media. Welcome back to Cambridge, Massachusetts, everybody. We're here at MIT sweltering Cambridge, Massachusetts. You're watching theCUBE, the leader in live tech coverage. My name is Dave Vellante. I'm here with my co-host, Paul Gilland. Special coverage of the MIT CDO IQ, the Chief Data Officer event. This is the 13th year of the event. We started seven years ago covering it. Mark Ramsey is here as the chief data and analytics officer, advisor at Ramsey International, LLC, and former chief data officer of GlaxoSmithKline, Big Pharma. Mark, thanks for coming on theCUBE. Thanks for having me. You're very welcome. Fresh off the keynote, fascinating keynote this evening or this morning, a lot of interest here, tons of questions, and we have some as well. But let's start with kind of your history and data. I was, I stood up for, I sat down after 10 years, but I could have stretched it to 20, but I figure I'll sit down with the young guns, but there were some folks in their 30 plus year careers. How about you? What does your data journey look like? Well, my data journey, of course, I was able to stand up for the whole time because I was in the front, but I actually started about 32, little over 32 years ago, and I was involved with building. What I always tell folks is that data and analytics, has been a long journey, and the name has changed over the years, but we've been really trying to tackle the same problems of using data as a strategic asset. So when I started, I was with the insurance and financial services company, building one of the first data warehouse environments in the insurance industry, and that was in the 87, 88 range. And then once I was able to deliver that, I ended up transitioning into being in consulting for IBM, and basically spent 18 years with IBM in consulting and services. When I joined, the name had evolved from data warehousing to business intelligence, and then over the years, it was master data management, customer 360, analytics and optimization, big data. And then in 2013, I joined Samsung Mobile as their first chief data officer. So moving out of consulting, I really wanted to own the end-to-end delivery of advanced solutions in the data analytics space, and so that made the transition to Samsung quite interesting, very much in the consumer electronics, mobile funds, tablets and things of that nature. And then in 2015, I joined GSK as their first chief data officer to deliver a data analytics solution. Long data history, and Paul, Mark took us through in your right, Mark. It's a lot of the same narrative, just kind of same wide new bottle, but the technologies obviously changed, the opportunities are greater today, but you took us through enterprise data warehouse, which was kind of ETL and then map, and then master data management, which is kind of this mapping and abstraction layer, then enterprise data model, top down, and then not all failed, so we turned to governance, which has been very, very difficult, and then you came up with another solution that we're going to dig into, but is it sort of same-wine new bottle from the industry? I think it has been over the last 20 or 30 years, which is why I kind of did the experiment at the beginning of how long folks have been in the industry. I think that certainly the technology has advanced, moving to a reduction in the amount of schema that's required to move data, so you can kind of move away from the map and move type of an approach of a data warehouse, but it is tackling the same types of problems, and like I said in the session, it's a little bit like Einstein's phrase of doing the same thing over and over again and expecting a different answer is certainly the definition of insanity, and what I really proposed at the session was, let's come at this from a very different perspective, let's actually use data analytics on the data to make it available for these purposes, and I do think it's, I think it's a different line now, and so I think it's just now a matter of if folks can really take off and head that direction. What struck me about, you were taking off some of the initiatives that have failed, like data warehouses. I was surprised to hear you say data governance really hasn't worked because there's a lot of, there's a lot of talk around that right now, but all of those are top-down initiatives, and what you did at GSK was really invert that model and go from the bottom up. What were some of the barriers that you had to face organizationally to get the cooperation of all these people who, in this different approach? Yeah, I think it's still key, it's not a complete bottoms up because then you do end up really just doing data for the sake of data, which is also something that's been tried and does not work. I think it has to be a balance, and that's really striking that right balance of really tackling the data at full perspective, but also making sure that you have very definitive use cases to deliver value for the organization, and then striking the balance of how you do that. And I think one of the things that becomes a struggle is you're talking about very large breadth, and any time you're covering multiple functions within a business, it's getting the support of those different business functions. And I think part of that is really around executive support, and what that means, I did mention it in the session, that executive support to me is really stepping up and saying that the data across the organization is the organization's data, it isn't owned by a particular person or a particular scientist, and I think in a lot of organization, that gatekeeper mentality really does put barriers up to really tackling the full breadth of the data. So I have a question around digital initiatives. Everywhere you go, every C-level executive is trying to get digital right, and a lot of this is top-down, a lot of it is big ideas, and it's kind of North Star. Do you think that that's the wrong approach, that maybe there should be a more tactical sort of line of business alignment with that threaded leader, as opposed to this sort of big picture we're going to change and transform our company? What are your thoughts? Yeah, I think one of the struggles is just, I'm not sure that organizations really have a good appreciation of what they mean when they talk about digital transformation. I think there's, in most of the industries, it is an initiative that's getting a lot of press within the organizations, and folks want to go through digital transformation, but in some cases that means having more interactive experience with consumers, and it's maybe through sensors or different ways to capture data, but if they haven't solved the data problem, it just becomes another source of data that we're going to mismanage, and so I do think there's a risk that we're going to see the same outcome from digital that we have when folks have tried other approaches to integrate information, and if you don't solve the basic blocking and tackling, having data that has higher velocity and more granularity, if you're not able to solve that because you haven't tackled the bigger problem, I'm not sure it's going to have the impact that folks really expect. You mentioned that at GSK, you found that you collected 15 petabytes of data of which only one petabyte was structured, so you had to make sense of all that unstructured data. What did you learn about that process, about how to unlock value from unstructured data as a result of that? Yeah, and I think this is something, I think it's extremely important in the unstructured data to apply advanced analytics against the data to go through a process of making sense of that information, and a lot of folks talk about, or have talked about historically, around text mining, of trying to extract an entity out of unstructured data and using that for the value. There's a few steps before you even get to that point, and first of all, it's classifying the information to understand which documents do you care about and which documents do you not care about, and I always use the story that in this vast amount of documents, there's going to be, somebody has probably uploaded the cafeteria menu from 10 years ago, that has no scientific value, whereas a protocol document for a clinical trial has significant value, you don't want to look through manually a billion documents to separate those, so you have to apply the technology even in that first step of classification, and then there's another number of steps that ultimately lead you to understanding the relationship of the knowledge that's in the documents. A side question on that, so you had discussed, okay, if it's a menu, you get rid of it, but there's certain restrictions where you've got to keep data for decades. It struck me, what about work in process, especially in the pharmaceutical industry? I mean, post-federal rules of civil procedure, it was everybody looking for a smoking gun, so how are organizations dealing with what to keep and what to get rid of? Yeah, and I think, certainly the thinking has been to remove the excess, and it's to your point, how do you draw the line as to what is excess, right? So you don't want to just keep every document because then, if an organization is involved in any type of litigation and there's disclosure requirements, you don't want to have to have thousands of documents. At the same time, there are requirements, and so it's like a lot of things. It's figuring out how do you abide by the requirements, but that is not an easy thing to do, and it really is another driver. Certainly, document retention has been a big thing over a number of years, but I think people have not applied advanced analytics to the level that they can to really help support that. Another Einstein from I'd keep everything you must, but no more. So you put forth a proposal where you basically have this sort of three approaches, well, combined three approaches, the crawlers to go, the spiders to go out and do the discovery, and I presume that's where the classification is done on. That's really the identification of all of the source information, so that's kind of stepped in. Find out what you have. Step two was the data repository, putting that in, and I thought it was, when I heard you, I said, okay, it must be a logical data repository. You said you basically, oh, the CIO, we're copying all the data and putting it into essentially one place. A physical location, yes. And then, so I got another question I have, and then use bots and the pipeline to move the data, and then you sort of drew the diagram of the backend, all the databases, unstructured, structured, and then all the fun stuff up front. The visualization and all the things you use. Which people love to focus on the fun stuff, right? Especially, you can't tell you how many articles are on, you've got to apply deep learning and machine learning, and that's where the answers are. We have to have the data, and that's the piece that people are missing. So my question there is, and you had this tactical mindset, it seems like you picked a good workload, the clinical trials, and you had, at least conceptually, a good chance of success. Is that a fair statement? Well, the clinical trials was one aspect. Again, we tackled the entire data landscape, so it was all of the data across all of R&D. It wasn't limited to just, you know, that's top down and bottom up, so the bottom up is tackle everything in the landscape. The top down is what's important to the organization for decision making. So it was essentially the entire R&D application portfolio. Both internal and external. Okay, so my follow up question there is, so that largely was kind of an inside the four walls of GSK workload, or not necessarily. My question is, what about, you hear about these emerging edge applications, and that's got to be a nightmare for what you described. In other words, putting all the data into one physical place. It's like, it must be like a snake swallowing a basketball, so thoughts on that. I think some of it really does depend on, you're always going to have these, IoT is another example where it's a large amount of streaming information, and so I'm not proposing that all data in every format and every location needs to be centralized and homogenized. I think you have to add some intelligence on top of that, but certainly from an edge perspective or an IoT perspective or sensors, the data that you want to then make decisions around, so you're probably going to have a filter level that will impact those things coming in, then you filter it down to where you're going to really want to make decisions on that, and then that comes together with the other. So it's a prioritization exercise, and that presumably can be automated. Right, but I think we always have these cases where we can say, well, what about this case, and I guess what I'm saying is, I've not seen organizations tackle their own data landscape challenges and really do it in an aggressive way to get value out of the data that's within their four walls. It's always, like I mentioned in the keynote, it's always let's do a very small proof of concept, let's take a very narrow chunk, and what ultimately ends up happening is that becomes the only solution they build, and then they go to another area and they build another solution, and that's why we end up with 15 or 20 fragments. Doctor, you had a great point about that. The conventional wisdom is you start small and you grow it from there, you fail, and that's not how you get big things done. Well, that's not how you support analytic algorithms like machine learning and deep learning. You can't feed those just fragmented data of one aspect of your business and expect it to learn intelligent things to then make recommendations. You've got to have a much broader perspective. I want to ask you about one statistic you shared. You found 26,000 relational database schemas for capturing experimental data and you standardized those into one. How? Yeah, I mean, we took advantage of the Tamer technology that Michael Stonebreaker created here at MIT a number of years ago, which is really, again, it's applying advanced analytics to the data and using the content of the data and the characteristics of the data to go from dispersed schemas into a unified schema. So if you look across 26,000 schemas using machine learning, you then can understand what's the consolidated view that gives you one perspective across all of those different schemas? Because ultimately, when you give people flexibility, they love to take advantage of it, but it doesn't mean that they're actually doing things in an extremely different way. Because ultimately, they're capturing the same kind of data, they're just calling things different names and they might be using different formats. But in that particular case, we use Tamer very heavily and that, again, is back to my example of using advanced analytics on the data to make it available to do the fun stuff, the visualization and the advanced analytics. So, Michael, last question is you well know that the CDO rule of role emerged in these highly regulated industries, and I guess in the case of pharma, quasi-regulated industries, but now it seems to be permeating all industries. We have Gokalon from McDonald's and virtually every industry, at least thinking about this role or has some kind of de facto CDO. So, if you were slotted in to a CDO role, let's make it generic, I know it depends on the industry, but where do you start as a CDO? For an organization, large company that doesn't have a CDO, even a mid-sized organization, where do you start? Yeah, I mean, my approach is that a true CDO is maximizing the strategic value of data within the organization. It isn't a regulatory requirement. I know a lot of the banks started there because they needed someone to be responsible for data quality and data privacy, but for me, the most critical thing is understanding the strategic objectives of the organization and how will data be used differently in the future to drive decisions and actions and the effectiveness of the business. In some cases, I mean, there was a lot of discussion around monetizing the value of data. People immediately took that that can we sell our data and make money as a different revenue stream. I'm not a proponent of that. It's internally monetizing your data. How do you triple the size of the business by using data as a strategic advantage? And how do you change the executive so what is good enough today is not good enough tomorrow because they are really focused on using data as their decision-making tool. And that, to me, is the difference that a CDO needs to make is really using data to drive those strategic decision points. And that nuance you mentioned, I think is really important. Indapal Bhandari, who's the Chief Data Officer of IBM, often says, you know, start with monitoring. How can you monetize the data? And you're right. I don't think he means selling the data. How does data contribute, if I could rephrase what you said, contribute to the value of the organization? That could be cutting costs. That could be driving new revenue streams. That could be saving lives if you're a hospital, improving productivity. Yeah, and I think what I've shared, typically shared with executives when I've been in the CDO role is that they need to change their behavior, right? If a CDO comes into an organization and a year later, the executives are still making decisions on the same data, you know, power points with spinning logos, and they said, ooh, we've got to have, if they're still making decisions that way, then the CDO has not been successful. It has to, the executives have to change what their level of expectation is in order to make a decision. Change agents, top down, bottom up. Last question. We're going back to GSK. But now that they've completed this massive data consolidation project, how are things different for that business? Yeah, I mean, you look, Hal Barron joined as the president of R&D about a year and a half ago, and his primary focus is using data and analytics and machine learning to drive the decision making in the discovery of a new medicine. And the environment that has been created is a key component to that strategic initiative. And so they are actually completely changing the way they're selecting new targets for new medicines based on data and analytics. Mark, thanks so much for coming to theCUBE. Thanks for having me. Great keynote this morning. You're welcome. All right, to keep it right there, we'll be back with our next guest. This is theCUBE, Dave Vellante with Paul Gillan. We're right back from MIT.