 Hello and welcome. My name is Shannon Kemp and I'm the Executive Editor of Data Diversity. I'd like to thank you for joining the current installment of the Monthly Data Diversity Webinar Series, Real World Data Governance with Bob Siner. Today, Bob will be joined by long-term Data Diversity friend, Tony Sarez, to discuss a data governance framework for smart data. Just a couple of points to get us started. Due to the large number of people that attend these sessions, he will be muted during the webinar. If you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the upper right in the corner for that feature. For questions, we will be collecting them by the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag RWDG. As always, we will send a follow-up email within two business days containing links to the slides, the recording of this session, and additional information requested throughout the webinar. Now, let me introduce to you our speaker for today, Bob Siner. Bob is the president and principal of KIK Consulting and Educational Services and the publisher of the data administration newsletter, TDAN.com. Bob has been the recipient of the DAMA Professional Award for significant and demonstrable contributions to the data management industry. Bob specializes in non-invasive data governance, data stewardship, and metadata management solutions. And with that, I will give the floor to Bob to introduce his guest speaker for today and get today's webinar started. Bob, hello and welcome. Hi, Shannon. Hi, Tony. Hi, everybody. Thanks for attending this month's installment of the Real World Data Governance series. As Shannon mentioned, the name of this installment is a data governance framework for smart data. And smart data seems to be one of those hot new topics that everybody seems to be talking about. I'm getting a lot of questions about it. Whenever the term smart data is being used, people have questions as to what makes data smart. And, you know, I decided, we decided when we put the series together that we would have special guests from time to time. And as Shannon mentioned, a good friend of DataVersity and a new good friend of mine, Tony Serres from N2Semantics is a smart data guru specialist, somebody who's going to help us to really understand what smart data is, how it relates to data governance, and what is going to be necessary in order to put a data governance framework in place for smart data. So thank you very much for attending the session today. Glad to have you with us as always. Before I get started and before I introduce Tony, I just want to share a couple of items of note with you. The webinar that will be part of this series next month is do it yourself and purchase data governance tools. And if you've been attending my webinars in the past, you know that I spent a lot of time sharing tools and templates and things that I've used over the years with practitioners and clients to help them to establish their data governance programs. I look forward to sharing with you in a lot of detail more information about the do it yourself data governance tools and also some tools that you could purchase on the market. So please register online for the next month's webinar. I'll be looking forward to having you there. A couple other quick items of note. If you're not familiar yet with the fact that I put together a book called Non-Invasive Data Governance, there's information on the screen that will help you to locate that book if you are interested. Also, kikonsulting.com, that is my website, and that is the home of Non-Invasive Data Governance. I wanted to share with you a dataversity event that I will be speaking at shortly, and actually there's another one that's not even listed on there, but the Data Governance and Information Quality Conference taking place at the end of June and the beginning of July. I will be giving two presentations. One is a tutorial on assessing your existing data governance program, and then we're also going to talk about data governance, privacy, and the Internet of Things. Later in the year, this fall, there's a data governance finance conference as well. You can learn more about that on the dataversity site, and I will be also giving a presentation at that conference in Jersey City. Last but not least, if you're not familiar with the Data Administration Newsletter, qdand.com, as Shannon has mentioned. It's a publication that I've had since 1997. It touches on all issues that are related, or as many as I can, related to data and data management in industry these days. So this monthly issue, the May issue, is already available, and there just so happens to be an article from my special guest today. Tony Serres provided an article called Box in a Smart Data World. So if you're interested in learning about Box for a Smart Data World, please take a look at the newsletter and register online for newsletter updates from time to time, or should I say monthly as well. So now that we've gotten that stuff out of the way, I want to talk to you a little bit about the abstract that we used for today's session. First question that I have for you is, does your organization have something that they call smart data? How is your organization going about defining smart data? Tony in a couple minutes here is going to share with us an excellent definition of what smart data is and why you should be paying attention to smart data and why that really is the wave of the future. Smart data is data used in non-traditional ways, as mentioned on the screen here. Businesses are embracing big data. They're talking about putting big data governance in place. Now that they're starting to extend that to include smart data, we're expecting that there will be needs for smart data governance as well. So we're going to talk about that today. We're going to talk about putting a framework in place, or at least pieces of a framework in place to govern your smart data. And so some of the things that we're going to talk about today is Tony is going to provide here in a second an easy to understand definition of smart data, why you should provide a framework to govern that, how smart data differs from traditional sources of data, and how smart data can and will be used in the future. And to last but not least, I'm going to share with you a couple of the items that I believe go into creating a framework around governing your smart data. So without further ado, I'm going to introduce my special guest for today. Hi, Tony. Hi, Bob. Thanks for having me. It's good to have you here. And glad to know that you work very closely with Dataversity for more conversations prior to this webinar. Tony has really provided me with a lot of insight into what smart data is, and I know that I'm getting questions about it. If not daily, it's certainly weekly about what smart data is. Tony is the founder and principal of two semantics, a consultancy based out of Orange County, California, and he specializes in semantic technologies, artificial intelligence, and cognitive computing. Tony, is there anything else that you want people to know about you before we get started with the webinar? Well, you were kind enough or crazy enough to put my profile picture out there, so as you can see, my hair is kind of salt and pepper a little more on the salt side these days, so I've been at the field of data management data modeling, database administration for a number of years. People always ask me how I got into artificial intelligence since I started off more in traditional data management, and so I thought I'd mention that I was working in the early and mid-1990s in database integration. That was the time everybody started connecting their databases, and of course they found out immediately that the conceptual definitions they had for a lot of the entities and data elements in those various databases when they were talking about the same or similar entities or data elements turned out to be different, sometimes in big ways, sometimes in subtle ways, so we started exploring the use of metadata, conceptual schemas to be able to define what those data entities and data elements really meant in a sort of formalized way. That got me into the field of ontology or knowledge representation. I spent a number of years working with metadata for development repositories, so component-based development and web services. I eventually got into things around the semantic web, so things involving RDF and the ALL ontology for expressing semantics in an RDF context. More recently I've done work in machine learning aspects of data analytics and also cognitive computing, so exploring things like intelligent assistant technologies and other sort of smart applications that use the kinds of smart data that we're talking about today. I think this is great timing for talking about a data governance framework for smart data. Smart data and intelligent systems in general are exploding right now on the scene, and I think oftentimes the people working in that field may not have the traditional data governance background, and so I think the notion of how we manage and govern the data associated with the systems hasn't been explored maybe as fully as it should be, so I think it's time we really tackle those questions. It's great. I think you're the ideal person to have as my special guest on this webinar because you've got a lot to say, a lot of experience, and it's great that you have kind of that traditional data management background to get started, and certainly metadata is being in all aspects of data governance and all aspects of data management. So in a second, I'm going to ask you to provide a definition of what exactly we mean by smart data. Before we do that though, I'm going to start with sharing the few definitions that I typically share at the beginning of this webinar series, and that's my definitions of data governance, data stewardship, and non-invasive data governance. So the term that I use, or the definition that I use for data governance, is that it is the execution and enforcement of authority over the management of data-related resources, and I've heard lots of definitions for data governance, but I like my definition to have some teeth behind it because the truth is, no matter what approach you take to governing data, whether it's smart data, or big data, or master data, or metadata, or just your traditional business data, there is a need at the end of the day to be able to execute and enforce authority over the management of that data. So we're certainly going to talk about that in a little bit more detail as we go into precisely what we mean by smart data. Data stewardship, on the other hand, is what I consider to be the formalization of accountability from people over the management of data and data-related resources. There are people within organizations that have levels of responsibility for the data, including smart data and then the up-and-coming field of smart data and the big data field that's with us and has been with us for several years, but really what we're talking about is formalizing accountability rather than handing accountability to people as something that's brand new to them. So typically when I talk about non-invasive data governance, really non-invasive really describes how governance is applied so that we can govern data in a non-threatening manner in ways that are going to fit well into the culture of the organization. And the goal of being non-invasive in the approach to governance is to be transparent, supportive, and collaborative. So with that, I'm going to kind of turn it over to you, Tony, here for a couple minutes to start to provide us a good definition of what the heck is smart data and what makes data smart. So please take the floor and explain to the listeners what smart data is. Well, I think probably one of the first questions people have is smart data just another name for big data. And I think the two are closely related, but they're not the same thing. I view big data really in a way as smart data in raw form. So it's the data that gets analyzed and semantics get extracted from that data, meaning, relationships, those sort of things. And that turns it into smart data. And as big data gets used by smart systems, those systems add additional metadata or other aspects of meaning around that big data that, again, pulls it into this realm of smart data. I think one thing certainly about smart data is that it's self-describing through metadata. So that's partly where it gets its smarts and certainly it has the sort of traditional administrative metadata that you would expect any data to have. But it also has other metadata that really provides the meaning associated with it. And that meaning often comes from understanding sort of the context of the data. And some of the aspects of context are things that we've tracked with metadata for a while, the source of the data, maybe who uses it, the location, the time. But we're going much deeper into that and we're collecting large sets of metadata about the data and then doing deeper analysis of it. So, you know, we may look at the location in comparison with the time in comparison with the usage by particular apps, one app and then, you know, another app that might pick up the data and follow on and extend it and expand it. So we're getting sort of much deeper and richer and more complex in a lot of ways. Views of how that data is being used and new data that's being created from intelligent systems. And the data can be, you know, sort of about a domain itself and it can be about the users that are actually consuming that data. So if you think about smart data in the context of maybe a retail system, we collect a lot of data about how people, you know, their historical buying patterns and their interests, the sites they go to, how often they clicked on something to buy and what was the maybe call to action that drove them to click to buy. And so we're collecting all this analysis about user behavior in addition to just the sort of pure data about the activity that the user's doing. And all of that then is now, you know, being analyzed to produce sort of much more of a bigger picture, a view of the world that we live in and provide a lot more sort of meaning about that world in various domains. And I'll talk about those domains in a minute. You know, the data gets additional meaning. Some of it is explicitly described. Some of it is annotated by humans, so you can pull in, for example, an ontology, maybe a linked open data model from the semantic web world or some other model, domain model that can augment or provide meaning to big data. You could do something like a tag it as often done in the semantic web world. You could introduce some structure with something like a schema.org annotation. So, you know, there's a way that humans do that very deliberately, but I think probably more frequently today what we're seeing is actually inducing the meaning related to the data or patterns and relationships about the data through running statistical and other sort of data analytics processes against that. So, you know, the whole set of things that we call machine learning and all the algorithms that we're applying to that big data now that really provides the smarts behind it. In addition to that, we've got data that's actually generated by the machine. So, in a lot of cases, we're reporting on the data, we're doing sort of prescriptive things about the data, maybe question and answer, but oftentimes now with smart systems we're moving beyond that into things like predictive analytics. So, this is, you know, data that was actually created by the system, maybe based on data that it already had about trends, but now it's going to step beyond that and really giving us sort of an idea of what might be or simulating certain things. So, machine-generated data. Cognitive computing types of systems inherently sort of feed on themselves in a positive way. So, there's a lot of feedback and learning. They look at which things are successful. Maybe they make a prediction and based on the success or failure of that prediction in the real world, they learn from that. And so, that sort of layering that they do about which things are working and not working, which things people like and dislike, all goes into sort of building up this additional set of metadata that provides smarts around the data. You know, one of the things that I noticed in your definitions of what smart data is is that you used a lot of the terms that people from traditional data management backgrounds are very well familiar with. When you talk about metadata and you talk about analytics and you talk about patterns and things like that, you know, it's good to have a data management background in order to get a better understanding of what smart data is. Are you finding that a large percentage of organizations or a small percentage of organizations are embracing smart data, calling it smart data? What are you seeing in the industry as it pertains to smart data? I think initially, particularly as we moved into data analytics and machine learning, that a lot of the data scientists in that field weren't necessarily familiar with traditional data management. So, you know, oftentimes they would run these algorithms and identify patterns and they wouldn't, and they would use the patterns subsequently in smart applications like prediction and that sort of thing, but they wouldn't necessarily think of persisting that data and creating a model of it and managing it and doing all the things that those of us who come from a more traditional database management background would sort of naturally think of doing. So I am seeing now probably within, even as recently as maybe the last year or so, people in the data science community being beginning to get a lot more sensitive to and rigorous about the whole aspect of data management and looking at the terms that have been used in more traditional data management forums, whether that's databases or other ways of managing data, SOAR-I and that sort of thing and beginning to pull some of that into their work as well. Yeah, and you know what, it's interesting because we as data governance practitioners, or at least those of us that are data governance practitioners, are always looking for new and innovative ways to be able to apply governance to the data that our organizations use. So it makes sense, but it's great to have smart data people or smart data people that are focused on smart data, kind of learning more about data management skills and data management people learning more about what smart data is. You know, when we talk about data governance and there's so many different types of data governance and people give different labels to data governance. You hear about things called big data governance and master data governance and now smart data governance. Well, there's a lot of differences in the types of data that are being used. So are there things that we can learn from what we're doing around governing big data and unstructured data and those types of things from a metadata perspective from how the data is defined as you mentioned earlier and produced and it's used in organizations. I would assume that there are ways that we can use what we've learned from governing, I won't call it dumb data, but data that's not yet considered smart data. You see that there's room for governance of the smart data as organizations start to embrace the idea of smart data. Yeah, and I agree with you. I don't think of it as dumb data. I think it's data that has semantics inherent in it that sometimes hasn't been teased out. So what we're doing now is sort of releasing that or getting that additional information. So I do think there's a lot of tie-ins between the two. I think a big data governance framework can be extended into smart data based on the sort of processes that you're doing against that data. And the same thing as you mentioned with concepts of metadata and using metadata to manage that I think is directly germane to the world of smart data as well. It may be different metadata. We may be adding additional metadata we didn't have before, and some of them were a reapplying existing metadata in a somewhat different context. I think really... I'm really glad to hear you say that. I'm really glad to hear you say that. I mean, there's a lot of organizations that think that they need to put big data governance as something different or master data governance as being something different. And really, I mean, what I'm finding now is that organizations that I have had contact with that are embracing smart data and smart data applications that they're looking to be able to apply existing levels of governance that are taking place within their organization. One of the next questions I'd like to ask you is, you gave us a really solid definition of what smart data is. Please share with the listeners of the webinar and with me and with Shannon, what are some of the applications that organizations or how are organizations applying smart data for use within their organization, and then maybe share a little bit about how these uses might have a need for that data to be governed just the same way that we govern other data in the organization. I've listed several that you've shared with me as to being applications. If you could kind of walk through those and share a little bit more detail about those, I'd appreciate it. I think you're right that the usage of the data in particular ways and particular applications does inform or drive the requirements for data governance. The ones that you see there that we spoke about when we talked before are sort of ordered from maybe least intelligent to most intelligent and starting with thinking of adding semantics to our existing search and filtering capabilities that we've had with things like Google or Bing and other enterprise kind of search tools. So now we're searching by concepts. So we've introduced a little bit of smartness by now using concepts. And so the Google Knowledge Graph that's underneath their Hummingbird algorithm is an example of that. And many enterprise content management systems and search tools now are beginning to introduce a sort of semantic graph or semantic model in the background underneath their search engines that helps to do that intelligent search and filtering. And so we need to govern the concepts that exist in those models that they're using underneath and understand how those are being applied in search and filtering. The next level, sort of building on that is what I've characterized as intelligent content discovery. So cases where maybe you give a set of criteria to a system, an intelligent system, and it goes off and does automated research on your behalf, or what's probably more common experience for people is personalized news feed. So you tell an intelligent content assistant things that you're interested in or point it to a particular article that you like and it gets a sense of what your likes are and presents you with news items that correspond to those likes or interests. And a company I've worked with for quite a while, Criminal, is very big in that arena and is a good example of that kind of using intelligence about both the domain that you're interested in and the nature of it and your interest in combining those to do things like personalized news feeds. Question and answer systems is another big one. So being able to respond to questions that people have, particularly in a domain like customer support, that's probably one of the big application areas of Q&A systems to basically be able to take the knowledge that we have about products and be able to use it in a customer support context. But health advice is another big Q&A type application. Image and object identification systems are very big right now in a lot of different markets, and I'll talk about markets in a minute, but knowing who's in the image, what's in the image, what are they doing, that kind of stuff, and you can imagine then, you know, tagging the images with that kind of information and be able to make use of it for search and other purposes. Then we begin to get into, you know, a different aspect of intelligence around sort of maybe emotions. So sentiment analysis is very big right now in understanding how your customers view your product or your brand perceptions that they may be expressing about it in social media. So all the data that's associated with that, how do we manage that, how do we collect that, manage it, and use it. Trend analysis sort of takes that to the next level too, looking at, you know, maybe product purchases, historical sorts of things, political issues, or one to kind of understand what current trends are. And then moving into the world of, you know, of recommendations and predictions and, you know, becoming active with that data. So recommendation systems and their simplest forms, things like, you know, what we use every day in a product like Amazon, for example, and their product recommendation or Netflix and movie recommendations. But I'm seeing a lot of work in that area of jobs. So if you're job hunting, you know, looking at your skills and your resume and being able to match that to open jobs, or the other thing I find interesting is employers who are job or looking for employees, and they want to find people who are passive, who aren't necessarily out looking for jobs right now. And so they're using tools to analyze the web for things like, you know, what's an artifact you've produced. Maybe you've checked in code on GitHub. Maybe you've written a blog. Maybe you've done things like that that create a presence, and they're using that then to analyze and do matching of candidates to open job breaks that they have. Decision support systems become even smarter then. Things like evaluating customers for loans or looking at the risk around insurance. You know, healthcare certainly is very big for decision support when we were talking about diagnosing conditions or coming up with mitigations for things like chronic care, kind of chronic care plans. So decision support tools are used heavily there. IBM, you know, famously with their work in Watson, it's very heavy into that market. Moving even further, simulations. So, you know, chat box that seem like you're interacting with the human in a context like maybe customer support or even just in a fun kind of entertainment context. And I'm really surprised I have a teenage son who's a big gamer and the non-player characters in games are getting very sophisticated and they sort of learn based on playing with humans and they become smarter and more challenging over time picking up sort of the strategies that humans use when they're playing these games. And then last just really, you know, full predictions, whether you're talking about elections or stocks or people's purchases, using all of that information we've collected to be able to sort of intelligently make predictions about where things are headed, you know, based on existing data but really showing some intelligence and assistance themselves in terms of how they might, you know, move beyond sort of obvious predictions. And so, you know, I can see what you meant by as you go through the list of bullets of the applications of smart data that they go from, you know, at the bottom of the list there are things that we're more traditionally aware of, you know, from decision support and data warehousing and business intelligence and those types of things. You know, simulations and predictions. Those are things that organizations have been taking a look at doing and have been doing, you know, been doing longer than some of the things that are at the top of the list. It seems to make sense that, you know, these data sources, the data's coming from different places. It's being used in different ways. It's being used in innovative ways to be able to really assist in managing customers. Are you seeing those organizations that are applying the smart data that they're focusing on the governance of the smart data or is that something that there's still a significant amount of room for improvement in these organizations? Well, I think, as you said, I think for the ones that are more, let's say, sort of incremental or evolutionary steps from what we've been doing currently, I think they are. I think in the ones that are newer, I think a lot of what's going on in the enterprise right now is very much sort of sticking their toe in the water. So pilot applications and maybe even just exploring these technologies and trying to get a sense of where they might have applicability for the business. So I'm not sure in those cases that they've really explored yet, what the ramifications might be on data governance. And I do think it varies by market. Some markets are further along than others. And I mentioned customer care before. I think customer support, that's an area that is fairly far along. In large part because we've had a lot of, you know, you have a lot of product information. You have a lot of facts. You have a lot of information about how to care for and maintain a product. But it's been sort of flipped into this notion now of responding to questions that users may have. So knowledge centered support. And we've restructured that data. And so we're able to answer questions that people have in an automated way by chatting with the bot or by them just sort of querying knowledge centered databases more directly. So it's a very good fit for this sort of thing. And there's a learning and feedback loop in there too based on how people interact with those systems. They become smarter. They understand the kinds of questions that people ask them and the ways that they ask them. So all of that metadata that gets built up about the interaction of users with those systems becomes very valuable to improve those systems. Healthcare is another area that I mentioned. That's a question, Bob? No, just go ahead. Go continue through the applications. I think that's great. Okay. So healthcare is another one. Even more data than customer support, I think it's exploding. Personalized healthcare, everything from fit bed or wearable kinds of things to people sharing much more about their health information in a formal way. And obviously there's privacy, security, and other concerns in that domain, probably to a much higher degree than there is in almost any other domain that we're working with. So the data governance aspects there, I think have a pretty high threshold, whether you're talking about sort of individual data, so personally identifiable information or you're talking about aggregate data. And one of the issues that I see a lot of discussion about in the smart data community is this notion of particularly in healthcare, aggregate data is very valuable. But can we analyze, can we share and analyze aggregate data without risking exposing people's individual data? And famously, I think it was even in the late 90s when Massachusetts introduced their equivalent of Obamacare or the Affordable Care Act when they introduced their state health plan. William Wells, who was governor at the time, famously said, we can share all this data and analyze it and protect people's privacy. And of course a grad student was easily able to take the publicly shared data from the health plan and correlate that to other publicly available data about people and came up with, was able to get to William Wells's individual healthcare data and his prescriptions and whatever and sort of prove that there was a fairly high threshold to be able to protect that data. And I think if you think that was the late 90s and where we are today with algorithms that we have to process, those data sets and the explosion in healthcare related data sets, the bar is even higher for what we need to do to be able to protect that data when we anonymize it. And there's a lot of work going on in the field called differential privacy that data scientists are focused on where they very scientifically, algorithmically introduce noise into data sets. And they do that in such a way that it doesn't statistically alter the value of the data set, but it makes it extremely difficult, if not impossible, for people to use sophisticated algorithms to reverse engineer to get to individual data. So I think you see people cognizant of that in the smart data community and trying to apply advanced tools like differential privacy for that purpose. Education is another big area, very personalized. Again, particularly K through 12, there's a lot of sensitivity around the data that's collected around individual learners. But there's, again, a lot of value and aggregate analysis of that data and understanding how to tune curricula and feedback is very key. So collecting lots of data about how people are learning is very valuable in that particular market area. Retail, everybody knows sort of the analytics in the retail side, the issues around privacy. In that context, payments, electronic payments system, certainly lots of data governance issues. I think as we move into the sort of unexplored part where people now begin to bridge between the virtual retail world and the physical brick and mortar world, that I think it's going to open up even additional privacy concerns and concerns about the kinds of data that we're collecting about people. Travel leisure and entertainment systems, I think mostly kind of fun things and I think people have less concern about that if you're collecting data about people's maybe movie habits or whatever and if you're making predictions or recommendations about that, you can maybe afford to be wrong a little bit more often, maybe the quality thresholds are a little bit less. But it's so important to give people a good experience that you collect the right data and you understand what they're interested in and offer them the content that they're interested in. And of course if they move to a by-service scenario we bring in to play again all the things around security there. Intelligent virtual personal assistance in bots. You mentioned the article that I wrote that you were kind enough to publish in your newsletter and so that's a hot topic right now. I think bots are sort of very specialized sorts of assistance and people are less concerned maybe about a single point bot but as you think of these intelligent personal virtual personal assistance like Siri, Cortana, Google Now or their new Google Assistant they just announced, you're going much more horizontally and you're crossing a lot of different domains. These are things that people could basically use in almost every aspect of their personal life as they're out and about in the real world and so they're going to have access to just incredible amounts of data about people and their habits. And so I think there's a whole world of data governance and data management issues that need to be sorted out there and similarly with Smart Home we're moving from the kind of more sensor-based things with your Nest thermostat or your lighting system or whatever, maybe even your home security system and certainly there's issues with not wanting to have those things hacked but I think as we tie the Smart Home with things like Amazon Echo and Google just announced their competitor product yesterday I think to that, you're going to again tie in entertainment, retail, health care. These sort of Smart Homes will have just incredible amounts of data about families and the life that they're living in the home moving into their car as they go out into the world and there's sorts of issues in terms of hacking into cars and getting control of autonomous vehicles or driver-assisted vehicles and so you can imagine the space is just exploding and so there's definitely a lot of demand for data governance to sort of help in the security, privacy and just control of all the data that's being created making sure that it's accessible, that people know what the data is and they know how and where to use it. Yeah, and I would guess that most of the people that are attending this webinar can find themselves in one of the industries that you just mentioned and we're kind of moving beyond traditional uses of data and traditional ways of being able to collect data and it's a scary world out there for a lot of people that the data that's coming from sensors and coming from devices and coming from your Smart Home and all those types of things that we prevent them from being hackable. We protect that data because it really tells people a lot of information about oxygen, about our businesses and those types of things. So typically I talk about data governance in terms of making sure that we define data well and it sounds as though in the smart data space that the data is being defined pretty well but the way that it's being produced and who has access to it and who can use it certainly is calling for needs to take our traditional ways of governing data and extend it into some of these new markets and these new applications. You want to talk a little bit about what are some of the opportunity areas for smart data as we move forward and then I threw a couple questions down on the bottom right hand side of the screen. What is the need for governance of these types of data and do they need to be specially pulled out or set apart from the other types of data that we're governing and how will data governance assist the organization to be able to address some of these opportunities? So if you could first speak to the opportunities and then why is there a need to govern that data just like we govern any other data that we have in our organization? Yeah, I think that people probably got a sense as I was talking about the applications and the markets of the kinds of issues that are implicit in introducing those applications in the particular markets and it does vary by application and by market but I think we're seeing certainly user profiles getting much richer, much more advanced demographic data, psychometric data, behavioral data, all the contextual stuff about where people are, what times they're at particular things, the flow that they go through, their daily habits, where they drive, what they eat, all this kind of information is being collected or could be collected and getting the sense of how to manage that and share it where it makes sense and where people are comfortable and get their authorization for it to be shared to make sure that we're able to use that data in ways that are adding value to people. So that's the obvious thing. I think probably less obvious to people outside of the data science community is this notion of bias in the data sets and particularly if we're looking at machine learning algorithms which are so, they seem predominant today in artificial intelligence, you've got the data set that is used to train models. So a data scientist picks various data sets that they want to train the models with. Those data sets may have been sort of curated by them or they may have been produced by a third party, in which case that third party may have had a particular view when they were creating the data set. And so the person using that, the data scientist, using it for the model training needs to understand that and have the same view about that data or at least have an understanding of it. So they're selecting the data sets to train the model and that sort of is based on their viewpoint again and they're putting in the potential for bias in the resulting model that's produced by doing that analytics process or that machine learning process. And that resides in the model and may not be obvious to people who are using the model. And several companies that, you know, offer access to various algorithms via APIs and software as a services perspective, you know, they don't necessarily sort of publish a lot of details about the data that they use to train the model. So, you know, you have to go with their reputation of do you believe that they did a thorough job of training the model and is that model applicable to use for particular analyses of targeted data that you want to run against that model? So then you're choosing, as the user of the model, your target data and you want to make sure that that target data matches the model well. Are you applying the model to a set of data that you should be applying it to? In this match, you could get misleading results. So you need to sort of understand all of that. So I think provenance, if I want to use a keyword there, of the training sets of the models themselves, of the target data, understanding that, having it be transparent and making informed decisions about it is extremely critical as we move into smart data. And, you know, I think one of the problems is that the biases can be hidden. Sometimes they only come out in the results. And, you know, some of the biases it may have been introduced completely unintentionally. And probably the most famous example of that is last summer, Google made available their image recognition algorithm and an African American couple or friends, I guess they were, male and female, had a selfie that they'd taken sort of in a field somewhere or a park or whatever, and they put it out there and used Google's tool to analyze it and Google tagged them, tagged the photo as being a picture of gorillas, an extremely offensive, embarrassing incident that hit social media and Google was immediately apologetic for. But it exposed, you know, bias that's in models that I'm sure was unintentional on Google's part. And Google didn't acknowledge necessarily the details of what went wrong in that case, what went horribly wrong. But I think a lot of people who were analyzing that suspected at least that, you know, Google had trained the model with various settings, outdoors, indoors, various objects, animals, people, but in the people that they trained it with, they may well have only used a small subset of white people. And so here you've got African American people in an outdoor setting and the dark skin alone may have been enough in the pattern recognition to somehow rather match that to gorillas as opposed to humans with darker skin tone. So, you know, there are going to be a lot of cases like this unless people are very cognizant of the data sets, the algorithms, and the data target data that they're applying with those models. So I think, you know, if I look at it, it's... Oh, go ahead. You have a question about that? How will the governance of the data help to prevent or expose some of those biases? Because that's one of the things that I find that's most fascinating about this screen right here or this slide right here is when you talk about biases and data sets and biases may be hidden. You know, will data governance and the role that data governance plays within the organization, will it help to resolve some of these issues? You know, because people are looking at the results, but they don't necessarily understand the biases. Will it become governance's responsibility to make sure that as these results are being made public that they're basically transparent in what the biases that are used to make these decisions? Will governance play a role in that? Well, I think your famous for saying accountability for the data is based on your relationship with the data. So if you're going to use one of these data models, I think the onus is on you to work with a vendor or if you're doing it yourself with source data to really do that analysis of the problem and so the data, the data sets, how well regarded are they? What were they trained with? Where have they been used previously? Are you safe in using them maybe in a different area? Should you trial that in some case? So I think there's onus on both sides in terms of the producers of the APIs and the data sets and the consumers of them, but ultimately the people are using them. I think it's up to you, if you're basing your system on that, to know what's in those models and to ask our questions to the providers of the tools or the models to make sure that you're comfortable using them for the application that you want to use them for. And I think that goes from... Go ahead. I was going to say, I think that's... it really comes down to that question of what data to use where and for what and so it's all about knowing how close it is to a fact versus a derived sort of data and I was going to say that we have oftentimes the raw facts that are captured with big data. We do analysis and so we're doing statistics and trend analysis and there's that whole thing about statistics don't lie, but the people who are doing this statistical modeling can lie and not really lie but they can skew the statistics by the kinds of analyses and algorithms that they're running against that. So I think you need to be aware of what kinds of processes have been run and what sort of preparation of the data and then certainly if it's a prediction that a system did on its own, you want to understand the probability associated with that prediction and where that prediction should be used based on how solid that prediction is and so I have this thing about particularly machine generated data. I personally want to have traceability back to both the supporting data that was used and the rationale and I think a great example of this is IBM Watson had something they called the physician's assistant product. I think it's still out there. It was one of the first Watson products and so doing a diagnosis a physician could work back to which data led the system to come to that diagnosis and what rationale, what sort of pattern or logic led them to it and then they could decide whether they were comfortable with that diagnosis if they needed to do more research or whatever. So it's a case where the smart system augments a human and you've got that, you know, the combination effect or the sanity check that a human working with that data provides and so that's the kind of thing I like. I feel concerned when I see, you know, deep learning systems that are recurrent neural network based and you've got these layers and layers of data that's been analyzed to create patterns. The patterns may not be very obvious. The rationale may not be explicit and so you have to ask, you know, do you feel comfortable trusting that system and for certain uses that may be okay for other uses you may, you know, want to have the system be smart enough to be able to explain to you why it came to the conclusion it came to before you're willing to use that data. Yep. And I see, you know, the last slide and the last bullet, the what data to use, where and for what reason. I mean, the whole idea of augmenting machine generated data from giving people involved in it before results are shown and before biases might come out. I think that's really important aspect of, you know, this is an additional or an extension beyond the traditional uses of data to do things that we, in a long time we haven't even really been thinking about how we're going to use it. But now there are people that are thinking about it. Data science is something that's real and that organizations are starting to embrace. Data scientists is a very popular, a popular title or a popular role that organizations are starting to embrace as they're starting to move towards chief data officers and things like that. The ways that we use data, the ways that we extend our uses of the data require that we have that traditional aspect of making sure that the data is well-defined, that it's being produced in such a way that we can track back to where the data is coming from and how we're using that data, making sure that, you know, we're not stepping on our own toes as we start to take some of this data and make it available. It's a fascinating world that we're living in and that the extended uses of data in organizations and the different ways that they're looking to use it are certainly calling for governance around that data. I want to spend a couple of minutes here talking about, you know, creating a framework around data governance, and Shannon here in a minute to see if we have any questions from our participants, but typically, when I'm looking at putting a framework in place and frameworks are very popular term in a lot of industries, but putting a framework in place around data governance, a lot of organizations are first focusing on developing best practice around governing that data and that now needs to extend into all uses of the data beyond business intelligence, infrastructure data into the smart data and the smart data applications because I think that we're just really scratching the tip of the iceberg when it comes to how data can be used in organizations. So, certainly, everything that I've talked about in the past in regards to putting best practice in place around the governance of the data certainly extends to the smart data itself, certainly formalizing accountability for people in the organization who are defining, producing, that data and making sure that roles are well defined. Again, so we don't step on our own toes or getting our own ways of how to best utilize this type of data for the benefit of the organization. Certainly, we've got processes in place that we need to govern and I always talk about the need to apply governance to the processes, so all the processes that you talked about in regards to the collecting of the data and the understanding of the data and the metadata associated with that data, we need to apply governance in the same way to that data and we certainly need to execute it in enforced authority because if we don't do that the management of the smart data can certainly become unruly in organizations. Would you agree that at least these are some of the items that we need to be considering and maybe add to the list of other items that we should be considering when we're talking about putting a data governance framework in place to support smart data initiatives within organizations? I think that's definitely a very good list and as I said, I think early on I think a lot of people in the data science community may be more familiar with algorithms and statistics and that aspect and not necessarily with the management data in more traditional enterprise settings so I think it's an opportunity for those two camps to work together and I think most people want to do the right thing here and I'm an optimist. I want to have access to all of the smart data for smart applications. I think there's a lot of benefits that can come from smart applications but I do think we have to be smart about the data itself and how we're using it and making sure that we're managing and respecting people's privacy security and we know what data we have and what it can be used for and what we're doing with it throughout its life as well so I think all the traditional things of data governance apply in this context and if anything we're just sort of expanding the set of things that need now to be included in that. What I would take from that is that we really need to be smart before putting data governance into place especially for these new uses and these new applications of data in our organizations. We've got a couple of minutes left so I'm going to turn it over to Shannon here in a second and we talked about in this webinar and I really appreciate your taking the time Tony to join us and to share information about smart data for people that may not have as much of a background in smart data as we wish as we're focusing on traditional manners of data governance in organizations provided us with the definition of smart data why it's necessary to put governance in place around that. The last thing that I want to mention is just remind people about the webinar for next month of being on do-it-yourself and purchase data governance pools. With that I would like to turn it over to Shannon and see Shannon do we have any questions for Tony to the information that we talked about today? Absolutely and thank you Bob and Tony for this great presentation and of course one of the most common questions we always get are people inquiring about the slides and the recording just a reminder I will send a follow-up email within two business days so by end of day Monday for this webinar with links to the slides recording and anything else requested throughout the webinar and let's get into the questions. I'll let you decide who wants to answer this first. Do you think data governance should extend to checking for bias or is there a need for governance around the analysis modeling process? Tony grab that one first. I was going to say since I raised the issue I guess I think it's both. I think the earlier you can introduce data governance in the process the better off you are so as people are creating data sets that they believe will be candidates for use in machine learning algorithms and certainly there are cases where people pick up existing data sets that weren't intended for that use and so then I think that means that at that point when you're now using something differently you need to understand the ramifications of that. The better. Certainly if we do see biases in models I think there's a feedback loop that we can introduce that says this model is good for this purpose but not for that purpose or be cautious when you're using it outside of maybe the limitation that it was originally set for. That can be things like the Google example that I mentioned or things that are more benign like if you produced a model that analyzes objects based on their shape because you want to put them in inventory efficiently in a warehouse somewhere that's different than maybe looking at objects and so maybe you would put a mattress in the table in the same inventory shelf but maybe if you're looking at for use in something like a purchasing system or recommendation purchasing that tells people something they might want to buy for their home will then function becomes important so you want to analyze use data sets that are more about the use of the object as opposed to its shape so you really need to know what information you're working with and what you intend to use it for and make sure that you're matching things properly. When we talk about framework for data governance associated with smart data certainly the roles of responsibility certainly all that I kind of just went through recently are things that we need to put in place certainly the roles of responsibility has become very important. I would say that I always focus on the definition of production and the usage of data when we talk about making sure that biases are being made, people are being made aware of biases that are being injected into the results that are being shared with people in their organization that there needs to be governance specifically around the usage of that data. Certainly the definition and the production of that data is very important but we certainly need to put governance in place around how smart data is being used so we use it smart and we take care of protecting the data that needs to be protected and using data in the most appropriate way in the organization. That's a great question. Thank you for asking that. We got lots of good questions. We got lots of good questions coming in. Can you repeat the name for the aspect that data scientists engage in not to identify data source as something like non-referential, non-identifiable? It's differential privacy. There's a good paper, extremely technical but a very interesting paper by Zachary Chase Lipton at the University of California San Diego on that subject. If you're into deep mathematics and understanding that sort of thing but it's also accessible even from a lay person in terms of just getting the sense that they are taking the algorithmic approach to introducing that noisier to anonymizing data in ways that I said that preserve the real valid statistics in the data but make it much more difficult for people to kind of reverse engineer into individualized data. All right. And we've got another question coming in. Well, actually we are right at the top of the hour. If you do have additional questions, we're actually out of time. One of the great things about engineering and we'll include those answers to any questions we didn't get the chance to get to but within the follow-up email. So go ahead and keep submitting those questions and we'll make sure and get answers to you. Again, just a reminder, I was sent a follow-up email within two business days so by end of day and Monday with the slides and the recording of this session along with always such a pleasure and thanks to all of our attendees who are so engaged in everything we do. We just really appreciate the involvement in the questions that come in on the end of your attendance. So I hope everyone has a great day. Thank you very much. Thank you, Tony. Thank you, everybody. Look forward to seeing you again. Thank you. Bye-bye.