 Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Manager for Data Diversity. We'd like to thank you for joining the latest in the monthly webinar series, Data Architecture Strategies with Donna Burbank. Today, Donna will talk about data catalogs, architecting for collaboration and self-service. Just a couple of points to get us started due to the large number of people that attend these sessions. You will be muted during the webinar. And we very much encourage you to chat with us and with each other throughout the webinar to do so. Just click the chat icon in the bottom middle of your screen to activate that feature. For questions, we'll be collecting them by the Q&A and or if you'd like to, we encourage you to share highlights or questions via Twitter using hashtag DA Strategies. And we would like to continue the networking and conversation after the webinar and to learn more about Donna, just go to community.dativersity.net. As always, we will send a follow-up email within two business days containing links to the recording of the session and additional information requested throughout the webinar. Now, let me introduce to you our speaker for today, Donna Burbank. Donna is a recognized industry expert in information management with over 20 years of experience in helping organizations enrich their business opportunities through data and information. She currently is the managing director of Global Data Strategy Limited where she assists organizations around the globe in driving value from their data. She has worked with dozens of Fortune 500 companies worldwide in the Americas, Europe, Asia, and Africa and speaks regularly at industry conferences. And with that, let me get to Florida Donna to get today's webinar started. Hello and welcome. Hello, thank you. It's always a pleasure to see some familiar names on the call. Thanks for those who joined fairly regularly. So as Shannon mentioned, today's topic is on data catalogs, what they are, how they're used, and particularly around architecting for collaboration and self-service in kind of this new world. Again, thanks for those who do join regularly. Those who do know that all of the previous webinars are on demand on data diversity. I'm pretty sure to keep them indefinitely and it's a great resource, not only my webinars, but a lot of other topics are out there as well. So it's a good resource for you to have after. And I know that's often one of the questions is this available? So yes, the slides and the recording itself will be available after the call. So as I mentioned today, we're talking about data catalogs, what they are, how that kind of fits into specifically some ideas of self-service because I know that data is hot and the more types of different roles looking at data are also looking for this idea of data catalogs and we'll talk a little more about that. So since we are data management professionals, I like to start with definitions or the metadata around what is a data catalog, right? So I took this definition from Gartner, I thought it was a good one. A data catalog maintains an inventory of data assets throughout the discovery, description and organization of these distributed data sets across the organization. Really a goal is just sort of how you find and understand and extract business value from them. So it can seem like a technical thing and there's tools out there in the market that are very technical, but at its core, I think we all understand that. If you think of a library, we all use, well, not all of us, the old people in the column are calling myself, I actually use these card catalogs in libraries, right? I wanna find a book, you need some sort of catalog or index to find that information. So data cataloging and in some certain, I've been around for centuries, or I've always had to find information and we'll talk a bit about that. What's different about a data catalog, compared to some of the older technologies that have been around longer, like the card catalog. But I think the general concept is pretty straightforward. I have data, I need to find it. Where do you find things as you find things in a catalog? The other piece of the Gartner definition that I like and we'll talk about on this webinar is the technology has evolved and yes, we all use card catalogs but we use that less and less because you have things like Google, right? So how can we use things like modern machine learning algorithms and really get the best of both worlds? The organization, you know, library, I love librarians, they really were pioneers in data management, how they organize, you know, the duty decimal system and how they organize vast amounts of information. So kind of riding on their shoulders and using some of that same idea of a data catalog but using also the newer technologies and not having humans, if you ever had to go scan through these manual cards, you know, that took a long time. So how can we get the best of both worlds? So we wanna talk about terminology but if you've heard me speak before, you know, I have little patience for getting too academic and getting ourselves tides and knots because we tend to, because of the nature of our business, we focus on definitions which is great but at some point we can get ourselves turned around and talk too much about, you know, meta and definitions. So the question I often get is it a data catalog or a metadata catalog? And I sort of, you know, we come back to meta levels and I start to feel that we're an extra diagram, right? We're sort of hands drawing hands or we're getting into little layers of complexity or are we getting too academic about that? So metadata is, you know, I think we, is a class. There was a webinar a few months ago if you're interested more in what metadata is but I showed you that data about the data, who, what, where, why, how the context is a good way to think of it around the data. So yes, we're looking for data through a catalog and metadata helps you find that information and some data catalogs actually have some data in it. More likely you're pointing to where that data is. So I'm not gonna go much deeper on this or I'll start to do what I just accused myself of not doing, of getting too turned around in the weeds but I wanted to call that out to get that I've had those types of arguments that, you know, in discussion with metadata professionals and there's some overlap and we'll talk about that but I wouldn't get too crazy about that. You're looking for data through a data catalog which could be a metadata catalog and there I go again, I'm gonna stop. Right, there you go. I do think a helpful discussion might be. What's a data catalog and what's a metadata repository? Someone asked me that the day and I was sort of snarky and I said data catalog is a sexy name for a metadata repository. We've had that forever but there are some real distinctions and I think it's worth talking about that. I've had several clients that have looked at these different technologies and have picked the wrong one based on some of these differences so I think it's worth talking through and the blue line at the bottom is sort of the obvious statement that I wanted to call out that this is a spectrum and tools and vendors have overlap and functionality and what metadata repository might have some data cataloging, some data catalogs might have some metadata some master data management tools might have some. I don't wanna get too into specific tools and don't ask me on the questions because I never like to talk about specific tools but I will talk about functionality but I do think in a general level is think of a continuum of what you're looking for. You may have heard me use this analogy before but if you think of the on the left of almost the encyclopedia approach of a metadata repository tends to be more strict more business rule driven and you think of the old days with encyclopedia someone vetted this information we published it out we truly understand the lineage of this information whereas Wikipedia is more of a crowd sourcing approach and I was a skeptic at the beginning with Wikipedia that yeah that's gonna work or just a bunch of people adding stuff in we kind of have academics for a reason and that's why we encyclopedias but I found I go to Wikipedia all the time so you need to understand the use case and do I need eventual consistency of a definition or kind of the vetted approach but I do think that's something to think about so again as a generalization if we just have the category of a metadata repository they have been around for a while I think there's some overlap both metadata repositories and data catalogs have some sort of automated metadata discovery right that if you're doing this by hand I wouldn't say you have either a data catalog or a metadata repository you're back to the card catalog both probably have some sort of search capability to find things I will say that a lot of these new tools that sort of sell themselves as a data catalog tend to be much more intuitive and we'll talk more about that of think of more a Google search approach then looking through an encyclopedia approach and so but that's the beauties in the eye of the beholder I do think what was intuitive to one person may not be too intuitive to another but metadata repositories often again generalization could be more on the technical side with that though you get some robust capability like data lineage so think of the classic I have a KPI on a dashboard and I want to see the source to target mappings of the through lineage of how that was defined what are the business rules what are the ETL rules what are the security lineage of PII and as of where is that found impact analysis of change I have a field and I'm going to change that field what else is impacted right so the other big difference is more in that Wikipedia encyclopedia if you're thinking on the metadata side if you really want to enforce a standard right and there's a bit of a difference here I have that I don't know last name field needs to be character 50 and it must be character 15 we're going to have that we're going to have the definition of a customer as X and the fields on this are PII and we want to cascade that through the systems and enforce my software development on that I want to have standards and rules alignment that tends to be more on the metadata repository side on the data catalog again the title of this was more around collaboration and and user self-service a lot of these data catalogs are sort of more of a lighter touch on purpose because the purpose is more collaboration light touch enforcement think of a catalog is more of if I'm looking at catalog of clothing right I'm just finding the clothing I'm not enforcing the sizes of the clothing or how the prices of the clothing I'm just trying to find a pair of shoes online and so that's more of the data catalog approach I think a lot of them and again not that no metadata repositories have these but that idea of collaboration and like buttons and user ranking tends to be more on the data catalog side some of these data catalog tools actually sell themselves the sort of analytic team collaboration tools unless you know you don't hear the word metadata repository as much partly because it's not as sexy and partly because it is a different thing and I think it's good when the vendors are clear is it a data catalog where I'm trying to just get up get an inventory of my information or am I really trying to do more about data lineage enforcement and again neither is right or wrong you don't have to even have one tool over the other maybe some tools have both of these functionalities that you're looking at and you can get both just the example is I had one client for example that chose one of these data catalogs because the user interface was really nice and I agree with them it was easy to use the good side is they got all of their analysts bought in to using it anyone here who's been trying to get a metadata initiative up can realize that can be a cultural challenge so having the easy to use user interface getting information out there the collaboration was great but then they were trying to use it from really strict governance PII enforcement rules enforcement data lineage and the tool broke down and they'd already bought the tool and the problem is they'd gotten the buy-in and then it couldn't do what they were doing so that was run risk the other risk is and especially the technical folks I'm one of them on this call it can be really sexy to see full data lineage and impact analysis if that's what you're doing from the tech team but maybe your business users want more of just a glossary approach or a easy to use user interface and some of these metadata repository tools are powerful but not as intuitive and so you shouldn't have to choose between one or the other but the reality in the market is that that can often be the case so just think of it as you're looking for a tool think of your audience think of which one you want is it a catalog in collaboration or is it a metadata repository and rules and enforcement and then realize that it's a spectrum between all of these so analogies are great with this so we talk about data catalogs and whenever you put data it seems sort of just maybe complicated and and hard to understand especially for non-technical users but we use catalogs all the time and just like the card catalog I mean there's been serious robux catalogs that mail to people's house right that idea of a catalog isn't complicated so we use product catalogs all the time I'm a big skier from Colorado and it's almost ski season so I thought this would be a fun analogy I'm a telemark skier and I was looking for some telemark ski boots so if we kind of think of what is involved in a product catalog there's some analogies that I'll go a little later also with data that why do I go to this was Amazon probably pretty obvious I think that's fair to say I want to just find my ski boots so the area the ease of use that I can just go to a box and type telemark ski boots and get all of the major vendors and all of the current list of ski boots that I could purchase right now is huge that's why it's one of the leaders in the market it's easy to get that I can also though look into the product details if I want to see what kind of plastic it's made out of and who the vendor was and the price and the size and all of that you can drill down into those details but it's really easy to find once I do find it I can obtain it or purchase it or get it and that's one of the beauties of these product catalogs like I can I see it I buy it maybe some of you in the call kind of wish it weren't so easy because your credit card bills would be lower but that is one of the beauty of this it makes it simple complex information really simple I type in ski boots and I get everything back there is a bit of data management though in any of these product catalogs you'll see that it's sort of organized by subject area is this part of outdoor goods is it part of recreational goods is it part of women's fashion well in Colorado it could be baby goods people start pretty early with kid skiing so but how do you group that's a nice thing of the product catalog I'm looking for women's outdoor wear for summer some of the other things that we're all familiar with like this bought this or this idea of reviewing related items so clearly someone knows that I want ski boots I probably need a strap to carry these ski boots right so can I can I buy that as well the other nice thing we're all familiar with with any social media is this idea of collaboration and similarly in the data catalog world one of them is relevance in ranking of who liked this is this relevant to me can I collaborate with other users if I were to drill down into this I could see reviews that maybe it got a you know one star ranking from people because they said you know the size on the packages is too tight and order a size higher super helpful because what the product detail says may not be the same as when you actually use the product and have that kind of contextual information so maybe that was all obvious and you might be wondering why I explain an Amazon search page but hopefully it'll be relevant as we go ahead one of the things to consider this was the flashy front end of a product catalog and if you who are in retail or doing any data design for retail realize there's a lot of work or anyone who does any studies of these there's a lot of work to get this data right so for example this ski boot the fact that you have the right skew the right product name the right price and description that's master data there was a whole lot of work to make sure that I got the right brand of the ski boot that two brands didn't show up that had the right product description there might be a document management system to get the pictures right and or a product information management PIM system right a lot of information there there's operational data once I purchased this there's going to be a purchase history of what I had this reference data to say what are the common department's reasons brands who created these categories I said it was baby goods versus outdoor goods right and they're the whole semantic layer of who created the data model of all of the is it brand is it price so what are the fields we even show what's the organizational structure of this how do we do either data science or no sequel or you know recommendation engines there's a lot of stuff behind this right so to bring this back to the data world my analogy is this is the data catalog a fake one I made it up this is not a particular tool but you should see some similarities in the different areas so you have a search so just almost like the Google search I just type in customer pretty simple I get back all of the table information around customer I get the definitions all of the columns and data types this is primarily as I said this isn't a specific tool but it's fairly common information that you would see in any data catalog right but this is to my earlier point this is metadata about the data right but you can also see what's the other related information there's certain views that people have written maybe a view to see just the customer demographics or the customer address what a lot of these data catalogs do now we've all kind of embedded Amazon type of catalogs in our brain kind of these ranking or usability or likeability because just like that example when I bought the ski boots and they seem great in the picture but I bought them and the sizes run small and this sort of you know maybe that was helpful for me to know or maybe people said I bought it and it broke two months later right that this is yes there's maybe there's two queries about customer demographics but this one's really helpful this is the one I really use in the field right or this one not so much you can see that people aren't really using it so getting that idea of collaboration around data can be super helpful and that's what a lot of these data catalogs can offer you can also see some of the related actual dashboards themselves a lot of these tools can link things like this is a table what dashboards use this table we can see there's a customer demographics query there's a top customers by region and many of these tools you can just click on that dashboard and bring it up and tableau or click or whatever you'll also see just like with the product catalog you can kind of group it by different subject areas or business areas and that's generally something you can customize within the tool and you want to make sure that the tool you pick can be easily customized that way and then to act as a point is it data or metadata generally there are data assets or trusted data sets you can get what are all of the customer tables that I can access or tables related to customer or views about customer or dashboards written for customer and that's some of the metadata that links you to these trusted data sets so this idea of a trusted data set is we'll talk more about governance a very important part of the catalog because as you know garbage in garbage out right so be careful when you're designing your data catalog is this truly a collaboration source where I have my data scientists writing views and saying hey this sort of work great idea why don't you try this query up didn't work for me I'm going to give it a two star rating that can be a very helpful use case for your analyst and I've had several customers use them in this way productivity went up reuse went up because people are actually talking with other people writing these queries and getting some buy-in but just think through that carefully or or and or are there certain trusted data sets where you want to be very specific and say no no no this is the customer demographics querying this has been vetted there is no other query that's going to be valid when we're talking about customer segmentation this is the approved definition and this is the query we'll be using both are valid just make sure that the tool you pick or the implementation of your tool you've given that thought so that you don't create more chaos and that the collaboration is actually adding value and not noise and that should be obvious but the the other nice thing a lot of these tools have whether it's a metadata repository or a data catalog is this idea of a discussion forum can people give feedback so some of it you know and we've all probably done this who have been in the business for a while like we just spent months creating this perfect description of what a customer is it's de-duplicated individuals who've purchased one or more products within the past 18 months and then someone who's actually using the data Joe says but what about lapse customers where do I find that great that's excellent feedback that maybe that maybe Joe isn't on the governance committee maybe we don't even know who Joe is but he's clearly doing some work in the field that we wouldn't have gotten before and that's a great usage of this collaboration and getting that mixed right is excellent because what you don't want to do is create this encyclopedia that's old and dry and dusty and is out of date so having this idea of the best of both worlds and adding some of that Wikipedia type collaboration can be great it could just be hey some of these data catalogs that are more kind of analyst productivity type side of things anyone have a good query for a net promoter score or MPS people can publish it out there it almost becomes more of a github type of thing where you're that's where you're actually sharing some of these data related assets here use my query I've tried this out huge productivity gain so hopefully that kind of makes sense with that analogy and I kind of spelled it out on the slide later but some of the same things that you might find Amazon right I can easily search and discover what I want you know I'm not looking at ski boots here but show me everything about customer here's all the tables the columns the views about customer I can look at the detailed product specifications right what are the columns what are the data types and this can be huge I mean and again think of your maturity as an organization and how strict you need to be maybe you don't need a full metadata repository maybe just getting the definitions for the core customer tables that people need then that can be huge just even publishing that out we've all worked in organizations where many people are not malicious about going against standards but people are busy right so if you can just say here's the vetted description of customer data and here's the columns and here's where you can get it people will be more than happy to increase their productivity and use it so us often just publicizing it can do a lot for consistency and data quality that said having this idea of ranking and who's using it and this can be several ways it can be sort of the like button of kind of yes I found this helpful some other tools will actually do usage ranking how many people have actually downloaded and used this so you can kind of see people speaking with their actions this kind of you like this you may also like this type of functionality you search for customer you may want to see the dashboards related to customer the views related to customer etc again this idea to organize by business area just like with the ski boots is it part of health and beauty or is it part of outdoor goods you define that yourself and that this is a whole part of implementing your data catalog that most products have but you need to define that in the most intuitive way is this by organizational area marketing sales development as in this case or is it by customer product in more domain areas you can define that but just give that some thought before you roll it out because how you kind of define these subject areas can really help with the usability what data assets we want to have on our kind of catalog is it the actual tables themselves is it the the views on those tables are the dashboards and then again the collaboration is a huge part of the value of these that can we actually see what people are using on the field so contrast that to the types of features that are in a metadata repository so the metadata repository has obviously some similar features you're going to have the web portal you can kind of generally do some sort of search and reporting I think some of the difference again it being a continuum this idea of the some of the matching and the reuse logic some of the automated interfaces to other tools having more of a robust meta model that you can sort of understand some of the rules and create them so that you are a little bit more prescriptive and this is the definition of customer and these are the five fields and I've locked it down and I've published out the business rules some of the benefit of doing that and a lot of these more metadata repository type tools can do some sort of impact analysis or data lineage generally automated through scanners or whatever they call their interfaces so maybe the classic is you know I have a report that you know total sales by region how was that calculated so what a lot of these tools can do assuming you use sort of SQL and industry standards that here's the field that it was in the warehouse it went through the staging area this is the ETL that was done on it these are the source systems and can really see a nice view of that doesn't have to be your classic data warehouse it could be I had it on a file I moved it up to AWS in the cloud and then we moved it to depending on your use case it's another important thing to remember where I've seen some customers get into trouble is they like the tool they like the UI but really do a robust inventory ahead of time of what your environments are maybe it works great for this data warehousing use case but maybe that's not your use case maybe it is S3 buckets on AWS does the tool show that kind of lineage you're using you know Google Sheets can it help you there and that's something to really consider the other thing it can a lot of these tools can do especially when you have some of these impact analysis is what's the effect of a change I have a field you know first name character 20 I'm going to change that to character 25 before I do that what am I going to break and yes I've still seen that happen even this year I can't get a client but that still happens did anyone think that through that we changed a product code length and we broke the system or there's downstream impact so having these tools to be able to kind of show that up front and be able to do that analysis before you make the change can be really helpful the other piece of these metadata repositories or data catalogs and both have this a lot of the tools have this now is this idea of machine learning or metadata discovery cringe at saying AI but they'll call it AI but this automation and any of you who have been in the field well even now a lot of people are doing this through spreadsheets so I want to see everywhere that social security number is used in the systems do I have some poor data analyst going through and manually doing a spreadsheet of everywhere that social security number is used it's probably not the best use of that so what a lot of these machine learning tools can do is kind of do pattern matching that I know that in the US a social security number is sort of three numbers and then two numbers and then three more numbers and so it can kind of start to see those patterns and self-discover that you can say this is all the places that social security number can be used a lot of the security tools can do this as well and there's some overlap there too between metadata and some of those security type tools but this is a great way to kind of get started and to do a lot of that menial work that you can self-discover although think back that sometimes you want to do specific mapping rules so in some case you want to just have it say show me everything that looks like an email address and I can classify that sometimes you definitely do want your own matching rules and that I would kind of classify more into that metadata world or I want to say when it says SSN underscore NUM that is going to match to social security number or I want to call maybe a business example I want to call anything that his client should be mapped to customer because that's a business rule I know I was seeing can't necessarily do that it can probably say that this looks like a person because it has a name but there's certain business rules that you might want to code and match for and just make sure that the tool you purchase can do that if that's a I need for you I think maybe the analogy is everyone loves Google search because I can quickly search for ski boots right but sometimes I want to say no I only want blue ski boots and women size 10 from this one vendor more of that sequel based search sometimes it's harder to do that if the tool doesn't use that so make sure that you understand your use case on these matching that yes automated machine learning is great saves a lot of time and can find things that you may not have discovered but also if you do want to do your predefined either name matching rules or pattern matching rules that there's some customization there as well so the use case around this I did want to make some distinctions and what is which you know is it metadata is it data catalog but again I like to look holistically because if this lady in the middle is your typical kind of self-service business analyst generally that's when a person wants to see a lot of things I want to know are there glossaries are there data models are there business rules around this what is the definition of a customer versus a client that can be a glossary that can be a data model and more in the data catalog type if there is a master data set I would love to use that tell me what is the vetted list of valid customer or valid product codes or valid state codes is that in the warehouse is it in the master data is it in the reference data I think often there seems to be this false dichotomy of where in the self-service quick analytics data science world we don't need things like master data I would think most data sciences I've worked with if you can just have a clean list of country codes could I just use it so I don't have my PhD or masters in statistics and I'm spending time cleaning up a list of country codes could we get it one place so I think that's a nice balance to remember that there is that encyclopedia and Wikipedia make sure to get it in the right spot then yes if it's our product skews and our product codes and names probably a good thing to standardize and publish them out in the data catalog as sort of a trusted vetted list and be very clear about that but there is also the self-service approach where maybe that's more of your crowdsourcing data cataloging I'm just spitballing here I just am trying some queries I've done some analysis hey I think that's cool you might want to try it too too and so that's where you want to give a little thought in your tools of either just being very clear this is a trusted data set and this is something we are just testing out and you know this is just be very clear in your definitions and I think the average self-service user probably wants a combination of all of that as long as we're very clear so I did I would be remiss if I did not mention data governance because there's great tools around data cataloging picking the right one obviously can be very important but none of that is going to be good without good data in it and good data requires good data governance so give this some thought before you publish are you going to create a monster who are the data stewardship roles for these curated data sets and I didn't show that my example because it was just a fictitious one but often there's a place for that table who's the data steward or data owner for that table who's responsible if I think this field is a wrong definition or wrong data type I have a question about it who I can go to a lot of my customers who have implemented data catalogs they sort of say ironically we're using technology the biggest value was getting people together one of my favorite stories was two analysts on the same massive oil company multinational thousands of employees they used one of these data catalogs people found out they were both doing some queries on some location data through this data catalog they started collaborating it was one of those oh great let's talk where do you sit what time zone are you in they realized they were on the same floor after all the years of the company they didn't realize that there are several cubes down from each other and they'd never collaborated on this stuff so that was a way where technology really helped the value of it was yes the data they saw but more the connection between people so I would be very conscious of how we can have these sort of who's the buckstops here owner when there's a problem but also then how do we have these kind of collaborative feedback loops that anyone who's working on this data can kind of have these ways to work together if we're going to have published data sets on this catalog how how do we say that's published what what is the governance rule to say this is a standard approved data set or this is a standard you know approved report this is the analytical model we're using versus here's something cool you want to play around with and then that kind of ties into how you publish and distribute these different data sets again not to be my own eschew diagram of is it data or metadata some of these tools can very easily store things like your reference data here is your list of country codes download it but you know think I wouldn't think of your data catalog as the place to store all of your data I would think a massive organization is going to have things like warehouses and MDM hubs and you know better places to store the actual data and don't think of it you're moving it all there it's more of a like a catalog my product doesn't live on amazon right I find my product on amazon and then it goes you know that said there are some products that live on amazon I can download an ebook so it sort of lives there but in general it's a pointer to something else that gives you your thoughts of these life cycle and workflow maybe I have my trusted data sets and then somebody in the example that I had said that's a great definition of customer but I think you missed a piece how does that piece of feedback then go to the right data governance committee get vetted approved back fed back into maybe a change in the definition how does a discovery data set that your data scientists found then get published to a trusted data set that no this is a new piece of data we found I'd like to share it and on all of that pieces in between again the tools can be slick the tools only work if the data is slick and the data is only slick if you have great governance behind it in the right role I've shared this slide before but I think it's it can be really helpful and kind of getting some context around where do I put what where so and again it can also help you when you're thinking of do I want more the strict lineage metadata rules driven or the more collaborative analytic discovery sharing model is to think that there's different types of data and different types of governance so if I'm up at my master data and reference data level you know this is the list of all of our customers or all of our vendors all of our patients or our doctors right you want to make sure that's right if this is you know in your list of physicians who have been validated to do surgery at a certain hospital I would certainly want that to be vetted really carefully through master data and yes maybe that list is out on your data catalog of you know where I can find the master list of physicians but that should only be published out there when it's vetted and very carefully approved with a defined data steward and a process for promotion similarly something like core enterprise data I might put your data warehouse in that category some of your you know financial metrics you're you're publishing if we are going to say this is the definition of total sales and that is being fed back to the street I would think that's a vetted KPI that everybody would want to have had approved before you put that out in your catalog but then there's some grayer areas or light blue area I guess more literally and there's some data that's still structured to still be vetted but maybe each department has their own operational reporting that maybe that doesn't have to go up to the whole steering committee before you publish it to the catalog maybe they can just say this is you know our own internal data I'd like to share it but it's been semi vetted or again use your own terminology there but then one of the benefits of these catalogs is this idea of exploratory data and sharing and ad hoc and hey I tried this query what do you think does anyone have a good model for XYZ and that's where you get some of the discovery so I want to be clear and where some of these tools can help with some of its process some of its tool some of its publishing I have reference data I'm sure analysts want to see that everyone across the company wants to see that but maybe something was found in this quote that started this exploratory data but then maybe we want to store that in our master data or in our warehouse there should be some sort of promotion sorry about that up to being a trusted data set how do you get to be a trusted data set and be really clear where you have that sort of verified button make sure you understand the process behind that so in summary for some of that I know there's a lot of questions so I wanted to give some time for that data catalogs are great they're growing in popularity partly because of this rise in that self-service user it's a really intuitive way to get access to all of these enterprise data sets the idea that there's of collaboration and feedback and the intuitive nature of it with things like like buttons and discussion threads can be great but just be careful of don't don't go too much into the flash and not and forget things like your data governance and to maintain that and understand your use case before the choosing the tool I think if you if I were to give one piece of advice as we leave this call and you're looking to do a tool for your catalog is to give that some thought are you going the rigorous standard approach or the looser collaborative approach or somewhere in the middle and do I want the best of both worlds because I will guarantee you these will be popular once you put them out there. So make sure that you've got the right data and the right process behind it before you go too far our obligatory of it join us next month we'll be going more into data modeling both from the business side and the technical and yes some of the data modeling tools can almost can be seen at the data catalog and some of these tools have sort of front ends that could be such I didn't cover that in this but you know on that idea of overlap between tools a lot of a lot of data modeling tools are saying that well you know there's a lot of metadata already in the data model can I publish it out so if you're interested in data modeling that'll be next month and again anything in the past has been on demand and will be on demand we do this for a living if you need any help and I will now open up to Sharon if you want to ask Shannon if you want to ask some questions Thank you Donna for another fantastic presentation great as always we got a lot of questions coming in and just to answer the most commonly asked questions I will be sending a follow-up email to all registrants by end of day Monday with links to the slides and links to the recording of this presentation along with anything else requested throughout now Donna you went into this a little bit but what's your opinion on difference between data catalog and data dictionary and you see a data dictionary having multiple distinctly different uses good question and I probably could have yep there's a lot of different flavors of governance and dictionary so which one do I want to go back to we kind of talked about data catalog and metadata repository but you're right and I want to find one of my slides I would say a data dictionary and everyone has different definitions is more on the technical side and on the slide that I showed where it basically just had sort of your tables and columns and what those tables and columns mean to me that's a data dictionary where I want to understand the fields and columns in my database if we're using databases what the structure is and what they mean to me that's almost your classic data dictionary I think these these data catalogs have an aspect of that but they go a lot more so sort of sorry I'll mark up here but if if I sort of have this section here this kind of table column data type description here I'm going to make it even uglier this to me is sort of the data catalog portion of that I think the catalogs are kind of a superset of that probably a follow-on question and I saw one of the questions come through about what about business metadata I think glossary is another piece of this business glossary so here I use the table kind of example what do we mean by customer but this could very well be a glossary of what are all our terms what do we mean by customer what do we mean by client all of that is another kind of subset of these tools and is there experience with combined metadata repositories and data catalogs if available yes so that sort of gets back to the slide that I just moved off of I think most of these tools are a continuum so a lot of that's more traditional metadata repositories are adding a lot of this more catalog type functionality and I think some of the catalogs are adding the more robust things like data lineage so definitely it's the spectrum as I mentioned on the bottom but just give that some thought and again this may be some false distinctions but I thought it was a helpful way to just kind of think of the end of the spectrum you're on if you're really going into the technical lineage there's some good tools that do that and if you really want to go into the user front and collaboration there's some excellent tools for that and then there's some pieces that do kind of both sort of well but just kind of think your use case so yeah definitely there's overlap between these and Donna well a catalog usually contain the permitted values within a data element for example if gender contains M, F, X, D are they listed with definitions they certainly can be and that again then I start to feel like this is a metadata catalog a data catalog a reference data set your master data set valid values but yes I think that's most of these tools have some of that functionality I would just be careful of when is it more reference and you know MF is sort of a kind of valid list and then there's a lot of ways you can do that once you're getting into maybe country codes or more reference data just make sure you're labeling as such and you want to put it in the right spot but most many of these catalogs have aspects of that as well so yes that's a great place to say these are the the valid values for these fields is often a really great part of the definition and sometimes it's the best definition you know I don't know what gender is well it's male female I can kind of understand by the list of what that is you know it's not self-identified it's not transgender right that would be different so just by giving the example sometimes it's the best definition so yeah huge fan of putting those in a lot of great questions coming in here so how does one select between Chara and and very Chara and Farchar V-A-R-C-C-H-A-R in this scenario what if there are more than 20 characters in the given customer's name does it just cut out the rest I would I would just say that's a design decision so I just gave it with the day the catalogs can do is publish what is out there I think what they a good point that was brought up is often that's not the reality of what's in the field so with some of these I would put that more on the metadata repository spectrum it can say the standard is character 10 and I guess to answer that direct question if if the field is character 10 and your your name is character 12 is going to get cut off and so that's why you want to give some thought into the definitions of that what these metadata repositories are can often do with lineages and say the defined definition is character 10 but we can see that in your feel in your environment there's 16 different variations of that one you know your oracle system is character 12 and your side based system is dates me side based character 25 or whatever and because you can't as you know any DBAs in the call you can't just randomly change fields without some impact analysis so it can do that but but varkar versus car the design decision has to do is kind of spacing and how much you know if you're not going to use those fields I think with most modern technologies that becomes a little less important but that's a great part for the this collaboration it could say that you know I'm saying that you have a length of your last name is character 20 well we have a lot of Indian clients and a lot of Indian last names that are really really long and it's going to be too short and that's a great way where you think you have a standard but until you actually have these discussion forums maybe you never thought of that because you are only doing business in the UK or something like that so anyway we'll have to think about there definitely so when building a glossary should I start with a scan for data lineage or a business tour to define a term that is later aligned with the system scan I'm not sure I understood the question could you read that again I'm sorry absolutely so in building a glossary should I start with a scan for data lineage or a business tour to define a term that is later aligned with the system scan oh if I understand that quickly I think I think you can do wealth in parallel what a lot of these terms what these repositories can do is if I'm understanding it correctly there is a term and I'm going to say it's social security number and there's somebody who defines what that is and then maybe someone says well no we're in Canada social insurance number whatever but so we have the right definition of what that term is where it means how it's used that should be done by your your business data steward there's a whole you know one of the most important critical data elements that can be defined in parallel you can do a search and that's where or scan whatever what you want to call that reverse engineer of what systems in the organization have social insurance number and that's where some of this machine learning can help discover some of that and that what some of these tools can also do that might be a good link to this one is some of that automated matching where again you can either define when it says customer I want to link that to client because that's the improved term now we used to call it customer and now it's called client you may want to manually do that or maybe you want the system to help you automate match that but what these tools can often do is say the business term is social insurance number and here's the 17,000 fields one of them is called field one that actually has social insurance number in it so I think you can do those both in parallel but I think it should be driven by the business priority because you don't want to do that for everything that's another good piece of advice for these based off the business value a lot of these tools can do a full inventory you push it against your systems it'll scan back a full inventory of all your systems that can be helpful but is it overwhelming and so you might want to use your governance business state stewards to help prioritize one of the buckets we really want to publish because I want to to search for ski boots I want the five ski boots I want if I saw thousand ski boots I'd get overwhelmed and leave and not buy any same thing with data I'm going to make it approachable and Donna you talked a lot about data at rest but I also see a need for data in motion for example in queues or web services and other APIs for example to today's data catalog products handle these use cases as well I have generally seen that I mean it'll be the specification for the API for that data but it's not it's really tracking the data it'll track the here I go the metadata about that data in motion that I have either an ETL script or an API that passes that data across and what the format of that data is and moving so in that sense where is my messy example I shouldn't use these drawings you know here ETL is data in motion but it's not saying which data is moving it's more that this is the ETL script that I use to define that data or that type of thing but generally it is in my experience kind of a data address kind of thing can metadata tools feed data catalog tools can metadata tools feed data catalog tools yes theoretically yes or you pick a tool that does sort of bull I mean it cuts back to this picture is that yeah it's often a multi faceted approach so I have worked with some customers that you know your your data warehousing team is going to and again when you're looking at these tools think of your audiences some of these metadata tools I can nerd out all day on them because they're fascinating I can do a full lineage of all the system scripts and all the data types that's what your data warehouse team is going to salivate over your your business team the self-service users probably going to run away from that so in that sense you've got that tool that's doing the kind of full on lineage and then some of these data self-service can either go against the same dataset just do it at a higher level or theoretically there can be some integration between you know you're more technical in that higher level but to me it's probably a different tool with a different use case I mean some of the I'm just thinking a lot some of them have partnerships you know with some of these more technical scanners or whatever they call it you know it's maybe grew opposite data catalog and then they they partner with some of the technical tools to do that full lineage probably the answer to that sorry go ahead oh no it's good so can data lineage tools allow you to better define data or would the script be the primary resource used when cataloging data yeah I think that's a great question that kind of ties into my other messy picture here of and again there can be different environments or just be clear what you're doing there can be sort of but you can use these metadata tools or catalogs as discovery I want to see what's out there what is the lineage and you could define it's a mess and then you can use that to have but the approved lineages X so a lot of these tool I you know I sort of just contradicted myself in a way I said you know a bad use case is just to scan everything and and try that's a bad use case if you're trying to publish it out to end users but if you're the technical team and you really have no idea what's in these systems that can be a huge benefit to these I see that more than metadata repository sign but that can be a huge benefit to kind of use it as a discovery tool and a big question here for you Donna we've got six minutes so that's I think we can do this how do you populate the catalog the very first time how do you populate the catalog the very first time I sort of ties into that last question if you're going to on the technical side most of these tools and and don't buy one if it doesn't have it I know that's a strong statement but they have these automated scanners are populated in their faces so the benefit of buying this tool is I I have a what could even be something like a cobalt copy book that someone wrote and retired I have no idea what this is or this Oracle database and or my SAP system with a gazillion tables they can point against these and reverse engineered or whatever you want to call it and create that initial technical difficult dictionary or there I used it dictionary metadata repository inventory of things that's sort of what you're paying for with these I would warn some of these tools have such a nice slick editor use interface when you kick the tires a bit some of that automation isn't as strong as maybe you want because the other thing is how did you do then do the next time and that's where those metadata repository reuse rules are do you sort of trunk and reload do you update do you want to give that because really metadata here I go is just data right so almost like a populated warehouse you have some of those same decisions on the business side I think somehow you start to populate that is tied into your governance what's going to have the best business value do I start with you know the sales data and get a great glossary around that often glossary is a great way to start because business users can really understand that if you're doing analytics and you're doing kind of that self-service discovery maybe you let people populate that themselves just say it's out here we're trying to collect all the coolest dashboards around customer what do you got you know so give that some thought it was a great question I think from the business you want to prioritize as the tech but kind of the way you do that from the business point of view I would type it in or maybe scan and spreadsheet that might exist from the technical point of view it should be automated and from the collaboration point of view clearly that's people collaborating and just kind of let that happen that's the Wikipedia maybe you don't want to populate maybe you want them to populate give that some thought how are you going to kind of populate it through over time great question and I think we have time to slip in one or two more here Donna um does data lineage of each attribute is it is it a lineage of each attribute at the data catalog level or data dictionary stage the data lineage at the element level I kind of see that at the metadata repository probably not even data dictionary I see data here we're getting all nerdy about these terms I see data dictionaries almost that's static that piece I showed with here's your tables and columns to me that's a data dictionary and what they mean to me that's a data dictionary I think when it gets to metadata repository that's your adding lineage and impact analysis between columns and and ETL and and all of that some of whether that's a catalog or a metadata repository feels more like metadata repository to me but some of these catalogs are really good at kind of doing both and just um what do you call it abstracting it for the business user so the business user says here's your report it comes from Oracle and in your ERP system that's all you need to know the technical team can drill into that same piece and see all the ETL scripts but it's hidden from the business user so depends on the tool but I see that as a metadata repository kind of thing all right then I will that does bring us to the top of the hour here thank you so much for everything and oh it's always a great presentation and thanks to our attendees for being so engaged and everything we do a lot of great questions coming in today great topic a very popular topic so and just a reminder I will send a follow-up email by end of day Monday for this webinar with links to the slides and links to the recording Donna again thank you so much really appreciate it thanks everybody thank you