 Hello, everyone, and welcome to our next EDW session called Rebuilding Enterprise Information Management from the Metadata Groundup, which will be presented by Peggy Zai, the Vice President of Data Solutions at BigID. Just to know, due to a conflict, this session has been prerecorded. All audience members are muted during these sessions. Though the speaker is unable to attend, please do feel free to submit questions in the Q&A window on the right of the screen to be reviewed at a later time. Please also note that there is a link form at the bottom of the page titled EDW Conference Session Survey. This is where you can submit session feedback and we encourage you to do so. So let's begin our presentation now. Thank you and welcome. Good morning. My name is Peggy Zai from BigID. Today, I will be talking about rebuilding enterprise information programs from the Metadata Groundup. Like many of you in the audience today, I also spent many years in the financial services industry working in enterprise data governance programs. My role predominantly was around building out the strategy and helping to operationalize programs so that it fit the regulatory requirements needed for BCPS 239, Solvency 2, CCAR, and GDPR. I'm really excited to be here today because I really be talking about some of the data and measurement practices that I've seen personally and how there are new AI machine learning technologies that can really help with rebuilding and really reimagining the way that enterprise information measurement programs can work today. Recently, at BigID, we had a data governance summit where Juan Riojas, chief data officer at Rackspace, gave this quote of data rich insight poor. And I think we've seen that with a lot of organizations that collect, create, and consume a lot of data. But the challenges they face is the fact that they're not able to get actionable insights on the data. So while they wanna treat data as a sacred and strategic asset, they're not able to do so. Now, the data challenges that we certainly all see is especially apt when you look at an iceberg. Above this waterline of an iceberg are a lot of the strategic efforts that many data executives and corporate executives are looking at today, such as supporting analytics, building out data science teams, and a lot of the AI machine learning projects are going on today. But below the waterline in the iceberg is where most of the foundational data capabilities are still being built out. What I'm talking about here is the efforts around building out a data foundation, inventorying, cataloging, data mapping, master data management. These are all very difficult and time consuming activities that require not only your data teams, but also your business stakeholders and your technology teams to work on together. Now, new challenges that have certainly come about for data organizations in the last couple of years really revolve around growing data environments. Not only are there structured databases that data teams need to govern, there's also data that's in the cloud, in the lake, streaming data, IOT data. There's just a large amount of data that was never really considered before in traditional data management programs. And if it was, it just was out of scope. And data centers that are being created in new locations, new data formats, especially when it comes to unstructured data, data that's in a file, in the PDF, email. If many organizations are using Google, AWS, these are new types of environments of data where it's being sourced. Now, growing regulations play a big part of this because new data in all these types of formats and locations must be considered and in scope. And I think the role of a chief data officer now has to consider data when it comes to security and chief security officer and chief privacy officer who have very similar demands and requirements when it comes to data and its consistency and accuracy and use across the organizations. So they may also have requirements when it comes to data usage, should it access controls. So when it comes to the different business stakeholders in your organization that has a stake in data, where do you start? Where do organizations start today in terms of documenting where data is, what it is exactly, when and ultimately who owns the data in the organization. Now, many of the priorities that come place when it comes to data management is first and foremost quality. I think that's one of the first questions that many business stakeholders ask is, what's the quality of my data? And there are different perspectives when it comes to data quality. Depending on the creator of the data, there's specific data quality standards and rules in place, but also thinking about it from the data consumers who also need to have a certain expectations when it comes to data quality. Other priorities when it comes to data management programs is the concept around enablement. And this is really important because we're talking about enabling business stakeholders, data users across your organizations to have access, the proper access, and being able to have the right usage of the data. Again, when it comes to many of the new privacy regulations, especially with GDPR, which really set the standard for many organizations to really think strategically about the data that's already being used in their data governance and management programs, but now applying it with privacy mindset in mind. Data lifecycle management, data access, these are all important components of a security management program to make sure that the data that's not only used from a governance perspective is also adhering to compliance and risk standards by your chief compliance officer, chief risk officer. So there are multiple players that are now involved in helping to influence and shape a data management program today. Now the key part of it, especially when it comes down to the chief data officer and his or her data management program is really looking at the fundamentals. And this is what I spent many years working on is building out these fundamental capabilities that can then be leveraged across the organization and then be used in multiple purposes, especially for ultimately for analytics or data science, but also now increasingly for data privacy and compliance adherence. So one major component of a data foundation program is having a data catalog or an inventory of your data assets. So what does that look like fundamentally? Now one of the key things and challenges that many organizations face is populating this catalog. Predominantly it used to be a very manual effort in terms of having to collect this information from your data stewards and from your business owners. And that was obviously limited by the time and the resources that could actually collect into a shared tool or spreadsheet. And what about the other information that was needed to understand the data in your catalog, especially in terms of describing the data, putting a definition around it, standardizing and reconciling that business term inside your catalog. These are additional aspects of the data that needed to be populated and also another big component is looking at now unstructured data that now needs it to be part of your foundation. So the question that I like to ask many people today is how do you keep your data catalog up to date and how do you actually take action on this data after it's been collected and put into your catalog? The challenge again that many organizations face is that there's a manual effort and even if it is automated somewhat, the coverage of this catalog is limited to the technology itself. So you wanna make sure that you are selecting a technology solution that encompasses and is representative of your entire data landscape. You don't wanna just be governing your structured data, certainly not. And this is where automation is really essential for success. And this is again, one of the key things that I would have liked to seen as the data stored in my formal roles in other organizations. Reason why automation is going to be a key differentiator for organizations that embark on this journey is the fact that again, the scale of data is growing at a very rapid pace. We're talking about data centers, petabytes of data spread across global regions and the fact that it's literally impossible for any one person to really have knowledge and understanding and business contacts and usage of all their data. Data is constantly being changed and updated and needs to be reflected in an automated manner as well. So the concept of a data stored role that I used to play is that it's really difficult to keep up with the pace and with the accuracy of the information that needs to be shared across your entire organization. So this is where we look at automation for in terms of being able to scale at fast speeds and really be reflective accurately of the data landscape that's in your organization. And the usage of AI machine learning is really crucial here because it helps to identify patterns, identifies outliers in your data and these are really important insights that can really help a human that's a data steward that's really looking at the data but can be really being enabled by a lot of these automated data insights that's available today. And then as a result, be able to take faster and quicker action on the data governance program. So some of the governance capabilities that can be automated with metadata are these on the screen today. Talking about data discovery. First and foremost, that's one of the challenges that all organizations face is not just finding the data, they need to actually know exactly where it is and be able to then connect and link it to their known data sources. So being able to get to that information easily using automation is really critical here because you do not want to rely on crowdsourced teams to be able to collect and curate that information together. Other capabilities involve data quality and being able to leverage automation to identify profile outliers in your data and also be able to give really quick insights into the completeness and the timeliness of that data. That's not something that you want to spend time writing standard operational data quality rules for ideally you want governance capabilities that will give you immediate insight onto the quality of your data and your data sets. When it comes to data lifecycle management, these are capabilities around retention policies, how long data should be kept in your organization. Again, this is a huge use case, especially when we talk to many groups in the records management office and they are spending time finding the data, understanding what data is, and then applying it to their data policies. We think there are better ways to automate this process, especially if there's already a foundation that's built on metadata and data that identifies all the data across your organization and then simply applying the policy that's needed to adhere to those documents. A lot of the efforts that go around governance programs is mainly around stewardship. A lot of the issues around stewardship is the fact that not only is there large scales of data, there needs to be immediate action that needs to identify duplications, similarities of data, business terms across the domains, and these are things that can be supported by AI machine learning. And certainly around reducing risk, understanding where there's open access to your files, being able to bring those insights directly to your compliance and risk teams are huge critical activities. Now, when we take a metadata bottoms up approach, this really enables us to automate and build repeatable and scalable processes that can support your information management programs. So as I said earlier, capabilities around discovery and multiple ways to discover and link your data together to give users insights allows data teams to really be able to scale and build out their programs really quickly. If you combine that especially with deep machine learning insights where not only are we identifying patterns, where there are machine learning technologies that can predict a lot of where sensitive and confidential data may be in your organizations. So being able to connect and link all that data together allows a data team to take action on their data much faster. And of course, these type of solutions must work on scale and leveraging automation so that stewards are not burdened by the heavy lifting work of collecting that information. They can simply work focus on validating and confirming a lot of the insights that come from these machine learning technologies. And most importantly, the scope of data that we're looking at in any organization must span across all your data centers, all your cloud and lake environments. In machine learning and data discovery, there are many techniques that can be done today, especially when it comes to leveraging natural language processing classifications, being able to automatically enrich your data sets and data attributes with tags and labels. Additional techniques that I haven't seen lately or in the field today, but are available in Big ID is when it comes to correlating the data, that means tying together all related attributes to either a single entity or to a person. And this has been extremely critical, especially when it comes to GDPR privacy regulations and fulfilling a data subject access request report. So a lot of the classification comes from identifying this data and using machine learning, regular expressions, but also other techniques around named entity recognitions that find data on a document and file level. So we're able to really group together or cluster very similar groups, similarities in data together. Again, this helps data teams identify similar data that's needed to minimize data. Again, this is for privacy regulations, data minimization, but also from a governance and technology perspective, really be able to know where there are duplications and you can identify either the single source of truth or really the golden copy of data. Now, the next two slides that I'm gonna be covering here is just a little bit more about the technology and the platform that Big ID talks about is, again, intelligence, data intelligence that can be supported from this foundation of data discovery and being able to have lots of sources that we can connect to and being able to virtually provide this in a catalog. So as we automatically scan your data sources today, we're able to visually represent all this information in one single catalog. So this catalog contains operational business, business metadata that really helps your data teams and users understand the context of this data. The second piece here is about classifying your data. Classifying your data means being able to understand is based on the data values. So more than just looking at metadata, looking at data in conjunction with that allows you to understand exactly what is differentiating between a 16 digit number that's a credit card number or if it's just a random string of numbers. So by applying techniques like regular expressions and natural language processing and machine learning deep learning techniques, really being able to classify and be able to understand the data itself. So that's really the first and most fundamental step of being able to build out your data foundation. The third piece here is around cluster analysis. And as I mentioned earlier, especially when it comes to unstructured data, when you think about all the files that are being created in your SharePoint, your local drives, your shared drives, all this information, whether it comes to PowerPoint or Excel files or PDF files, how many are literally duplicates or variations of each other and how many would you actually keep if you were to embark on a cloud journey, for example, or you're migrating your data over to a data lake. You really want to think about minimizing the data that you'll be moving over. And then you wanna think about copies of data you actually wanna eliminate from your organization, whether you're abiding by your data policies of retention or whether or not you just want to have a cleanup exercise, right? So how do teams do this exercise today? It's really literally opening up each file individually and then making that determination. But by leveraging a technique called cluster analysis allows you to see groupings based on the size of the similar files, but also be able to show you keywords or attributes that help describe the grouping so that you understand the context of this information. Again, leveraging this machine learning to give you the insights on whether or not you wanna take action and what priority that action is against your other activities that you have going on. The last one here is correlation. Correlation was predominantly created especially when it came to privacy use cases, being able to tie together all related attributes to a single person. So being able to find things like for a person, for privacy purposes, being able to find related healthcare records or cookie and IP settings, those are all related information that you may or may not know that you're collecting on an individual, but when it comes to fulfilling a data subject access request, it's information that's needed to be collected. So taking that example of just a person, you can certainly do correlation when it comes to an account, any type of entity. We've seen it done on other types of information that helps to identify data that you may not know that's related to that starter set. So building upon this data discovery foundation, there are very specific applications that can certainly be done to help emphasize and take action on these insights that certainly derived from the foundation. So when it comes to data privacy, there's apps that we have around data subject access request fulfillment, data processing and sharing, and having a privacy portal in place to manage a lot of the privacy requests that comes in. Data protection apps are our security-focused applications that comes to our data remediation, labeling of data, understanding breaches of data. And lastly, data perspective refers to our data governance applications, and these are very specific applications that can be taken to monitor the data quality, stewardship, certainly with understanding the lineage and where exactly all your data is today. So the next portion of my presentation today is really talking about some of these customer use cases that we've seen and obviously not speaking specifically for any specific organization, but really talking in general, a lot of the types of organizations that we speak to and our customers of ours and some of the challenges they face. And when it comes to financial services, predominantly these are global organizations that have enterprise-wide data programs, but when it comes to their data, they're bogged down by legacy systems so they really have challenges on multiple systems that may have their customer data and how do they work on connecting and breaking down those data silos and really helping to build out a know-your-customer manager risk compliance program and also customer 360. And also as companies grow throughout the years and they may purchase or take over other companies, throughout those mergers and acquisitions, they may be building upon legacy data assets that they now need to include and govern as part of their organization. And also new banking and insurance regulations have always played a part in how data programs are run in financial services industries and certainly they need to be quite conservative, but also be very proactive in making sure that they have all the controls in place for all the data that's in movement. So one of the challenges that we've seen is the fact that with all these mergers and acquisitions, there's no one single group or whether it comes to technology and a business that has a understanding of all their data. And you also consider the fact that, considering the turnover of employees, that type of documentation and knowledge about data and its usage can easily be erased if once a person leaves an organization. So this concept here again with the data catalog is being able to bring together all the disparate data sources that are either across your lines of business or that may be part of your legacy versus the current data sources systems, really being able to help identify in one singular virtual view all the data that you may have in your unstructured formats, whether it's in your pipeline, cloud, no SQL data. This is information that you wanna make sure that has the proper governance and controls in place. And by building this data catalog, you're really enabling your data consumers to find the right data, understand the right data that they need to be using. So not just your business users, we're talking about a lot of the business stakeholders and privacy and security, and really being able to centralize the tagging and the enrichment of the data in one single place so that if someone on the privacy team is looking at a column, they can understand the purpose of use, they can see whether or not it applies to specific regulation, but also a data governance team member can also look at this same view and understand whether or not there's a definition applied, there's the right data owners assigned to review and govern the data. So really being able to share this foundational knowledge with multiple groups is really again, sort of a revolutionary concept that we want to expand this type of enterprise data to be beyond the traditional business and data teams. This part is about machine learning classification for any data. And this is really the second part after building out a virtual catalog is really being able to put the labeling, the tagging for your data and extending that to your metadata and possibly perhaps even all your documents as well. So what's really important here is that you wanna make sure you're using the right technology that reduces false positives and accurately identifies your data on multiple ways. So being able to leverage natural language processing and name entity recognition to be able to label your data. And then that would then be applied downstream to your, for example, maybe your data science team for feature creations or for your data visualization team or building out your dashboards, et cetera. So classification allows downstream data consumers in your organization to know whether or not it's the right context, it's the right data that they should be consuming. And being able to leverage a model that allows you to give feedback into the results is really quite powerful as well because organizations maybe have very specific and nuanced information. And you want as a consumer of a user of any technology make sure that you have the leverage and ability to then give insights and feedback to train a model and improve the accuracy of that model. Retail space has certainly been one of the key drivers especially in leveraging data. When it comes to customers, segmentations they really want to be able to improve the experience of the customers through personalized products and offerings and emails and marking the message. So the retail industry has certainly been one of the beneficiaries of good data management, good customer 360. And also balancing that with a lot of the privacy regulations where they are collecting a lot of personal information when it comes to a customer. When you go shopping, you may sign up for their credit card or sign up for the rewards on their website. All this information and even browsing on retail companies' websites, you're possibly being tracked when it comes to the web pages you go to, the products you link. All this information is being collected in by retail companies and as a retailer, you want to make sure that you can find all this data that's being collected and if needed through privacy regulations, be able to delete all that information. Data quality has certainly been a huge part of a retailer's data organization priorities, especially if they need to have and maintain accurate records of their customers so that they can communicate with their customers. And as many organizations are now moving to the cloud, it's imperative for retailers to ensure that they are not keeping duplicate copies of their data. They are able to really prioritize the data that they want to start their migration. So being able to have customer 360, being able to link all the customer data to all the related attributes, activities and findings and being able to create business policies that help monitor the movement, storage, the placement of the data is really critical. So when it comes to some of these concepts, correlation, able to look across your data landscape, again, not just the data that's saved in one particular database, but what if it's a file, what if it's saved in your CRM tool, your marketing tool, et cetera, you want to make sure that you're able to look across these different silo applications and understand how it all connects together. And this illustration that you see here today is based on leveraging graph technology and building out almost like a, I like to call it a spider web of information that ties it back ultimately to a data element or it can tie back to a person or an entity in the account, for example. It helps you find, helps organization finds hidden relationships between the data that you're never gonna quite see. You're not gonna know or be able to infer that there are dependencies or other relationships that just kind of pop up when this type of correlation is built. And this is what I meant by earlier when I said, and there's a lot of data, but having that insight and then being able to take action on it is really the challenge that data organizations face today. And by very of correlation, being able to see those hidden relationships allow data teams to then take action. Another challenge that many data teams today face is being asked to map their logical data or business terms, logical concepts to their physical data level, data layer. So, you can have a concept around customer email address, but actual data for customer email address could be saved in emails or other documents from saved in your stored systems, but being able to leverage technology, AI machine learning technology to map that all together has been quite critical. Now, this third use case, and predominantly we've been seeing across the healthcare sector, we've also seen it in amongst many pharmaceutical companies and healthcare companies is the fact that there are now new regulations popping up around healthcare that ask for transparency around patient data and really being able to publicly provide all that data. And a lot of healthcare providers have to adhere to HIPAA compliance, which is very similar to privacy regulations where they have to keep and maintain consent for patient data that could actually vary across not only just within states in the United States, you can have varying levels of consent that you need to maintain. So all these policies around information security, how long retention, how long you have to keep all this information just adds to the layer of complexities of maintaining this data and the concept of remediating or fixing the data is a concept that we hear about quite often. Also, when it comes to CMS interoperability, this is where the exchange of information is going to be really important, especially when it comes to share data exchanges or marketplaces and being able to build upon the data that's gonna be available and making sure that it can be consumed for a better analysis. So when it comes to healthcare, a lot of the ways that data is being collected is still in unstructured format. So being able to, again, leverage concept of cluster analysis that focuses on fuzzy matching when it comes to the data attributes that are found inside the contents of these documents or PDF files, being able to understand the size of information for example, in this case, the biggest circle is around medical registration, which depicts that there's a lot of text files, files unstructured data that relates to the topic of medical registration, similar around prescriptions and doctor notes and patient records. And this shows a data team and technology team how much information is out there that's needed to be consolidated or minimized as part of reducing your technology footprint and also by in terms of minimizing your risk as well because you are adhering to your data policies. So hopefully some of these use cases and approaches around managing data, really emphasizing if you're gonna, if you're going to reinvent or improve and scale up your enterprise information management program, you wanna start with metadata and data because that helps to automate and really scale up and build out a lot of your processes. Firstly, by automating your discovery process and the ability to connect all your data together, that allows you to eliminate your data silos, giving yourself 360 degree view of visibility across your entire landscape. The same time adhering to privacy regulations or industry specific regulations that really require you to have a full handle of your data. And then once that's done, once that foundation is starting to be built out, being able to activate your entire information life cycle when it comes to managing the data and making sure that deleting the data properly as well. Other use cases when it comes to looking at metadata and data from a grounds up perspective is really helping with cloud or data lake migration projects, consolidating data. There are so many efforts that go on today with prioritization for cloud projects, but without leveraging any of these automated techniques makes it really difficult because otherwise you'll just be looking at domains one by one, looking at files with a singular view of mind and that just prolongs the whole process. Lastly, security data access is a huge priority for many organizations who are thinking not just when it comes to data access but having the right controls in place and really reducing any potential data breaches. And fundamentally, this is first and foremost knowing your data, but now knowing your data from metadata and also a data perspective allows you to really take on a lot of automation. And ultimately, business outcomes that organizations are really looking today comes to reducing your risk, supporting your data lifecycle measurement, allowing your business teams to build new products, find competitive edge, and really innovating on top of data that's already being collected and consumed today. Building customer trust is really critical because you want to be able to, even as a data consumer internally, you want to be able to trust the data that you have access to and that you're looking at. Otherwise, if there are discrepancies or differences in the data or different teams are using the same data differently or collecting it differently, you're just going to be at a standstill and not know what to do. Ultimately, as an organization, you want to achieve compliance, you want to pass all your audit tests, and you want to make sure that you're doing this consistently and in a shared process. So Big ID really sits at the center, our technology of discovery, classification, really bringing this intelligence and actionable insights really supports privacy use cases. A lot of our stakeholders are on privacy. Security, our security teams look for us in terms of helping with a lot of their use cases and then governance, really building out those foundational capabilities of insights around data quality and data lineage. So I hoped what I shared today can give everyone insights into ways that data program can be accelerated by focusing on metadata and data. And a lot of new technologies focusing on AI machine learning, especially those at Big ID, can really help accelerate and bring these business values and reducing risk and adhering to compliance. Bring that together faster and make that a winning combination for your organization. So with that said, I thank you for your time and I hope you enjoy the remaining sessions at Enterprise Data World. Thank you. All right. Thank you, Peggy, for this great presentation and thanks to our attendees for tuning in. Please complete the EDW conference surveys located at the bottom of this page. The next sessions will start in a few minutes. If you guys have some questions, make sure to add them on the right side of the Q&A and we'll get back to you on those. Thank you.