 My name is Josh Malewski. I'm a senior solutions engineer here at Manta and David Bryan from our team was actually originally supposed to be doing today's presentation But unfortunately he had some personal matters that he needed to take care of which prevented him from making the trip out here to beautiful San Diego So instead I will be going ahead and talking today about how active metadata helps with innovation As I mentioned my name is Josh Malewski I am I've been here at Manta for about a year and a half I've been in the IT space though for over a decade now And yeah data lineage is kind of what we specialize in and that's been my forte here of late I'm originally out of New Jersey though. So if there's any folks from the New York City area, you're very much so near me So let's talk a little bit about Let's talk a little bit about companies and how the world was kind of previously For a long time, it was widely accepted that the bigger the company was the more successful it was going to be At the larger the company the more resources available to throw at solving complex problems The world though has has changed a bit nowadays The secret ingredient to success is the ability to innovate quickly adapting to changes as they come rather than just simply being big Fast to react to customer feedback fast to react to customer requests fast to introduce new products and services to the market And fast to react to change the changes in the marketplace These are all key characteristics of an agile and generally speaking a smaller company The reason being is smaller companies don't have the baggage of a long history complex environments lots of often Historical services and bureaucratic processes impending their responses to the market So for today's discussion, I'm going to talk a bit about how to use metadata to help the flexibility Help gain the flexibility to innovate even as smaller companies grow and how to help larger organizations gain this flexibility back so If data is the lifeblood of an organization, then metadata is the DNA. I Personally like this quote primarily because I came up with it. So But it also is quite an accurate analogy Just as the DNA shows your genetic origins of where you came from so does metadata show the information's origins Your DNA defines your physical characteristics as metadata defines the physical shape of your data is The is this metric we're looking at a var char or is it an int or a date time field? Where was this information source from when was it last updated all of these details are defined in that meta DNA? I know there's some some other guys out there that might snipe this from me today, but I'm quite proud of that quote So companies are now reporting approximately 90% or more of their time is spent preparing data For advanced analytics data science and data engineering a large part of that effort is spent addressing Inadequate missing or erroneous metadata and discovering or inferring missing metadata This quote I did not come up with I Borrowed this quote from a recent Gardner report It shows that only 10% of highly skilled resources time is spent on the actual work. They were hired for Thus limiting their ability for innovation The rest of their time is spent on supporting activities that they are not directly related to their jobs that a person was brought on for a Large part of this supporting and prep activities that takes 90% of their day is Related to missing or incorrect or unavailable metadata What this allows for is that companies with more streamlined metadata management processes have a strategic advantage over others who don't For today, we'll start with an overview of where we are and kind of moving into What we can get we can achieve with activating our metadata So we'll start with saying, okay, do we currently have an environment that has a black box? If so, how can we activate the metadata to give us a little bit more light shed into that black box? And then we're talking a little bit around data lineage and how it is a can be used in the future for enhancing those metadata Processes and then at the very end. I'm going to open it up for questions So I'll leave 10 to 15 minutes at the very end to have you guys give me some feedback and any questions you have From today's presentation Okay So There we go Why is 90% of engineering's expensive time spent on data preparation and lookups? Well, it's because of this This is a terrible slide But this is there's entirely too much going on here for anyone to digest You know what is actually happening, but this is really the intention of this slide It's meant to represent the complexities around what data environments have become and to be honest This is grossly oversimplified version of the truth Because in reality it wouldn't we wouldn't be able to represent the data flows of most organizations With a slide that takes up the entirety of the size of this room And this isn't changing anytime soon This is our proverbial black box if we rewind time to about 40 years ago There we go We had just our primary system supporting our day-to-day operations As time progressed in the 90s We realized that we could use data to drive business and make a better decisions and we started to introduce data warehouses for analytics Fast forward into recent years and the demand to better serve customers along with data-fueled platforms such as web Applications and social networks require more analytics to be available introducing new platforms such as big data and methods of analysis such as artificial intelligence and machine learning and This is exactly what makes organizations slower as they grow It is much harder to innovate our data engineers are and scientists are often going into their job blind In many cases the environments or at least parts that they're not familiar with our black boxes to them They need a map of that environment in the world of data that map is metadata So just quickly what do we mean by metadata? This is a typical situation that a data scientist runs into here She has access to records, but how do they interpret them? How do they know what activation data is? How do they know what a T is in the active column is this true or is this terminated? There's also much much more here Who's this data? Who uses this data? Where's this data come from? How often is it refreshed? Is this raw or is this clean data? Does it contain PI or PII information as well as a slew of other questions? This is a good example of a data scientist could be looking at this data set and say okay This could be all of my customers globally inside of this data set Or this could be just customers with the initials Ju that have started after 2010 and Maybe they're looking for a model to build a model off of just the customers with the initials Ju that started off of 2010 Or maybe this is completely useless to them in either scenario They're going to need to validate that this data set is exactly what they're looking for Before they go into actually building out their models and generating and actually doing the innovating part of their job So how do we activate the metadata then? Metadata that we provide to our engineers serves as a map of our data environment and our data pipelines Let's use the actual paper map as an example You can use it to plan a trip from point A to point B and our data scientists can do the same with their metadata if it's available So sure you can give your relational databases DDL and stored procedures to your scientists You can provide them with the ETL jobs and the tools the XML files of those tools You can even expose reports from your BI and analytics platforms and have them comb through to locate all the paths that they're interested in You'll probably also need to share with them the application code that's moving data around the organization as well and There might be a number of different languages that these applications are written in Java, Scala, Groovy, Python, R, C-Sharp, Cobal All those just to name a few. I had somebody come to me the other day and asked for TypeScript That was a first But they had data lineage that was being developed in their TypeScript application so Beginning to get deeper and deeper down the rabbit hole We can see that the time that these engineers are spending is in fact more or less on the Researching of the data than the actual producing of the results from the data so Why would anyone still use a paper map? None of us here in the room really would use a paper map that I'm aware of And there's a much better way to navigate out there So Most of us are using a GPS navigation app and by most of us I mean all of us You can think of this kind of as an activated air quotes map Just to go along with the active metadata theme here There are lots of advantages of a navigation app It gives you instructions turn left go straight then turn right these instructions are based on our current location If we make a wrong turn it automatically Recalculates the route and gives us new instructions it even factors in current traffic to choose the best route for us And this is exactly what we need to with our metadata We need to produce something that's similar to a GPS navigation Not just make it available to our engineers as a paper map where they can read through all of the actual code But they need to be able to use the metadata instead have it give them feedback We need to activate the metadata so that it actively helps guide us give us suggestions hints and alerts The same way this GPS navigation app does while we're driving to sum it up Active metadata is metadata Available in a form that is to be used in data-related processes to provide hints alerts and suggestions or even make decisions for us And this is a huge change compared to how we understood the role of metadata up to now as a mirror catalog apologize I'll take a step back All right All right cool. So for the actual deployment, this means we need to revise and digitize the processes that we currently have Keeping in mind that in real life This means modifying or completely revising and digitizing the processes. We currently have to work with the data Let's analyze such active metadata on a specific example of the data quality process So in the case that follows I'm going to be using a specific type of metadata data lineage in a nutshell Data lineage gives information about your data flows and data pipelines It shows you where the data comes from and where it's distributed to and used As well as what information it changes and what changes are applied to it what changes are applied to it along the way It's a critical capability that allows us to manage the dependencies and in our environment So I had a problem for a long time that my new credit card was constantly sent to the original mailing address provided when I was opening the account So what kind of metadata do we need to investigate to resolve this issue? We can see I have an incorrect mailing address right here I need to understand where the mailing address is being sourced from because it may not be a single source There may be multiple each one of these blue dots on this screen kind of represents an object within a data lineage lifecycle So an object to be anything that stores or transforms data It could be a relational database table to be an ETL job a stored procedure anything like that It's going to take information house it and move it. We're representing it here with a blue dot These are all the sources the movements of the information prior to the data points getting to my final and resting place with the wrong information What we need to do is we need to know if at any point the data is being transformed or filter Over on my far left-hand side. I have two different sources for my mailing address and we can see they take two different paths They converge on this center path Which is essentially saying I have a logical decision point for deciding which mailing address is the correct one Maybe one location is the origining account information That has the original mailing address and the second location is a modified table that says okay All new addresses will get placed into this second location something to that sort either way It's pulling from information from both sources making a decision and then sending it upstream There's also additional details that are being done here. So this could be modified specific dates in time We could have different data quality scores looking at these data points to determine whether or not these are in fact correct and There could be a number of other situations where the data is being used for other particular applications So we need to understand if we make changes to this flow. Are we going to impact any other things downstream? What other consumers of the data may have been affected by this poor quality and Does me changing it actually increase or decrease the priority of this issue that we're running into? We need to adjust the data quality process itself so that it automatically collects the necessary metadata Evaluates it and generates adequate notifications based on the metadata or recommends on where to focus to solve the problem Or if we have enough information to make a decision to correct it This is all the information from the previous slide reformatted into a data quality issue dashboard And this is where the data lineage information is crucial Because lineage shows us the path of the data It allows us to automatically follow it and collect the additional details. Whatever the metadata hasn't been provided. We need and Not only that it can highlight the important points in the data pipeline that are more likely to have caused the issue Details such as transformation steps recent changes to the data pipelines recent failures, etc Instead of weeks or months as described in the traditional investigation process time period that it would take to resolve something like this All the metadata can be can be available to the data quality source Instantly immediately allowing them to resolve the potential use in our situation of the the poor mailing address in minutes Rather than weeks or months In order to do this you'll need a platform that captures or even better discovers metadata and dependencies in an automated way and Provides a way to not just share and integrate that collected metadata with your existing or newly built systems But more importantly performs continuous analytics and generates alerts that allow the orchestration of other systems and processes Through messaging APIs and embedded widgets Here at Manta, we focus primarily on data lineage. So this is kind of a Manta specific take on how Manta and lineage fits together We have your data catalog tools. We have your relational database source systems. We have ETL tools We have your business and reporting tools and even coding languages that we can plug into Manta Manta will analyze that metadata and then give us that roadmap of the historical lineage set of Where the data is coming from and where it's going to as well as what transformations occurred to it along the way active metadata is the future of data management and key functionality for data scientists and engineering to be productive At the same time it is a way for companies to stay and become responsive Especially today when there is an enormous shortage of people in the IT sector and the wages are going up Companies cannot afford to waste their time on routine activities that can be automated or that do not bring real value Data lineage is a great first step to achieving an active metadata state We here at Manta would be happy to discuss more around what we do in the lineage space and how it can help you Provide a metadata GPS to your team members Thank you everybody