 Hello, everyone, and welcome to our next EDW session called Apply Data Lineage, Maximize the Value of Understanding Your Data Pipelines, which will be presented by Ernie Osteck, SVP of Products of Manta. All audience members are muted during these sessions. So please submit your questions in the Q&A window on the right of the screen. And our speaker will respond to as many questions as possible at the end of the talk. So let's begin our presentation now. Thank you and welcome, Ernie. Thank you, John. And thank you, everyone, for joining me to discuss the importance of data lineage for understanding your data pipelines. I spent my entire career in data integration, moving, transforming data across the enterprise, most often in support of analytics. Prior to joining Manta, I spent several decades in the extraction and transformation and loading space, the ETL space, if you will, and helping customers track how their data flows throughout the enterprise. And I learned a lot about data flows. When we look at the data flows, I think one of the first things I like to just kind of, it's obvious, but it's always good to outline it in scenarios like this, is that they are ubiquitous. They are everywhere. Big organizations, small organizations, everyone has data flows that are central to their operations, that they all touch, that they impact, or that they are impacted by in just everything that we do. And they're in a multitude of technologies, every possible thing you can imagine, and sometimes not even technology. I had a customer recently say, we've still got situations where people take a memory stick and move it from one machine to another. Is there a way I can record that lineage? Because that's important to me. We like to think that that stuff doesn't happen anymore, but it does, and that needs to be tracked as well. Another key thing about data pipelines, they are vulnerable. And there's lots of moving parts. And whether we like to admit it or not, data pipelines break. And when they do, whether it's because of human error, technology, malicious attack, or whatever, they are going to touch and affect everybody in the enterprise, and that's an important thing that we need to recognize. But data flows are also empowering. They're the data highway that gets you to new applications in the cloud. It's opening up new opportunities for insights, the exploration that many of you are doing into data science. And they're also the pathway to really improving and monitoring data quality throughout the organization by examining those particular paths. So what we want to do today is really kind of look at some of the key aspects of data pipelines and how lineage is going to help you understand and manage them better. So, whoops, agenda. So quickly, just talk about what is lineage just to kind of level up, set everybody here in the audience and define it a little bit better. Talk about how hard it can be to achieve lineage. We'll look specifically at some data lineage use cases. Where is data lineage being most effective in looking at your pipelines? And then I'll close a little overview of what we do at Manta to help you with your lineage challenges. So first of all, introduction to data lineage. Data lineage helps you understand how data flows across your enterprise through all of its systems. Where does data come from? Where does it go and what happens to it along the way? And data lineage delivers immediate actionable intelligence and enables digital transformation. There's a mouthful of words. Let's kind of tear apart just a little bit. We talk about actionable intelligence is being able to make some decision, more effective decision, more quickly. And that might be taking action by a decision maker who quickly has to make some decision about maybe moving inventory from one place to another and has a spreadsheet. Is looking at results that maybe justify the decision but isn't sure when they were derived or where did they come from? Can they trust it? Having trust in data is such a significant use case we're gonna talk about later. But that's an actionable decision that lineage can help deliver and be made more quickly. And enabling digital transformation, this is a term digital transformation that could apply to the paper-based systems we all converted to technology years and years and years ago. But in more recent definitions that I've seen, I like the appeal to digital transformation as being something that's helping encourage collaboration, collaboration between business and IT or multiple business units to better agree on things. And lineage can certainly assist with that process. So why is it so hard? There are a lot of reasons that can be complex here but let's talk about a couple of them. The first of these is rate of change. Everybody doing agile nowadays. Technology is constantly moving along. There's changes being made to code, being returned to your production systems on a very regular and sometimes radically quick basis. And that's very important for development but it also can potentially impact the reliability of your data pipeline. We get into levels of detail which is probably best illustrated by the business versus technical view. We have different users that need lineage. Those that wanna see it at the high level and just need to see one application points to another application or my reporting system gets information from the East Coast Data Lake. Okay, I'm happy, right? But then others need to go much deeper and actually see how it flows through the different tables and into stored procedures and maybe even need to get into the individual transformations. So lineage becomes difficult to achieve when you're trying to answer the needs of many different groups and personas within your organization. Companies evolve, mergers and acquisitions happen all the time. That's where we get so many different technologies. I had a customer just last week asked me about doing lineage on legacy reporting system that I remember working with in the 1980s and they still had 20 or 30 instances of this reporting tool that they needed to track lineage for. Who would think? And it seems like every day there's a new cloud technology that comes out in the ETL space or the reporting space or the migration space or the database space that needs to become important and tracked for lineage. So many technologies, so many different skill sets that it can be difficult to get your arms around all of it. So what are some key data lineage use cases? We'd like to look at data governance, data ops and cloud migrations and migrations of any kind, but cloud seems to be one that's certainly pressing on everyone right now. And there's many integrations between them but they have some characteristics in lineage that I want to outline for you in this short period. Let's take a look first at data governance and specific with data governance also compliance. So meeting regulatory commitments faster. Customers come to us and talk about having many, many, many mandates to do a manual lineage tracking that they need in order to hand over information to their regulators. And what the regulators in many of these cases, especially for things like BCBS 239, for those of you in the banking risk area, they need to show that the company can demonstrate that they are in control of the data that generates their reports and their justifications for risk exposures. And they need to show that the data is well understood and monitored and managed by people in the organization and not with unknown sourcing. And lineage can help to lead to that in a much quicker fashion than trying to track it manually with manpower that could take many weeks. Gives an opportunity to maximize the value of your catalog solutions. Many of you are exploring data catalogs where you're doing curation of data sources and putting in glossaries and definitions. That's fantastic. Doing data quality and semantic understanding is important, but it also needs almost like a three-legged stool data lineage. So you can also see how the data flows into those different places and where it comes from. So it completes the governance picture. Highlighting privacy in the context of data flows has become a very big one too. It's also kind of in the compliance space, but wanting to connect the dots, whether you're employing a tool that pours into the data to understand and classify things that are sensitive, or whether you simply know that you have sensitive data that's sitting in San Francisco and you also have sensitive data that's sitting over in London. How do you connect the dots? How did it actually flow between San Francisco and London? And what are the different places where that data exists that you might need to know about? Another one for lineage, that last bullet kind of fits in here. Discovering previously unknown destinations for sensitive data, lineage, we think of it as looking upstream and downstream for different use cases. A downstream situation that I've heard from many of our users is the ability to say, oh my gosh, look at all the places in this legacy system where we are splashing, and I'll use the word splashing here, sensitive data into flat files, into little data march, maybe it ends up in people's spreadsheets and we need to get a better grasp of that and control it, maybe shut down some of those pathways because these are risky areas for us when it comes to GDPR and the new privacy laws that California, Virginia have pioneered and certainly other states are already looking at. Increasing trust in data, that's the one like I started with my first example. I'm looking at a report and I don't like what I'm looking at. There's numbers in the red, I need to justify it or you wanna make a decision based on a spreadsheet and it's not one you've ever looked at before. Understanding where that came from can help accelerate that decision making. And even defining the scope of your governance initiatives, if you're just exploring governance now, where do you start? Do you boil the ocean and grab every single asset that's in the organization? Or maybe if you find that in fact, the finance team is the most concerned and they're screaming the loudest, let's look at the reports that they're using and go backwards. Lineage can answer which databases, which sources, which ETL tools that we should start with and give you a better scope on that initiative. DataOps, the next area that's key and this is a new phrase, at least in the last six to seven years, but it seems to be growing and gaining a lot of traction now. DataOps is really about really trying to manage the reliability of those pipelines and having lineage is going to help you do better debugging if something goes wrong. So you can actually see what the tracing is of that data and how it flows along that pipeline. It improves the communications between teams so there's less infighting on what the real truth is about something flows. When you're looking at the code, there's no ambiguities. Impact analysis can help proactively as well. Let's prevent problems from occurring. Architects and DBAs who are about to make a change and put it into production could look ahead with lineage and say, you know what, let's not do this just yet because we're not exactly sure what it's going to do to that report or maybe they actually find a problem up front that says stop because it's gonna damage this particular ETL process that's critical and we're gonna need to make some additional changes. One of the things that I'm really excited about some of the discussions of DataOps is proactively putting checks along the pipeline like checking to see that a lookup table in fact gets loaded and if it only gets, you know, if it gets zero rows because something was null and screwed up a Boolean test somewhere in the process, then immediately flag it and send out an email to the people that know. So you're constantly doing these checks. Well, deliver a lineage report also to that person in their mailbox so they can actually see what the flow is. You know, maybe they aren't aware of what the lineage is and that could be given to them right away and shorten the resolution time. And in many cases, instant lineage validation is also important. If you're constructing a 100 line SQL statement, check the lineage of just that statement just to see if you're fully grabbing the parts and pieces that you need and then it's actually delivering the answer set that you expect it to even before it's part of the overall ecosystem, all helping to make the data app objectives achievable. Lineage for migrations. We have lots of customers who have looked out and said we're migrating from an existing legacy system and going very typically right now to the cloud. Well, those legacy systems, if they're 12, 15 or much older than that, years, are the subject matter experts all around? Do they know exactly what's in the inventory, if you will, of that particular set of assets? Lineage can help with understanding what those flows are, which are the most critical, which need to get migrated, perhaps which parts of the application aren't even used anymore, right? Track a report that no one cares about. Maybe you'll find that there's also a whole lot of ETL and processes that no one else cares about either. Or in the situation where you have unused assets, many times in the past someone would start a large-scale database application by purchasing a model, a model that perhaps was a business intelligence model for your particular vertical, and it has 3,800 assets in it, views and tables and all kinds of things that were prepared so that your organization could accelerate the analytics process. But now, find out that in fact, with lineage, that there are tables that have never been touched. No one's ever written a report for them, or a view that no one's ever bothered to use. And that's okay, because maybe your company didn't need that. But if you're gonna do a migration, do you wanna lift and shift everything, right? There's parts that you won't need. We talked a little bit about looking for previously unknown sources and targets as potential privacy exposure, and that's critical for when you're doing a migration as well, you wanna be able to clean those up. But also when you're doing migrations, looking for redundancy. So you see two paths that are the same. Do we need them both? And perhaps some sort of procedures could be eliminated. And we can also highlight critical transformations, right, that are important that you need to make sure actually there. And finally, lineage can help you measure your progress. You move to reports, what's there? Next week, what else have you been able to achieve? You moved, got the views, and then the staging area is all done? Okay, we see that we have lineage to the staging areas. We've gotten the sources yet, and keep going back and forth until you have the full path of your migration complete. And all the way back to perhaps public cloud staging areas like S3, and now you can actually see that it goes to the reports and the same procedures that maybe used to come from mainframe flat files in your previous application. So, Manta, what do we do? We help customers achieve regulatory compliance and governance and migration, and ultimately shorten the delivery cycles through IT. And we manage end-to-end lineage, and for us, end-to-end lineage means being able to look at all these different parts and pieces within the same lineage diagram, or the same lineage analysis. So you can actually see those mainframe systems in their flat files and the ETL work that you're doing to pour that into Hadoop, perhaps. And then all of the processing that goes on inside of Hadoop and your HiveQL and eventually ends up maybe in a cloud system and the reporting that comes off of that. So being able to see that end-to-end is gonna help with all these different parts and pieces. How does it work? We actually crunch code. We pour through your SQL, we look through your ETL and your business intelligence code, we look at the columns, we look at the individual ware clauses and the transformations, and we trace with our understanding of those technologies how one thing flows into another. And we document that lineage along the way. And then we visualize that lineage with an interactive map or we can optionally push it into your third-party governance solution or homegrown application that you have. And we specifically focus on code. We don't look at data. We parse through the code in order to achieve that. That quick picture right here, you can see that screenshot on the far left. We believe very heavily at Manta that making lineage easy to consume is super critical. You can see lots of different color coding in there. I'm not doing a full-blown demo, but if I was, the national ID that's in yellow just happens to be what the user currently clicked on. And we show everything downstream one color, everything upstream another, and we always aim to do column level detail. Manta is quickly installed easily. And once you hook it up to a particular database or resource, you're gonna quickly be able to get automated lineage. We believe strongly in integrating lineage into your existing processes and practices, which means that everything you can do in our user interface is controllable through a script or an API, so you can have lights out activity that just happens automatically on a Saturday night after you've returned all your code to production. We talk about consumption and avoiding visualization overload is important. If you're looking at a completely confusing spaghetti diagram with no ability to tease it apart, then the lineage isn't particularly helpful to you. And we go even farther. We talk about preventing lineage mysteries. We expose things such as indirect lineage. Imagine those of you that have SQL background, they SQL statement that says select name address phone where state code is California. Well, state code is not in the lineage, but that where clause with state code definitely impacts the lineage. So even though state code is not in the flow, name, address and phone are the critical columns of that flow, which rows are actually sent downstream is important. And Manta highlights that with a completely different visual color and indicator. And finally, being able to do time slicing so that you can look at the lineage not just from today, but at the end of March or the end of February where an analyst comes screaming into your office with their hair on fire and needs to know exactly how that report was generated at the end of December, at the end of Q4. But Manta, you just go back and take a look, right? The time slices are kept for when those scans of lineage were performed. Go back to December, let's go see and see that lineage to satisfy that request or compare that lineage to today's lineage and actually see where the differences are. So thank you for the opportunity to talk about using lineage to better understand your data pipelines. And if there's any questions, John, we can take up the remaining couple of minutes and talk about those. Okay, you've got one question. Maybe you'll cover this, but Manta seems to focus on horizontal lineage, but what about vertical? Connecting those physical columns and fields up the abstraction models to semantic names and definitions to be able to tie this all together. Excellent question. And there's a couple of ways to look at that. Certainly our laser focus on lineage is about the technical lineage that as you said is somewhat horizontal, right? From one tool into another. The semantic lineage comes up in a couple of different areas. And I think first and foremost, I'd like to say that Manta is really focused on the technical data flow lineage and not trying to be a business glossary. We compliment our partners who are into curation and offering data catalogs for doing those semantics. However, even in the domain that Manta does, primarily for the horizontal kind of lineage, there's a need to be able to pull in logical models, like for instance, from a modeling tool such as Irwin and use that as a starting point to get into this lineage or to be able to represent the concepts that are in a catalog and have them appear in Manta so that inspires a entry point where people can actually get to that particular lineage. Or finally, and which gets used very extensively with our partners that have a catalog solution, is we push a hyperlink directly into the curated source in a catalog tool so that while they're looking at stewardship and business glossary information and the workflow for that business glossary, but then need really deep lineage, they can click on a hyperlink and get down to the horizontal level. Next question, does Manta as a lineage tool integrate with Calibra as a data catalog tool? Can it make those connections to data catalog assets automatically? Or are those connections made manually? Oh, fantastic question. Calibra is one of a variety of catalog partners that we have where we push the lineage information up into Calibra and make the matching over to the assets that are in Calibra. And we do the same thing for other catalog tools also that either we integrate with directly or that our partners actually pick up metadata from Manta and bring into their catalog solution. We've talked about Calibra, it's worth mentioning that there are integrations to Informatica EDC, there are integrations into IVMs, Information Governance Catalog Solution, also Alation and our partners Data.World and Atacama also. All right, next question. Does it integrate with IGC from IBM as well? So yes, in fact, definitely push the metadata up into IGC and with each of these using the native APIs or published methods that are part of those solutions. And it looks like the last question, which products, databases, ETL tools, et cetera, it can scan to perform technical lineage. If you could share any links, what is on the roadmap will be helpful? Wonderful, certainly if you go to www.getmanta.com there is a technology section that will show you all the different scanners. We call them scanners and for us, there are scanners that fall into different categories, ETL tools being one, reporting tools being another, modeling tools, databases across the board, too many to list here. We aggressively are always adding new scanner types based on our examination of the market, requests that we get from our customers. We just very recently as of last week introduced a scanner for Google BigQuery. We do have some other ETL tools in Hopper as well as other programming tools. I did forget to mention the other category for programming tools such as Cobalt, Java, C Sharp. And we'd be excited to talk to you all further about other things that you have in your plants. All right, it looks like we have time for one more. Do you need to build up a horizontal lineage both on the business semantic level and also the technical level or should they be derived from each other? Great question. And we could do either. Manta is very extensible. So you can actually create lineage for anything that you want. When I'm teaching a class and reviewing how to extend the metadata, often we'll talk about you want to create an asset for the chair that you're sitting on or the table that you're at and the monitor that you're using, you can. And you can connect the dots. So whether it's a conceptual piece or not, you can do anything that you need there. And being able to tie concepts together is extremely important to us so that not only do you want to get that lineage automatically analyzed, but you also want lineage to be inferred. So one of the things that Manta can do is take a concept that might be attached to one of your source areas. And another concept, whether it's a business term or not, is immaterial for this discussion. And it's connected to one of your target reporting systems. And there may be hundreds of assets in between, ETL processes, stored procedures, jobs, tables, et cetera. But because you've associated this concept with something in your source and this concept with something in your target, you can have a lineage that's inferred by Manta that just shows two objects, just those two. And of course, from there, you could springboard to see the detailed technical lineage if necessary. Great. Well, thank you, Ernie, for this great presentation. And thanks to our attendees for tuning in. Please complete your conference session survey on the page for this session. The next session will start in about 10 minutes. Thanks again. Have a great day. Thank you, John. Thank you.