 Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Officer of Data Diversity. We would like to thank you for joining this Data Diversity webinar, the benefits of a data catalog with built-in data lineage, sponsored today by Octopi. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them by the Q&A in the bottom right-hand corner of your screen, or excuse me, in the bottom middle of your screen, or if you like sweet, we're encouraged to share our questions via Twitter using hashtag Data Diversity. And if you'd like to chat with us or with each other, we certainly encourage you to do so. And just note the Zoom chat debuts to send you just the panelists, but you may absolutely change that to network with everyone. To access the chat or the Q&A panels, you can find those icons in the bottom of your screen for those features. And as always, we will send a follow-up email within two business days containing links to the slides, the recording of this session, and any additional information requested throughout the webinar. Now let me introduce to you our speakers for today, Nesim Ohayan and John Fry. John is the CEO and Managing Partner at FinTegraD, Consulting LLC, with over 30 years' experience across all asset types in both functional and technical aspects of the Treasury and Capital Markets industry. John has been responsible for over 15 major operating model transformation in addition to 35 vendor system implementations and upgrades in over 100 client engagements. He has been the Product Manager for five world-class FTB full-cycle Capital Markets trading systems. Nesim is the Director of Global Business Development at Octopi and is a longtime professional in sales business development and marketing expertise in establishing the foundations of scalable sales models supported by implementation and implementation of, excuse me, and execution effective process management and integration into the organization's end-to-end management of the customer lifecycle. And with that, I will give the floor to Nesim and John to get today's webinar started. Hello and welcome. Okay. Well, thank you very much for that introduction. Wonderful. Let me go ahead and advance to our screen. So here we are. Thanks, Shannon. Let's get started. I'd like to start by just maybe making a statement here that, you know, if we know that knowledge is power and that data is a source of knowledge, then we should all be really superheroes with superpowers. Considering all the amount of data that we generate and consume each day with all of our actions being tracked and digitized, it's really a wonder that there are places to keep all this data. And yet that data is indeed flowing and quite often it flows very nicely into well-maintained structures from which this data can be accessed for insights. So when BI survey, the one that provided all these data points that you're seeing on the screen, when they decided to do this poll, they found from 710 companies, they, you would have expected that the companies would have been able to report favorably when they're asked. So, you know, what percentage of your decisions are data driven? Well, it seems, unfortunately, that this is not necessarily the case because for half the companies and in turn, half of the decisions, the gut driven decision is king, especially in companies that are over 5,000 employees, for example, because they found that in their case, it was only 40% of the decisions which were actually driven by data. So there's also companies that were specifically considered to be the most mature in the data infrastructure and the data intelligence systems. They reported that still only half of their decisions were driven by data. So now that is or this is the gap that we are looking to explore today, right? So we want to know what kind of practical steps can organizations take to increase the likelihood that their decisions will be driven by data. Why are decision makers yet not yet really trusting and using the data that's currently in inventory that the IT teams have actually painstakingly put together? And are there any low hanging fruit in terms of some methods, tools or procedures that can be leveraged to take your organization over the threshold and become more data driven? So to address this gap, let's sum up some of the challenges that are faced by data professionals whose job it is to consume and present this data to become more, you know, for the organization to become more data driven. So there's obviously the BI teams are, I would say they're drinking from a fire hose. They're just constantly getting so much information. And to continue with that analogy, if they're drinking from the fire hose, I would imagine they're probably not thirsty. They're getting lots of data. They're not, they're not, they're not starting for the data. But the problem is that there's a lot of this data that's going by the wayside. So is it a question really of if you build it, they will come. But the reality is that, you know, just because the data is there, it doesn't mean that it will be used, let alone that it should be that it will be used correctly. This council countless initiatives, regulations, strategies, and strategies that add to the demand for and leveraging that data effectively. When when you add to that the rapid increase of data sources, how can a single version of the truth ever become agreed upon? The workforce, the workforce is also quite mobile. You've got people that come in and out of jobs that are moving from company to company, that are moving from different roles inside the company. And now post COVID, hopefully we're post COVID, right? You know, in the post COVID era, everybody's working from home much more than they ever, than they ever used to. And that also has an impact on the ability to, to be able to leverage data assets that that used to be, you know, discussed in person. So the more systems that, more systems than ever before, new and emerging technologies, and obligations to comply with ever increasing data and data privacy regulations, or just a few examples of the load that's being, you know, kind of hoisted upon the BI and data teams overall. So the data teams are especially the ones that are expected to fundamentally transform how decisions are made through all kinds of different initiatives. They're modernizing the data stores to ones that are more, that have more control, that are have better visibility. They're the ones that are also basically making it possible to migrate the data to new data intelligence systems that are more accessible, more scalable, they're using cloud architecture that can be a mixture of private and public cloud, as well as supporting those legacy on-prem systems that, that are, you know, still keeping the lights on. So they're busily also cataloging all those data assets are doing it through all kinds of different means, even Excel, but they're doing it, and they're making sure that people have some kind of access to these definitions. And they're doing it in a way that they're hoping that the business users will be able to somehow self serve and consume the data in a more effective way. They implement and support additional ways to consume the data quite often. These actually end up being data science projects where they're looking to gain more insights or use the data through machine learning and AI initiatives. All the while, like I said, to try to keep the lights on and make sure that the traditional reports and dashboards and the marketing use cases are still able to function because that's exactly how the business operates. So with all those initiatives, the, the resource and the resources that are being poured into treating data as a resource, it is a wonder really that you hear of examples of the major disconnect between the providers and consumers of company data. I would add that the inability, the inability to use data assets that are on hand can actually render them more of a liability than an asset. So what can be done to assist business users to better use the data that's already on hand? So I'd like to see if we can illustrate this with with a case study. So John, give you a chance to take the mic here and give us a give us your your thoughts on this that you've been doing for so long, even long before you've ever heard of Octopi. Maybe you can share with us an example of one of these projects where you've been involved where you saw that the lack of visibility control of the data actually caused the customer to spend a lot more time and money and energy than they should have. Should you know, would it be that they had done the things according to best practices today? Yeah, sure. Thanks, Nelson. Appreciate it. So let's go back to 2008. This is a project I ran for the South Africa Clearer. They had implemented a systems very complicated system or at least tried to use somebody else's database and essentially cobbled the thing together, did it in about two and a half years, and then went live. So you're talking about clearing bank, they went live four weeks after they went live. They couldn't reconcile anything. The reports going to the South African Central Bank were all wrong. And they had to turn it off, basically roll back four weeks, unset everything, unclear it and then redo it with the old system, which obviously was not, let's put it this way, it wasn't the best idea at the time. So what we did, we came in in September and we looked at it and we said, right, we're obviously missing a few things now. One thing you could do was look at everything coming into the data, 3,000 trades a day from a couple of the systems, track it all through, find out where it was going. So we said, no, that's taking too long. We had five days to come up with a timeline, we said seven months. So as you can read this much faster than I can talk to it, the strategy of fix it, that's verbatim, that's exactly what I said to the CTO, we're just going to fix it. So if you flip to the next slide, please listen. So the idea was let's look at what they were missing, which were the South African Reserve Bank reports and all the audit and then back out from there. What we had to do was do essentially a data lineage and a data catalog manually. So we had four SMEs sitting there looking at the system, dredging data out. Now this system is a hybrid database, it's object relational. So it doesn't actually have a lot of keys in it, you have to know what you're doing, you have to know the API. And we went after the data going back from the reports all the way up to the front and tracking it through. We didn't have to do much of a data catalog because the system data was essentially very simple. It was entities and trades and they were very simple trades and we didn't have to worry about it. What was important was the confidence in the data because clearly the spotlight was very much on the project the South African Reserve was leaning over the shoulder literally every week they wanted to know what was going on. So we had to invent a data catalog, a couple of new project management tools. We use data extensively for completion of the project. We didn't use any kind of normal Gantt charts or anything. We just said, right data, the data's good, we're on it. And we did it. And it took we were two weeks early. We left with actually perfect South African Reserve Board reporting perfect audit. And when we flipped the switch, which was two and a half hours and it was checked through the rudimentary and I mean rudimentary Excel at Microsoft Access type tracking that we built. We wound up with zero issues for the next four months. So literally nothing through the switch. Fine. It just worked. So what the we didn't we did a postmortem after that we looked at what we should have done what we could have done better and clearly trying to build this thing was not great. We had to do it. I don't think there was anything else around at the time. Nobody really heard of it. At least not much. But the best we could have done was two weeks sooner. We did it in six and a half months. We could have done it in six if it had been perfect. And yeah, that was that was the first one I'd done where it was really hit me in the face that there were sort of two or three things that stood out. One was you had to have a catalog link to the lineage. Two, you had to get the lineage because people mean different things when they talk about different things. So you need to know exactly where the data comes from. Otherwise you do throw rubbish out and it goes beyond the system. But this was essentially one system. Wasn't joining up multiple systems not in the sense of full blown integration. So and the other one I guess was that we we found out we were using data everywhere. Every single piece of the system implementation, the testing, everything was driven by data. We didn't look at functionality at all. We basically made the assumption that which was perhaps a bit wild, but we said that the vendor, the vendor system that tested the functionality reasonably well, we needed to make sure the data was good. So we went after that and it worked very well. It was actually at that time which was 15 years into what I've been doing. It's the only project in this space that at that time been done under budget, under time, and we got it right. We as a team. So that's the sort of summary of that one. Brilliant. So you really did it based on making sure that the the data was was healthy and that you were able to see visibly all of the data assets connecting with each other and understanding how that flow was working. That was the primary thing and you just trusted that the system could take the data the way it was designed to. Yeah, I mean, but basically I looked at the database so I was I knew the database. I mean, this was lucky. I designed that thing. So I knew it but I knew it called which is I guess another lesson is get somebody in who knows the system if you're dealing with a vendor system. But yeah, it took two months of raw SQL to line the data up and make sure there were no errors anyway. So that was it was intense. It was a great project actually. We still talk about it as one of the best. Great. Wonderful. So yeah, so it really does sound like that was a successful implementation and an important way for the for the bank to really pivot towards having better management of their data through their visibility and cataloging things as tightly as possible and being able to connect to the lineage, which is really what we're talking about today. So why don't we unpack a little bit about, you know, some of the key capabilities that we recommend to focus on. So the first one, of course, is yeah, you really do need to focus on the automation. Nowadays, there's so much more automation than in the days that you were working on this before. So, well, you know, ideally you want to stop doing this kind of thing manually. Try to make sure that you know that the system can actually continuously be scanned for updates and get that information into the lineage. It doesn't mean that, you know, just because you have a system that can map lineage, the question is, are you doing that mapping manually or is this going to happen automatically? If you're doing it manually, you can pretty much assume that the users are not going to trust it so much because it typically isn't going to be kept up to date regularly. So the goal here obviously is to make sure that the data will be used. So that's the first point. The next one is to make sure that it's democratized in terms in the sense that let everybody contribute to the enrichment of the metadata, right? You don't want the business users as well as the data engineers or architects or whatever titles they're using nowadays. You want to make sure that all these different stakeholders really in the data landscape are able to add their thoughts. And it doesn't mean just that they're able to, but you actually create an environment where it's natural for them to do so. So be sure that all the relevant data citizens have access to all the relevant data that they need and then encourage collaboration, create essentialized portals so that the discussion about the assets doesn't happen in another channel somewhere else in Teams or Slack or email or in the coffee room or whatever. You want to make sure that when somebody asks a question about, hey, how's this calculation done? Does this actually have tax included or whatever the questions might be about around the data? Those questions need to be placed right there where the consumers for the data are actually going to be ready to use it. And that way, when somebody leaves the company, that information doesn't leave with them as well. So they've actually now become part of that, then becomes part of the travel knowledge in the organization. And finally, of course, traceability, which ultimately really is the lineage. You want to make sure that BI Teams are able to self-serve, but really they're not going to unless they know for a fact that the data that they're about to consume is the data that they are comfortable to say that they're ready to commit to that data set. And the data teams on the other side, on the more technical or data engineer side, if I can call that, they need to know that they can see what would be impacted if they make a change to something, because that's their job, is that they need to maintain these data pipelines. These flows have to really connect with each other and have to understand exactly why they're connecting them to each other. So making sure that that actually works means that the data teams that are making those changes, they need to know what's upstream of this asset, how did I, like why am I working on this asset? And then before I make a change, I need to know where should I look for any potential impact. And that brings us to our second case study with you, John. So you've got these capabilities in mind, hopefully. Let's hear about another one of your amazing big projects, who are involved. And specifically, I'd be pleased if you could share with us how this project presented a real deep challenge and how you solved it with the creation of the data catalog, making sure everybody's talking the same language or singing from the same hymn sheet, if you will. And that they've been able to leverage some form of automation or data collection or metadata collection, especially, and how you provided that visibility from source to target. Sure. Okay. This was an entirely different case altogether. I mean, in 2008, the issue we had was the bank and the vendor said panic, go. We don't care what you do, fix it, which was great. This one, not at all, very political. Lots of different managerial silos, which you run into in organizations, big or small. So, the thing here was, as you can see, there's a big industry change, impacts a lot of financial instruments, a lot of the bank's working processes, everything else. So, the bank turned around and said, well, we're going to change this, and we have these front-end systems, there were three massive systems, and we want to know what the impact is all the way back through the bank. A couple of things immediately jump out. One is that the new thing that was replacing the old thing was different only in real data, not in naming of data and not anything else. The other one was that you were dealing with at least five different worldviews. When you hear provenance, I was thinking about this, there's a couple of things. One is that how the data trend moves through an organization right now, one into the other. The other one is the meaning of the data moving across time as new people come in, and we're talking about IT staff turnover through the roof on some of these systems. They're like an instant masters in, let's get a rise when you've touched the system. So, the terminology changes, it has changed, and some of these things were very old, 20 years old. So, we're trying to traverse all of that to find, okay, where is this thing used, how's it changing, what are we going to do with it? But again, in common with the other example was pick something that maximizes the value and hit it really hard. Don't try and do everything, you cannot boil the ocean, there's too much data. You know, one of these systems has a trade report that if you want will give you 16,000 columns. You can't map that. It's in, well, you could, but it's difficult. They also had different database structures. Again, one was the same system as before, object, relational, kind of database. So, hybrid, the only way you get at that, legally, according to the contracts, is via the API. So, we had to make a way, wrote code, literally, to translate the API that the customer used into views on data. So, flipping it that way, and then looking at what they've done and tracking it through using terms, literally. I say English language, I mean, it happened to be in America. It was English language. So, we had to follow that through. And then we did need a data catalog, extensive. I wrote 1200 terms myself. It took a while. I think that pointed out the, had a lot of debates with a lot of the users about what things meant, and constantly looking at the dictionary, constantly looking at the legal terms because they're all based on legal documents. And it took a lot of time to persuade users that things they had referred to as A for years actually did not mean what they thought or didn't mean what it literally said. So, that was huge. And then we wound up with a system that, to some extent, ran through their processes automatically, to the extent that we could get out the code, get out the database, join words or commonly used abbreviations for dates like D-A-T-E, D-T, D-A-T. If it has got underscores in the term, stripping that out with code to put it in a database and then say, right, we can now start to track things. We also found that a lot of this is very complicated. So, when you're talking about this, it's not, it's pricing. It's not just put a piece of data in a record or a report. It's actually used for pricing the instruments. It's used for risk. It's used for everything. Issuing bank loans, commercial loans. It goes all the way through. So, when we did that, we thought, okay, how are we going to do this? How are we going to put some kind of boundary around if it's reasonable or not. So, we settled on units number one and that is great. You can get to your units. You can get to your calculations. You can get the lineage for that. What is super interesting is when you're trying to put boundary points around reasonableness, because now you're roping in extra pieces of data, because you need to know, okay, this is the number, but this is its boundary and you have to bring in extra data. So, how do you do that? So, we made some assumptions and we used, again, our knowledge series of SMEs, about three of us and the team, to rip through the databases and say, we know what this should be. It should be within this range and that's governed by these other pieces of data. Let's roll them in and you come up with approximations, which is the best way I can put it. So, we could go around and say, this is where you use this thing. This is the boundaries it should have. If you got it wrong, it's this. If you got it right, it's that. And try and track that through. We did five systems, I would say pretty well, and they were different types. So, that was good. One was very, very mathematical. The other one was very old, denormalized mainframe, and then you had the one in the middle, which was object relational and very cutting edge. So, we were able to establish where those things were used, where LIBOR was used, where it was going to be replaced, and how. So, the bank could then go off to their legal department and say, these are all the contracts we need to remediate. These are the reports we need to change. This is the code we need to change. Well, guess what? The engineers, they don't know what they're actually dealing with. People think they do. They don't. You're looking at sharp end finance. They, nobody's taught them that. They're sort of up to object modelling, but they don't know finance. So, this was a way of really presenting that information. Here's finance. Here's the lineage. Here's the catalogue. Learn. That was a huge benefit for them. Okay. Was there anything from this like, because I didn't advance the slide, just wondering if there's anything else that you wanted to bring out here? Yeah, you can, I think there's, no, that's about it. I think that's pretty much what you just mentioned. Yeah, pretty much. Agile. Whoa, there's a big one. Everybody waves the flag on that. You've got to watch it. The point with Agile is it's backwards driven. You get user requirements. It pushes you forward. Perhaps they call it a product owner. You don't need a product owner with this. You need a product manager. Somebody has to have the vision of saying, we are going there and I will steer this to the point I already know. Well, you know what the point is because you have SMEs, you have experts in the domain in data, more in the domain than anything, but object modelling was a huge thing than just an Agile project manager. Scrum was fine. You have to generate value immediately. Otherwise people don't believe you can do it. But you notice that both these cases were not decision points. They were correcting things that were wrong. And I found that a lot is that people talk about firms making decisions on data. They don't. Fundamentally they don't. They make decisions on cheapness and on money and politics. And you know, can we build something rather than buy it? That's the universal thing. They want to build all the time. They don't want to buy stuff because they've got bunches of engineers hanging around that want to build things. Do not build this. Do not. This is very hard. Conceptually it's hard. Curating it is where you want to put the knowledge and the investment of time and everything else is curating the data catalog, looking after the lineage and getting those two things working together. So you can establish confidence in the meaning of the data. Maybe the values, if you can get that far, you can. If you can, great. But the meaning of the data and the lineage together, they're linked, completely linked. And they both feed off each other in terms of how confident are we that that lineage works as confident as you should be based on how accurately you described it. So those two, that was the thing with this particular project was joining the dots on that. That was, I guess, the next layer for me anyway. That's great. I think that what you're saying about the curation is that it's an ongoing process, right? I mean, you don't come up and say, okay, so here's what we're going to call everything. It's an interaction between those that are kind of the purveyors of the data versus those that are consumers of it. Well, kind of, but it's kind of like a dictionary. You don't give the Oxford English dictionary out and say, here, amend that as you see fit. Because you give it five years, it'll be jumped actually about a week. So you need a gateway. You need a workflow. You need people who are going to sit there. It's like sort of having a super rigid wiki where you're saying, yeah, we will take your suggestions. We will consider them, but you are a user. You are not used to doing object modeling. You don't know how to express yourself. And they don't. It's not their fault. They're not been told. But you have to take it and somebody who didn't think about this, yeah, completely objectively, so then put it in the catalog and say that is the same as something else and we're going to write that accurately. We're going to look at the definitions and we're going to have time to do that. And it's rigid is not what you want, but it's got to be extremely sticky. Otherwise, as soon as you tweak one of the meanings, your entire lineage changes around the risk of breaking all of it. Good point. All right, great. Well, that's excellent. So what I'd like to do here is just summarize a little bit of what we've covered so far. You know, there's a bunch of reasons that we believe that it's absolutely critical to make sure that lineage and catalog functions be tightly integrated or indeed become part of the same solution. So firstly, staying up to date. It's a major undertaking to build a data catalog. Anybody who's on the call has been involved in that. I would imagine that you're probably not in your heads, but it's actually going to be exponentially more of a challenge if you are doing it from the catalog after finding out the lineage from another source. So providing that together and keeping it always up to date is a pretty, you know, kind of central message that we're making here. Doing this manually is an exercise truly for designing futility. I mean, you're basically constantly trying to keep the inventory up to date, trying to map the lineage to different assets. It's always going to be at a date by the time you actually update these things if it's done manually. And for the data catalog and the data lineage mapping to be trusted, you really do want to make sure that it is always kept current. Another point is that it's a, you know, the full visibility of the data journey. For that, you really have to be able to demonstrate that each asset has its story and non-technical users will need to know specific information about that data asset so they're going to need to know, you know, where does it come from? Can I actually trust it? Is it up to date? What can I learn from it? Is it approved by a data owner for actual use in the business? Or am I taking a chance by saying, oh yeah, this one looks like the right one and I'm not absolutely sure that somebody else has actually investigated it. So as a business user, I don't necessarily have that the access to that kind of information or really the training to really go that far into the depth of the understanding of the asset. So when planning a change or making or fixing a problem in the data flow, the data engineers and architects and the like, they need to be able to know what's upstream and downstream of this asset, what are the dependencies so that I can go ahead and make the change with confidence that I know exactly where I need to check for the potential impact so that I can make sure that if I'm fixing something, I didn't break, you know, 10 other assets or reports down the road. Okay? The lineage also needs to be able to provide visibility across multiple platforms. You can't depend on each tool to provide its own lineage because the data doesn't flow inside of just one tool at a time. It flows across tools. We buy systems that are, you know, from different vendors and expect to be able to use that system to, you know, extract from one system and load it to another system. And then you've got views and tables that are coming from different environments, whether they're on on-prem or in the cloud, all of these things are crossing all kinds of different infrastructures. So you need to be able to cross multiple boundaries in terms of vendors, in terms of architecture, whether it's on-prem cloud, whether it's cloud of your own, your private cloud, or is it a public cloud as well. And to include the lineage, you really do need to be able to have a single unified view of all of those data flows. And finally, like you were saying, making sure that the definitions are absolutely clear and that they're, you know, you can't fudge them. You need to be able to then get that conversation into a single place where the conversation is happening around these assets and maintain that travel knowledge over time. Because there's, you know, lots of lots of change going on in terms of the people and what the roles are. So we need to encourage the discussion alongside the very place where the assets are actually being consumed and that's obviously the data catalog. And we need to simplify the enrichment of the metadata by creating a single place for the technical and business users to collaborate and do that effectively. So with that, I'd like to just give you guys this all a quick little demonstration as to what we're talking about. Oh, let's go ahead and jump in. Okay. So I'm going to do this real quick and just to give you a context as to, you know, yes, it's possible. So let's illustrate it through a little scenario to illustrate how important it is and how easy it is really to have the data catalog and the lineage directly integrated with each other. And we'll begin with a business user. Her name is Holly in this example. And she's researching a data asset and she wants to understand the business term. So she goes in and decides to start it in the data catalog. Well, she'll actually see a full cross-section of data assets. But that would be the data owners that are going to be managing all of this. But as a business user, she might go in and say, okay, so I'm looking for something relating to cost. So if I'm searching the data catalog, I'm first going to come up with, well, in this case, 72 assets that are going to be linked to the word cost. But these are, they look a lot like technical assets, their columns, their calculations, all kinds of different things. So I'm going to go ahead and remove all the ones except for the business glossary terms because as a business user, that's really what interests me. And now I've removed everything else and we should be able to get our result here. Total product cost, which I also has the flag of approved. So as I click into that, we can now see that as a business user, I can already see the business glossary term, the description has been given. We can also have a calculation description. We can also see that that is owned by Holly. Okay. Now, in this example, Holly might want to know a little bit more. Like for example, in this case, what is this term linked to? What other assets are actually linked to this asset? We can scroll through and see there's a whole bunch of them here. But one of them also has the flag of approved. So let's go ahead and take a look at that. And now we can see that we have a column in Power BI. It's got a description, a general description, a technical description, and even a calculation. And you can see that there's three columns here that are being used in this calculation. So Holly decides, okay, I want to do a little bit more investigation. Who owns this asset? So it's Sophia. I'm going to ask Sophia a question. I want to understand, is this exchange rate used as of the order date? And before Sophia can answer, Sophia is probably going to want to do a little bit of digging herself. So Sophia comes she gets tagged. And that information, that tag obviously is going to create a notification. In that notification, she'll come to this page where she'll be able to start her investigation. And she'll probably go straight to the lineage to understand, okay, so what exactly is this asset that she's asking about? And we can see right away that this is the total product cost that we've asked about. It comes from a report called Total Product Sales. And if we look at the entire landscape, from the point that it came in, from the business application, we can see those three columns that are used in the calculation and how they transform inside the ETL. The business application was accessed through the ETL and then it transforms along the way. And you can follow it all the way through until you get down to your report, okay? Again, this is just really high level. I wanted to give you an understanding that, first of all, we're looking at this at the column level going across all systems. But I might want to do a little bit more of a dig into this and see what are the systems themselves without wearing so much necessarily about the actual column. Now I'm kind of zooming out a little bit and looking only at the systems that are involved. And here you can see a little bit of a picture of the different systems that are providing data to this report. So that's how quickly you're able to get to the lineage. You don't need to do a long research and finding all this information through all kinds of queries and opening tickets with all the different data system owners. You basically get this information immediately because again, it's all kept in one repository which is giving you the ability to really quickly go to the answer without having to do a lot of research. So we now have this information. We can see by the way anywhere that you're seeing a shade here that tells us you we can continue to go upstream from these ETLs. We can go downstream as well from these different from these different assets. There's three more dependencies on top of that. But basically we already know now all of the systems that are providing data to this report. Okay. Now, if I want to go deeper into this report, I can go into the inner system lineage. And now this is the third level of lineage. This is basically what we're talking about when we're talking about three levels of lineage. We're giving kind of a three-dimensional view, if you will. Okay. And total product cost is down here. It's one of the columns in this report. Right. But there's a bunch of other columns. And over here you can even see that if there's a calculation or if there's some kind of code in here, we can double click on that and we can actually now even go straight to the code. We don't need to go to the Power BI system to see that code. We can actually see it right here to understand what exactly is the transformations going on inside of the report itself. And at this point then let's say that Sophia satisfied. She now has an answer to the question as to how this calculation is done. She wants to go back to the data catalog and bring that back to give an answer. And she comes in here and she says, yeah, the US dollar exchange rate, this calculators of the order date. Okay. So based on that quick review she's probably saved quite a bit of time in terms of being able to get to the answer that she's looking for. So hopefully that gives you a high level understanding that yes, these things are possible to do. It's quite easy once you've got the kind of tools that are available today. And at this point, I think that we're probably ready to get ready for a summary of today's session and maybe getting some questions if you have any. Be happy to feel some questions with with John here. Thank you so much. And John, thank you so much. Just to answer the most commonly asked questions, just a reminder, I will send a follow up email by end of Thursday for this webinar Pacific Time with links to the slides, links to the recording and anything else requested throughout here. We've got a couple of questions coming in in the Q&A section. Would you say data lineage can be used incrementally build a data catalog by letting users first link potential duplicates and subsequently eliminate them once confirmed such as with stakeholders? Yeah, I think it's a good idea. I think that's that's what we did. I think the the danger is to assume that stakeholders who maybe the people who use the data the most actually know what it means. You're looking at again, it's the object models because most people don't get into behavior and a lot of the data names, the daily uses behavior. So yeah, but that's exactly what we did in the second project was to really sort of open the doors and say, hey, you know, throw us what you think. We'll curate that and we'll feed that back and it allows people to pick up on their own misunderstandings, defend their point of view, debate, and you get a lot more confidence in the definitions and therefore the lineage. Yeah, I think that's that's definitely the case and but I'm just going to clarify from the perspective of you know, when you're using automation like we do at Occupy, you're able to do them both kind of simultaneously. I mean, basically the way it works is that the metadata is scanned and collected and uploaded to the central metadata store. And that means then that you have it all in one place. So we were already from the moment that you've implemented Occupy, we've already collected all the data assets into the data catalog. But there's not much information on them other than what's available in metadata out of the system. Sometimes it's good, sometimes it's not. But incrementally, you're right over time as you start to uncover what are the source of target implications of this asset versus that one, you start to document that as you go along. So obviously as you're working, the goal is always document, document, document, don't wait until until it's much later. You see something documented right away. So that's yeah, definitely the case in that sense. How can Occupy generate auto generate metadata on a system? So Occupy's process is actually quite simple. We provide an agent that basically it's a windows application. It goes into the VPN inside of the company's environment and it's configured to go out to the metadata systems that are in the customer's environment. So it'll go out and collect from data stage, SSIS. Really, I mean, you can get the supported out of the box technologies from our website. But basically, you just run that client which we will configure for the customer. You run that client and pretty much the effort can take it anywhere from an hour to a half a day. It's really not a project, per se. It's more kind of like a task. Just configure the tool, run it. It'll produce a local set of files that you'll probably want to have InfoSec take a look at before you go ahead and upload that. Most likely that's a process that you need to get approval. So they're going to look at it and they're going to see there's only metadata in there so you'll get the green light. Go ahead and upload that and then within 24 to 48 hours to answer your question at last. That's all it takes. Basically, it's a one-hour effort to let's say half a day effort at most on your end and about 24 to 48 hours from the processing side on our end. Nice. So are metadata scans based on connectors or size of data assets? Yeah. I mean, basically we have quite a large list of connectors that are pre-configured inside of the Occupy client. And these are, as we call it, supported out of the box. So that means that we've actually done the homework quite intensive work to understand exactly how those metadata systems are structured, where we need to collect the metadata, what kind of user we need as an access to that metadata section to be able to collect that. And then once we get it, we need to know exactly how to parse that information and run it in order to be able to use that code to render the answers to the questions as to, so how does this connect to that? And what happened in between? What is that code that converted these three columns into one column? A lot of questions about the product here, which I love. So can Occupy handle the creation of self-defined metadata during a scan? For example, if John is the steward of all the data in the sales schema, can he be tagged as a steward for those data assets in the metadata? I'm actually looking for the question because I didn't quite understand it. I'm looking for the question in the chat. Yeah, I have it sorted by... Oh, there it is. Okay, so can Occupy handle the creation of self-defined metadata during a scan? For example, if John is the steward of all the data in the sales schema, can he be tagged as the steward for those data assets in the metadata? So I'm going to say yes, but I'm going to presume that there's probably more to peel that onion. But first of all, the metadata that we collect from the systems is automatically collected and obviously we include that. But as you saw in the data catalog, there was a number of things that you could go in. There's a pencil, I didn't point that out, but anyway, there's a pencil mark in some of these areas where you can go ahead and enhance the metadata. So yeah, it's a combination of the machine-collected metadata as well as the stuff that you can actually continue to enhance and certainly yes, you can tag users as the steward, as the owner, and you can tag the metadata also with all kinds of different descriptors in order to be able to find data more quickly. And Octopi was the solution used in both case studies that you presented, John? I can answer that, no. Not as an expert in the challenge, but we only met recently and when he was talking about his previous projects, I said, wow, I want you to tell us, I want you to tell your story. Now, I've seen a whole bunch of questions. I'm going to just want to put away because yeah, just all of them coming in. Okay, so number one, is it fun? Depends. I've been around for 46 years in finance. For me, the challenge of defining things is sort of fun at first and then it's teaching people and that's fun and then it gets to the sort of carry on and do it. Can you do it automatically? I don't, to some extent, if you're talking about derived terms, I do it somewhat automatically with meanings with what things are actually doing, but you are, boy, it gets into a building a dictionary of your world. So that's not, it's very hard to automate that. And I think as far as ROI goes, yeah, it's actually pretty easy because there's a couple of ways of looking at it. One is what can I do with the data, which is one way. And the other, the other way of looking at it is what happens to me if I get it wrong? What happens if your audit is wrong? What happens if your regulatory reporting is wrong? What happens if your sound sensor is wrong? And you're really looking at making sure it isn't wrong that people actually do get the right data through that you can identify when it's incorrect, when you've got out of boundary conditions and you can put a stop before you report that to somebody. Size of the projects compared to the cost of the tools? Yeah, I mean, that's again, depends how big your projects are. I could do one system. It's, it would easily qualify just for this tool, just for the system implementation alone, let alone anything else because I've seen how horrible they get. So I mean, think about it this way, right? Most, most certainly financial organizations, they spend 8% of the revenue on their IT budget. 50% of that's on, you know, run the business, just BAU. Then 25% more is supposed to be on enhancements. All those go so far over budget. 60%. I think you look at the standard report, you know, the collision stuff. 60% of the budget, all the R&D clean out the window. 8% of financial institutions revenue, that is a big number. That's a lot of dollars. So when you come down to how much do you need to do? Well, what value do you need to show to hit an ROI of four when most projects are negative? It's pretty easy. If you bring an implementation down from, you know, 60% over to 20% over due to massive, massive improvement. And a lot of it's testing, a lot of it's data integrity, a lot of it's, you know, knowing that the reports right. As I said, I've got one report with 16,000 columns. Okay, go. Find something that says, let's put all these different products with the same column in this one line up. And they're all different departments, all different people, different histories in the market, different knowledge. There's no way they'll come up with the same various, you know, pieces of data from the same products to wind up with a total. They couldn't even add them up. So, yeah, you get ROI pretty, pretty easily as long as you keep it focused. If you try and boil the ocean to get it, you're dead. But if you use it until when you go after things that you've identified before and are, you know, bang for the buck, then yeah, you'll get the value. Absolutely. And just, you know, and Nism, do you have anything to, I would get that question a lot, you know, how do you present the value or ROI on a solution like this? How do you, how do you get the buy off? So I think that one of the main things that that we've experienced that occupy is that typically these kind of tools are purchased when there's a major event like migration, mergers and acquisitions. Anytime that there's, I mean, any large amount of change because frankly, the way that things are operating for the most part, if they've been doing a manually forever and nothing big is changing right away. So they'll continue to do it manually, even if it's a, you know, somewhat painful, but they get used to it and the expectations aren't that you're going to turn around and, you know, give an answer to a specific challenge that somebody comes up with and says, oh, I have a report that broke. So that's, that takes a week, it takes a month, it takes a certain amount of time to resolve. But once you get into a situation where, well, you know, we plan to go live by a certain date and figure out a way to manage that impact. That is the typically the scenario where they're going to look at it and say, okay, well, we need some, we need some help to figure out where that impact is going to be because we're going to be changing a ton of stuff. So how do we actually do that? So typically that's the, you know, kind of the event-driven scenarios where people will think about a tool like this. That's not to say that, you know, there's no value in a tool like this for, for an everyday kind of use in improving, improving systems. But typically when they're, as they're looking for an ROI, it's going to mean reducing the effort to get to the point where we can actually go live and not have the noise. You want to quite go live. You want to know exactly where to search for the, for the potential impact, what to test and what specifically, you don't need to test and where to look for the impact. And we're not to waste time looking for the impact. Jason, there's a lot of questions here on product comparisons, how you compare to other data catalogs. We are a vendor neutral company. Of course, we don't allow putting down other companies, but I know, but just as we wouldn't allow anyone to put down Octopi, right? Of course. So, but what competitive advantages, I like the way this was raised. You know, what competitive advantages that Octopi have, like what sets you apart from other data catalogs? I would say that the first part is that the lineage is integrated with it. And that's obviously the title of today's session. And that's really at the end of the day, one of the main differentiators because Octopi's data catalog, you know, it's a great tool. It's got all the, you know, the important features in there. But the number one thing that sets it apart from everything else is the fact that you can go straight to the lineage and it's always, you know, available. Secondly, it's automated. It's actually already created as a result of that one hour to half a day effort. You now have a data catalog. So you just basically upload the metadata to the to the Octopi cloud. And that's it. 48 hours later, you come back and it's ready for you. Now, obviously, there's additional work that needs to be done in terms of, you know, adding your own descriptions and, you know, creating your data, your glossary and so on. But the fact is that those, the effort of collecting the metadata into one place creates the ability for us to not only create the catalog, but also creating all those views, everything that you saw in that mapping of the lineage and how those interact with each other and the connectivity with all the different sources. That's all done automatically. This is not something that somebody went in and said, oh yeah, so I've got this asset and I think it connects to that. Let me represent. No, that's not the way it works. It's actually all fully automated. So those are the main two differentiators. I would say that, you know, it's also kind of one of those little hanging fruit kind of tools is not an expensive tool. And it really does have the ability to to be a very, you know, tactical tool when you are struggling with a lot of change or you're planning to make some change, you want to make sure that you can actually manage that very quickly. All right, we've got just four minutes left. I'm going to slip in as many questions as I can here at least one more. What are the key tools used to search the data catalog, such as tags, topics, key words? Is there a combination of automatically generated tags and tags with taking rules defined by data stewards? Well, I guess the tags are user generated, right? So they're the ones that are tagging it based on their use case. So those are individually added. So you could set up a workflow for your business to decide exactly who gets to tag and what kind of tags to, you know, you're approving so that it becomes a kind of an agreed upon methodology. But searching through the metadata, you can search by looking for a column name, you can search by looking for a report name, you can search by looking for a calculation. So if you're making a change, as in John's example, if you know that there's Libor mentioned in the code, search for Libor and you'll find every asset that has the word Libor in the calculation in the name, in the view in the column, all of those are going to come up. And is there a batch mechanism to keep data lineage up to date? Yeah, yeah. Basically, once the InfoSec team, generally the way it works is once the InfoSec team has inspected that the process only collects metadata, then they go ahead and improve it and the the data team then just creates a scheduler that automatically uploads the metadata. Typically what they do is they'll upload it on a Friday and then come back on Monday and they have a fresh updated metadata connectivity graph. With all the updated metadata inside. Okay, I think I can slip in one more here. After the automated metadata scanned from the system, how accurate is the final result? Do we need human assistance to confirm the result? We find that our results are pretty darn accurate. There's obviously some areas that customers will point out to us and we've got teams that immediately jump in on that. But the customers are telling us that they are absolutely impressed with the fact that they can actually really rely on the results. They're not questioning it quite to the contrary. That is perfect. Well, thank you so much to both of you for this great presentation. If I can, I'm going to answer one more question. I just saw a question that I think would be important to answer. The question is, is there an integrated data quality component? So first of all, once you know lineage for the pipeline, you're actually considerably improving the amount of the level of quality that's available for understanding. But if you're looking for data quality in terms of sampling the data, it's important to note that OctiPy only collects metadata and we don't actually see the data whatsoever. That is important because InfoSec quite often will really slow down the process in terms of an implementation if they see that there's any kind of data going to the cloud. So we partner with companies that do offer data quality solutions, but we don't do that ourselves. Perfect. Well, thank you so much. That brings us right to the top of the hour. Thank you both so much for this great presentation. And thanks to all our attendees for being so engaged in everything we do. Just again, reminder, I will send a follow-up email by end of the day, Thursday Pacific time with links to the slides and links to the recording from today. Thank you all. I hope you all have a great day. Thanks to OctiPy. Thanks, everybody. And thank you, John. Thank you, everyone. Great to have you.