 Hello, everyone, and welcome to our next EDW session called Managed Data Smarter Using AI-slash-ML-Powered Data Quality, which will be presented by Raj Joseph, the CEO of DQLabs.ai. All audience members are muted during these sessions, so please submit your questions in the Q&A window on the right of the screen. And our speaker will respond to as many questions as possible at the end of the talk. Please note that there is a linked form at the bottom of the page titled EDW Conference Sessions Survey. This is where you can submit session feedback, and we encourage you to do so. So let's begin our presentation now. Thank you, and welcome, Raj. Hey, thanks, Jim. Thanks, Dada, for this opportunity. Let's explore a little bit about how organizations can manage data smarter, primarily using data quality as an approach. So first, let me start off giving an industry what's happening in the last two decades in terms of all the traditional data management we have been growing. Today, if you see the landscape, businesses are overwhelmed with a large amount of tools. If you're looking to the data landscape, you have tools for data catalog, governments, and some are based on 1.0 frameworks, and some are 2.0 frameworks. But over and all, if you see any environment which has lots of data, we have lots of different tools and practices and processes and procedures, which is kind of a little bit for me personally, it's overwhelming. And more importantly, the amount of time and cost that is spent on these disciplines hasn't really scaled or performed well. And usually, if you see these traditional methodologies, you'll see data quality as an add-on to any of those other main components in terms of governance foundation or process procedure peoples, but not as a primary focus, which is a very, very critical component which has been missing over the last two decades or even before that too. Further, if you see some of these data management products, which is out there, the lack of innovation is overwhelming and has not gone to the point, grown to the point where we are left with unhappy customers and experience. So this is the kind of landscape we are in. And this is a Gartner slide, which pretty much tells the whole story in a nutshell. I like this because it kind of divides into two quadrants, two decades, 90s to 2010 and 2010 to 2020. If you see both of those on the left side, you'll see how we have been doing VA data warehousing and then MDM, all these big data concepts coming in, but still the fundamental challenge always revolves around data quality. If you see from 2010 to 2020, the same thing, right? Like big data projects, AMO, mobility and all those things, but again, still the data quality aspects are still there. Today, if I point to any one of the audience and ask how much percentage of data quality is good in your organizations, it's very hard to answer that because the whole data quality is more of a, there is a lack of consensus among us. And then also it's very hard to determine because there is no process standards, which kind of governs. And then there is also this business process-based and outcome-based measurement of data quality, which is not enabled in any of the organizations today. Because of all this, 88% of the data that we have today is untouched and we don't know how much of the relevance. Usually most of the organizations and the users revolve around this remaining 12% as predicted by a forester based on a research study, which kind of points out how important data quality is more actionable, more outcome-focused. This slide again, if I see, it's a nice pyramid which kind of tells about how outcome-driven strategy needs a lower level data and metrics, more DQ metrics. So obviously everything starts from a higher level business strategy and an outcome focus, which kind of translates into performance metrics and KPA. But what's missing over the years is to connect this bottom level data quality metrics back to the business outcomes. And if you cannot do that, then the leadership and the strategies who's in the organizations is not able to measure how much of a percentage improvements or how much of a significance they can do or contribute of any initiatives of sort. This is a fundamental challenge. And so what I'm preaching here is like not having this outcome focus, not having a focus of data quality has been a bigger problem in the past. And that's where we want to have shift that focus and more for a DQ primary focus. This also makes sense when you're looking to the amount of data that has been growing. If you see in the last decade and in the further decade that's gonna be happening, the amount of data that is growing is enormous. These are numbers that everyone knows of it. Today we live in an environment where data is generated faster, ever than before. And in fact, like the amount of data that we produce is far much more higher than the data that we can consume. Further organizations taken to fintech or any other vertical domains, data is spread across on-premises, third-party systems, sometimes tied into APIs or legacy systems. Not just one cloud, it could be even multi-cloud. We use a lot of different vendors for different purposes in terms of compliance, privacies and preferences and et cetera. So we are primarily left out in a place where data is growing faster. We don't know the relevance of the data today. The traditional data management practices has not worked well. Further, with the amount of data that is spread across multiple locations, we need something that is faster agile, something that on the spot evaluates your data quality and then also gives you more information which can be tied back into outcome. If you look into this chart here, this is a study by Gartner in 2019 and it just kind of tells like how 51st, 7th person of the organizations do not have any ability to measure a DQ. And in some cases they have informal metrics but not necessarily a standardized process or a governance practice which is starting with DQ. So the way I see DQ is it cannot be just a one-time approach or you measure it over a period of time and just leave it. I strongly believe in a continuum of process and cycle and as you see here, I have outlined the four different steps that needs to have in any kind of data quality process or tool sets that you may have. So the first is the ability to connect to all those data sources which is everywhere, anywhere. And then the second is the measurement. Traditionally, if you'll see the rules-based approaches as been there with the amount of growth in data, we cannot afford that and nobody has any more time to go and manage a set of rules. You need an out of the box, no rules approach or a way of monitoring that makes it easy in terms of ongoing measurement. And then we shouldn't stop there. Measurement is the very first critical step of understanding where we are but it does not necessarily help us to go where we should be. So in that aspect, we definitely need to improve the bad records into good records or whatever the problems that has been identified from a data quality standpoint. So that gets into this improve piece and then later how do we sustain this over a period of time and make it more continuous and ongoing in a way that it operates as just like any other process or products or operation aspects of that that may happen in a given organization. So this continuum is very, very critical apart from the need for DQ is what I see. As I talked in the previous slide, data quality measurement cannot be manual which has been what we have seen in the last two decades as well. So the way we do here in DQ Labs is automate first. There are two principles that we take very close to heart from a product development and lifecycle. One is automation first. Anything that we can automate should be automated. It may be a simple pattern based or it could be as complex as based of algorithms in AML. But I think the closest to the heart from a approach standpoint is automate first. And the second thing is relevance. Relevance is very subjective but how we do that is we need to be able to provide relevant metrics or either it could be an outcome based or it could be a tool that may be using from a catalog or governance standpoint which could be interested into that. Or if you're looking into a lineage you need to not just look at the lineage from a root cause analysis or an impact analysis but in terms of the quality attributes too. Like I mean, is this a relevant quality attribute that we need to care about? Is it relevant attribute that we need to even do a root cause analysis or link impact analysis? And even from a search and discovery standpoint, right? What's the point of searching and discovering a low quality attributes? It makes sense from a semantic search or in terms of catalog search you need to be able to find if I'm looking for emails I need to be looking for high quality emails which is the contributing sources across these things. So we take this relevance from that and power metrics which allows us to do lot of different business processes or enablements for any of those things that you may be doing. One thing we do in terms of this automation versus not only in terms of pulling the data we're just using native connectors but also in terms of classifying using a semantic based semantic is more of a business context. A number in a business context could be a phone number or a social security number or it could be something else or a passport number, et cetera. How do we know what is very relevant at the point of ingestion? And that is what we do in an automated way to help organizations understand this is the business context of this attribute which allows us to now overlay any kind of data quality automation that we need to do and also measure the metrics around that. So this helps to go from metrics to more business outcome based remediation. And that kind of is what we preach from a product standpoint and that's what we have been doing and building more and more features towards those two areas of automated and relevance first approach along with the DQ. So this is a good slide which kind of pretty much tells about what DQ labs is. DQ labs as it is is data quality focus. That's why they emphasize on DQ. We are a data quality centric platform. We preach data quality first approach versus any other methodologies or processes and how we do that is using this automation and using AML plus also this relevance metric. So if you see how to, if you look at this chart and how to read this by following numbers, one, two, three, starting from the bottom left. Of course, we do have connectors as it was pointing to any of those data sources either it's a relational, nor it could be streaming data or big data environment we connect to it. We get to the process of semantic discovery and classification which allows us to understand what is the business context of an attribute. Takes it to the next level of measurement of data quality. If it is SSN, of course, we know what SSN formats checks blocks of numbers issued, et cetera, et cetera. Similarly for other types of attributes. We also have created a way how you can define your own semantic classifications and models on top of that which allows us to decide for any kind of load numbers or any kind of other attributes which may be very specific to your vertical and domain knowledge. Primarily allows us to measure the data quality of those over a period of time or a given snapshot. So as I was pointing out, measurement is one component but we don't stop just there. We just try to take an automated approach of improving the data. When I say improving, it's just like cleansing. And the difference between any of those tools you may be using this is what DK Labs as a platform offers us. It's all automated and makes it easy. So as a business user, you can come here and just connect to whatever and all of those things is automated and does it automatically. And we have made it seamless in a way that it is very easy and useful and you're not overwhelmed with the amount of technologies or management of any of those functions or silos of sort. In this improved data quality, we clean the data example. I will show some examples to make sense of this. We don't just do it in a single column. We do it to multi-column, multi-dataset, cross-dataset and et cetera. So which makes it more powerful in terms of data preparation or data enrichment or even cleansing as it is based on a node or reference data set. So as this process happens, and we also look into the data volatility in terms of more anomaly deductions, outlayer deductions and ongoing measurement of DQ towards a continuum which allows us to monitor and also calculate drift and data deviations of data. Anything in the outlayer's anomalies of sort of what used to be a benchmark and what is now different from that when it deviates is all measured and alerted back to the users which can be notified or it could be even gone back into some kind of management process in terms of measurement of data quality. So this three, four, five measurement and together acting makes it easy for an ongoing data quality measurement. And then also we have created this model where all these good records can be fed into some kind of consolidated high quality data models which could be used later back as a reference data for cleaning. So this is kind of like the Agile MDM of sort but it's primarily to trigger towards data quality in mind how we can get those high data quality records together in a way that it can be also used back to any new data that is coming in. So that's kind of the high level platform overview in terms of how we look into from a data quality centric standpoint. If you look even from a data quality metrics we just don't do it at one level or we do it at three levels. The first one is a data quality score based on the traditional objective dimensions but we don't just chop it objective dimensions we just take it to from a subjective dimensions. Usually your subjective or reliability integrative precision existence which is usually collected either based on CSAT or survey or user feedbacks and et cetera. We have integrated our user collaborations and usages of the data or data sets or attributes into a seamless way that we are able to now measure not just from an objective dimension but also on the subjective dimensions as well. So that overall can be retrieved as a DQ score, a data quality score and measured at over a period of time as well, trending and et cetera. The second score what we have and provided is impact score that is primarily the amount of bad records that is turned into good. So this talks about the automated data cleansing, data preparation levels and et cetera which kind of makes it easy for users to see. Normally this is where the data quality was and now we are able to make an impact of 5% and 10% and et cetera. So these all could be now leveraged into any kind of KPA performance metrics back into business outcomes based. The third level of scoring that we provide here is the volatility of the data. The best example is like any data that you measure over a period of time used traditionally has an limit in terms of upper bound and lower bound and when it deviates from the benchmark there is something that's happening. It could be a bad data quality issue that needs to be looked into it or it could be driven based on some economic factors which is totally legit and it could be okay to happen that way. But either ways, there is something that's happening that we need to look into it. So this is not necessarily an indicator of data quality but at the same time something that may evolve into a bad data quality aspect of it. So more of this drift level allows you to give into whether it's a high level of drift or medium or low which primarily allows us to now in conjunction with DQ score and impact score provides us overall understanding of your organizations across every attribute that you have in an automated measurement. This allows you to easily ingest into any of those other platforms you may use or tools you use and bring it in terms of management of data. These are some examples of how we have used emerging technologies such as AMO in different aspects. The way you read this pyramid is by going from lower to upper. If you remember the previous slide where I was showing the six different functions and modules that is embedded as part of the product, it kind of follows the same modules starting from the ingestion layer, how we automatically ingest and automatically classify the semantic context of the data and the attributes. And then also we profile it automatically using frequency analysis, pattern analysis, statistical base, continuous monitoring of data gaps or missing data or drifts. And then also doing some kind of smart curation which allows us to help the platform learn anything that may be needed automatically even if it is not defined as the first purpose, first time when you use it. So these are the way we have used in different modules but primarily again as I was pointing out using an automation approach. So let me go real quick and show you a couple of screen shares here so you can see the product a little bit. So if you see this one, this is a kind of like a catalog of all your data sets that you've connected. And the main important things I want to highlight here is you can pretty much connect to any of those different types of connectors for this particular conference I have just connected with Salesforce, Hadoop, some Snowflake, Oracle, REST API base. And you have an attribute, you have a data quality score and also an impact score for each one of those data sources. Not only that, you also have this ability to search. So here for example, I can simply come here and search email. And then also like I can search the sensitivity level. I can define higher data quality scores. And then also even if I want to do pick any kind of source types that I may be needing or even I can go and say like I want a particular users and et cetera, whichever I want. So this could be altered at any point of time and you can use any of those variations to find it. So now when I search on email, you can see like email is not always defined by developers and engineers as email. Sometimes it could be a username or it could be a totally a different flag which doesn't have any relation to it. He addresses an example username, but platform is smart enough to identify all this based on the semantic discovery and bring in all this data and give the data quality scores and makes it easy for us to do. Raj, you got about four minutes before the Q and A starts. Okay, good. Thank you, Jim. So I was saying like you have native connectors. You can simply pick choose any of those connector types and then give the configuration that's all it's needed. Once it's needed, it just goes through this process of three steps in terms of profiling, curation and learning as part of that process. And primarily what happens is like for every dataset or every connection you make, you have a data source level scoring, you have a dataset level scoring here and then also for every attribute within the platform, you have a score that is interested to if there is any kind of catalog information which is there either in your platform already or in another platform, we can integrate with that platform and also get that information as well. So you can see both the catalog information plus your data quality score. Another quick way, kind of a little bit testable is how lineage is important but lineage without quality does not necessarily add as much as value. So here is an example of how you can see color coded or labeled attribute on top of a data quality score. You can simply associate all this. So we have primarily color coded in three different ways of how you can see red, green, yellow, red being the lowest, et cetera. The profile here is a simple example of a profile what we do from a data quality standpoint. Again, as I said, like this is all automated and it just all this information is given you to look in case if you need to do it. So example of this is here, you can see some, there are some bad records identified here. Most of the tools just show how much percentage of records is bad and it does not necessarily give a real-time view into of your data. So in this case, what we are showing here is you can real-time query any of those data and then look into it and then see the actual bad record that comes along with that. So in this case, if you see like it's a real consumer first name, we are talking, but we see some business name that's coming in which does not make any sense. So all of these bad records are identified with this red marks and you can go through it. We have different levels of checks that we do. If you want to add any kind of custom rules, which could be either SQL base or function base, we do support that ability to create it, but you create once and apply it across your organization, not necessarily from a table or a column specific, which makes it easy. Here's an example of a data preparation or an impact, how we turn bad records into good records. Here is an example of how you see all these bad values, which is not just, the city is not necessarily dependent on that one column, it is dependent on multi-columns. It's dependent on address, street, state and the country and of course there's a code and the latitude and longitude. So we can look into this cross-column analysis automatically and then try to figure out what is the right value and then change the records and calculate the impact percentage and bring it back to the higher level. In some cases where your data is totally new and we don't know anything and when we are using some kind of statistical or more AML based algorithm based on distance or similarities, you can see like without even having any understanding of the data, we automatically clean all those things and then also we give a user input a way when the confidence is less with my reinforcement learning standpoint to see if this is good or not and by this way, the platform learns as it goes overall. Last, we give a organization level dashboard both from a governance and quality standpoint, overall quality for organizations, your timelines, you can even do a domain level scoring if you need any kind of filtering and your user usage information along with sensitivity analysis and all in all, but they focus towards data quality. Now organizations which were not able to answer before how much percentage of your data is good or bad, how can using DQ Labs be able to tell easily in an automated way? That's primarily what I had for this. Hopefully you had enough information and if you have, we are in the booth and we have representatives in the booth for you to show detailed demo if not reach out to me in email or LinkedIn or whatever your medium of choice. That's primarily it. Let me go to the question and session. Okay, the first question I have is a somatic discovery slash classification stage purely automated via a platform. Yes, so we do have multiple levels of somatic discovery. The first one is more simple. Like for example, we don't have to use any kind of complex algorithms like token base or all these complex algorithms, but it could be simple as like a pattern based example as like a form number and some kind of formatting. So the first level of classification we do automatically is by using some of those simple pattern based, but as it goes more and more then we use more of this AML based and also training database. So that comes automated as part of the platform. We also provide a way for the users to classify if we had misclassified. So that we consider that as a learning opportunity. So for example, for whatever reason, if you have classified a SSS and a phone number and the user changes that from phone to SSN, now we have an learning that has been made and that is also included as part of our ongoing learning process. So this way of coming out of the box with some kind of discovery and classification and later on building up on top of that more and more and learning helps the organizations to align to your data culture and then overlay whatever your level of classifications automatically. The main focus of what we have tried to do is make it less technical and more business user friendly and not delve into the management of data quality from an attribute standpoint, but in terms of outcome based and et cetera. So to that point, we are also even looking into providing like business outcome specific dashboards, how do you enable it much more easy for your data stewards and not just put in the hands of the engineers and et cetera. Can you provide who I can reach out to discuss? Oh, definitely. So you can reach out to either me, Raj at dqlabs.ai or we have a couple others in the call, sb at dqlabs.ai, any one of those individuals would be fine. So I think we are coming close to the hour. Jim, is there anything else you want us to answer? No, it looks like all the questions that audience had you've answered. So, you know, you could probably let him leave a little early and Luis has got a little closing for us, so. All right, thank you, Raj. Thank you for this great presentation and thanks to our attendees for tuning in. Please complete the EDW conference session survey located at the bottom of this page. The next session will start in a few minutes. Thank you, Raj. Thank you. Thank you, everyone. Bye-bye.