 Good afternoon everyone. My name is Eiter. I'm a senior data engineer working in Microsoft working with different customers and in the last year, you know, we're having, you know, building different data platforms and also just, you know, also building different data partners and actually suffer a lot with data quality, right? So this talk is just about data quality and how we can, you know, perform different steps in order to build like AI or automated data quality platform that could actually run on top of our data platform. So let's start. So can anybody guess what is this, this picture? Okay, so we all agree that this is bad quality data, right? Okay, what about this? Yes, good enough, right? Yes, good enough. So some people could say, oh, I can guess it. I'm the best one. And maybe you cannot say that, right? However, we all agree that this is good quality data, right? And this is actually what happens sometimes when we are dealing with data. So the same happens when we have a graph. When we see like a huge spike, maybe someone is just telling to herself or herself, is this a data quality issue or just a normal, you know, load or volume of data that I'm just ingesting? So how can we guess that? How can we say this data quality issue or not, okay? So this is a typical, you know, data like architecture that when we have like a different data sources, we collect data in Kafka, for instance, then we just land this data in a block storage in a raw data store. And then we have some processing, some spark, whatever, and then some maybe real-time, you know, processing also, and then Cosmos DB, and then some reports and web application to soon real-time, right? So when we have this type of spike, first of all, we ask, or maybe the front-end developer or the user just, you know, using the web app realize that they see the spike. And they talk to the developers and say, do we have any bug here in the front-end, in the web app? Is this real? And the front-end developers say, no, everything is okay, right? So then we just go to the back-end developers and say, do we have any issue here? And the same goes to the data engineers or the ones working in the data platform, right? And this is just independent of the, you know, technologies or the frameworks that we are using in our data platforms. And the data quality program actually has two different flavors, I would say, two different separate things. First of all, it's really time-consuming, okay? And if actually performs or we get, like, a lot of unreliable insights. So how we can solve this? But first of all, before trying to solve this issue and try to solve how we can, you know, deal with these kind of issues, why actually those data quality problems happen? And this is because, first of all, we are assuming from day zero that we have good quality data, right? And this is just because, and then time-consuming is related to this. And the second one is because we are propagating low data quality to different, you know, applications or consumers that we have. Machine learning models, reporting, web apps, et cetera. This is typical, you know, a data store. And the question is, when we realize that we have data quality issue? Maybe, you know, day zero we have everything blue, good data, blah, blah, blah. Maybe in one month, you, someone realize just taking a coffee, you know. Okay, here I'm just seeing some issues on the data. But, you know, realize, guys, this is not an issue, I could just perform my tasks, go on. Next month, oh, this is just time-consuming. We are just realizing that we are having more and more issues here in our data platform. We should start doing something. Sometimes it could be really late on a Facebook, you know. So the data platform is just useless and our data store is useless. It's a mess of, you know, bad quality data. And of course, this has a lot of consequences. Not only revenue, productivity, credibility, et cetera, et cetera. I guess that you all, guys, know those. Okay, so we are going to talk here about some of the foundations of data quality. Somehow we can face or deal with this, okay? And maybe go beyond, you know, automated data quality and think about literally how we can apply some ML on top of this. So first of all, the first challenge is, okay, we talk about we are assuming good quality data, and this is why we are, you know, we spend a lot of time, you know, performing queries, et cetera, et cetera on this. What's data quality? I like to say that it is the fitness of data for its intended purpose, right? And data quality includes a lot of different, also, modules or aspects that here in the conference you already heard. So data quality, data profiling, data management, data governance, data architecture, et cetera, et cetera. This talk, actually, I'm going to focus a little bit on profiling not too much and also on data quality, okay? Have somebody heard about data quality dimensions before? Yeah? Okay. So in the, you know, you can find like a lot of publication research or even in the literature, different dimensions regarding data quality. And this is the one that I really like to share. So depending on the complexity, we can define different data quality metrics or data quality dimensions. The first one could be the threshold. It would just define, you know, if we are seeing a different threshold in our web app, we can't realize if this is real or not. Somehow, we can't have a sense if this is a problem or if this could become a problem in the near future, okay? The second one, timeliness. It's just, you know, the difference between two different time stamps. So we can define a threshold and say, if I could not assume 20 seconds between the first time stamp and the second time stamp, for instance. Right? The same goes for the linkage, integrity, accuracy, okay? And complainness, uniqueness, validity, conformity, et cetera. We can define like a set of data quality dimension and say, okay, we are going to start from the first two or first three. And we are going to monitor them, okay? So this is just a data pile and logical build, right? So data quality could be related to data sources, but also to the job, okay? So we can say to the data set itself, to the input data set or for the job, for the ETL, okay? So actually, the data quality dimension that we already mentioned could be actually classified into, you know, data source issues or ETL issues. So for instance, data sources, we can talk about data inconsistencies or hard delets and bulk inserts, okay? And the same goes for ingesting ETL issues. Some of the metrics that I already mentioned, uniqueness, complainness, et cetera, could actually happen in the ETL and we can detect those in the ETL process. So how we can face this? This is just a meta model, really high level meta model, okay? So normally we have like a data pipeline, we already talked about data pipelines and for instance Spark and then we have like a Spark runner, right? So, I mean, I just developed really a Spark job, right? In the structure system for instance and this just running on Spark then, okay? So we can talk about data profiling and data quality on top of this normal data, let's say data finance infrastructure, okay? So we can apply the first of all data quality, okay? So we define a data quality dimension that actually applies to an entity or data set, okay? This is just about data profiling and the same goes for the data quality. The result of the data profiling will go in a dashboard. That way we are going to face the first challenge. What I mean is that data quality is really time consuming. If we are monitoring and having a real time dashboard for data quality, we can have, we really know where is the issue. And we are not going to ask, you know, front end developers, back end developers, data engineers, where is the issue actually? So this is just an example. We talk about different data quality dimensions, threshold is the first one, this is just an example. We have a spike, right? This is real or not? We have a huge spike. Okay, let's move on on different data quality metrics that we have. Okay, this is just about duplicates and schema. It seems that we don't have duplicates. So it could be normal spike, right? What about timeliness? Are we having that day like a huge timeliness problem? Maybe, you know, something is broken or we are not ingesting a spectator or something? It seems that not, right? And what about validity and conformity? The data that we are receiving or ingesting is just a schema compliant, right? Or are we having like different fields or different, you know, schema changes here? It seems that not. So just having a really simple monitoring tool could actually help you to, you know, maybe instead of taking hours or taking days, maybe in minutes or in really few hours, you can realize what's going on about data quality. So we solve more or less the first challenge, right? Time consuming, assuming good data quality. For that, we are applying monitoring, data quality dashboard. What about the second one? We talk about propagating low quality data. So guys, we talk about this metamodel. Can anybody guess what is missing here? We need to actually act on the data pilements that we have, okay? So actually, it's not enough to monitor and to have dashboard in order to realize that what data quality issues we are having, okay? But also, we need to act on the data pilements that actually are moving the data from the source to our data store or to our whatever we have, you know? So we can't use circuit breakers for that. This is just the pattern of software development and we can't use circuit breakers to act on the data pilements. So what this mean? Based on our data quality thresholds that we define, for instance, I don't want to have, you know, 98% more, you know, duplicates or related to uniqueness or whatever, right? So actually, we can act on the data pile and resolve. So this just happened in, so sorry, this just, we open the circuit or they close the circuit depending on the threshold that we are monitoring. So we can actually change those thresholds and act totally differently on our data pilements. This is just about profiling also, you know, not about data quality, I mean. This is because for enabled circuit breakers, we need to go beyond, you know, the typical metadata ingestion or freed man and having data set, etc. So maybe we need to also have profiling for the data pilements that we are running, right? If the data pile is running or not, how much time is taken, etc, etc. So this actually was happening. We can't define soft alerts and hard alerts related to our threshold. So we can be really hard on our data pile and say, I don't want to move data from A to B if my threshold is really big or I just say, okay, I'm quite good, good enough. Do you remember the good enough? So I'm just good enough with this data so I could just move and send an alert to the dashboard, okay? So as I said, the circuit could work in soft or hard way. In the soft way, as we have the monitoring tool, we can send alerts to the monitoring tool. In the hard way, we are actioning on the data pile and we are not going to move the data to our final data store. This is just an example of some sort of alerts that we can define in our data platforms, soft alerts, hard alerts, etc., etc. So this is what we are going to take. We apply the hard alerts, right? So basically we are going to shift from an unbounded data stream to a bounded data stream. We are going to act on our data stream and say, I don't want to pass this good enough data quality. I don't want to move bad data quality to my data store, right guys? So what does it mean? It's just a matter of, it's a trade-off of availability versus quality. Maybe you prefer to take all the data in your data store, but for you it's just good enough, okay? I'm responsible, accountable for that, okay? But maybe you want to act on the data pipelines and so on. No, I don't want this shit in my data store. So we talk about data quality monitoring. Before we started talking about this, you know, we were ingesting all the data and all the data had the same flavor, right? Everything was black. However, with the monitoring, I can say this is just flavor A, flavor B, flavor C, okay? What about circuit breakers? Okay, all the data was available, even having different flavors. However, if we apply circuit breakers, we can just act and, you know, have the data, only have good quality data in our data stores. But we come beyond this. So we have been, you know, having applied this in different data platforms in the past, but we realize that actually, you know, we can apply data quality dashboards, we can apply also circuit breakers and act on data pipelines, but we come beyond that, right? So we find the threshold, we can actually use anomaly detection to check this data quality issues. And this is really, really important because, you know, we start by doing like a huge manual work, then we move to a dashboard, then we move to a circuit breakers, now we move to maybe a little really tiny model that could run on top of our data quality platform. Guys, I want to share some things that I think that this is quite important, and maybe you agree or not, but we don't use this, you know, this monitor of data quality and have data quality from day zero. We're not going to have, we're going to have a shit in our data platform. We can, you know, spend like a lot of money, a lot of time, best engineers working in our data platform, but it will be useless. Okay, the same goes, you know, for the technical perspective and for the schema as you can, you know, also Gobi John's syntactic checking. This morning, I think it was, Oscar talked about ontologies, right? So we should use ontologies and semantic checking also for the data that we are ingesting, also investing good data quality monitoring that we talk also here, and she from, you know, a bounded contest to a bounded contest, right? Having like a, the circuit breaker pattern, if you agree on that. Maybe you are willing to, you know, having all the data there in your data store and you prefer availability, random, you know, data quality, but it's just your choice. Okay, and the same goes for the metadata. So metadata is important, not only having all the data sets. Some take away. I would like to say that we need to move from a Gigo to a Gino. So garbage in, garbage out, do garbage in, no out. This is for me really important, right? So we are taking garbage and we are saying, okay, I won't put this garbage here in my data store. I will think later on what I would like to do with this. So this is just really, really important for me. And the same goes for the rest of the things that we already mentioned. So monitor, control, and automate, right? So you can provide visibility to the stakeholders. So anybody in your data team could actually be aware of what's going on, what could happen in the near future with the platform. If you ingest, you know, new data sources, if you are going to have like a new issues, new data quality issues, you can run new data quality metrics on top of those that you already defined, et cetera, et cetera. But the important thing is you need to take aware of data quality from this area. And this is just something that we need to think about. And this is not about me. So AI is making headlines, but data quality is also making headlines, okay? And this is important. I will like to finish here, because actually I didn't have so much time to live. So I really thank you for this. I'm open to questions. Questions? Thank you, I thought. Any questions? We still have some time. Sorry. Which tools would you recommend to use to implement all the things we're talking about in terms of the quality and so on? Yeah. So in the market you can find like a different data quality SaaS products, right? Data quality, let's say. Everything was based on Spark, okay? So yesterday we also talked about Delta Lake, et cetera. So we built different, so per each metric, we built different notebook, okay? So SparkJob. And we were running those SparkJob as you run the different data payments. So we got like a data quality payments, right? Running those different jobs. And the output of the jobs goes to the dashboard, okay? And the same goes for the circuit breakers. So in this case we have like a different jobs. And using the API you can actually trigger on those jobs that you're running based on the data quality jobs. Thanks. Do you have support from business part to the data quality text implementation or definition? In this case, no. So in this case we assume that we are defining those data quality metrics as we expect, okay? But as I said, you can find like a different literature there. So actually uniqueness or duplicates is actually well defined. Maybe if you want, I mean, in our team we agree on what validity means, okay? And we write down, we wrote down what this means actually because in the literature actually you can find like a different definitions, right? So for me the important thing we don't count on business on this but we do agree on the team what validity and completeness or the other metrics means, yeah.