 Welcome everyone to Data Quality, the holy grail of great for our data-fluent organization by Balwinder Kaal Corona. So without any further delay, over to you Balwinder. Thanks Vijay and thank you everyone for joining. Very good morning, good evening, good afternoon to all of you based on wherever you are based off. And very warm welcome to Agile India 2022. So without wasting any time, I'll quickly start. So data quality, today I'll be talking about data quality and it is indeed a holy grail of any organization that wants to be a data-fluent organization. And aside of big data platforms and big data ecosystems, I think all of us have worked with data in one form or another. So let's keep it conversational, do pitch in with your comments, any questions that you have as we go. So last decade was the decade where every organization wanted to become a tech organization. And now almost every organization aspires to become a data-driven organization. All the enterprises have started collecting huge volumes of data from various sources from inside the organization, outside the organization. They are even collecting social media feeds and they are doing it in real time. And they hope to get insights from this data to make smarter decisions for their business. One of the biggest challenges in leveraging this data, however, is that not all of this data is useful, right? If I give a one-star review for any product and no text to support it, that data is pretty much useless for anyone who wants to use it, right? So the quality of data is not up to the mark. And even though we have made a lot of progress in the ways of implementing data platform, we have wonderful architectures. We have cutting edge tech stack to implement these data platforms. The fundamental issue of the quality of data flowing through these data platform is still there. So today I'll be talking about some of these issues. What are the causes of these issues and how can we fix these issues? I will talk about a data quality framework that we created and we've used that framework with a bunch of our clients. And we've seen how this framework actually elevates some of those challenges created by data quality. I will also briefly touch upon domain data boundaries and data platforms. And how can we shift left the data quality aspect of any data platform? And I'll do all of this using a case study so you will get more examples and more perspectives. So moving on a little bit about me. My name is Balwinder Khurana. I have 15 years of experience and I work extensively with data-driven systems. So creating solution data architectures for the clients, creating data platforms, working with them to create a data strategy for them. Currently I'm working as data architect in ThoughtWorks and I'm also leading the global data community in ThoughtWorks. So as I said, I've worked with multiple clients and majorly in the area of how data can accelerate or transform their business. But whenever we meet enterprise leaders and we talk to them, we often hear that they do not know the complete picture of their business and they do not know how to use data to join the dots. Some of them also say that they do not know if they can monetize their data. And even if they know they can monetize their data, they do not know how or they do not know what are the hurdles in their way in data monetization. Some of them fail to use data to differentiate themselves and stay ahead of competition. I think the worst part is some of them have said that their own teams do not trust the data coming out of the data platform. And they use the age-old methods of spreadsheets for finding out the insights or for decision-making. Some organizations, even though they have data available, their data is not usable for data science algorithms. So these are the things that we often hear. But if you look closely, all of these are symptoms. These are not the causes of data platform failure. These are essentially the symptoms why data platforms fail. And moving on, essentially what happens is garbage in, garbage out. I think this is a very old and very cliched way of saying this. But you can build a very modern data platform with all the right set of tools and technologies in place. But if the data quality is not up to the mark, the analysis on data is not going to be up to the mark. You will see all of those symptoms that I just shown. And you will see that your teams have started losing trust in the data. So now let's start peeling the layers. Now we know the symptoms. Let's start getting to the causes. But before we do that, let me introduce you, introduce to you the case study that I'm going to use. So I'm going to use the case study of a retailer and retailer because all of us can quickly connect to the retail world. We've been using retail since long. So this retailer basically wants to price their articles. So they have been collecting a lot of data for some time now. And they feel that they can use this data to intelligently price the assortment of articles that they have. So data points like what is the demand of my article? How are the competitors playing with the prices of these articles? Does my article show any seasonal behavior? Using all of these and many, many more search dimensions and data points, they want to dynamically price their articles. However, as I said, businesses have very low trust in data in the data platform. Along with the solution, they also want answers to the critical questions that you see on the screen. So essentially, how do you measure the correctness of the prices? Basically, these are recommendations, right? So in order to measure the correctness of recommendations, you will have to measure many more data points and you will have to compare those data points against some of the SLAs that you initially agreed on and then establish the correctness of your prices. Essentially, what you're doing is you are validating the accuracy of your analytics. And as I said, untrustworthiness in analytics and insights is not because of the system which is generating these analytics and insights, but it is because of the quality of data that is getting fed to the analytics and insights, right? So in these big analytics systems, the data platforms and the data pipelines, data flows through multiple stages and there are a lot of transformations that are done on data. What we want to do is bring in transparency as data flows through these stages and then see how can this transparency establish trust with the data and insights again that we are providing to our business teams. For the analysts that are working with this data, for the development team that are working with this data, we want to enable them to discover and use the data that is being collected in various systems. So we want the discoverability aspect of data quality as well. We want to ensure all the compliances, any legal compliances, any regulatory compliances that we have. And the most important part, we want to find out who is going to own this data quality, who is responsible for ensuring data quality. So even if you have answers to all of the questions above, but you don't know who is going to take care of data quality issues and who's going to fix them and where they are going to be fixed, these answers are pretty much useless for you, right? So this is the use case that we'll be talking about and I'll use examples from here. So going back to the causes, we'd already seen the signs or the symptoms of data platform failure. But if you look inside the hood, insufficient data quality is because of certain systemic issues. And these systemic issues starts from here, addressing the data quality late in the process. So what happens is data teams, they spend disproportionate amount of time in tinkering with data quality in the downstream systems rather than the source system. This wastes time and affects the overall data quality. In my retailer example, let's say the prices and the sales that I used to feed into my retail pricing algorithm, they are incorrect to begin with. Fixing the data quality issues with the price recommendation coming out of the algorithm, it's not going to improve the overall quality. What also happens as a result of this is, because I'm fixing my issues in the downstream system, I lose a lot of context. And when I fix issues without context with a myopic view, I might create issues at other places. So I'll again take the example of this retailer, let's say I'm also generating some of the reports, right? So I have price information, sales information, I'm doing some aggregations and I'm generating some reports. And there could be some aggregated values that feel incorrect. Now, because I do not have any context, I don't know whether that is correct or not. Let's say there is one particular store that has run a short term discount on the prices. So the aggregate price is definitely going to be incorrect or it will seem incorrect. But if I attach that context of short term discount, then it is correct. So missing context also happened because of fixing the issues downstream. Most organizations work without any strategy. So they do not have an integrated process or framework which can fix the data quality issues in the entire data platform or the ecosystem. Now, having said this, we are not saying that just have one central place and fix all of your data quality issues there. What we mean is you should have a self-service framework which you can provide to the teams working with data to execute the data quality. So they are still executing with same strategy. We've also seen organizations do not invest a lot of time and effort in creating definitions for data. So each team which is working with data, they have their own definition and they fix the quality issues depending on that definition. And this leads to more trust issues in the data, right? Because at one place, because one team used a specific definition, you will see something at other place, you will see something else. As I said, because you're not addressing data quality uniformly in the organization, you do not have uniform definitions. The solutions that teams are creating for fixing data quality, they are point solution and they are getting looked at from the lens of viewer. Basically, the person who is going to use that data and not the person who is going to provide that data. So again, that price recommendation example, let's say you got some price recommendations out of the algorithm. Because you are going to consume this data and you feel that the recommended price are not correct or you feel that they are little off, you will start applying rules on the recommended prices. And now these recommended prices, you don't know whether they were incorrect because the input was incorrect or because the algorithm, the engine which is generating the prices itself is incorrect. So these point solution do not fix the data quality holistically. And the most important systemic issue that causes data quality failures is under its estimating the impact. You might feel that, you know what, this one KPI is incorrect and that is only one report which is consumed only by 10 people. So that's only impacting 10 people. But essentially what happens is teams start losing trust in the data platform. And in the longer run, it becomes an impediment to change management. So generally your data initiatives or your data platforms lead to change management in the organizations, right? So they would tell you whether your business process is going correctly or not. And if not, you should change it. But if you do not trust your data platform itself, you will not make those changes. So it's a huge impediment to change management within an organization. So what we are trying to say is treat your data as you treat your code, right? Apply the same rigor of quality on data as you apply on your code. So you would write the entire test pyramid for your code. You will test the smallest function to the integrated layer to the entire application. Use the similar rigor on your data as well. So now we've seen the symptoms and the signs. We've seen the causes also and now we want to fix them. But before we do that, let's look at the definition of data quality because that's how we know how to fix these issues. So data quality refers to ability of a given set of data to fulfill an intended purpose. Now because it is related to a purpose, it is important to note that the definition of data quality will keep on changing. So what is a good quality data today might not be so tomorrow because my requirement or my definition has changed, right? The analogy that I use is the potato quality in a fast food chain. All of us have been to fast food restaurants, right? And they make French fries with potatoes, right? To make French fries, my quality requirement is big round potatoes, right? So the purpose is making French fries quality requirement is big potatoes. But if my purpose changes, let's say I want to make mashed potatoes, this quality requirement is no longer valid, right? So as your requirements change, your data quality definition would keep on changing. So it is important to remember always the purpose for which you are going to use your data. Furthermore, if you double click into this data quality definition, there are multiple dimensions of data quality. If you work with data-driven systems, if you work with data platform, you would have encountered some of these dimensions while talking about data quality issues. We can say that these dimensions form the tactical definition of data quality, right? And this is also ever-changing. As your requirements change, the dimensions will also change. So at one point in time, you might be talking about how relevant is your data for your problem versus some other point in time you might be talking about reliability of your data. If you're working with near real-time, real-time reports, you might be talking about availability or timeliness of data, right? So there are further and more such dimensions. How usable is your data? Is your data following standard rules or not? Is it trustworthy or not? Is your data consistent over time? Is your data integrity preserved? Is it being unique, right? So all of these dimensions form the tactical definition of your data quality. These can be unpacked further, so you can go one level deeper again. So there are multiple tenets to each dimension. So let's take the example of availability. We say that the data should be available, but what does it actually mean? For data to become available, it should be accessible. The user should be able to navigate to data and it should be available within time. There is no use of data if you wanted a report today and you are getting it tomorrow. And you should be authorized to see that data. You should be authorized to access that report. So there are multiple such tenets that form one dimension of data quality, right? And again, going back to the purpose, you need to figure out what tenets and what dimension is applicable to you. Cool. So now we have a complete understanding of what we want to do and why we want to do it. Someone might think that data has always been important, right? Even with the transactional systems, you want to ensure that your data is of high quality. All of our relational databases are also used to have this as their selling point where they would say that they are asset compliant and the data quality is preserved some of those dimensions from asset you would see here as well. So why is it that we are talking about data quality so much now and why we are so deliberate and intentional in talking about data quality? I think when it comes to big data ecosystems, the characteristics of big data themselves add to the complexity of data quality and the impact of incorrect data, the impact of incorrect insights is also very huge. As we said, data platforms are agents to change. So if that agent itself is providing me incorrect insight, the impact is going to be really, really huge and across the organization. So let's look at some of the characteristics of big data ecosystems and how it makes data quality complex. I think most of you would be aware of these four weeks of big data. Now there are seven, but let's talk about these four. So the first one is volume. When you have let's say hundreds of records, hundreds of thousands of records is easier to run comprehensive data quality checks on top of that records, right? You can go to each record and validate if it is correct or not. When you are dealing with petabytes and gigabytes of data for a giant e-commerce retailer, which is our use case, the platform generates data of orders of gigabytes per second and then you cannot run comprehensive data quality checks on the entire data. Sometimes enterprises start working with approximation. So what they'll do is they'll take a batch of data and they'll do some aggregations and they'll try to see if the data quality is correct or not. So they'll use probability, they'll use confidence intervals. However, when they do this, what also happens is they also start to approximate metrics. So they'll say, okay, 90% is okay or 80% is okay. This compromises with data quality. Then as I said in the beginning, enterprises are collecting data from various sources. So you have variety of data. For an e-commerce retailer, it could have data about a product which could consist product images. It could have customer reviews, which could be text reviews, which could also be images and videos. It could have transaction records. So there are different types of records. It's no longer a structured RDBMS, right? And you need different ways to ascertain the quality of these different types of data types. Then you have velocity. Every enterprise today wants to be real-time or near real-time. So how do you take care of quality without compromising the timeliness of your insights? You might want to just check the structural validation of your data. You cannot do semantic validation in that short amount of time. And then there is veracity. I think veracity is the most complex facet of big data. Veracity refers to inherent impreciseness or noise in the data, right? And most often you will see veracity in situation when there is lot of human data entry, right? So for example, for this retailer in a store, you will see a lot of point of sale terminals where you will swipe the power code and the entire information of the sale, which is like price and discounts using code scanner, is sent to the main system. And it is accurate. But a lot of time manual entries are done and these manual entry sales transaction could have errors. And why veracity is complex is because you do not have any direct ways to identify such errors or such inaccuracies. So essentially how do I validate petabytes of data for my retailer who is sending me millions of records, millions of text reviews, images, and they are doing it per second and they also have lot of veracity inside the data. So that's what makes data quality complex in the big data world. So now let's talk about the data quality framework. We've seen that data quality is an integral part of any data platform. And with the right data quality baked into the system, we can answer a lot of questions that enterprises have been asking. We saw those questions. We can also avoid a lot of failure scenarios that we saw earlier. And we can build the lost trust back into the data and in the data platform. And to do this, we need a very comprehensive data quality framework, which I will talk about now. So going back to the data quality definition, we said that data quality definition is attached to a purpose. We also saw when retailer, in the retailer use case, most of those questions were attached to the purpose, the purpose of pricing, right? So that's where your data quality framework also begins. So at the top of this data quality framework, you have your data quality definition based on the purpose. So here the purpose is pricing algorithm. You can have many more use cases and depending on each use case, you will have a different data quality definition. This is a hierarchical data quality framework. So at one end of this data quality framework, you have data quality definition depending on purpose. We call it fit for purpose. And on the other end of this framework, you have your baseline data qualities, or we can also call them as sensible defaults. The way to think about these data quality rules is this is a common denominator, right? So irrespective of the purpose that you're going to use your data for, these are the quality rules that you would always want in your data, right? Something as a numeric field cannot have alphanumeric data or an email field should have a certain standard. All of those rules becomes your sensible default or baseline data quality. And now you want to move from this baseline data quality to your fit for purpose data quality. And you do that using a critical path. So what do we mean by critical path? We saw that data quality has multiple dimension. Each dimension also has a lot of tenets. If you want to fix the data quality issue across each dimension, you are going to invest a lot of time, you are going to invest a lot of effort, and you might not even use all of those dimensions in your purpose. So the idea is to just do enough and that is your critical path. And when you move from baseline to fit for purpose using critical path, you generate intermediate layers of data quality, right? So there could be certain purpose which do not need so high data quality standards, something like ad hoc analysis, right? So that sits in between and that forms another layer of your data quality. Now, each layer in this data quality framework also has things going inside it, right? So how do you know that for each layer, whatever data quality dimensions I have chosen, where do I want to be in terms of those data quality dimensions? So let's say I say that timeliness is my dimension. But how do you measure it? Getting data within one day or getting data within one hour, what is your timeliness requirement? So you need threshold or metrics for each of these dimensions to ascertain that whether you have those threshold or not, you have some rules. So you would say that although I need data which is not older than one day, but at times I am ready to accept data for two days, so that becomes your threshold. So you create your rules. Then you execute your rules and after executing your rules, you get reports, right? So you would say that most of my data I received in one day, but there were let's say these 10,000 records that I received later and this was the reason of getting these records later. Or if let's say accuracy is one of your dimension and you say that I want to be 90% accurate and you could say in the report that most of my records are correct. Some of these records have some erroneous values and this sort of reaches to 89%. So that sort of reports you will start getting and now you want to know that from 89% you have to go to 90% or from 1.5 day of delay you want to go to one day of delay. So that feeds into your data quality improvement. So this is how you improve the data quality for each layer. And those are the loops, the data quality improvement loops that you would have for each layer depending on the dimension for each layer. And this completes my conceptual data quality framework. Definitely it's a big data ecosystem and this framework cannot exist in isolation. So there are pillars that support this data quality framework. All of this falls into a bigger umbrella of data governance. However, just to talk about a few, you have data discovery. So as I said, development teams would want to know where is a specific data stored. If I'm using AWS buckets, what is my bucket location for prices data? What is my bucket location for sales data? Can I navigate to that bucket? If some new data is onboarded into my S3, what is the location of that data? So that is all your data discovery. You need repository and indexing services. So it talks about metadata of your data. So what is the schema? If I'm using prices information, how does a record look like? What is the relation? How is price related to sales? What is the lineage? Then you need metadata service. So these are the services or the API calls that would help you manage your metadata. So with all of these data points that you're storing, you also need semantics. So how do I attach business glossary to my data? How do I attach technical metadata? How do I attach lineage? And most important of all is the ownership of data quality. So if I find any issues in the data quality, who is going to fix that? Which team is going to fix it? What are they going to fix it? What are the SLO, SLA for each of these different layers, right? And the ownership of data quality is not restricted to the data platform or this data quality framework. It goes beyond this to the business process as well. So you might figure out an issue which would tell you that you need to fix your business process. So that's where ownership comes in. Cool. So I think that was a little heavier and let's try to understand the entire conceptual framework again using a retailer example. I will also talk very briefly about domain data products here. So traditionally in organizations, data users and data providers are different groups, right? So there is one group which maintains your data platform. And there are other groups which maintain your operational systems, your transactional systems. So transactional systems are generating the data and data platform is using the data, right? Both of these groups could also be from different organizations, right? So they have different goals, different operational procedures. And therefore, their notion of data quality is also different. So data providers, a lot of times they do not know how the data users are going to use their data or what are the business use cases. A lot of time they don't even care, right? And this disconnect between data providers and data users is one of the prime reason behind data quality issues. And the way we try to fix this is by defining domain data products. What we are saying is make one team entirely responsible for this whole thing, providing the data as well as consuming the data. Because this is a single team working on the entire domain, they would also understand what are the data quality rules that this domain should adhere to. And this concept of domain data products and creating a data platform using domain data products is called data mesh. If any of you needs more information, you can reach out to me, but essentially that's the idea of dividing the ownership. And because you're dividing the ownership, it aids in your data quality and we will see how. So for the retailer, let's say these are some of the example domain data products that I have picked up. Articles, which is the information about article, article name, any code if the retailer uses codes for articles, category, hierarchical category to which article belongs. Then article price. So what is the price of article in let's say a store in Pune versus what is the price of article in a store in Delhi or what is the price of article on my website. Then sales, what is the sales of article, this is in voices data, competitor prices, how is competitor pricing the same article. All of these are your domain data products. And now you would have your quality rules defined according to your domain. And these quality rules, if you go back to my data quality framework, these are the baseline data quality rules you want to adhere to these quality rules irrespective of purpose. So baseline data quality rule for article could be that article has a fixed category. So let's say I have categories like edible, non-edible, within edible I will have let's say fresh vegetables, packaged food, meat, dairy. So every article should have a fixed category from this predefined set of categories. It cannot have any random category because I cannot handle that. So that's the baseline data quality rule. Then there is article price. The data quality rule can say that I don't want any negative prices. That's illogical. So no negative prices. Or I can say that there should be a unique price point. So at a given point in time, I cannot have an article sold at Rs 1000 and at Rs 2000 as well. So there should be a unique price point. For sales now, let's say there are return sales. And as a retailer, the return sales also is a line item in my sales data. For return sales, the total amount would definitely be negative. So you see the difference because of domain. For article price, I was not allowing negative prices, but for sales I can allow negative values. So that's how your domains aid in the data quality. You are more clearer and more intentional in your data quality rules. Now you can combine some of these domains to create more domains which would feed into your eventual use case. So here going back to my use case, my use case was dynamic pricing algorithm. And I need article prices as well as sales data to come up with the price recommendation. Because I want to see how sales change if I change my prices. Now here at the bottom layer, we were talking about baseline data quality rules because we did not even know what the purpose is. But now we know the purpose here and we will now talk about data quality rules based on purpose. Purpose is dynamic pricing algorithm. So my data quality rule can say that there should be no outliers in the price. So there should be no very high prices or very low prices because that is going to skew my algorithm. But let's say I'm using the same data for another purpose. The purpose here is reporting. In this case, the data quality rule of outliers is no longer valid because I definitely want to see the outliers and fix those outliers. So the purpose defines my data quality rule. And that's how you fix you fit the entire data quality framework using purpose and baselines. Now we have seen how data quality works from one lens, which was a hierarchical framework of moving from baseline to purpose. Definitely when you actually implement the data platform, there is another lens of looking at data movement, which is from source to insight. So these are the technical boundaries of your data moving through the entire platform. And this is where we start talking about shifting left of your data quality. We saw in the systemic issues of data quality that the failure happens because we are fixing the issues in downstream. So if you had some data quality issues while you are actually ingesting it from source. The source itself was giving you wrong data and you try to fix it while you try to extract some information from that data. You lost all of that context. You are working with data that you don't completely understand and you might not be able to fix it correctly also. So we are saying that try to identify the data quality issues as early in the process. So at each of these process, source, ingestion, storage, information, you run your data quality rules. Again, you use the same hierarchical framework, but you run the data quality rules at each of these transformations. You figure out what are the issues and you try to fix those issues there itself. So some of the examples of such data flow could be this. So when you have not even started your ETL, you have just ingested the data, you would want to check for format. So if I'm expecting JSON data from any source, if I encounter binary data, that's a data quality check failure for me and I want to fix it. I would ask the source like why are you giving me binary data or I would check for completeness of the data. If I'm running batch pipelines every 24 hours, I would want to confirm that within 24 hours, the number of records that I'm consuming is equal to the number of records that the source is providing. So you do that reconciliation. After you consume all of this data, you have ETLs, you have simulations, so you would have certain checks over there. You would see that your data transformations are completely working correctly or not. You could also write your unit test to check your transformation functions. So now you start also creating your data quality test pyramid as you have your code quality test pyramid. You will check for data completeness again. You will check for correct metadata getting generated. For your data science and algorithms, you would also check the validation of your model. For aggregation, you have validations for all the aggregations that you're doing. And even at the consumption layer, you will have UI validation. So you will have validation for representation. So if you're using date and the SLA is to use date in DD, MMY format, you will have that data quality check in place. Or you will also see how intuitive are your reports. So all of these checks keep on happening as your data flows through your pipeline. And I think this is also one crucial bit. We talked about continuous data quality improvement loop in each of my layers in that framework, the quality framework. And this is essentially the double click in that loop. This is very similar to how you would operate in the software world, right? So you would plan something. After you plan, you will assess and then develop. After you have developed your software, you will analyze this software, whether it confirms to the plan or not. And then you will keep on maintaining it. Similarly, if you want to operationalize the data quality, the first thing you want to figure out your goals of collecting data. Remember that data quality is defined using a purpose. So we want to know what is my goal of collecting this data. Depending on the goal, I will figure out which data quality dimensions are applicable out of those 10, 20 and many more such dimensions. I will figure out which dimension is most applicable for this goal. And then I will figure out KPIs. So earlier I was giving the example of accuracy. So I could say that for this goal, I want my data to be really accurate. And currently the KPI that I want to operate on is 90% accuracy. Okay. And then you create your evaluation baseline. Sometimes you also do a quick pilot to see that if this data is way off. And if it is way off, there is no use of creating a data platform, you would actually want to fix that first. But with that quick pilot, you will basically do feasibility check of creating the data platform. Once you complete that quick pilot, you start creating your data platform, you collect your data, you clean your data, you assess data in accordance to your evaluation baseline. And that assessment would basically tell you whether you are satisfying your goals or not. If not, you go back to the step of probably collecting more data if required and if not cleansing the data furthermore. So you do that until you reach your evaluation baseline. Once you satisfy your goals, you are good to go ahead and use that data for your purpose, the eventual purpose. Important thing to note here is you are always generating your data quality report. Whether or not you satisfy your goals, you need to generate those data quality reports to give you the direction in which you should move ahead. And this data quality life cycle is like not done once and it is done. It is a continuous loop. It is a continuous improvement data quality life cycle. Once you complete one iteration of it, you might have new goals, right? Or you might have new baselines as well. So you said that I wanted to be on 90% accuracy, but now I have made certain changes and now I want to be on 95% accuracy. So you get into another cycle of this data quality. So just summarizing, you're trying to identify the data quality issues and rules. You're trying to quantify them, prioritize which ones you want to fix first based on your goals and then mitigating those. So that's the summary of operationalizing data quality and that's what I would like to leave you guys with. I think that's what I had for data quality. Any questions, suggestions, any thoughts? Thanks Balwinder. I don't see anything in the chat or comments or Q&A section. Thanks a lot everyone. Thank you.