 Welcome everyone to this session about deciphering the way data is tested, automate the movement, transformation and visualization of data by Rasini Singh. We are glad that they can join us today. Without further delay, over to you Rasini. Thank you. Welcome everyone for your participation in this session. So why this topic? First of all, this is very present for APM conference to select my topic. But this is the most challenging thing which we can see and there are different use cases currently. AI is there with data, IoT. So everything is dependent on data and how these data can be tested. So that is why we need to test these data in a different way, whether it's a normal data quality testing or we need to automate the movement, not only the ADN kind of, but automate the pipelining thing. And in batch processes or real time processes, is there any transformation visualization? So this topic will help you to understand how we can do the different kinds of testing or different ways of testing or different use cases where data is involved. So if you can see some with respect to current technologies and the business requirement for the software development and the verification process, the testing team not fully understand the implication of data on the design, the business, the configuration, operation or the system and database. Because sometimes we do not go beyond the front end or some back end activities also there, but we do not understand about the business insight. Maybe. And based on the stats, the big data analytics market is set to reach by 103 billion by 2023. It's not very far away from now. Internet user needs to generate about 2.5 billion bytes of data each day. And get over 3.5 billion of searches daily. Sorry to interrupt. Would you mind moving the camera a little lower so that we can see? Okay. Yeah, much better. Thank you. And then these data, bad quality data, what are the implications and how it can impact you? So based on our business review, the study estimate that only 3% of companies will need the basic quality of data standards. And you can see the 80 to 90% of data we generate today is unstructured. Okay. And 95% of business site needs to manage the unstructured data as a problem of the business. And some of the use cases where data is generated, like Internet of Things, there are sensors that are actuators, which is generating data and as a developer or technology people, they are transforming interpreting processing those data and then visualization is there. And then big data is the application, big data application has lots of data. It can be the modeling you need to do any artificial intelligence of machine learning. It is all about based on the application interpretation, the product output all depends on the data of what data goes in and what data goes out and the ETL processes and the data migration and kinds of testing. So I'll also cover what all ETL, why it is different from ETL and data pipeline testing. What are the differences with this? And the first step as a tester and for any data management plan is to test the quality of a data and identify some of the core issues that lead to poor quality. So tester needs to understand, has a clear plan to execute the test, but there are many new unknown data systems are there, layered on top of your enterprise system. And the struggling with those data quality. To those troubles, there are challenges and updating and hoping that information into the data analytics or predictive software there are. Okay, so how you will measure what are the quality of data, what are the dimension of data on what, how these unstructured or structured data generated a statistical process how you can confirm highly concurrent system do not have a deadlock. So what tool should we should use? What are the different process and model? What all needs to be tested? Which framework you should use? There are a lot of tools, library frameworks available. So we'll proceed with this with what is data quality and how what does it matter. So there's no specific definition of data quality that it gives you a limit of the scope of the data quality. There are however there are benchmarks like it should be error free, it should be complete or accurate, or it should be unique and it should be timely it should not obsolete that they should exist today also and accurate uniqueness. So these are the six dimensions which you can take care by testing for the data and why these matter because if you do not get these data accurate or consistent or complete, you can lose your business. There are direct emails or marketing campaign which is run through the data home. So there would be unnecessary cost there and there are business decision on the floor data can give you incorrect insight. So you need to understand and to start with the understanding there are prerequisites of data quality testing like first you need to know the purpose of your data and what are the data quality matrices you will follow and the metadata of your data feeds. So like to proceed to this understand the volume of data, source of data, frequency of data, transition and types of data. Later in our business room which is applied we can use this for visualization and there are two key area of a testing problem is establishing the efficient test data set and availability of any SDFS kind of SDFS centric testing tool. So earlier it was not available, now there are a lot of tools available whether it's an open source or commercial tools. No one, no business is ready to use any open source for their data so that it should not be vulnerable. So these tests include the data type validation, the range and constraint validation code and code across the data. The reference validation is there and structured validation. You can achieve this by adding an inbuilt coded solution and showcase one inbuilt coded solution which we have done for using open source tools like Kafka and Spark, Hadoop and Cassandra. And manual validation is always there and you have to do manual validations and then open source library which are available which you can use in your Python scripts or you can use in your scripts visual studio code. So there are automated self service tools are also there and we'll talk about the rule based in general there, suggestion based in general there. However, what, before going to how we can do, how it can be difficult to test like there are not a lot of expertise. We can say that libraries are there, how you can use those libraries. The integration of tools you need to know the processes like most of the time I face this issue that developer or a manager says that this is already automated. If you have a pipeline, it is already automated. And if you have any ideal kind of thing, it is done through tool only. So what does the use of where QA, where QA states and where QA validates. It's all automated process. So what QA will do, it needs to be verified by that pipeline only. So you do not know how you will test where you should go, which phase, which node and which, which, which all things you need to test. So you need to identify what all you need to test, you need to identify what all tools are there available, which is integrated. So for this, you know, there are five steps for better quality data quality. So first process is the data discovery and profile. So what actually data discovery and profile. This is the first and foremost, you need to understand what is the disk data discovery, how you are getting the data, the collection of data extraction of data. And if you find the source of truth, then how you know profiling. So profiling is something that in your database or in your data source, there are spaces, there are special character null value, how this item should be displayed. So if you can see in this small, but there are tables which has some space, underscore, string, a character, and there are lots of inconsistent structures are there, hashes is there, numbers are there. So how you can profile it. If you know your data, you can understand your testing much better, you can do a data testing easier, easily. So first you need to understand profiling of your data will help you to discover the data quality issue, risk and overall challenges, which is what is the trend of your data and how you can analyze and report your data in a manner. Again, so then standardizing and matching those data. So how you can standardize. So now data discovery is that profiling of data is done. You know the data, how your data pattern is, what field, what column, what type of string, how your data is being structured, non-structured. Now you standardize those values and match with the more incoming data and compare it, whether there is some outlier or not. If I am writing employee ID somewhere in a table and some character is coming, which is not expected, so you need to match with the new set of data. How you enrich after a standardizing call, you have an enrichment like same data can be used by other person and can be enhanced by other user also by, you know, by taking those data in a JSON format, they can publish it, they can visualize it, these kind of things. And particularly we create a services to share the particular piece of data, they can connect with the API services and they can connect with the data. And monitoring earlier, how it is different earlier monitoring is not required in the initial or data phase. Now you need to have a monitoring part since starting off your requirement and when data is extracted only. Then only you can, it's not that data quality is not one and a operation. It needs to be continuous, ongoing and practices because the data in your organization is constantly transforming, shifting from and processing it and those chains needs to be monitored, whether there is an anomaly detection, anomalies data are there, some outlier data is there. So to ensure the quality is maintained or not. Then operationalizing using these data, monitor the enriched data and has data for visualization for business insight and all. So we talk about the data pipeline here. There are some earlier, in earlier slides we can, we talked about only about the data which we can test and this can be tested to various tools, various manual and automation tools are there. And data pipeline is something like how we are using a water system, you need to go to somewhere earlier in traditional way, you need to go to some well or some places and you need to get the water. Now your necessity increase that you need water at home, you need more frequent water, you need water 24-7. You need to, so how it has improved the water system that you have created a pipeline and there are nodes where these pipelines, the source of pipeline water is connected and then pipeline is created like extraction of water is done. Then there are processing of water, some training is done in water and then integration of pipeline that one pipeline should go to one region, one pipeline should go to another region. So this is kind of data pipeline we are creating where from extraction of data, then processing of those data and then integration validation is done and then visualization. Now you can use those data as per your requirement. So each phase you need to test, what all you need to test for data ingestion that creation, how data is created, what is the source of the training of data, what is the volume of data, frequency of data through both and the variety of data, what are the formats is there and ETL is a part of testing here. It's a subset of data pipeline and ETL, ETL is a subset of data pipeline testing and you need to value data reporting also. So where does QA role involves or step it? So there are data where you can do the duplicate missing outlier or format testing or there is a model, whether it's a, if this data pipeline you are creating for a database testing or your application testing. So there will be normal application model testing, DB model will be created based on the schema and the fraction dimension and then AI model is also created. So you need to test your algorithm, the overfitting, underfitting, biasness, fairness, how you can test those models. So the accuracy, confidence and the coordination thing, how it is responding you, is there any failover or not? And then model pipeline also because a model which is created, it's also continuous process. With the new set of data, your model is updated. So you need to update your version, which is also have a MLOps concept. So versioning of model is done. Then if we are enriching those data, there are API calls which we are using. So you need to test the API calls as well. And if there is an interface or use cases for visualization, whether you are adding some tools for visualization or you are creating your own visualization interface. So you need to test all those use cases as well. So this is a one in-house open source framework where we have used all the open source tools. So there are two different forms of data, which is coming batch data and batch data and the real-time data. So in case of batch data, it is easy. And you can say that if it is a batch data, we can do the extraction transformation and load and then visualization testing also. But in real-time, your processes needs to be running continuously. You need to continuously extract the data. You need to continuously stream the data, add the business rule or transformation or processing. And then storing those transformed data into some database or adding some log stash or elastic search so that you can visualize through Kibana or any Power BI kind of tool there. So Kibana is open source here and this is the flow which is created by developer or your team members. So as a tester, when you should test, what are the phases, what are the stage and what all you will test here. So what you need to test is that you need to test that source of data is coming correctly. My data is the consumer. Here Apache Kafka we have taken is the consumer open source data processing engine which acts as an intermediate of a streaming data pipeline. And it collects the data batch real-time. And then Spark Stream is there. An extension of our course Spark Streaming we have used as a Spark API where it helps to process the real-time data from various sources and the structured streaming machine with the data sets and data file API. Then this Spark engine, as you can see here, is utilized here to perform the necessary data validation using configuration files. So whatever data processing or formatting or some business rule addition here in configuration file it is maintained. And then there is a batch of process data through Spark engine goes to store to, etc. It's a column oriented, no SQL database, which is used to handle large amount of data and real-time data with a high availability. So there are also challenges with Cassandra is that you can not get the ad hoc request. So you can add here as an alternative ad root is there for ad hoc reporting. And LogStash is a code component of PLKStats which helps you to gather and process the data and gather the data with the connecting with the elastic search and with the Kibana. You can store this in a Kibana. So there is a framework which is created and this is a lot of tools is required for installation. So here you need Java 8 or Spark, Kafka, Cassandra, ELK and Jupyter notebooks. That is why I created a runbook here so that we can share with the team and we can create this framework and run your demo. I'll showcase what all is required. So this is a general data pipeline format creation of design is there, what all it is required, how it can be implemented, what all the steps. So first you need to start with ZooKeeper, then you need to start with Kafka, then you can create a topic and then send a message. So and there is a Visual Studio code. So what I talked about is that you need to create a data model based on here we have used key space DB where DB is connected or data you can add data from some file or connected DB. You can also add a clustering column, which is the unique whether you are taking a timestamp thing for real time processing, insert and update and instance also. So here we are using a PySpark. You can also use any other processor you can also use instead of Kafka you can use AWS services for streaming. You can use the ADF or ADF also Azure Data Factory and Synapse here. So you need to configure all these which all instances you are using in a tool and this is a validation configuration where you are getting the column name class param. So how it is, it is not only getting the param for classes, it is also validating what is the validation and the requirement, what is required. So here we have identified that whether my product price is greater than 7000 or not here. You can see the buying event is the CSP we have kept here only so that it executes easily so we can keep it anywhere and it will identify through relative part and it is loading all the configuration instances and configuration data model. So these are the in-house tool, why we have created this in-house tool so that it cannot be, your data is secure and safe and you can deploy in your company or your project. It actually happened that yesterday night it was not working I thought to rerun again and it was not working because some zookeeper issue which we face most of the time for starting the server. So I thought how I can do the alternative. So I got another tool which is called great expectation you can also see. So it is not a pipeline testing but it is a data testing, very good tool. I will talk about and I will give them. It was easy as well and James, Ben and Kyle, they have a community channel on a Slack you can connect. So it took not more than 30 minutes to one hour to understand that tool and you can use ready to use tool is there. So if someone is not able to deploy this framework and also this is this needs a lot of installation process just the benefit of this framework is that your data is with you only and it is secure and you can run. You can add as much as configuration as much as validation you can run here. So this is a visual studio code and there is a one Python script which is written in as Jupiter, which is Kafka. It's it's important producer the topic which we have created and it runs. So morning, I have run it so all the data which is coming as a buying event. It is giving me all the data which is already there in that format. So if you if I run the ELK the spark and Kibana the spark Kibana and lock stash. If you run you can see all those here in so we can create a package also but this isn't a development stage. Suggestion maybe you can answer that question live for that. It has other people also. Okay, okay. Okay. Thank you so much. If you run this book, it will give you all the processing of how Kafka is producing and application and started. We can see all data is coming and it will save all those data and it should come in the Kibana live. So it is running like once it is saving is completed, then it will shift apart from this. So what I talk about is the the great expectation. So here in this framework, it will take some time to run execute and a lot of settings are required. Our word with great expectation. It gives you the package. You can run just add two, three steps are there. Just need to install and run. I showcase both. MLOps that are the poor data quality. It's very difficult to test it. Come on. Right. Just install to start it. Another step is great. So now this tool will ask which file and how now it has stuck. So if you want to do a notebook, you can check the expectation. You can add a checkpoint also. There is a plug-ins also and it will save in uncommitted. So it's asking whether I like to configure the data source. So either you can add it to SQL server. You can use either file or relational database already there. And if I give one option for five and I can get five spark or pandas by spark is already there on my system. So I can add two. So now it is asking for path where my this file is. I can give in downloads. I have my file on which short name I can ask them to test the five names which we are going to give. And this will give you the HTML report with all details. You can see the result, expectation validation result. So evaluated the expectation. It has successful expectation on eight columns were there in that file. I'll showcase that file also. And it's equal to 2200 rows. What it has observed how it has checked it. And whether any of the values are not or not. You can add many checkpoints. You can add many checkpoints. It will give you. So if you want to see all those things or any failed ones. So there is no fail. It has succeeded. You can edit also. This is great expectation. And how it works. You can walk through also. You can add more checkpoints here. So I have only the basic test of one fight, which is comparing matching and doing the testing. And what all it test is that it test all the. Test all the functions like table search is there for ideal kind of testing distributional function. Multi columns are there or not expected. Column max between is there or aggregate functions are there on date and time or Jason. Whatever formatting date, time, aggregate functions, string matching, sets of range or table spaces, missing value, duplicate value. So whatever the validation point, which I write in a configuration file. So how why this session would be important for anyone is that they can create. They get some idea that you need to create some data validation thing. You need to add some validation function or a module and then you can compare with the incoming data source and any processing of those data and then target data. And if there is any use cases also. So this will give you a quick solution. And you should also give you the ways to understand and replicate testing of your data, which is there with you. Okay, so another we have talked about all the things all the data which is static and now we'll talk about what if. Integrating these data in data validation in ML pipeline. So in machine learning also there are pipeline which is also carry a lot of data. So here you will define the matrices which we will follow which framework or model which we are taking and we add all those data, transform those data and build a model. So now if your model is deployed. So most of this is also very conflicting debate which is going on that how the scrum or any. Agile methodology or any software development traditional development lifecycle can help machine learning to develop 100% because 50% only it is developed by developers. Now 50% is trained by data scientist after helping with the after getting response from a real user or after getting tested from a crowd or real user who they are using then assessing those models based on the real time data and monitoring continuously monitoring all those data continuously deploying the model assessing those model what is the accuracy what is the confidence level what is the. Confidence and accuracy or precision or any fallback is there when you need to test so for this. And if someone says that machine learning is very difficult and how data is important in machine learning so if there is a Google there is a hidden technical depth so this is again a white paper which is already written so in machine learning there is a small machine learning port is there apart from that small for everything which is here is the data ingestion feature extraction from data data verification is there monitoring of data analysis of data configuration of your services are tools which you are using infrastructure so the ML thing is very less and how we can test it there are there is a tool which is available by Google which is called TensorFlow or T TensorFlow data validation library which is there and you can test it so it's a scalable calculation you can get an integration of newer and also TensorFlow port is giving you the performance scalability and security kind of thing so this is a big topic which we can cover maybe in other session which I have given the AI so there we have covered the TensorFlow data validation thing and how we can test AI application this kind of it gives you the automated data schema it gives you schema viewer anomaly detection so you can write a code in Python for anomaly detection or you can use a cloud.io where you can upload your data or you connect your data and it gives you all the anomaly detection outlier detection so it can be done within five minutes so what is the chances and the source of error maybe for a model whether it's a gaming model or a ML model you have sometimes by developing you do not have a sufficient data or maybe your data is the inaccurate or invalid so how it will represent those data sometimes if you try to create a data it created by there is a biasness if I think that I can create a prerequisite data creation is there so it may be I'm using only my perspective data biasness will be there and apart from that security is very important here integration of machine interfaces important and technology problem like lots of tools integration is required and sometime we do not have that expertise with the tools it takes time to understand and earlier we also faced this that we have deployed a model we shared and published it but with the next set of data our models are deteriorated so how we can identify all this by continuous monitoring continuous performance issue identify skill ability of data and then maintaining the model version so that in case and a model version and also there is an important concept here is that you create a shadow development environment that is not staging but it's a production middle image kind and it is not visible to develop users the front end user you they just create get all the extract the data for real time user and then do the analysis and then to do all the performance capability and versioning of model and then it can be validated updated model can be published through staging production so there is a shadow environment sure so there are some Python library which are already available we also take reference from servers library and these are some of the smart compassion data set in import you can do validate you can do so benefit you can see the whatever loss we have seen earlier you can benefit from using testing and monitoring is the most important part which is continuous so what all you need to monitor the processes of visualization and we need to add a lot of notification for your application or model. So these are the data changes model performance CPU or model input checks are there. Visualizations are executed or not but it will come like this. Do we have some time for questions Rajeev. Yes, yes we have. So any can ask questions because this is a tool which is already available in market and anyone can use talent open studio DQ great expectations. Great I encourage participants to post their questions in the Q&A section so that we can answer them straight away. So reports will come in Kibana and for great expectation it would be there in the file. The HTML file. So these are the folders which is created checkpoints which we have added what all expectation means validation you have added if you have added any customized code you can add here. And the plug-in which we can add here so because it is supporting airflow and all and the file validation file is tested here saved here. In a decent format also in HTML format also. I have a question Rajeev this is a great talk by the way just wanted to understand what would be a recommendation for a team. That has zero testing for their data what would be their first initial step that they should take towards testing their data. Initial testing should be done first profiling is very important. So if you understand your data you understand what quality validation will be done and what business insight from that data can be incurred. Any other questions are available at LinkedIn or Twitter so you can connect with me.