 Thank you so much everyone for coming along this session. So my name is Mohawad Ali and I'm research data specialist in data architecture at ARDC. So today I will be presenting about the data architecture checklist that we have developed recently and I will be particularly focusing on the role of data quality in this data architecture checklist. So I will start by acknowledgement of the country. So we acknowledge and celebrate the first Australians on whose traditional lands we meet and I pay my respect to the past, present and emerging. So this is the content for today's presentation. I will start with some of the concepts of data architecture and the relevant engineering behind it and then I will walk through how data quality can help improve in the existing data architectures and then I have included a practical scenario where I have added the both concepts of data quality and the data architecture and following that I will walk through the data architecture checklist and I will also discuss some of the common challenges and existing obstacles in this domain actually. And following these challenges, we have proposed some solutions that how we can improve the existing architectures and workflows and since nowadays there are a lot of discussions around AI and ML. So I have also included how AI and ML can part its role in enhancing the data quality in the data architecture. Following that, I will discuss the future perspective of what next from here and then at last I will define some of the key takeaways and conclusions from this presentation. So this is the first slide which shows high level architecture diagram. So in data architecture, what it defines is data architecture basically defines the data strategy and data flows within the organization. So it basically consists of five layers. So it starts from the data collection then from data processing and then transformation and how the data is stored. So at the end, we usually discussed how the data is used and analyzed in terms of analytics. So in this figure, we can see that the first step is started from the data collection and acquisition. Then there are some kind of processing following by some kind of transformations and then storage in different databases. And then lastly, I have highlighted some kind of analytics in terms of visualizations and data science driven decisions. I have also included ML Ops in this data architecture diagram because nowadays there are a lot of discussions around deployment from the conceptualization. So ML Ops basically help us to convert the concepts into a kind of real time deployment and production. So following this data architecture diagram, so these are some of the data quality metrics that usually embed in data architecture. So some of the terms are quite common. Let's say there's a term of completeness, accuracy, consistency, timeliness, validity, uniqueness and relevancy. These terms are quite common across different domains but today we will be describing specifically in terms of data architecture. So when we describe these terms in terms of data architecture, let's say if you'll start with the completeness, we usually talk about the data pipeline. So in that data architecture, data pipelines basically ensures that all the data is captured and into the required formats and into the required systems. And in terms of accuracy, we usually check that in the data architecture, all the data sets are within the relevant formats and they are in the consistent formats. And there is a concept of data profiling that we usually consider to identify missing points and to ensure that the accuracy is in place in the architecture. When we talk about the consistency, which is the third one in this list, we usually talk about the data models. So data architects usually work on data models to keep this data consistent across different systems. So that this is also relevant to the data integration, which is a kind of pen point for different domains or different systems nowadays. And in terms of timeliness. So I mentioned this point because this is somehow also relevant to pipelines, which we are also called as ELTR ETL pipelines. So we usually use these pipelines in data architecture so that we can reduce the manual processes because most of the processes nowadays are manual. So by using these ELTR ETL processes, we can automate the processes and we can save the time. So in terms of validity. So let's say in terms of data quality, we usually define some kind of rules and constraints just to ensure that the data is maintained and the data is properly validated and updated accordingly. One of the things that has been just has been coming to my mind was like in terms of uniqueness. So when we usually work around the data models, we usually define different kind of relationships and within each component or each record of the data set, we usually prefer to have some kind of unique identifiers. And so the purpose of just like mentioning unique identifiers is to avoid duplication actually. Nowadays we are dealing around a lot of data sets, but duplication is one of the key problems. So to avoid this problem, we usually assign unique identifiers across different data sets of the architect within that architecture. Now how this is relevant to the organizations basically is from data architecture perspective, we work around with different organizations at different data consumers. Firstly, to understand what are they need? So today we are specifically focusing on the data quality. So I will be focusing on what we need to understand about the data quality needs at first step and how we can ensure that the relevant steps in the data architecture are meeting the specific use cases. So these are some of the metrics, but these metrics, the names of these metrics are quite common, but in terms of data architecture, we usually take these kind of steps, actually. So now we talked about the metrics. Actually, this is some of the practical implementation and this is from one of the research groups, which is known as the Infotech. So in this diagram, basically, it shows that there are five key components of the data architecture, which starts from the data creation and then data ingestion and then ends with the reporting and analytics. So in terms of data quality, there are different data checks across these layers, actually. For example, if we see at the first one, we have the databases. So when we walk around the databases, we ensure that the database is free from the errors. And if we come towards the data ingestion, we usually check the different kind of syntax and formats so that the data is ingested in appropriate manner. And the third step is the database, the data warehouses. And nowadays, there is a concept of data lake houses where we actually capture and store the data. So data warehouses and data lake houses themselves, themselves, actually maintain the data quality by working around different kind of data pipelines that they have built in. So in the last, we can also see that how in this data architecture, how we have ingested the different kind of data quality checks to ensure that the data is quality, data is properly collected, then ingested, then properly stored and delivered, and then consumed as the last step. So one and two, as I have shown in the lessons, one basically shows how each component in these diagrams attempting to fix the data quality issues, especially at the root cost level. So when you mentioned about the root cost level, it starts from basically the ingestion from the databases. So that is the first step. And throughout these data flow, actually, we actually usually assign different kind of checkpoints to ensure that we have a good data quality output at the end. So these are some of the common challenges and obstacles. They describe repetitive nowadays, let's say organizations are facing that in terms of data quality, we have unreliable data which leads to unreliable outputs, actually. And then there are inefficiencies in the data sets and to fix those inefficiencies, we have different kind of costs associated with it and cost is another factor that is like a challenge for fixing data quality issues in these data architectures. Now, what is the impact of poor data quality? Because it has a domino effect. So if the first step of the data ingestion is not handled properly, it basically leads to the wrong outcomes or the wrong decision-making at the end of the process in these data architectures. And one of the other obstacles that we have, or I have noticed is like that nowadays, there's a failure to realize the importance of data quality. So we usually do not give much importance to this concept that data quality is like one of the important parts of these data architectures, but they are often ignored nowadays. And also, I think we are not, usually we are unsure about where to start with the data quality. So this uncertainty also poses the challenges to the data architects that from what step we have to start because within all these layers in the data architecture, there are different layers and each layer we come across the different kinds of data sets and how to start in terms of lineage is also a problem. So we also come across this word of data silos nowadays. Actually, this is most repetitive word in data domain and the challenge in terms of data quality and data architecture is that the data is like basically stored in different isolated systems. Other departments and which makes it difficult to ensure that the data is consistent and have a quality or to maintain quality across the organization. Now this kind of data silos also leads to the lack of integration, data integration between the systems and the lack of integration actually results into inconsistent data definitions. And of course it leads to inconsistent quality standards. So these are some of the challenges and common obstacles that have been listed here. So as a potential solution, I have listed a table here. These are kind of very generic and basic solutions which can solve the existing problems. Let's say within these data architectures, what we usually do is let's say, as I shown in the previous diagram, we usually implement data quality checks at every stage of the architecture. Let's say when we talk about the different layers, there are different stages of pipelines in these data architectures. For example, that pipelines basically handles how we have ingested the data, how we have transformed the data, how the data is integrated and how the data is delivered. So by positioning the data quality checks at the different steps of the each layer, we can control or we can improve the data quality aspects in the data architecture. And apart from that, we can also improve the data quality by improving the results that we can usually collect from the data quality checks. Let's say we usually set some kind of thresholds or rules which says this is pass or fail statuses. So these kinds of messages or different quality scores can also help us to analyze the state and performance of the data quality in the data architecture. And yeah, and there are some other aspects that I have actually listed here. So one of the thing is in terms of visualizations also. So instead of manually going through individual component of the data, we usually nowadays work around the dashboards or different tools or charts, which basically shows the different kinds of forecasting and data quality trends to ensure that where are the anomalies and where are the data issues actually exist. So data visualizations are also one of the important roles to maintain a good data quality in this data architectures actually. And the last one, I think it's a kind of collaboration between the different teams. Let's say data architects usually work with the data engineers and data analysts and data scientists within the organization so that they can effectively communicate about the data quality issues and work on the existing challenges are the workflows actually. So I have to support previous statements. I have added an example of an organization that how data quality is like affecting the data architecture. So the example is, let's say, if there is an organization which is like trying to migrate their data from one platform to the other platform. So after the migration, actually they realized that the migration that they have done actually produced some kind of inconsistent results. And those inconsistent results were actually from the data quality issues, which were ignored at the first step. So what actually has gone wrong in this scenario is like that while migration from one platform to the other or from one domain to the other, it is necessary that time should be spent identifying the quality data quality issues. And the other problem that we usually face nowadays is there are a lot of manual processes to fix the data quality issues. And this manual processes usually leads to the high extension of delivery of time. And now to address these issues, let's say within this organization, what they'd usually do is they go back again and include some additional resources and inclusion of additional resources basically also increase the financial burdens and also the associated cost with it. And when we talk about the data quality, let's say in this example, if there are inconsistent results from one platform to the other, then it also initiates mistrust or the issue of the trust in the let's say in the new system, if the results are not up to the mark or what it usually supposed to appear in the next platform actually. Now, if that is the issue, then there are certain solutions also. Let's say this is another diagram which I have listed here. This diagram is simple one which shows what are the data resources and what are the other staging areas which can improve the data quality issues let's say in the previous example from migration from one platform to the other and within the complete data architecture. So the step one, let's say from the data resources and the second step, which is from moving the data from data sources to the staging area, we have actually, we usually let's say maintain different kind of checklists to ensure that the data which has been coming from the actual source have some kind of, it's free from the data quality issues. Now, in this staging area, in the data architectures, we usually have a data pipelines and within these data pipelines, we usually in just different kind of checklists. Let's say this is one of the checklists which is let's say in terms of operational checklists that how the data is operating. So these are the five steps which ensures that the data which is coming as a raw form should have free from errors or that data is free from errors by following these five steps. Now, when it goes from staging area to the data warehouse, we usually, what we usually checks is like whether it meets the business requirements, whether it's consistent with the business needs. So these are the other checks that we usually implement within these data architectures to ensure that the data quality, the integrity and data quality is maintained within these data architectures. So I listed this because there is a lot of discussions around data within the data which is metadata. So what could be the role of metadata quality in the data architecture actually? So how metadata actually quality usually impacts the real time implementation of data architectures. So I listed five of them, but they are not limited to. There could be much more than this one, but some of them is like, we usually talk a lot about nowadays fair and one of the component of the fair is discovery. So what usually metadata usually help is like the high quality metadata actually ensures that the relevant data could be easily find within the data architecture. And in terms of linear striking, so metadata can ensure that data can be properly tracked from its origin from various steps such as transformations and the movement throughout the organization. And in terms of integration, good metadata quality is actually important for successful data integration because when systems and applications use consistent and accurate metadata, it becomes easier to integrate data from various resources. So in terms of data consistency, I usually mentioned here that consistent and standardized metadata access within the data architecture ensures that different platforms or different organizations are using the same data language to maintain the consistency across different systems. It also supports the data governance and compliance because metadata is one of the foundations for effective governance. It basically helps to set the policies or to enforce the different data management policies actually in that architecture. Now, I have given a bit of overview of how data quality is important for the data architecture. Now, we can, in this phase, I will highlight how we can ensure good practices of data architecture and data quality. And for that, actually we have actually developed a checklist which is currently available on the ARDC website. So what we have focused in this checklist actually is that we actually compiled different components of data architecture. So in this checklist, we focused on the 17 areas which I will highlight in the coming slide. Now, what is the benefit for the end users from this checklist is that after using this checklist, actually organizations can have a better understanding of where their data architecture is at and what areas they need to follow up on. One of the things I think I should mention is like that this checklist is intended to be used as a starting point for teams or for the organizations to customize. And all the items in this checklist, it's not necessary that they would be applicable to all organizations and it's likely that organizations will need to air revise or remove the item based on their requirements. But this checklist basically aims to ensure that there are certain elements which needs to be embedded in the data architecture practices and this checklist will initiate further discussions and reflections on how we can improve these existing data architectures. So this is one of the snapshot from the data architecture checklist. So these are the 17 areas which are shown in the bullets. We started from the questions around existing architecture. Then we gone down into the data sources, data formats and data types and data integration and transformation and techniques and how data pipelines and workflows can reduce the existing manual works and then there's a long list of components that we have listed here. So this checklist is currently available on Zenero. I have added URL in this presentation actually. One of the component that we have focused in this checklist is that the first bullet that is shown in this slide is basically a core list of data architecture but within this data architecture, there are three other cross capabilities. Let's say in data architecture, we also our team, let's say we have a team of data governance, sensitive data and fair principles and these three areas are also come across the different components of the data architecture. So I will walk through each component of data architecture checklist in the coming slide. So this is the first one. So as I mentioned previously that these are the different components. Let's say if we started from the existing architecture. So this checklist ensures that let's say do you have an existing data architecture in your organization? If you have, then what are the problems? If the data architecture is outdated, redundant, it's not properly aligned with the business requirements. And what are the current needs? So we actually started by focusing on the existing data architectures and then going into more details of data resources. Let's say from where the data is coming from, how the data is collected. Are there any devices which have been used during the data collection? Are there any social media? So there are different resources that we have focused on here. And in terms of data formats, let's say we actually highlighted what kind of formats are you being just like using in your data architecture? And also we have highlighted some kind of metadata schemers within that checklist actually. Nowadays because there is, nowadays we are in the age of big data. So when we talk about the big data, there is data is usually clustered into three types, structured, semi-structured and unstructured. So we also included this component to ensure that what kind of data types that your organization is using. Now, in terms of transformation techniques, how the data is like moved from one place to another just to make it for analysis-ready. Let's say, have you used some kind of existing techniques such as ETL processes or ELT processes to clean and organize your data? And then we also included the effect of pipelines and workflows just to reduce the manual steps. We usually do in data transformations. And in the data ingestion, because there are two kind of data which is basically ingested into different kind of systems. One is the base streaming and the other is the real-time streaming. So it's important to design the architecture based on how the data is coming into your system. So whether you are using the base streaming or whether you are using the real-time streaming for your objectives actually. So if you are using both base streaming and real-time streaming, and then there is a concept of Lambda architecture which is the best suited for the data architectures. We also included the component of data storage, although ADC is not providing storage at this time, but we focused typically based on the repositories of the data warehouses. Let's say whether your data is stored on-premise, whether it is in cloud or whether your data exists both on-premise and cloud in hybrid situation. One of the key aspect in data architectures is databases. So let's say we have different databases and these databases are dependent on the data types, whether the data is based on the tables or whether the data is based on different kind of documents. So that will decide whether it's relational or non-relational in terms of databases. So in the third one, we actually focused on the data quality aspect and which is quite aligned with this presentation also. Let's say what are the key steps that have been taken to ensure that the data quality is maintained across the data architecture. Let's say one of the technique could be the data profiling and there are some other techniques and which have been discussed in the previous slides, let's say automating the data pipelines to ensure that the data is free from the errors. So in terms of data validation, when the data has been presented in the last stage, it is important to ensure that the data is free from any inconsistency so that the data that has been used in the last stage for decision-making should have a clean and appropriate data. So when we talk about the data decisions, because we talk a lot about ML and AI, so even for ML and AI, they are heavily dependent on the incoming data, so it's also important to maintain the data quality for these ML solutions. And in the last, sorry, and in the last we have visualizations part, let's say nowadays organizations are using business intelligent tools such as Power BI and other tools to visualize the big data to get real-time alerts and predictions of the data sets actually. These are the cross capabilities that usually data architects work with the other members of the team, let's say data governance. Let's say it's the governance that's in place in the architecture, for that we have more details on data governance in our website and our data governance specialists have actually designed those checklists to ensure that the data governance policies are in place actually. And we have also teams on the sensitive data. So when we talk about the data in the data architecture, we come across also across the sensitive data. So for sensitive data in ARDC, we also develop a guidelines around how we can work around the sensitive data and how it can be useful for the data architectures. When we work around the data quality in the data architecture, then we also ensures that the data is like, data follows the fair principles. So to understand how the fair principles are relevant to this component, again, we have a fair principles document in the ARDC websites, which can be easily accessed by actually anyone. And I have also included the hyperlinks in this slide. So that was the checklist. So after that checklist, actually what we realized is that when we talk about the architecture, we mix different kind of technologies within those architectures because there are different kind of architectures. Let's say this figure shows seven to eight different kind of architectures and each architecture has a different kind of role and responsibility. I have listed, let's say, from data architecture what usually they look after, what systems architectures they usually look after, what is the role of infrastructure, architecture, enterprise architecture. So what is the role of ML and AI architecture? So I have listed this figure just to make, to clarify that how we can select the right architecture based on the needs and more details can be found at the ARDC website actually. Now in terms of solutions, how we can improve these data architectures and data quality and data workflows. So we have actually listed a couple of steps here because data architecture needs to keep pace with the technological changes. So these are some of the recommendations to improve the data architecture along with the data quality. Let's say the first step is the, we have to assess the current state of the architecture. And in terms of data quality, we need to understand what are the requirements of data quality and we have to apply or design the data architecture accordingly. And in terms of existing architecture, we need to understand that in which areas we need further improvements and how we can design different strategies and workflows so that the data architecture can be improved based on the current requirements or the current stages. So there is also a concept of a gap analysis where we actually draw a comparison between the current state of the architecture and the future state of the architecture. And within this gap analysis, we also focus on the data quality aspect. And this is one of the important aspect in the data architecture that what is the existing situation or what is the existing state of the data quality and how we can improve this data quality aspect in the future state of the architecture. And I think a proper planning and road may be also important to optimize are to improve the existing data architectures. And a proper planning and a workflow is, I think it's also important to keep the data architecture up to date. And I think one of the other things that has been coming into discussions was like, we need to promote collaboration between the different members of the data actually. If you foster data-driven culture, we can better handle the existing issues because as I mentioned in the previous slides, then nowadays most of the organizations are working as a silo. So if you promote collaboration between different data teams, then we can handle these data issues within the data architecture more effectively. So there are a lot of discussions around how we can use AI and ML for improving data quality in the data architecture. So these are some of the components that I have listed here. So let's say AI can help us in automating the data cleansing and standardization. And we usually do this by automatically impute the missing values, rectify the inconsistencies and bringing the different standardized formats across the architecture. And in terms of anomaly detection, we usually use in machine learning and AI we use supervised and unsupervised learning. So this is important because if an organization is working on a big data, let's say petabytes are a big data, manual processes will usually not work. So by ingesting the AI and ML solutions and by doing some kind of feature engineering, these errors of data quality can be easily identified and rectified. So yeah, so the rest of the ones are like how we can improve the data quality monitoring. Let's say we can improve by ingesting the predictive models and ingesting the data linear striking. And nowadays, data linear striking is usually being done by using the historical activities of the data and usually AI, data science and ML use different kind of feature engineering to detect what are the historical trends and how we can attract these changes for the future. And I think one of the important aspect is like democratizing the data quality. Let's say AI can help us to do this by promoting different kind of data quality dashboards which could be visible to different members of the team. Now, what could be the, how a modern data architecture would be look like and what could be the key features that we have to focus on? These are some of them. It's not necessary that all these will applicable to the different organizations, but these are some of the key items that we have found that these could be the key features in the modern data architectures. And one of them is the scalability. And the other one is the reusability because nowadays there are a lot of data pipelines but they are not reusable. So by introducing scalable and reusable data pipelines we can combine different kind of intelligent workflows and these workflows can help us to improve the data quality issues. And the second one is in terms of data integration because nowadays let's say from here you see where the data is like, let's say it's coming from different or from various sources. It's important to have a seamless data integration let's say by ingesting ELT processes or considering other approaches such as data pipelines and even driven architectures for efficient data integration. Now there we discussed in the previous slides that how it's important to convert the manual data flows into a data flows automation. This is because to reduce the manual data processing and analysis and in terms of technologies, machine learning can help us to automate these processes. Actually let's say one of the use cases machine learning ops where we use different kind of workflows and pipelines to automate these pipelines and reduce the human processes or human labor worlds. So it's also important to consider the batch and real-time processing requirements because some of the data just like in terms of batch comes in terms of batches that they're like not comes on a frequent basis but some of them are real-time. Let's say in terms of IoT devices most of the IoT devices they actually release data in a real-time. So it's important to consider the requirements for batch and real-time processing. So one of the enterprise architect's name is Charlie Raviri actually listed six essential steps using which we can modernize current data architectures with data analytics and AI which I have listed here. Other aspect that I have listed is like our data architecture should support both on-premise data and the cloud data needs so it should be hybrid. And the cost factories I think is there for a long time so we have to balance the cost and the simplicity within these data architectures. And in terms of data storage because scalability is I think it's a problem for most of the organization. So we can adopt a new technology such as data lakes or data lake houses to address the scalability needs actually. Now from where from here is what's next from here is like the guides are the checklist that we have developed the starting points and we are welcoming our community to have a look at on this checklist and guides for continuous improvements and we will regularly review and update this checklist so that we will remain, we can remain update with the evolving data landscapes or the evolving data landscape challenges. So if I will summarize these and five key points the conclusion is that we can prioritize data quality in data architectures and by doing this we can avoid the negative effects or domino effects within the data architecture so that the data driven solutions should lead to this success not to the disaster. Nowadays I think it's the important thing is it's not just about defense algorithms and impressive reports. I think it's also about building trust through reliable and accurate insights and to develop these I think data quality is one of the important factor. And in terms of the checklist these checklists can definitely improve the data quality but it's not a magic bullet actually. A checklist basically ensures that the key aspects and the key ingredients of the data architectures are in place and considered and they're addressed in a systematic manner and by including these kind of checklist organizations can improve and maintain their existing architectures and also the key components such as the data quality. The last one I think which is important one is like every organization has unique architecture that basically reflects their infrastructure, their culture and their business requirements. However, I think in every architecture there are key components are the key elements that we actually try to capture in the developed checklist actually. So these are some of the highlights from my side. So thank you once again for listening and yeah, that's it from my side.