 Greetings all, this is Jnana Bharati. I am pleased to meet you all with this hurriedly put together filler talk about data monetization or should I say creating value from data for in the research field we like to pretend that we don't care about money. But before I get into this let me acknowledge the country we acknowledge and celebrate the first Australians on whose traditional lands we meet and we pay respects to the elders past, present and emerging. Anyway, I will be speaking about data monetization in the context of data quality as well as other data management key principles upon which data monetization or value creation is inextricably linked. So what data monetization and how it applies to research sector is about how to derive value from data asset. So first of all to think about data as an asset and does it have the characteristics of an asset. So that is one question that often comes to mind and it has been argued by a number of economists that data and information possess characteristics of an asset. So as typically economists do and they have been attempting to manage this or our described models to manage this. So they talk about factors affecting data quality, the measurable ones on the left talk about accuracy, integrity, consistency, completeness, accessibility, precision, timeliness, etc. And the one on the right are little more intangible. They are relevance, usability, believability, clarity, objectivity and scarcity and so on. And so in valuing these assets like any other assets, they talk about the quality which describes completeness and accuracy, etc., etc. And the relevance bearing on its alignment with the objectives and processes, etc. And the third one about timeliness, how soon or at what time it is the information is available. So it's a timeliness. And then there are a number of cost in at various stages in terms of acquiring, processing and applying this in different contexts and the benefits, of course, can be gained by this. They argue in the performance and the gain or in decision making and so on. So in particularly, they propose six models of information valuation or data valuation. The one, the first one is the intrinsic value of information. And we will look at what each one is. And then the next one is a business value of information and also the performance value of information. All three are based on the characteristics of the data and the domain. But they also propose three other valuation models that are based on economics and finance. One is a cost value of information and the market value of information and the economic value of information. What each one particularly talks about in this case of intrinsic value, how good and easy to use this data versus how likely others might have. So what is an intrinsic value of this? So of course, for the data to have value, it should have some level of validity. So percentage of records that seem deemed correct needs to be complete. And so fewer missing values and so on. And I mean, fewer errors and scarcity and how rare the data is to find for it to have the monetary value as economics would put it. And then the life cycle, which is how long this data would actually be available. So I think this is probably the most important concept for especially to translate into research. But the formula proposed is validated times completeness times one minus scarcity times I mean, life cycle is it seems to at least I felt it seems to give a mixed results. I mean, or mixed interpretation, I would say the scarcity would increase a value. So I did slightly modify took the liberty to modify this as accuracy, completeness, accessibility, how easy the data to access the scarcity and the and the life cycle length or of these actually contribute in some way not necessarily as a product, but proportionally contribute to the intrinsic value. Then the business value on the other hand is all about how quickly something can be applied to the particular process. And there, of course, talking is still talking about the business relevance for business application. So it talks about relevance, validity, completeness, and timeliness. So at any given time, how soon you can get it to a point and be able to use it. And talk about performance value of information, which is how much does the data having a unit of information intrinsically contribute to moving closer towards target. So it's an incremental increase in moving towards a target. So it's a delta. And so in over a lifespan and how much it contributes. So it's a different improvement divided by or the fractional improvement against a control group of having a data available is what is what they actually propose as a performance value. Well, there are other values, financial measures, such as what cost value of information, which is based on what it would actually cost to replace the data, which includes capturing processing and etc. That can be attributable to this data. But as well as if this data were to be not available, what would be the opportunity cost or the loss? And then market value is how much someone is willing to pay for it as in in this context. And then to the finally, the economic value. So this is essentially what they're saying is it is like a performance value of information minus of course, all the cost involved with this. That is a cost of acquiring and processing and applying this data. So this I mean, that gives an overview of what the literature or what widely talked about in terms of value. And of course, these are pseudo equations, or you might want to not really I mean, take you want to take it with a pinch of salt. But but how do you actually apply? There are three ways you can apply this that or derive value from data. One, of course, the lowest most the option three, you can just simply sell the data. And option two is actually you can use this data to enrich the products and services by wrapping information around it. So by making it more relevant or making it more personalized and so on, as well as providing additional information. And then the first one is how could it actually be used directly in decision making, or in improving existing processes, etc. So three ways you could use it. But in all three case, it's what we are trying to do, especially this is what is relevant for research in the in the sense is we are raising this data to be more and more organized form to answer certain questions. So organized data becomes information. And the linked pieces of information becomes knowledge. And the knowledge when it is distilled knowledge, is becomes wisdom. So says, Rasa Lakoff, in his argument about the data hierarchy. And this is this is this particularly valuable to us, because as data professional, this is where we want to get to. How do we actually raise this data to to this wisdom? And there are different processes that we employ. I tried to summarize both artificial intelligence based processes as as well as extracting simple insights such as business intelligence processes into one nutshell, which is of course, the data, the tools and infrastructure, and the methods, and the frameworks come together. And through us. I mean, through a scientific process, and to be able to integrate and interpret. We are with an interpretive mindset, you actually put them together, and then come up with your output. So, so if we were to describe it in the pseudo equation, the data usage depends on quality, integrity, of course, as we recognize fairness, and the security and privacy, etc. And the process of raising data to the wisdom process or towards it. In the case of AI is all about minimizing the error, using data algorithms and compute. In the case of insight generation directly from the data, it's all about applying rules and compute on the data to come up with insights. This is I see as primary difference between the AI and the BI. And so in but the perform that is for just the model. And the performance depends not just on this first part. I mean, the equation one, but also an engineering compute and the processes that I involved. And then if you want to have impact, it performance alone is not sufficient. It has to be aligned to the actual problem. And there must be risk as well. A risk that might be because that should be controlled as well. The sustained performance should it should have impact, but also should have an organization around it, including culture, governance, etc. But to achieve this, we use a lifecycle process from so starting from problem formulation, to actually acquiring and processing the data, developing model. And that model at one stage becomes your proof of concept. After few iterations, it would become your product that you actually take it to productization. So this process often drives the lifecycle of the data. But we know that that the data quality itself has been a serious impediments, as both in the industry as well as in the academia, for for actually realizing the value. And billions have been estimated to contribute to that. But particularly the data quality, the opportunity to address data quality is is in the early stages in in the problem formulation, in the data processing part is where most of the data quality management occurs. But even but it is also, it also spans the entire lifecycle process. But what is even more important is what are the downstream implications of this. The value of of this data. And in this case, I want to just step away from just the monetary value, it can be any value of the data increases as you go through more and more processing. And but the ability to influence the data and ability to influence a value as well as the cost or the outcome decreases very rapidly over time. And so it is in the very early stages of the problem formulation and initial data acquisition and pre processing. Most of the opportunity for working with the data is lost. Or it is expanded. And beyond that, it is actually developing the model, whatever algorithmic is, and then mainly translation and adoption. Of course, there is some dependency of these in the in the data quality, but being able to influence would actually decrease over time. And but it is also it is important to know not just it decreases, but there is a serial dependency as well. That it is it is a fragile path to the value or the impact. There is a so the whole analytic life cycle can be thought of as number of a chain of linked life cycles, one, of course, information and data management, modeling and data science and deployment. So if any of this from end to end, including the problem formulation and and deriving value, if any of this is broken, the ability to extract value out of data is actually broke. And so you need to look at the whole life cycle. If while your ability to influence is higher in the earliest stages, the ability to but at the same time, the risk continue to remain the same. Meaning if any of this is broken, it will continue to it will actually destroy any value realization. And but the traditional view of research, as we see today, is is is primarily ending in the reports and papers and publication, primarily publications. I mean, of course, you define various measurements and constructs and gather data just to satisfy this. And, and they can you may address missing data and and so on. But areas such as why as much as the validity of the construct is is explored, the validity of the and the provenance of the data and the data quality is is hardly addressed. In fact, anything beyond being able to achieve the papers and the present achieve these publications, etc. are are seriously jeopardized. And so you can actually so you can fix a data quality issue at at the end by at at by adding some rules or doing some end processing. But actually, the value in it, but it can also be those can be very symptomatic. And the majority of the processing that is done at, let us say in pre processing in AI or machine learning, it's it's quite symptomatic. So we use statistics to manage what is missing data, or any if I try to identify errors, and and so on, or we cleanse the data, impute the data. So the data quality is handled at a very symptomatic level. But on the other hand, the root cause resolution, going back and collecting a better data, it's of course, incentives are often against us. And so it is seldom done. And but in reality, the fixes should occur throughout. And that that includes definitions, metadata, traceability, or data integration. So or inconsistent or erroneous lifecycle processes and management, and poor data migration, and lack of active maintenance, which is another factor, even if you had good data, and it's one of, and it is actually destroyed. In other words, the weight both in the industry, as well as in the academia, are towards the production side of it, then in the monitoring control and stepping back. But the data quality is one area, at least it along with governance and risk management does look at these. And so in the case of I mean, industry, it is all about delivery of and that in that in that case of academia, it's all about publication, these outweigh other, that is urgent, often outweigh the importance that having governance, having risk management processes, and the deep innovation in the industry can be quite lacking. And especially in Australian context, and similarly, translation in research is is again, is a serious issue. So both are struggling with different problems. And the data lifecycle itself is aiming and or the data quality that is conventionally measured, is looking at completeness, integrity, timeliness, validity, these are all that one considers in the case of data monetization or value creation. And along with also data curation, or improving the fairness, it is it's applied. And so both care and the fair principles that are coming into place will play a role or could play a supportive role. And in the case of the value creation, this is one definition if I mean, this is that actually looks at the whole lifecycle process. And it looks at the end to end treatment of the of the quality. So I wonder if Leslie is here, or I mean, I mean, there are authors of this paper are here as well. So if they're happy to comment, I'm most interested to listen. But this Peng Zetal's paper does actually capture the importance of data quality and encapsulate this in this in this bigger picture. So that's something I thought it's quite quite relevant. But also, as I mentioned, the failure of the data, I mean, real, I mean, data value realization goes beyond any one stage. Which is, for example, if you look at, even after you build a model, a good model, it's only about 20% of actually it's about 10%, 10 to 12% of the models that actually see the light of the day, they act, they end up getting deployed. And that is what the statistic says. And these these involve big figures. And it is a case even even with the industry. And there are many reasons. But the low success rate of deployment, of course, it tied to data quality, but it's also tied to number of other factors, including even if you were to deploy a model, the model shifts, or the data quality shifts. And over time, so the retraining and the maintenance, etc. are quite necessary. So the value realization would involve, of course, understanding the drift in the data, and identifying it in time, and then say retraining. So the model performance is retained. And so this means a continued monitoring is necessary. And so so in terms of beyond the algorithm, one also needs to look at issues such as the data related issues, model related issues, and the number of non technical issues, if you really want to see value realization. And if you also want to see the model getting used, they need to be trust in the model and the data. So the model must be able to should not be black box, it should be able to explain its finding. And so similarly, the data provenance, scope, quality, etc. comes into picture, but also a number of non quantitative factors, non technological socio technical factors also comes into being. So that is why it is so AI ML code, for example, in in a machine learning situation is only one small piece of this. And in reality, you have a number of other boxes to worry about even just from the technological perspective. So putting it all together. So we are looking at there are systems level challenges to the value creation, starting from the life cycle stages of linked life cycles. And oftentimes, lack of stakeholder participation and serial dependencies in the path and number of interactions, dependencies and points of failures in this. So the path to value creation or impact is is is a fragile path. And but at the same time, if you really look at majority of the price, if you ask the practitioners, which issues are important. And what I had found is it's kind of diverging usually the younger professionals, HDR students, or even early career researchers focus quite a bit on the technical aspects, such as say, the model development, the green for the greenbies. And if you ask more senior professionals, and some of them will concede to the one says get familiar with it, they will concede to data pre processing stages. But but it takes much more years of training and actual practice and failures to understand how much more value also lies in other areas, including the problem formulation, as well as production icing as well. So so this. So so there are a number of issues such as domain inefficiencies, peculiarities, missed opportunities, and so on that are already plaguing research sector, as well as number of, I mean, as well as the industry sector is not immune to that as well. They have been legal battles, and, and so on. But at the same time, and also it results in the poor decision making propagation of biases and organizational politics and broken window syndrome often happens in research labs. Okay, it's been this way. And let's keep going. And same in the industry. And could often result in mistrust and distrust of data. So I was going to give you some case studies, but in the interest of time, I'm just cutting it short. And, but it is important to look at solving the right problem. And it is important to have a hyper participation of the stakeholders, and not just both the technical stakeholders across a value chain, but also the customers and other stakeholders as well. And to understand the path to value is long and fragile. And also allocate appropriate resources without cannibalizing and understand the biases risk and ethical issues both in the data, as well as the processing pipeline, including machine learning or other modeling processes, and tech compatibility is also quite, quite important. So one needs to take a more systems view and the framework governance and managing management are important processes and are important meta data and meaning are quite important. And human factors need to be addressed. So so do risk ethics and fair and care ideas. So I was thinking what happens in slightly less data intensive area in the case of translation. So I was looking at some of the systematic reviews in this area. And surprisingly, the number of issues that have been highlighted in the case of translating research evidence is is quite interesting, but also worth learning about partly because there are very few studies in the startup profession itself about the translation and and so on. So they talk about time constraints, lack of workforce, etc. So and they talk about identifying the right stakeholders, budgeting, the right level of activities, motivation of professionals, and so on. And so things that are barriers versus that are facilitating have been nicely laid out. And similarly, what is micro level versus me so or systems level have also been highlighted. So I thought it is worth looking at some of the other domains and see how they have actually done some of these translation and and value creation. And there are a number of theories on models and theories models and frameworks in certain sectors, particularly in health care or clinical translation, that is worth learning about. So with that, I would actually stop here.