 Welcome everyone and thanks for giving me the opportunity to talk about our experience with data standardization workflow and that vocabulary about our e-carcinum project that has been launched in November last year. And it's a repository of cancer data from dogs and cats across Australia. Now, I didn't know if you were familiar with the concept of cancer registry so the first slides will be just some introduction to other projects so you can understand where the issues were in terms of data standards and data dictionary. So our cancer registry is actually an information system designed to collect, store and manage data on patients with cancer. And in many countries there are regulations that actually mandate that any human healthcare provider who treats patients with cancer collects and reports that type of information. And there are actively 700 human cancer registries. And one of the questions that I'm usually getting when I'm talking about cancer registries in animals is that is why do we need them. And in particular why we needed our products so e-carcinum. So you need to think about that in Australia there are nine million of dogs and cats but we don't have any reliable and accurate information about cancer cases in our pets. And so how can we get that information so how can we answer all those questions on the slide. Only through a systematic and standardized collection of cancer cases that is actually the foundation of a cancer registration system. The other issue that we have in the veterinary world is that cancer is not reportable. And there are no databases of cancer cases covering large enough population in detail to make reliable statements about overall cancer rates. There have been few studies attempting to assess the rates of cancer in dogs and cats but they've been using different methodologies, different populations. They've been conducted at different times so we can't really make any comparison among them. And the second question that you may have is can we get this information from other cancer database or cancer registry. Not really. And in this slide I've included all the animal cancer registries that have been reported so far. Many of them are under this big red cross because they've been discontinued. But as you can see in the last few years there have been like interest in having or setting up cancer registries local, national or state or multi-state level with few initiatives in Italy, Brazil, in Portugal and US. But again cancer diagnosis in animals may change according to the country. The most common breed may have different cancer predisposition or there are specific environmental factors in each country leading to different cancers. So we can really translate information from other countries or other cancer registries to the Australian scenario. And now that we are in the AI era so I've tried to interrogate chapter GPT and it's not helpful either. So if you ask AI if what's the most common tumor in dogs and cats in Australia you don't have you're not going too far. So this is where a carcinum came in and the a carcinum platform is collecting data from veterinary pathology labs from four within four academic institutions, Murdoch, UQ, Adelaide and Sydney and two commercial veterinary pathology labs. And basically data transferred into the a carcinum data warehouse where they can be used to analyze trends of cancers looking at geographic distribution to identify potential environmental risk factors of cancer or to conduct comparative studies and see if there are shared risk factors between human and animal cancers. But when we started this project, we have to look at the main features and requirements of a cancer registry. Now, the value of a cancer registry and its ability to carry out different activities rely heavily on the quality of data and the quality of the control procedures in place. And there are four main indicators to measure quality and data quality in cancer registries. The first one is called completeness. So it means is the extent to which all cancer cases within a population are captured and included into the registry database. And that's important if you want to calculate incidence rates or proportion survival proportions. Then you have validity or accuracy that is the proportion of cases in the registry with a given characteristic, for example, age of breed that truly have the attribute. And that this measure depends on the quality of the data source and even the skills of who is registering with reporting those cases. Then we have comparability that is the extent to which all the procedures or the definitions of recording adhere to standard guidelines. And the last one that is the least important because you know there is in a way a trade off between being quick and having accurate data. So is the repeated at which the registry can collect process and report reliable cancer data. So, keeping in mind that as cancer registry, our project had to comply with those requirements. In the beginning of the project we discovered many issues in terms of cancer data standardization and data quality, the human that human cancer registries to not have. So the first big issue is that we don't have international guidelines for cancer registry registration in animals. There are differences between how cancers are reported in humans and animals. So there is this international association called international system cancer registries that be founded in the 60s. And it's fostering exchange of information between human cancer, cancer registries worldwide, so that we can improve data quality and comparability. We don't have any type similar association in veterinary oncology. And we don't have even many veterinary cancer registry to compare with. The second issue is that the pathology reports may be different, so they may have different type of information. For example, in the veterinary pathology report, we are not talking about staging of tumor. Or the TNM staging system. Or, you know, in humans, there are also information about particular molecular or markers that may be useful for full up information. And that we don't have that either. The second problem in terms of data standardization and data quality is lack of uniformity in data reporting and extraction. So remember that we are getting data from different data providers. And the ACARCINOM project was faced with importing a large number of records of different data formats from several organizations. Each organization uses their own export format for patient and case information, including diagnostic results, which is held within the form of a text based report. And these reports hold the core data. The ACARCINOM project is interested in. While this report was often and is often semi-structured, it's a big challenge to extract data required by the project. So that's this and I want just to show you a few example of what I'm talking about. This is one example of the original data from one data provider. As you can see, it's a big free text format in which you have information about the animal at the top and then the description of the sample and then the name of the tumor and the site of the tumor. But then you look at another data provider and those are extracted data from another data source in which you have less free text, but still there is a different type of organization of the data. For example, in columns and then you have the name of the tumor and the site of the tumor in this particular column. The third very important issue that we had is lack of uniformity in terminology and lack of a standardized cancer vocabulary. That means that so in human, they refer to the WHO classification of human cancers where every tumor has a specific name, it's coded, so there is a specific number and that's the standard. We don't have an analog classification for animal cancers. So, and even there is no consistency across textbooks or publications regarding name of specific entities, or you may have pathology report talking about the same tumor with different names. Classical example is lymphoma. So you can have lymphoma reported as lymphocercoma or non-ogic lymphoma or malignant lymphoma malignant lymphocercoma. So different names for the same entity. And this seems like pretty pedantic. When I'm talking about importance cancer vocabulary, many people say, oh, what's the point, you know, it's pretty paranoid. But really having a standardized vocabulary is extremely important for the extraction and transformation in cancer registry. So we had those three big issues. And we took a multi level approach to tackle all these issues in terms of data standardization and data quality. So the first approach is determining our data requirements. So human cancer registries say record data with a set of variables for each cancer case. So the International Association of Cancer Registries set out like 10, 11 basic variables and no cancer registry could function with less than that. So that's the minimum data set. Of course, well developed cancer registries humans may include a different set of variable or an expanded set of variable. But you need to remember that the data being collected from secondary sources, so pathology report, clinical records, not from the patient. So those pieces of information might not be routinely available. So what we did keeping in mind that we need to have at least a minimum requirement for being an accurate cancer registry. And even looking at small data sets from different data providers so we could have an idea of what level of data we could get. We set up a list of minimum requirements for our database that is in a wire mirror ring, the minimum set of requirements for the human cancer registries. And then we had we decided to have strict inclusion and exclusion criteria to have, for example, cancer diagnosis confirmed by histopathology only without histopathology or cancer cases are removed. And then even certain diagnosis so we want to avoid the uncertainty. That is, for example, having pathology report saying could be a lymphoma versus another type of tune. So these in a way this type of approach led us to the boxes of completeness comparability and validity of our data. Then the second type of approach is analyzing the data form. So we ask each data provider to give us a small data set or road data, like 100 cases. So we analyze those small data set in order then to define each step for the data extraction and transformation. Just to give you an example of what we did. This is basically an extract from the analysis of one particular data set. The first column contains all the variables of the original data set. The second column, what each variable means. And this is important because then we could decide what to keep and what to take out. And then we transformed those variables into our a carcinoma standard or information. So, for example, the gender reported in the raw data set original data set from these particular variable we could get information cycle incarceration status of our animals. So the main issue was actually trying to find information about the tumor within the free text, because again, the extracted data might be different and they contain lots of free text. And, but we realize that each reports for each data source in most cases follow the same level of standardization. For example, reports might contain headings of diagnosis or morphological diagnosis. And we knew that in that under that particular heading, we had the name of the tumor, or the site of the tune. Using these headings we constructed regular expressions to find the text blocks of interest in order to have an accurate parsing system in place for each data provider. The third important we wanted to overcome the issue of lack of uniformity in terminology. So the third task was defining standards. It's not just the site of the tumor. It's not just the name of the tumor, but we wanted to define even our standard for the name of the breed in dogs and cats, because it may be written in a different way. And also the level of certainty of diagnosis. So we create a set of a carcinoma dictionary or data vocabulary, because referring to the international classification of diseases for oncology humans. This I CDO is a multi actual classification of the site morphology behavior and grading of human cancers. This is also assigning specific numbers of codes to the site of the tumor and the name of the tumor. A carcinoma is partnering with this global initiative for veterinary cancer surveillance in trying to define standards for cancer registration and coding. And we actually published the corresponding veterinary coding system. So we have now a system in place that is a standardized system for classification and coding. So at the beginning, I just have my gigantic Excel file in which we had to, for example, this is an extract from the a carcinoma data dictionary name of the tumor, in which we had to define our standard name and all the possible data to allow for the highest level of accuracy of the data extraction process. A cinnamon can be a common misspelling name changes or common names. This is how the dictionary looks like in the a carcinoma database. So each data dictionary called the preferred name of the tumor together with additional information like the coding or the display color. And used by the parser to extract data and all the synonym. So if a synonym term is found within the report text, the parser will map these to the preferred term. So we had to have a vocabulary in place. And we are also working in translating all those information into research vocabularies Australia, so that all those information names, synonyms and codes are shareable and accessible. This is just to show you how building the dictionary was very important for extracting data from the different pathology reports from the different data sources and how effective is the parsing system in doing that. So there is a color coding display color. The parsing system is able to capture the data and the synonym and the preferred term, and then coding them according to the veterinary coding system. There is another important aspect when registering cancer cases. And it's the definition of what we need to record the report in specific situations. That's where standardized by in human cancer registries, but not much for veterinary cancer registries. So, when you have a cancer case, it's pretty straightforward to do registration and coding of one particular case of one particular tumor in one patient. But imagine to have one patient with multiple tumors or one patient in different organs or patient with multiple tumors in one location. Or tumors like lymphoma they have a very systemic distribution so you really don't even know where it's coming from. So, how is the standard or for reporting and registering those particular cases. So we had to work on that too. We build guidelines for registration of specific tumors. And we construct an algorithm for coding a registration in particular situation. So we now know that, for example, if you have a multiple tumor. If you have a petal organ with similar morphology, it needs to be registered as a one tumor, according to the human cancer registry standards. So using this diagnostic algorithm. We are pretty sure that we are registering our information and our data in the most accurate way, and the most standardized format. Approach that we used in terms of double checking the quality of our data is doing like a check in after the first step of data extraction. Now, one of the methodologies used by cancer registries human cancer registries for checking the quality of data is called re abstracting. So it's a process of re abstracting records from a given source, putting the data and comparing the abstracted encoded data with information recorded within the registry database. The objective is to characterize the level of agreement between data and registry and the data re abstract and recorded from the source records. So we've been testing our parsing system using random set of parsed registered already imported data and data re abstracted and recorded from the source record. And then we compare them to identify discrepancies and try to work around those discrepancies and have the highest level of accuracy of our data. Now, taking a look at the issues that we had before and going through the multi level approach in terms of data quality and data standardization. We made sure that the accuracy of data important in the system are standardized, accurate, complete and comply with the main indicators of quality of the human cancer registries. So, in this way, we are also taking the box of comparability with the human cancer registries to do some comparative studies on the same type of cancer. Now, if you want to play with the dashboard and take a look at our standardized data, you're welcome to visit our website and our dashboard. And finally, I just want to thank all the research partners and the ACARS team from the different institutions. Our software engineers who have been great in understanding the process because sometimes it's not very easy to deal with animal health data. And thanks to ARDC for supporting emotionally and financially this project. Thank you. And if you do have any questions about all the different aspects of the project, I'm very happy to answer all the questions that you have.