 Okay, great. Okay, so I'll be talking about the Ontario neurodegenerative disease research initiative, specifically the neuroinformatics and biostatistics team and the pipelines partnerships and products we have in place. I'm Derek. I was a postdoc as part of this project for a number of years until about a week ago and I'm now director of advanced analytics at the data science and advanced analytics division at Unity Health here in Toronto. So, Andre for short, what we call the project is fundamentally about people. We have a lot of participants, their family members, friends and care partners that have volunteered their time in order for this project to exist. We also have a lot of clinicians, clinical coordinators, researchers, scientists, analysts, trainees and scholars, management administration from all sorts of domains and disciplines that really form who makes up all of Andre. So we can think of this as a diverse group of individuals trying to achieve kind of the same goals, missions and outcomes. But really, and what is for today is that Andre is data. So when we think about Andre as a data problem, we usually present it as this hypercube where these colorful rows represent the different cohorts that individuals can be recruited into that are different neurodegenerative disorders with columns here that are the assessment platforms. So the types of data that we have where every individual has a massive amount of measurements for each one of these different assessment types. And then going back across this hypercube is the number of visits so how often a participant will have their data collected in the project. When we think about all of this, really kind of the foundation that we need to help make sense of the complexity and to make sense of all the different data types that exist within and across we need a foundation of informatics and statistics to kind of bind this together, make it easy and make it more informative. So I'm going to talk about three key things, the pipeline partnerships and products, but there's going to obviously be a very big emphasis which is data standards and data packages in the project and how we landed on what we landed for our standards in the Andre project. So our pipeline briefly is that we have what's called brain code by INDOC, which is a system of systems to house different types of data. So red cap will be for surveys and spread will be for imaging and other types of data. Then all the data go down to the different platforms, specifically experts within those platforms to process, curate and prepare the data. They'll go through a standards check with the neuroinformatics team and if it fails, it goes back. If it passes, it goes to an outlier analysis and then what's passes outlier analysis, it heads out for release. However, I want to go back in time a little bit to when things were not as standardized. And we're all probably very familiar with how different data could be even when it's all from the same project or same initiatives. So this is an example of one data set where we can see some blank cells, sex is coded as one and two. So we are entirely sure what that is. We have something here called red cap event name that seems to indicate visits of some sort. Another data set from a different platform had n slash ace and we now have a visit column that says oh twos and then get another data set where we look at some of the commonalities across these we have some some column labels that are a little ambiguous if you don't know what they are. And we have these very large numeric values that I can tell you are supposed to be missing codes. So all of these things kind of reflect the institutional inertia within different fields, disciplines and domains. Typically, this is how different teams would deal with their data if they were not part of a larger scale effort. So we need to do something to bring it all together. And this is where our standards really developed from the basics that we fundamentally requires that they're simple, they are comprehensive, generally fair, which we heard a little bit about from Donnie. And we used very key resources to to determine how we're going to build up our standards. So a lot of these are about data sharing crossing between disciplines data organization which we heard a little bit of earlier today from Carl, an equality curation and fair principles. So I'll be showing you a little bit of the details from Andre's documentation, the standards documentation and what we call the toy data set, which is a synthetic data set bundled up to look like a real Andre data package. So all the bells and whistles that are required so that people can follow along and develop these data packages on their own within the project or outside if ever they adopt these practices. So an example snapshot of what we call a tabular data package is here with all the different files that are essential and required. We also require in the project a very specific naming convention that allows us to quickly search for when something is missing or something is misnamed. So something more automated to make sure all the pieces are in place. So the data package is fundamentally several files that are required and some that are optional. I'll go through some of these in more detail with a quick overview of these is we have a read me file that is a structured tabular file in CSV format to tell you what is in the package. We have a methods document to give you all the nitty gritty details about how the data were produced and the ways in which it was produced. And we also have a missing file. This is a key level of detail that will tell you when certain observations of participants are missing on the whole from a data package, even though we may see data from those individuals in other data packages or other platforms. So this is a missing this on on the whole. I'll go into more detail specifically about our dictionary and data files. So our dictionary file is comprised of four columns, something called column label description type and values and these first two I'll focus us on the column label are the variables you will find in the data file. And the description is a short approximately 200 character description of what these variables mean in something that is relatively plain language. So let's take a look at now what the data set looks like so we can see the order of these column labels and it is in the same order in the data set and underneath all of these headers the column labels the variables themselves are the data for the data file that we have. So I promise not to take us through 43 pages of documentation, but I will take us through some of the key value pieces that that we established as part of our standards that are a little bit different from most other standards, with some reasons why. So the key indicators are absolutely imperative across these packages and they are found in every data in every dictionary file. These are subject visit site and date and they must be in the first four positions of all dictionary and data files. This helps ensure that we can combine and merge data almost arbitrarily across the entire project. We also require data types as part of the data dictionary and this will help data consumers understand the intent of the measurement usually as curators we know what they're supposed to be and what they're supposed to measure, but this isn't always familiar to every type of data consumer once they get access to the data. We have a small list that we define the project including text and categorical or no numeric even mixed data so when something changes from say decimals to a greater than or less than we needed to accommodate for that. One in particular that I'd like to point out, which may be controversial is our date and in particular our date format. So we've elected to use a non standard date format ISO 8601 is regarded as standard unambiguous however, through our experience it's only unambiguous if everybody knows it and everybody uses it, which is not the case. This is especially problematic if you're trying to determine if it is January 12 or December 1. And this is kind of a big deal when it comes to aging and neurodegenerative research projects, because data points very far apart we're very close are important to know, especially as individuals with neurodegenerative disorders, going to decline. So, so having absolutely firm dates is key. So no ambiguity is really important. And we elected to go with four digits, three characters, two digits to reflect year one date. Effectively as a contract between data curators and data consumers we are putting a stamp on this to say, we have verified unambiguously, it is the state, or at least to the best of our ability. We are telling you this is the date we believe data collection or some occurrence happened. We also were motivated by the fact that this is easily readable and convertible for both humans and machines. Missing codes is another element that we brought into our standards. And what we did was survey the project and lots of individuals across all the different platforms. And we asked, what kind of data are going to be missing in the Ontario neurodegenerative disease research initiative now and in future projects. And we came up with what we believe is a fairly comprehensive list of the types of missingness that could occur. So instead of something just being missing or instead of having numeric codes, these are predefined and harmonized ways of indicating in data. This is missing for a very specific reason. And we can even see a little bit of this in the toy data set under NAW percent, which is normal pairing white matter percentage for a specific participant. We have MRT, which is a better signal of missingness here. It is due to an artifact in the neuroimaging pipeline or neuroimaging collection process as opposed to say an administrative error or the inability due to cognitive or behavioral deficits to actually administer some sort of assessment. So this brings me to one of our next stages. I'm going to shift out of our standards and now into another piece of our pipeline, which is the outliers process. It is pretty important for us to establish whether anomalies are errors or not. We have a lot of data on a lot of individuals and they're very heterogeneous. So we wanted to know, are these very strange values real, reflective of something very specific or is it an error in collection or an error in processing? So the neuroinformatics and biostatistics team, for most of the data releases, nearly all of them, perform what we call the outliers pipeline. When data come in in the standardized format, we go through a series of analyses. The first is a partially squares regression because all of these are effectively multivariate data, multi-response data. And we use either generalized form or a specific form of partially squares regression to handle, say, agent sex as covariates for virtually all data sets. The first stage of our outlier analysis is to perform a principal components or a correspondence analysis depending on the data types. So PCA is generally for numeric data where correspondence analysis can be used for categorical, ordinal or mixtures of different data types, including continuous. The next stage is minimum covariance determinant or a generalized version that we developed in the project to effectively handle mixtures of different data types and identify which participants may be outliers in a specific data package. And finally, finding out which items are driving these anomalies, these anomalous observations. Here's something called the Cormax procedure where we, again, also developed a new variant of this called the generalized Cormax so we can handle different data types. And once all that is all said and done, we don't just have those pictures and lots of stats and multivariate things because that makes most people upset. We bundle them up into fairly straightforward harmonized reports and give them back to the different curation teams and curators so they can review this and then go look in more detail at specific individuals or groups of individuals and say, oh, yes, this is an error. I need to go fix this in our pipeline or that was not collected correctly. Or, oh, yes, this is real. It's just very strange. And all of this is supported through a lot of software that we've developed some of it before we did some of this, some of it as we're actually going through the project. These are through standards and outliers applications, as well as an arm marked down template and really these are in place so that we can effectively guarantee a harmonization of the process, regardless of whatever data are coming in. So this will lead me to partnerships. I'm going to go a little faster in some of these next sections here. There are two major components for the partnerships between all of the different platforms in the project, but more specifically the neuroinformatics and biostatistics team and the rest of the platforms. And one way in which we form pretty strong partnerships is by working directly with the teams. You're seeing some snapshots of some GitLab repositories where we've worked with different platforms to help get pipelines set up so that their data are formatted, tested, we're doing some checks, and they come out of this code in a standard ready format so that people don't really have to do anything manually they don't have to check things they don't have to compute things. So we're working with a lot of teams to do that. The other side of this is really critical because we have a lot of technical skill. We need to provide our time and training to a lot of the other teams, which includes phone calls, site visits, one-on-ones, and a variety of different beverages after those meetings. Lots of reference material that we make available, usually to Google Docs or on our GitHub. We've run many workshops to help have closer ties and deeper explanations of how to do and what to do and why to do the things that we do. And then also a major component of this is the formalization of teamwork. So what does it really mean to do the different pieces? And one way in which we've pushed for this is through something called the credit taxonomy. A lot of people do a lot of things and a lot of it can be hidden in this process. So we're trying to uncover a lot of that through the formalization of teamwork. So next I'm going to talk about some of the products. And products are vital because they support both the partnerships and the pipelines. We can't do any of this other stuff without the things that make it all run. We've seen five of these out of many that we have. So documentation, toy data packages that are very concrete examples, some of the tools. But I want to highlight that we need these to make working with our data much easier. And we needed to build these around the standards and the goals and missions of the project. I'm going to highlight just two that are on kind of opposite ends that are some of the favourites and I think most important ones that are up and coming or while established. One of the products is actually a clone of the West Anderson package by one of the organizers of CSB conference, Karthik, where this is frankly our most popular R package. It helps harmonize the color palette across the project so that whenever someone makes a new graph or something, we all get the same colors across all of our papers. One of the next most important products right now is something that's in a prototype stage. And this ties right back to all of the standards. We've developed something called Andre Data Frame or Andre DF for short, where the goal of this is to scoop in the data and the dictionary files, preserve all the missingness while mapping them out and then giving you a description of how many things are missing, and then give you some of that preserved information about the variables from the dictionary in a data frame. So this will bring me to the conclusion. A lot of this is not possible without a strong culture, seeing the same goals and missions and trying to achieve the same goals. It can be very hard to sell a lot of technical things to large scale projects, but with a lot of those close bonds with time spent with people, the value in all of these things starts to become very apparent and everybody will get on board eventually. So with that, I'd like to acknowledge a lot of different individuals and organizations. So this work was primarily done at Baycrest through funding available from the Ontario Brain Institute in the Ontario Neurodegenerative Disease Research Initiative, with a variety of individual platforms who were subjected to many early versions of standards and software and we apologize and are appreciative of our collaborators and friends and colleagues who worked with us while we built these tools up from the ground. And those include a number of individuals including this list here of the core group to really push for a lot of the standardization and establishing a lot of these pipelines. A lot of the students, RAs and researchers that have worked with the neuroinformatics team over the years that are really running all these pipelines and making sure everything works and a number of individuals across the entire project in management, leadership and different platforms that have been vital both in a supportive way and in a collaborative way. So I have a million references and I'm just going to kind of go through these and leave us on some resources and finally a thank you.