 Our next speaker is Nolan Nichols. It's a pleasure to introduce him. Nolan and I got to know each other through the INCF data sharing task force. Many moons back. Mars may have been closed due to it again at that point in time. Sure. And Nolan has done an incredible set of work over the years on trying to represent these ontologies in various biomedical spaces. And then he's gone over to the other side. And I have no idea what he does. So today we are going to hear and find out about what he does on the other side. I hear it's something related to metadata. Yes. Yeah. We'll get into metadata. Hopefully I'm in the right session. So hello everybody. And thank you for inviting me to come present today. It's quite exciting to be here. I really see a lot of familiar faces. And I haven't been to this conference in a couple of years. So it's really great to be invited back to share what I'm up to. Satcha alluded to. So I've now moved to the other side and now working in industry in this company called GenTech, which is a member of a broad organization called Roche. And so when I'm talking about metadata, meaningful metadata at scale, this is really about, you know, at the enterprise level. So there's close to 100,000 employees, lots of stuff going on. And it's really a challenge in terms of all of the socio-technical challenges in terms of getting everybody to work together in order to kind of move forward with science and precision medicine in the clinical trial space. And so I thought I'd motivate our discussion today by introducing, we're talking a little bit about one of our trial programs in Alzheimer's disease. And so as many of you know, so Alzheimer's disease is a progressive disease of the brain that causes problems with memory, thinking, and behavior. And it's the most common form of dementia. And it's also a large and unmet growing need, medical need, where there's over 45 million people worldwide that are affected with Alzheimer's, and that's only growing substantially. And so we desperately need some form of medication to treat these patients and not only treat the patients, but also really the caregivers that are spending money out of their own pockets and spending their time to be able to take care of their loved ones. And so in order to kind of address this issue, one piece that we'll talk about is that a little bit of the biology behind the biomarkers and understanding how Alzheimer's progresses is that here we have like a popular framework or model around how AD biomarkers that help to do in vivo staging of the disease. And so this model focus on five of the most well-known biomarkers, which can be divided into either measures of brain amyloid beta deposition and measures of neurodegeneration. And so the major AD biomarkers become abnormal in a temporally ordered banner, where CSF A beta and amyloid pet are dynamic earliest in the disease, followed by CSF tau and FDG pet, followed by structural MRI and then followed by clinical symptoms. And so the biomarkers that are denoted as upstream are neurodegenerative biomarkers, whereas those that are downstream are considered more on the clinical side. And so within this framework, so right now most of the treatments for Alzheimer's disease are focused on treating the symptoms, but the general consensus is that earlier treatment is considered to be key in order to prevent or delay the onset of neurodegeneration. And so when developing some therapeutics for intervening in this disease, one thing to be aware of is that these amyloid plaques are considered to be a key feature in the pathology of AD, and this amyloid precursor protein that's cleaved releases a variety of different isoforms of the amyloid beta peptide. In particular, there's an aggregation of the A beta 42 isoform that leads to these toxic oligomers, as well as downstream amyloid plaques and tau pathology. And so the downstream beta amyloid aggregation and other processes leading to neurodegeneration and neural loss. And so right now our current kind of portfolio of different medications that are being evaluated target both tau, but the ones that are furthest along the pipeline are this Gantronerumab molecule that targets both forms of this A beta molecule, so both the oligomers and also the amyloid plaques. And then also this Cronuzumab drug that is more specific to the oligomer form of the toxic A beta 42. And so I'll be talking about introducing just a little bit of background about our Cronuzumab plan, our program, and in particular the phase two program that I'll be discussing kind of in the context of metadata to some extent. So there are two programs. So there's one that is in a cognition study called ABI. So this is a larger study. And a second one called blaze. And so this is a smaller biomarker study. So the enrollment criteria for these two trials focused on moderate or mild to moderate AD where patients that were enrolled were between 50 and 80 years old. And there were two different arms for both of these trials. So one used a subcutaneous dose of 300 milligrams every two weeks and the other one was a higher dose that was administered intravenously at 15 milligrams per kilogram every month. And so this is about a two and a half fold higher exposure. And the primary readouts were done after 72 weeks. And so the primary endpoint for the ABI cognition study was a change in the ADOS cog as well as the CDR sum of boxes. And for blaze, this was primarily imaging, medical imaging related. So this was a pet readout. And so I'll be focusing just on the ABI study here and this cognition piece of this. And so in the overall population, the primary endpoint here was not met for this trial and so for this readout. But if you look at the low dose versus the high dose, there was a trend that the high dose may have some kind of treatment effect suggesting that a higher dose may be better. And if we go and look in a pre-specified sub-cohort within this trial and look at progressively milder subsets of patients, we can see that the drug might be working, but the patients would need to be treated earlier with a higher dose. So here you can see that early, if you look at patients across the whole spectrum with MMSC from 18 to 26 score, we have a 16.8% effect. Whereas that grows up to 23.8 and finally a 35% effect. And so while the primary endpoints were not met, this still suggested that CRENEZMAB may have some kind of an effect and this is consistent with what's being seen in other anti-amoled studies that are being reported from other pharmaceutical companies. And so from this, the decision was made to still continue moving forward with the phase three program and also evaluate what higher doses were safe in a smaller study. And this would be done in a more prodromal to mild population of patients. And so as part of this trial, we did kind of a post-hoc analysis and looked for trends within the data that would give us some degree of evidence that we would be able to move forward in the program. However, as part of the trial, there's also all of this other auxiliary data that's created. So you have all the different patient visits, all the vital signs, other types of biosamples and imaging data that wasn't really included in this analysis in order to be used to move forward just with the trial. And so there's kind of this left hanging question and this isn't just for Alzheimer's, but really for all the studies that are being conducted, whether they're clinical studies or clinical trials and other therapeutic areas, is really like how can we reuse these data in order to find actionable insights that either allows us to identify new targets or biomarkers or just even understanding the core biology of these diseases. And so if we look at precision medicine, traditionally the approach within clinical trials is you have early research and developments that identifies a target in a molecule and this is used then to go through a series of preclinical and clinical trials and if that medication makes it through the trials it will then end up in clinical practice. However, this is a very rich source of information that was traditionally not really reused. And so there's this concept of repurposing the data, referring to this as reverse translation where you take the results of the clinical trial and all the samples and other information and feeding it back into research and development. And so while this may sound like something that has been done frequently within academics in the clinical trials world, these two parts of the organization really have been separated and so this is still kind of a new-ish idea over the past few years around how do we actually operationalize reusing clinical trials data and putting it into the hands of bioinformaticians and data scientists. However, if you take the data from a clinical trial and go and hand that over to the data scientists and bioinformatics folks that are sitting in research that have never seen this type of data before and don't really understand how clinical trials are designed, et cetera, you're going to end up seeing, they're going to have a hard time actually using that data. And so this is a figure from a Forbes article that came out a few years ago that surveyed data scientists and tried to ask them where they spend their time. And so you can see here that they spend 20% of their time collecting data sets and then another 60% just trying to clean them up and make them reusable before they get to the fun part of actually doing their data analysis and training their models. And I'd argue that probably within biomedical sciences it might even be more than 80% that you spend trying to get your data and wrangle it and actually use it. And so we have this kind of question of how do we pay this price only once and push all this data curation and integration up front so that the people that are trained to do analysis can get directly to work. And so that's where we introduce this concept of FAIR. And so as we heard in Michelle's talk, FAIR is really providing the principles around which we can make optimal reuse of data. And so interestingly, the FAIR principles have really made their way also into industry. And so this is something that Rochewide, the FAIR principles have been like a rallying cry that I see in like all of our internal presentations where they discuss needing to make data FAIR and the FAIR principles are really like the war cry around how we can be better stewards of our data to the extent of which there are internal projects where we've hired internal data curators and integrators and created a variety of new positions that are focused just on making data FAIR. And so just briefly to go over where some of these challenges arise and kind of where the FAIR principles are actually being applied. So as you guys probably already know, so clinical trials generally have a multi-site. So we have all these sites that generally collect a variety of different types of data that go to different vendors that provide certain types of analyses or run assays that then get fed back over to us through these different vendor services. And these are fed into either like our sample biobanks where we keep track of the actual physical samples or the actual assay results or imaging analysis results that then go into in more of like a CSV or like tabular form into a clinical data warehouse. And so when these data come in, they're in a very raw form and they are not ready for analysis. And so the idea of being able to make these analysis ready means that we need to harmonize them and bring them under certain metadata standards. And so one of the first stages after the data come in is that we need to apply standards that can then be used to derive different measures downstream that are used for the actual deliverables for regulatory approval, et cetera. And so the standards that are used are called CDISC. And so as of December of 2016, the FDA requires that any regulatory filing must follow the CDISC standards. And so this is very much the kind of stick approach where it's like you shall follow the standards. And so this has helped and forced adoption of metadata standards within the pharmaceutical industry. And so the idea then is that for clinical scientists that are working with the data that are part of a trial, that the data is more or less fair and follows these standards. But you might ask kind of what is CDISC and what is this framework. CDISC stands for the Clinical Data Interchanges Standards Consortium. So it's been around for I think since the mid-2000s or so. And it really is a framework that covers a broad spectrum of different standards here. So everything from planning a trial, so the protocol and study design, through the actual data collection that's sent in through different vendors, as well as data tabulations. So these are kind of the core workhorse of what we actually end up using. And so this is the SDTM standard, as well as data analysis standards where the data is basically organized in a way that's specific for different deliverables and figures that are part of the clinical trial milestones. And so just to look a little bit closer at the SDTM standards, so SDTM is broken down into a variety of different implementation guidelines. So for medical devices, tabulations, or people that are actually not participating in the trial are the people surrounding them. So these can be family members or physicians, et cetera. And then there are more domain-specific standards that describe how to collect data around questionnaires or pharmacogenomics information and therapeutic extensions that go into areas like Alzheimer's or asthma, et cetera. In addition, there are also a set of controlled terminologies that are allowed to be used. And so looking at the core SDTM implementation guide, we have a variety of different classes, and each of those classes have what are called the different domains that model specific types of data sets. And so these classes go from special purpose, which will be something like demographics, to these three different classes of general observations, which go into interventions or specific events or findings that have their own specific descriptions around what's important to be captured, as well as other kind of auxiliary pieces like findings, trial design, and also relationships between data sets. And so looking at an example of one of these, so the data that comes in to our organization from the different sites maybe organized kind of like this lab test results here at the bottom. So we may have a study number that we can map over directly. We may have some patient identifier where we need to modify the header or the column name, but then also modify the value. So just here it's just removing a dash, but it could also be applying some other type of transformation. And then you can see that these other fields also map over. But in addition to those mappings, we also include additional codes. So here we have basically key value pairs where you have the lab test name, what category it's in, the units that it was collected with, as well as being able to bring in knowledge from outside. So from the World Health Organization or other places that say like what that value is with the low ranges or the high range. And so really this is kind of augmenting the data that you receive with additional information that can be used to automate downstream analyses and speed up our ability to generate deliverables and results. However, across the large organization like this, how are you able to pull all this information together and be able to follow like the same comic guidelines. And so really what this comes down to is data governance and being able to have a metadata repository. So a core location where everybody references for the gold standard of how to represent information. And so internally we have something called the Global Data Standards Repository. And so here we are able to model each of these SDTM domains. So this has all of the different classes that we can access as well as what the applicable or valid values are for each of the fields. And then also breaks down for each of the kind of columns like exactly what the scheme is that you're allowed to use. And furthermore, we create releases of these standards and so that when you have some specific trial they have to use specific versions of the standards. And this whole metadata repository is actually kind of interestingly modeled in a fair way. So it also uses kind of link data or RDF in the background and it's natively like a what's called a triple store that models all of this relative information. So all the terms, IDs and versions also have these URIs or unique identifiers that can be referenced. And so in addition to this, so that's part of the actual clinical trial process. But for discovery-based research we need to be able to take not just these clinical measures and be able to pull those over, but we also want to take the biological samples from these studies and be able to feed those back into research for other exploratory assays. And we want to be able to provide those to the bioinformatics scientists and data structures that they're familiar with and able to use. And so we can pull both the clinical data that's already been harmonized and follows these nice standards into a data structure that we can then provide to bioinformatics scientists in a fair way. And the data structure that many of these folks are used to using, at least within our organization, follow something from a group called, or an organization called Bioconductor. And so the idea is that we can eliminate this 80% of time of cleaning all the data sets up by using these curators and integrators that pull the data sets together into these concise data objects and they can start with their analysis right away. And so we may have a variety of analysis type, or assay types like RNA-C, nanostring, or fluid ion that can be integrated into these multi-assay experiment objects. And so these are data structures that provide all the methods for manipulating and integrating multiple assays for, in this case, a given clinical trial so that you have an efficient way to construct these and subset and as well as a variety of other types of analysis. So like survival analysis and other tools that can operate on top of these objects, as well as building dashboards and interactive apps using tools like Shiny. And so, you know, but doing this is hard. It's not something that is done, can be automated. So it's really something that is highly custom and you have people that have to go in and do manual quality control, make sure all the pipelines for the assays are following the same versions, as well as mapping all the different identifiers, which is extremely challenging. And so what we want to be able to do is provide the consumers of these data so that they can trust them and be able to reuse them and also discover what data is available and what results are being provided. And so along these lines, so we've developed a system internally that we call GREX. And so what is this? So it's a system for recording, tracking, storing, finding and retrieving computational results, including data and these multi-assay experiment objects in a fair manner. And so the key goal is really to apply the fair principles, but applying them to generated exploratory and also one-off types of lymph analyses. So not something that can be incorporated into a standard pipeline. And so there's our logo. So it has a dinosaur because dinosaurs are cool, but also because the idea that we want to be able to leave behind this kind of like archeological record of all the information that went into constructing both data and results that people can then unearth later in use. And so GREX is based on four pillars. And here I have kind of a simple mapping to fair. However, I guess it could go into more detail now that I know about the fair metrics. But we have the kind of core piece is just an archival system where we can store a standard directory structure of the data that's being produced along with metadata, a provenance pillar that is used to basically create relationships in a graph around how the different results of data are connected, a discoverability portal for being able to find data, and also this other component of reproducibility where we can take the code and data and environment that's submitted and be able to reproduce or provide certain guarantees that the object you're consuming was actually constructed by the system and could be reproduced. And that piece, by next point, is that the R in fair is not for reproducibility. And so GREX really provides some extensions and additions to fair where we want to be able to intrinsically link code and results and also be able to include all the versions that went into producing some results so that you'd be able to reconstruct the environment for later reproducibility. And furthermore, the client that we're using is able to do some smart things in terms of introspecting the objects that are being created and be able to generate metadata that really describes exactly what took place during the construction of some dataset. And the reproducibility part is something we're still working on, but that's also something we're developing. And so just to give you an idea of how this information flows through the system, so there's an R client that primarily works on right now with an R markdown. And so you can take an R markdown file, so this is a description of both the code and also the narrative along how either a piece of data or result is being integrated or created, and then submit that into the system where it constructs something we're calling a client bundle where it has all the metadata for the versions of R, all of the code, who's doing the submission, and all this other information of the actual data objects or for RMD it will also include the HTML file so you'll be able to see the report. And so this is handled by a controller that then stores this in the archive and also so the metadata that's constructed uses something called JSON-LD. And so that's loaded into a provenance triple store, so this is represented as a graph. So this will allow us to link across different R sessions for the objects that are being submitted to the system. And then also a publication step where all this information is made available where this discoverability portal is able to query the provenance and also download those results. And we have just a small couple of screenshots of that. And so this is just kind of like one of the early interfaces of what the system looks like. And so this provides you just like some simple facets that you're used to being able to see. It also provides free text search and well free text search and pretty much any search that the solar search engine that supports. And all of the results that are submitted get a unique identifier that's based on a hash. And then also metadata that's actually extracted from these R objects themselves. So if you submit a plot, it can go in and grab the title and axes and other related metadata. In experiments, it can also read out the actual variables that are consistent with this CDISC standard. So then we have the actual IDs of like within the data sets what kind of information can be made available in search and retrieved from the object. And then just for like an individual result it has links to be able to download these. Eventually we'd like to include links to other data sets that may be related to the one that was submitted. And then also the code that the actual exact code that was used to submit the object. So we actually traverse the kind of execution path of all the top level functions and provide a description of what called what. So you actually know exactly what code was executed to submit that object. Cool. So the GRX platform implements some of the fair principles. So we store results in associated metadata with persistent IDs. We make them findable and accessible and we're partially interoperable using this JSON-LD standard. So we map to some public vocabularies such as Schema.org or DCAT for describing the data sets themselves as well as the Provantology. And we also make these data sets reusable through these R packages called History and Tracker. And so these two tools are the kind of smarts behind the client that does all the metadata extraction. And so these are used to really describe exactly what took place during the session. And so that's kind of the summary of where GRX is at. And so I'll add here that so out of the bioinformatics group so both the History and Tracker packages are shared. So those are open source tools. And along with that, there are probably 36 or so other packages out of the group that have made their way onto bioconductors open source tools. And with that, I wanted to segue over to this talking a little bit about, you know, I've talked about how we're processing data making it available for our internal sharing but there are also some external sharing efforts that I wanted to make you aware of. So one is something called the Clinical Study Data Request website. And so this is a collaboration across several different clinical trial sponsors to make the actual patient level data available to external researchers. So this also has a website. And so if you go and browse this site, you can view over 3,800 trials from across these different organizations and submit requests to be able to download patient level data and use them in your research. And so at Roche alone, there's a little bit over 200. And so this number is going to continue to grow and there's a lot of internal kind of effort to go into making these data available to the outside community. That's seen as something that's valuable to contribute back. And along those same lines and to link back to the Alzheimer's story that we started with. So the Alzheimer's Prevention Initiative, so this is something headed by the Banner Institute. So we have a collaboration with them for this Columbia trial. So this is a study of, it's actually closer to 250 patients that are enrolled that have this familial early onset AD. So this is essentially all the people that are in this trial will get Alzheimer's disease at some point in their life. And so we're enrolling people as early as 30 years of age. And so they're taking our, this Kurnooza Mab drug starting early in life and then we're looking at other primary points in terms of cognition and imaging to understand if we can delay the onset of Alzheimer's disease for these folks. And so with that, I just wanted to make one final point around this is that I think that data standards and these notions of harmonization and the open source tools and software and platforms that are being constructed by the community are a great place where people in industry and people that are in academics can collaborate in order to move the field forward. There may be competitive space in the actual development of molecules in terms of actually collaborating on projects like CDISC which Roche is an active contributor to. I think that that is something where we can contribute and move the whole field forward. And so with that, I'll acknowledge the G-Rex team and folks within Bioinformatics as well as our clinical data standards team and then the folks that I'd worked with on AD and the Kurnooza Mab public program. And with that, I can take any questions. So while we take a few questions for Nolan, maybe I can ask the other panelists to come sit on the chairs. Nolan, that was a great talk. In terms of the R tools that you've developed for the clinical, I guess the Bioinformatics people that are using the data that comes out, how does there are workflow change when their analysis is being tracked by the G-Rex system? Like is anything actually different from the code differently? Can you just talk about that a bit? Yeah, sure. So the tracker package, so you do need to do a library tracker call at the beginning and essentially what that means is it'll keep track of all the different evaluations of functions as you're going along. But other than that, they don't have to change any part of their workflow. When submitting an R markdown file, there's actually a separate function called knit and record. So that basically sticks with the whole literate programming. So you basically just feed that in our markdown file and it evaluates that and automatically kind of loads this tracker package and keeps track of everything that's going on. And so you really don't have to modify the workflow of the analyst that's working with the data, which has ended up being like, I mean it's a critical piece of, if you want adoption, don't change somebody's workflow. Another question in the back here? And thanks again for the talk. Two very related questions. One is clearly a lot of moving parts in this whole well engineered system. How do you keep the sheer complexity of it from overwhelming users? And second, you talked a lot about what you're trying to deliver, what evidence or feedback do you have that those mechanisms are useful beyond compliance or beyond sort of tracking things, but their people are able to do things they couldn't do otherwise with that? What are your success stories? Yeah, so on the first part, yeah. I mean it is very complex. And so mainly the way that we've done this is by scoping down to specific use cases. So we'll use like cancer immunotherapy and we'll just focus just on a G-REx that supports the use cases that are around that specific project. And in terms of the complexity, or I mean in terms of like success stories, so I mean even internally this was released fairly recently but we got a lot of good feedback from folks. I mean they're really used to working on like file systems and directory structures and so just like migrating to web applications that actually have the data and providing tools where they can computationally access these objects from a session and do searches for certain studies and see what data is available. That's still like a step forward from where they're kind of currently coming from. And so from what we're hearing back, yes they support it but there are a whole bunch of other issues around data versioning and people will identify oh there's an issue with this data set and we have to go back and redo it. So there are challenges that are unsolved there but I've been learning to from other projects that are similar to this even here at this conference and trying to understand what the best path forward for doing like data versioning and other types of things as well.