 Okay, so I'm gonna start this presentation. So before I actually dive into the live demo and actually give an overview of the patient of prediction package, I just wanna give a quick introduction. So my name is Jenna Repps. I'm a researcher at Janssen R&D. And I also work part-time as a researcher at Erasmus Medical Center. My co-presenter Ross Williams, he's a researcher at Erasmus Medical Center. And both of us work a lot in the Odyssey collaboration. So you're gonna be hearing in the next slide about Odyssey. And we both co-lead the patient of prediction workgroup. So if you're interested in this patient of prediction, feel free to reach out to Ross and myself and we can give you information about that. We have monthly meetings and we discuss lots of different topics. It can be the R programming, it can be actually methods research, anything really to do with patient of prediction. So feel free, like I said, to reach out if you're interested in being involved in that. So firstly, some of you may know Odyssey, some of you may not have heard of Odyssey. Odyssey is a collaboration of researchers and effectively we're all working together to come with best practices to do the analysis on these big observational healthcare databases. And when I say big observational healthcare, I mean things like insurance claims, data sets or electronic healthcare records or even survey data that some researchers in the network who have survey data, all of us in the network are working together to come up with best practices to extract useful information from these databases that we have. And you can see here's a worldwide map. Each blue dot here is actually a collaborator. So you can see we actually have collaborators all over the globe. We have a lot of collaborators. It's an open collaboration. So if you're interested in Odyssey, I'm gonna be sharing some links and you can check out and you can start introducing yourself and join if you're interested. It spans all over the world and the key thing actually with Odyssey is the data network. So I'm gonna be touching a little bit on that and then we'll be kind of showing you how that is useful and Ross will be showing you how that's useful for actually doing external validation of prediction models. Oh, by the way, if you have any questions throughout this presentation, feel free to ask as I go. So the way Odyssey works and the key thing of Odyssey is there's lots of standardizations and the main standardization is the data. So if you're someone who has experience with the big insurance claims data sets or the electronic healthcare records, what you'll realize is that all of these data sets in their source form have a different structure. So they will have the information recorded in different tables with different names and each table has their own set of column names and their vocabularies can be different. So one set of data set will have IC9 or IC10 codes, another may use read codes, another may use snow med codes for conditions. So they all have their own vocabulary and structure and this makes collaboration difficult because you have to customize the code if you wanna extract the data, if you have data in all these different formats. So Odyssey, because it's about collaborating and it's about running studies across networks of databases, this is only feasible because we standardize the databases. So what we have is we have this OMOP common data model and this is a structure which is shown here in this figure of different tables and each rectangle here is a table in the OMOP common data model. And then within each table, there's a set of columns and there's a standard vocabulary and every person who has data in Odyssey maps their source data into the OMOP CDM format. And this then means that everyone has the same format, the same tables, the same column names, the same vocabulary and it means that we can write code that can be applied to all the databases as long as they've been mapped to the OMOP CDM. And the package I'm about to be showing you the patient health prediction package, this works because of the OMOP CDM. We go end to end, we actually extract the data as you'll be seeing and we actually train models and we actually are able to explore the models and this only works because of the standardized data. If the data weren't standardized, we couldn't do this because we couldn't have one code that would work to do the extraction for every single database. If you're interested in learning more about Odyssey, the main website is www.odisi.org. There's also a forum. So if you're interested in joining, the forum is a great way to introduce yourself and start chatting to people. And there's also courses. I'm gonna have to give quite a brief overview of patient health prediction and Odyssey tools in general because we only have an hour, there's a lot that actually goes into it. But if you're interested in learning more, you can go to the odisi.org website or there's actually the Eden Academy where we have free courses that actually teach you all about the OCDM, all about how to create cohorts and phenotypes with the Odyssey tools and it even has a prediction tutorial in there and it has a quiz and everything. So if you're interested in learning more about Odyssey, these are some good resources. So before I jump in and actually start talking about the prediction, I like to always have this slide to kind of show how prediction differs to the different types of analysis because often when people want to do prediction, it often is a different question they have and it's actually, they may actually want to do causal inference. Prediction in general, and especially the package we're doing, it's not doing any causal inference. So if your question is a causal question, then you can use Odyssey and there's tools for that and you can use the Odyssey network but that would be population level effect estimation tool rather than the patient health prediction. And Odyssey also has a characterization package and tools as well. So characterization is what happens to them. So what are the comorbidities for a set of people? What agent and gender distribution do they have for a set of people? Like for example, you can look at people who have diabetes and in the data and have descriptions about them. Population level effect estimation is looking at the causal effect. Prediction on the other hand is saying, okay, at this point in time, what is my risk of having some future event? So if that's the sort of question you want, then prediction is what you want to use. And that's why I'm going to be talking about how you can actually do that with the package that Odyssey has developed. Now we have a framework for doing the prediction. It was published in Jamie in 2017. So if you're interested in reading more, but I'm going to be going through most of the parts of this framework in the next few slides. So this is just that if you're excited by this presentation, you want to read more about it then this is a great resource just to see the original kind of framework that we based up the app package on. And then more recently, we've published a paper on the whole standardized process. So the first publication I just showed you was going from the OMWAP CDM, kind of specifying the prediction you're interested in and then the whole process to get the model at the end. But this process is how you actually use the Odyssey network to do a network study. So this is actually explaining how you map your data to the OMWAP CDM and do some quality controls, how you actually initiate a network study by creating a protocol and collaborating with people, how you use the app package that I'm going to be showing you to fit a model and do external validation. And then how you put all that together in the end in an interactive shiny app to actually explore the results and you can share that with everyone else as well. So this paper just gives you the whole process. The previous process paper focused more on the model development aspect of this. The patient of prediction code is all open source. It's all online, it's on GitHub. So at the bottom of this slide, you can see a link that's the Odyssey GitHub repository and then you go to the patient of prediction repo and you can see all the code. So everything I'm showing you is fully viewable on GitHub. And also if you're someone who likes to program in R, then we would love people to actually collaborate and add to this. So you're going to be seeing the different framework that the whole process we have and you're going to be seeing that it's very modular and we've made it so you can actually plug in custom code. But if you plug in custom code that does well. So if you actually add in a new classifier or you add in a new feature engineering process and it works for you, then we encourage you to also make a pull request and add it to the package. So we do have some standards for that. So we do require unit tests. And you can see here, we have a currently 90% code coverage from unit tests. That's what the code coverage percentages. So we do require when you have a pull request that you meet some standards like the unit testing, but we will actually work with you on that. So if you're interested in this, let us know or start working on the code. You can take a fork and start working on it. But we do always hope that people will expand the package more and the more collaborators we have, the better this package will get. So now I'm going to dive into the framework. Then after I go through the framework, I'm going to actually show you some of the R code and actually show you it in an R session live. But what we realized when we were developing the framework originally is that there was a lot of prediction being done and published, where it was kind of unclear completely what they were doing. So it may be that they would save the target population, but they weren't really clearly defining how they were identifying the target population and their data, or they may not have been clearly identifying when the prediction is usable. And then we may see that people have an outcome, but again, they're not clearly defining how they're identifying the outcome and the data. And you may also see that sometimes where people don't really specify the time at risk very clearly. So you may see that they're predicting an outcome for a target population, but you don't know when they're doing it. So we realized that to make a prediction very transparent and make it clear what you're doing, we wanted to decompose the prediction task into three components. Your target population, who it is you're doing the prediction for and when, your outcome, what you're predicting and your time at risk when you're predicting the outcome relative to the target population start. So these three components are key and you're going to see these coming up. This is like our prediction task. It's going to come up a lot. And these are the three things that you need when you want to do the patient of prediction. And then our PLP, our patient of prediction, our package follows a cohort design. So it uses these three components in this cohort design where you're going to be doing the cohort design for your target population patients. And then T zero is the index date for your target cohort. So when the patient satisfy that criteria to get them into the target population, that's T zero. And then you're going to be looking into the future during your time at risk, some time at risk period that you need to specify to see whether they had the outcome or not. And then you're going to be using, so this bit here is basically going to be labeling people. And because we're using data retrospectively, we have this follow-up. We're able to look at people in the year or so whatever period we're interested in in the future. And then this will be used to label people as having the outcome during the time at risk or not having the outcome during the time at risk. And patient of prediction, our package does binary classification to basically try and discriminate between whether you're going to have the outcome or not. And it uses features that occur prior to your index. So it's going to use data post index to look to whether you have the outcome and label you. But it's going to be using data prior to index to create the features. And it can be using things like conditions that you have recorded before index, drugs that you have given to you before index, procedures, measurements, all of this can be used to create features. And you're going to see that we actually have a library of features that are already available, but you can customize and you can create your own features. You can also look at demographics of the age and sex at index. So this is the design and this is how the three components fit in. And this is how we basically create the label data we're going to be learning that model from. So here now I'm just going to give you an example. Your target population could be, for example, a new user of Lysinoprol. So what you're going to do is you're going to find all the people in your database who have Lysinoprol. You're going to be looking at the first time they have Lysinoprol. And that's going to be your T equals zero. And then you can use anything prior to this first Lysinoprol that's recorded for the patients to create features, to describe that patient. And you can look at their age when they have the first Lysinoprol and you can look at the sex. And then you're going to say, okay, now I'm going to look at what happened in, for example, in this case, we're going to do a prediction one day to 365 days after index. Our time at risk is one day to 365 days. So it's very important to specify the start. Often people say a one year follow up and one year time at risk, but they don't specify whether time zero is included or not. So this is something that we kind of stress that you need to specify the start and the end of your time at risk. So that's clear. And then you're going to look in the one day after your index up to 365 days after the T equals zero. Did you have angiodemia? Yes or no. So that's our outcome. And then we can use this to create the label data. So features using anything prior to index and creating the label whether you had the outcome or not after index. And I'm going to be using this demo for the rest of the presentation. So you're going to be seeing as actually fit models for this task. So the patient of prediction has this standardized process where you specify your task, your target cohort, your outcome, your time at risk. You need data mapped to the AMOP CDM. And it could be any observational dataset mapped to the CDM. And then the whole process that we have is firstly to extract the data to give us label data to learn from. Then we're going to be splitting the data into test and train sets. Then we're going to be doing some pre-processing of the data. Then we're going to fit the model. And then finally we're going to then apply the model to the test data to see how well it does internally and evaluate it. So this is the general framework we have for developing model. But Ross is actually going to be showing you we can then actually take this model and actually apply it to new data in the AMOP CDM pretty readily. And we can do external validation at scale. And we can do that on datasets all over the world because the Odyssey network is pretty expansive and it goes all over the world. So now I'm going to be showing you this step-by-step example. So I'm going to be looking at these new users of Liz and April. I'm going to be looking at an outcome of angiodemia. And I'm going to be looking at the time of rest of one day from the start of Liz and April up to 365 days after the start of Liz and April to be predicting this angiodemia. So we're going to use this cohort design that I previously showed you. And this is what the PLP package does. So you do need some setup. So if you're going to be using patient of prediction you need to have Java installed. The reason you have to have Java is because within R we're going to be communicating with the database your AMOP common data model database to actually extract the data. And we use a JDBC connection. So we use Java to connect to that database. So you have to have Java installed on your computer and you obviously have to have R because it's an R package. We recommend using RStudio and you're going to be seeing that I use RStudio when I actually run this code later. And then some parts of the package will require Python. So depending on what you want to do with the package you may need Python installed. The reason we have Python is that some of the classifiers like the random forest and the A and a boost and the neural network when I was like testing some of the classifiers many years ago, they were faster in Python. So we ended up using Python backend for some of the classifiers just because they were more efficient. And we use reticulate to communicate with Python. So if you want to install the basic patient of prediction then this will have lasso logistic regression. So logistic regression with lasso regularization and also have gradient boosting machine. And you can also have K and N. So these three are actually our back ends. You can use the remotes and then use remotes install GitHub, Odyssey, forward slash patient of prediction. And this will install the package into your library. Then you can use library patient of prediction to load that package. If you want to get the optional Python extras then you need to use reticulate to install mini condo if you don't have any Python installed. And then you can use in patient of prediction there is a function called configure Python that will create basically an environment. And it will install all of the, the R reticulate actually is the default environment for reticulate. So it actually just installs it into the R reticulate in this example. But it will install all of the Python dependencies into your condo environment in this example here or into your Python environment if you're using Python. So we have functions that make it easy to, to configure the Python if you do want to use that but it's optional. And also if you want to contribute, if you wanted to add new classifiers and they are in and you want to have Python code then you can use a similar process that we've got and you would just add in the dependencies you need for Python into this function in the pull request. So the first thing you need to do is specify your target population and the index when people actually in the database match and enter the target cohort. So we're going to show you an example using Atlas. So Atlas is a website which you can see this link here. It's a website for the public Atlas. You can get it installed locally on yourself on your own computer if you wanted or but it takes a bit of technical requirements. So it may be better if you're playing around patient prediction to begin with to actually just use the public Atlas. And so you just go to this link. Atlas is quite extensive. It's got a lot in there. So you can create phenotypes and you can create definitions that will and it will create the SQL to execute that in your OMOP CDM database for you. So this is a way of basically identifying people in your type of population or people who have the outcome without writing SQL. The Atlas will write it for you and this is why I'm showing you in this example using Atlas. Although you can actually write your own SQL if you know how to use SQL. But there's a lot to it. So I can't really go through Atlas since it would take more than an hour to actually show you all of Atlas and everything you can do. If you're interested in learning more about Atlas then I recommend going to that Eden Academy that I mentioned earlier and you actually see courses on how to use Atlas. But effectively you can use Atlas to create cohorts. So there's a cohort definition option and in there you can specify logic of how you're actually gonna identify the target population in your data. So here I'm saying I want a drug exposure of Lysinoprol and it has to be the first time. So I'm gonna look at every time that they have Lysinoprol recorded I'm gonna then restrict to the first date for each person. And then I'm also gonna require that they have a 365 days prior observation. So this means that they have to be in the database for at least 365 days prior to index because I'm gonna be using that time prior to index to create the features. And I'm also gonna restrict to people who have hypertension. So I want people who have Lysinoprol for hypertension and this I can create this in Atlas and it's basically gonna generate the SQL and I'm gonna be using this in R in a bit when you'll see me actually running the code. And we do the same for the outcome you need to basically specify the outcome. You need to say here for example I'm looking for an occurrence a kitchen dish and occurrence of angiodemia and I'm gonna look at every single event. So I'm gonna extract every time that someone has that in the data except if they had it in the prior 180 days because that's gonna be considered potentially like the same angiodemia. So we want that kind of washout period to make sure we're looking at new angiodemia dates and not just they have angiodemia and then they have another record because of some continuation of care. We don't want the continuation of care dates to be in there. So here we just got some logic for how we're identifying angiodemia. And one thing you need to know in Atlas is there's gonna be an ID that is a cohort ID. So here is 1782710. This is a unique identifier that you can then use to extract this cohort in R and I'm gonna be using this later. So you're gonna be seeing this one that was 1782708 for the target population of Liz Interpro and then I've got this next one for the outcome. So I'm gonna be using these two cohorts later on when I actually run this. The next thing for the task is the time at risk. So we have a function in patient prediction called create study population settings. And here there are four main inputs that tell you the time at risk. So you've got your start anchor and this tells you the cohort start means you're gonna be starting from your Liz Interpro first, like the start of Liz Interpro. The first time you took Liz Interpro that's your cohort start and then you're gonna add the number of days that you specified your risk window start. So I'm gonna be doing one day from my Liz Interpro start is basically the time at risk start date. And then the end date is two similar inputs except rather than start it says end cohort start plus 365 days. So these four inputs here are specifying that my time at risk starts one day to 365 days after they started their Liz Interpro for the first time. So this is specifying the time at risk. And then the study population has additional inclusion criteria that you can specify. Like you can say that I wanna remove people if they have the outcome prior I wanna only restrict to first occurrence. So in the example for Liz Interpro we're looking at first occurrence of Liz Interpro but you could have cohorts where people are in multiple times. And here you can actually then override the cohorts and say I only want the first occurrence the first time they're in the cohort. So there's lots of other settings here. I'm not gonna run into these too much but if you're interested if you basically look at the help file for PLP you can see all the different inputs and all the different options you have for inclusion and exclusion criteria into your cohort here. And we also have some research of things like min time at risk. This is what happens if people lost the follow up during your time at risk. So because it's observational data it's possible that someone drops out of the cohort of the database story during the time at risk. So someone could drop out they could have Liz Interpro drop out of the database 60 days after when we're doing a one year follow up. So how do you what you do with that person who drops out? And we actually have research we've done quite a bit of research to look at the impacts of this and actually guide some of these choices. So if you go to the patient and prediction package on GitHub you can go to the website you actually will see some of the resources that we have as well for guiding you for some of these choices. So that's the task the target cohort the outcome and the time at risk specified. The next thing you need to do is now actually specify what how do you actually want to do the process of fitting the model. So the first thing is you've got your target cohort outcome type of risk specified you've got your data mapping your CDM you need to extract that data into a label data to learn from but that requires specifying what features you want to use. So there is a Odyssey package known as feature extraction and there is a function in there create covariate settings that you can use to select default covariate. So these are things that basically it's in our library. So we have a library of covariates that you can select from and I've picked to use sex Asian groups, five year bins conditions and drugs that occur in a 365 days within 365 days of index. So these are just standard features that I've selected and if you look at the create covariate settings options you'll see that there are a large number of features you can actually pick from that are pre-created. So basically the SQL to create these features from the own what CDM data is already being generated. But if you wanted a custom feature you can write that. So we this is all being done in a way PLP the patient of prediction packages being created to make it very flexible. So you can pick from a library if you don't really want to have to do extra coding. But if you want to have for example a specific measurement that you're going to be converting to a certain unit and doing some manipulation to you can write the SQL to do that and plug that in. And there's there's instructions on the feature extraction website of how you can write custom custom covariates basically. So this is what you need to to define when you do the extract data and then patient of prediction will use the cohort design with your target and outcome cohorts and your time of risk to create the labels whether someone had the outcome during the time of risk or not and then it will use the settings you put pick here to create the features using records prior to your index date. And it will it will go from the data and actually extract it will go from the database and extract this label data into R for you. The next thing is then defining how you want to split the data. So we actually have lots of different ways of splitting in the package. You can specify how much of the label data you want to put into a test set by default it does a 25% into a test 75% to the train. And then we do crossfold validation. Yeah, crossfold validation. So this is the number of folds you want to do for the crossfold validation. This is used for picking hyperparameters. So a lot of the models have a lot of hyperparameters and we use a grid search and this will basically tell you how many folds you have that option there. And here you have an option of how you want to do the splitting. So stratified means that you have you split your data so that you have an equal in the train and the test they have the same outcome rate because basically the spits done stratified on the outcome. But you can also do a subject split where you make sure if someone if the same person is in the label data multiple times, a subject split will make sure that that person and their five occurrences, for example, are either all in the train data or all in the test data. So a subject split make sure that the test and the train data sets are disjointed on the subjects. There's the persons either all in all in the test or all in the train. And then a time split puts the older data and the train data and newer data in the test data and this mimics you fit a model now but you're going to be applying it in the future. So it mimics that situation and sees if the model holds up over time. So these are the options for doing the splitting. And you can write your own custom splitter as well. So if you wanted to split location you could create the code to do that with the A1 CDM. Then we have feature engineering options. So after you split the data you can do a feature engineering to process the data. We don't really have extensive library of feature engineering right now. But if you have like for example ideas of doing feature engineering you could plug in some feature engineering you could add to our library. It's as I mentioned it's open source but right now basically the default is to do nothing. So it just takes the data and spits out the same data. But we do have there are some people who are using PLP to do feature engineering and they were able to plug that in and they will hopefully be contributing that back. So for example there's actually a researcher doing sequential pattern mining feature engineering and depending on whether it has value they could actually put that back into the package. So in six months time that may be something that's available. But as I mentioned everything's customizable so you can add in custom feature engineering if you wish. Same with sampling so we have under and over sampling available for the class imbalance. But research is shown from our research from Odyssey and also researchers external to Odyssey that class imbalance methods like under and over sampling is not a good thing for the observational data in general because often with the class imbalance the reason you balance is because the outcome is hard to observe or difficult to observe. So but in reality it is a balance situation. So people because the outcome is hard but it should be a 50-50 people over sample or under sample to get that 50-50. However in healthcare outcomes are often a lot rarer than having the outcomes a lot rarer than not having the outcome. So in reality you have this imbalance. So if you force a balance when you fit a model but reality is that it's imbalanced you actually end up with a calibration issue. So the research is shown that actually addressing class imbalance isn't a great thing at the moment especially doing under and over sampling but if you have a method that you think could be good for addressing class imbalance you could plug that in and you could actually do some research on the Odyssey network to investigate that but the default setting right now is we don't do any class imbalance. We just, we keep the imbalance as it is because the cohort design we've done with the dataset and the way we've done everything the imbalance that we see is that it's a true imbalance. And then we have preprocessing so you can remove features that are extremely rare you can do normalization which is required for things like logistic regression. It's not as important if you're doing tree-based models and you can also remove redundant features. So here this is just option the final kind of option to just do some preprocessing of the data. And then the fun part really is actually picking the classifier. So we have a library of classifiers available to be used. The main one that comes with the package is the lasso logistic regression but we also have the gradient boosting machine so both of those are available if you just have the R setup. But if you have the Python setup you can also add things like random forest, Aida boost, the neural network, the support vector machine. So we have a lot of classifiers the standard classifiers are built in basically and they all start with a set function so it'll be set and then name of the classifier. Some of them will have inputs for the grid search so for this gradient boosting machine setting here I specified the grid search for the max depth is gonna investigate two, four, 10, and 17. If you don't specify it's just gonna use the default grid search that we've got into the package. Often this is based on whatever package is being used so for random forest this is using a psychic learn and Python random forest and they specify the default hyper parameters. So we just copied that but you can override that by specifying the hyper parameters that you want to use in the grid search. The last thing is just regression is the one classifier right now that doesn't have a grid search automatically searches for the hyper parameter only has one hyper parameter so it's an easier thing to do. The others have lots of hyper parameters. So like I said, we use cross validation on the training data to do the grid search for the hyper parameters. So we end up getting the best settings for the model for our prediction task and you can also write your own classifier as well. So if you need to test out a new classifier that you've been developing you can plug that into the framework but most of the time the library that we have will probably be adequate. So then you can create a model design. The model design has to have your target ID which is the Atlas ID for the target cohort your outcome ID which is the Atlas ID for the outcome cohort. So I kind of mentioned these IDs previously this plugs in here and then you need to population settings. So this is the time at risk there were and some inclusions your covariate settings what features you want to use preprocessing settings whether you want to do normalization how you want to do the splitting to test train validation and also what model you want to use. So for every single design you need to basically create a design for every model you want to fit and you can specify hundreds of designs and actually the patient of prediction will fit every single model that you specify for every design. So here I'm just showing you one design but I'm actually now going to jump over to an R session. I'm going to find the correct R because I always have lots of R's open. So here I've actually got the code. So here I'm actually going to be doing a little bit extra. So I've got the code that I just showed plus a little bit extra I'm going to run through. So there's a few libraries that I added. These are all Odyssey libraries. So these are all on that Odyssey repository the same one that patient of predictions on code generator basically will take the SQL from Atlas and generator into a table for you. The R Odyssey web-based API is a way of connecting to the Atlas. So this is basically going to be used to extract that Atlas definition into SQL into my R session. And this is now going to be used code generators then going to use that SQL to generator into a table that I'm going to be using for PLP patient of prediction. And then database connector is the Java connection to the database. So these are all needed basically to run patient of prediction and code database connection is actually a dependency. The other two are just used because I'm using Atlas to create the cohorts. But basically if I run these I can run cohort definition. It's basically now gone to the public Atlas and it's gone to the, it's extracted those two cohorts that I specified using the IDs. So I could have put any ID for any cohort that's in the public Atlas and it would download it into my R. So if I go to cohort definitions now let's put the C in and I put in one here you're going to see it has the Atlas ID it has the cohort ID, has the cohort name which is the new users of Liz and April with prior hypertension and it also has the SQL. So this has all the SQL I need to execute on an OMOP common database model, database to actually generate the and find the people who have Liz and April and their database and when they had Liz and April for the first time. So this is basically just going to this is downloaded in my cohort then I need to specify information about the schema where my database for the OMOP CDM of my database are I'm using a key ring. So I can't share this unfortunately it's private I can't share my database details with you but I've got this save in the key ring in with the name Medicaid. So I'm going to be using the Medicaid database to run this. And so I'm setting my CDM database schema to basically this is just the schema that my database for the OMOP CDMs are in. I need to specify cohort table and a cohort database schema. So this has to be a schema and a database that I have read and write access to because I'm going to be creating a table called this into the schema that I have access to. So you need to have if you're going to be running this you need to have basically access to a database where you have right access because you're going to be creating this table and then I'm going to this table basically is going to contain my Lysinoprol and angiodemia patients. It's going to have the IDs of those patients and the time when they had those when they had the Lysinoprol for the first time or they had the angiodemia events. I then have to create a connection details. So this is going to be using the database connector package. I create connection details to my OMOP common data OMOP CDM database. So here I'm basically going to be connecting to the Medicaid server. And again, I can't share this because this is private but you would basically have to get this from your administrator. If you have OMOP CDM data you would just and the database you would need to speak to the database administrator and figure out what your server and port and username are. There's help in the Odyssey. So if you ever get stuck on this if you have the Odyssey forums I usually check the Odyssey pages you'll find help to guide you on this as well. Then I'm going to create a cohort table's name. So I specify this was like PLP PLP demo table was my cohort table. This function basically just creates a bunch of tables with that name in it. So here you can just see that it says cohort table is PLP demo table but then there's like an inclusion table and this is because when you generate the cohort with cohort generator it doesn't just generate the patients who are in the cohort it does additional things as well. So it will tell you like some inclusion criteria and it has all these statistics around the cohorts as well that may be useful. So this is the main table that we need but these extra tables are just interesting statistics that you may want to look at that will get generated. Then if I run this I'm not going to run this again for now because I actually already generate the cohorts and they can take a little bit of time to create but what this would do is this would create the cohort table as a blank cohort table. So create a new table called PLP demo table that's in the structure of a cohort table that I need and it will create all these other tables into my scheme that I specified and then this generate cohort will take the cohort definitions that I downloaded. So these are the Lysinoprol and the angiogenia cohorts and it will actually execute the SQL and insert it into the cohort table. So this earlier stuff really is just it's just creating a table that says here are my users of Lysinoprol here's the date they had Lysinoprol here's all the people that had angiogenia and here's the dates they had angiogenia because that's going to be used by the PLP package. Now I'm going to install patient I'll predict well load patient prediction into my R session. I'm going to create a database details for my this is required for patient prediction that needs to have the connection details to the OMOP CDM database. It needs to have the schema for the database it needs to have an ID so a reference for that database and then it needs to know where the cohort table that has the target population and the outcome are. So here it's basically the cohort table which we previously specified was that PLP demo table and the schema I can't share because that was private but this would have to be a database giving you have read write access to that you've created this table and you've inserted the cohorts into. So you create a database details here. So this beginning part really just configuring and saying here's where the data are here's where my cohort of patients with the target outcome are then we're going to create these so I'm just going to run this and then I'm going to run and create the same settings. So I previously went through and showed you the population settings the covariate settings, the split settings speech engineering samples these are the same settings I went through and the model for the logistic regression was the set logistic regression set lasso logistic regression and then the model gradient booster machine I'm going to run that as well and then this was the create model design where I previously showed you you put in your target, your outcome and all your settings. I added in one extra thing where I've actually sampled 100,000 this is just to make things run a little bit quicker 100,000 is still a pretty big number but I know that I can run this locally on my 16 gigabyte RAM compute basic laptop and a MacBook I can basically run a model that is using a sample of 100,000 patients of Liz and April and I was able to run a logistic regression and a gradient booster machine within I think an hour and a half last night. So if I run this here it's going to basically just it's only going to take a sample of 100,000 Liz and April users and then I'm going to do the same create a model design but this time you're going to see I'm using a gradient booster machine. So here I'm not using a logistic regression I'm using a gradient booster machine and that's the only difference in that design. And then the last thing I've got to do is just do this run PLP. So it's run multiple PLP you put in your database details. So this is where the cohorts are this is where your OMOP CDM is and the connection information to access that. Then I'm going to put in my list of model design so I put two in I put in a logistic regression and I put in a gradient booster machine they had every setting all the settings were the same except for the classifier was different. Then I'm going to put in my cohort definitions that I extracted from Atlas from Atlas and I'm going to say this is a PLP demo live and if I run this you're now going to see this activate and it's going to start basically sampling. So this is gone to my database my OMOP CDM database it sampled 100,000 patients who have Liz and April with basically hypertension in the prior year it's now creating the features that I specified. So this was the drugs, the conditions, the age in five-year buckets and the sex is creating that on the server and then it's going to download this into my R-session so I'm going to basically have the data downloaded for the Liz and April root users I'm going to know who has the angioedemia and I'm going to have their features and then it will start doing the model development based on how I specified. So you can see it's already create those features and because I looked at basically all the drugs and conditions it's probably going to have somewhere between 10 to 20,000 features for each patient so we're going to have quite a lot of features and then I'm just going to show you this is the PLP demo I ran this last night so this was the analysis one it basically downloaded the data so the data was downloaded here for the task and then it ran the logistic regression and it ran the gradient booster machine and if I look at the log you can actually see potentially how long it took so this is the settings it basically, that's how well it did so analysis one, target ID like we specified sampled 100,000 patients so we had almost 20,000 covariates were created and downloaded and then it did some pre-processing and then it fit a logistic regression with that size data I think in less than a minute so it's running Cyclops at 11.33 and then it was finished by 11.34 so it actually fit a model for 100,000 patients with 20,000 features in less than a minute for the logistic regression the gradient boosting took a little bit longer because it has to do more extensive research but as this is chugging on and actually running it's going to be the point where it actually downloads the data soon so if I go to here we're going to see the settings it's going to actually have the data downloaded it's going to end up basically being the same as what we see here, the same structure because it's going to end up having this file very soon once it's downloaded everything but while this is running I'm going to pass you over to Ross who's actually going to be able to show you the results from this earlier run that I did yesterday okay, here's the data so it's already downloaded the data Yes, so I will take you through so thanks Janna for that that was really informative I think for everyone and there's been lots of discussion in the chat so generally after running the models she then shared all of the she shared the results and shared the data that it created and so this is quite hot off the presses but she then sent me an SQLite database that I can then load into a Shiny viewer so this is like that comes within the patient level prediction package so what we'll do is I'll just source this and then I'm going to open this in browser because then I can zoom in and we were having some issues with the size of the text earlier but you can either run it sort of directly from your R session or you can run it in a browser when you open it up you get some information about what the viewer is, how to use it but let's dive into the predictions I'm going to zoom out for the first civil section so what we see, so this is design ID so this is our analysis one and analysis two so you see here it's analysis one is a logistic regression model which is new use the Lysinobrub and we're predicting anti-edema events and here we've got Jenna talks about this idea of like time at risk so you have a one year time at risk but here you can clearly see it's cohort start date plus one day so that's patient starts their drug and the next day they have their first day at risk and then just 365 days after that and then we see some performance statistics so the mean and marks AUROC so we did I think because of the size of the patient we didn't calculate the confidence intervals but what we also have in the shiny app is a bit of explore the results so in PLP what we've done recently is set up some diagnostics so this is basically to say do we think that you should look at this this is where appropriate data source is used so here we've passed this diagnostic and there's a couple on yeah where's the outcome definition to find a term in a similar way for all participants so we see we're hitting this so this is like we're getting these consistencies in just because you pass these checks doesn't mean that you have got a great study but it means that you're at least coming up to sort of some minimum standards that we'd like to enforce but I think what's really nice to explore is looking at the results so here you can see a little bit more information so you've got the development and the validation database when these two match up what you're doing it's just a development and you'll be looking at you can either look at the cross validation train or the test what we also have the ability within this app is once you someone shares a model so here Jen has shared the results and actually within the results the model is contained so I could in theory take the model that she sent me and I could run it on my own dates so Jenna's works mostly in the US so she's run this on this was on Medicaid data I work at Erasmus in the Netherlands where we have the ipsy database which is a general practice database because these are both mapped to the own web CDM given that Jenna has shared with me her cohorts and her model I can then run it directly against my database and in theory we're extracting the same patients obviously the populations are different but other than that everything should be the same so we should get a really direct comparison this is coming to this idea of external validation which I'll go on to talk a little bit about in a couple of minutes but here so you see again this information so your target population your outcome your time at risk you see your AUROC, you see IE or PRC here this is the size so Jenna says she's summed with 100,000 patients and what you can see is actually we've lost 273 of them or 173 of them somewhere so you might be interested in why that is and that's going to be things that can be things like sort of like either they didn't have a time at risk or they've already had the outcome prior to entering into the cohort you see we had 760, 759 patients with the outcome and some other things like the instance rates so the timestamp are then typed so this is developments you see this development and then that makes sense because that matches up with the develop that you've got the same databases here but what we can do is we can explore the result so what's really nice is one you immediately get the model so this is a plot so on the y-axis you've got the prevalence in persons with the outcome and then prevalence in persons without the outcome you know what you'd like to see is lots of dots quite far off of this diagonal then it would be probably quite easy to separate these patients but if you're asking that question then you probably don't already have the answer you likely won't see that here so if these were going to be all the way off and it was really easy to separate these things maybe we wouldn't need a prediction model for it but what's nice is you can go through and explore so we're seeing that the beta blocking agents like thiozides are getting picked up as being predictive and these are all from the genomes of like you know 10, 20,000 potential covariates these are all the ones that were selected by the model and we're doing a logistic regression implementation we have is a lasso so an L1 penalization so we do basically an automatic feature selection within it we have here binary measurements so as Genomes using claims database in observational data in claims data you tend to not get that many measurements ensure as companies being interested that a measurement has been taken but not necessarily what the value is whereas if I think about say the general practice situation that I would be working with there we have a lot more of these measurements because these tend to be recorded by your general practitioner so that's actually a really nice experiment and something that we like to sort of play around with which is like what's the impact of these different sort of modalities as a data other measurements really helpful or is it just knowing that there was a measurement do you sort of lose performance when you move between these different databases these are all questions that it's really nice to sort of ask and you can answer quite well with the improbability provided by the... You also have the... So you've got your probability threshold plot and let's see if I can see some of the plots more and here we have some various things you can move your threshold so you can say I want to like... I want to decrease it or I want to increase it and you can see what the impact that has on your various statistics of PPD the sensitivity of the negative predictive value we can dive a little bit more into detail into the discrimination so I'm just going to look at the test set so this is generally what you're really interested in you can see okay so you know the RSE curve it's okay it's not great but I think we knew that from the 0.67 and then we can see like precision recall similarly F1 score we've got a couple of the predictability in the classes this is the distribution of your probabilities and then you see the overlap in the prediction score and the preference score of the distributions what we also like to look at is the calibration so again I'm going to take the test set these can take a little bit more time to load and here you see so what we've got here is it's a smooth lowest curve and yeah I think the calibration here is sort of probably like quite reasonable and you can see that because of this class imbalance we're getting way more sort of no outcomes than outcomes we like these flexible curves a little bit more than either sort of like a rigid line because it allows you to see okay so down in this area we seem to be really well calibrated whereas here it gets a little bit worse and similarly we have this for the stratified by ancient sex we also have various calibration metrics I think this is an area of research that we're really interested in the moment is to try and find a metric where it's nice to look at these plots but you know when you're making 20 30 40 of these studies it's difficult to look at all these number of plots and so if we could have a metric that tells us something about this over the entire region and that would be really helpful and that's where something like the the calibration in the large intercept that's not so helpful for that it tells you sort of like a baseline level and then like the colorates and the mean it tells you what that's sort of around the middle that's kind of these are all sort of helpful but they don't give you the full calibration picture which is why I actually really like these sort of these smooth more flexible curves similarly you can see sort of the net benefit here as you can see this one I think this one's looking all right and then validation so select this this just gives you the ROC and the calibration plot once you start to add the validations of these models so if you you take this model and you run it another database you can put that back into the shiny application and you will get here information so you can do a direct comparison it will plot the ROC curves over each other and it will plot the the lowest calibration curve over the top so this is what we get we give you to sort of explore your results and I showed you that in order to run this you basically just need to have two lines of artwork well you could do this in one line but two lines so import your library and then just do the multiple plp and then with a pointer to wherever it is that you've saved the results of the study that you were doing what's also nice that we provide some functionality so that you can make your results shareable so we'll provide a function that will remove any sort of low counts often what you'll see is the IRB approval you're not allowed to share anything where you have like the number of counts whether five or 10 depends what it is but we will allow you to do that and specify it and make sure that you don't share any what's known as like patient level data so we do so what we do is for instance we remove we remove the predictive risk for each individual patient because this is considered to be patient level data but what we will do is do an aggregation so that you just don't produce someone say like the calibration plots by giving you an aggregation of the different percentiles of risk so that you can retrospectively sort of recreate these plots after the fact and then the last thing that I want to show you goes quickly is the Delphi library so this is a sort of a new initiative that we're starting to push towards so log in so this is delphi.odicy.org and this is where we are going we are hoping to push all of our sort of models that are patient prediction models that are developed in the Odyssey and what CDM framework too so you can see the moment we've got sort of 54 models in here we've used 2.5 million patients in total there's been 19 different researchers in eight different databases and then if we go into the library to explore let me predict let me select okay we'll see one that's got like a nice performance so let's take this one so gastrointestinal diseases and then targets of MDD treatments so this was from a paper that Jenna published a couple of years ago I think she referenced in the talk this was predicting various outcomes in patients who start a forms treatment for major depressive disorder but this is all online so this is going to be like a public repository where everyone can push their models and they can push their validations to you'll see this idea comes back from the shiny app you see there's been a few more patients sort of covariates have been selected again because this was developed using clean say we're not seeing any measurements but if there's a possibility to be there you'll see similarly sort of development settings so you see about the model settings you see some of the attrition so this is the patients that you're losing so here we're saying patients had a prior sort of a couple of thousand patients had a prior outcome and so we're getting rid of those patients and then what we'll go through is again we have the threshold you can explore the discrimination the calibration the net benefit and the validation but what's really nice for this is we have an upload section so once you've got the results of your study and you say I'm ready to share this you can take those results and you can upload it to this library it will then get pushed to this library so we have a database which is where we store all of the performance and metadata around the model and then we have a sort of another backend area we're using GitHub for it where we'll store the model itself so you can load your model up to it and then you can publish that for different researchers and you can say hey I've got this new model I think it works really really well I'd love people to come and validate it and what we will be adding to over the next sort of in the next sort of few weeks is the ability to sort of then download models and what that will then do is download a script that has the cohorts the models and all that you'll have to do is to add your connection details and then you'll be able to run it and do a validation of that model or what you'll be able to do is to say hey this is a really interesting model but the performance of the model that I see here is not really good enough for my situation so I want to train a model and then what you can do is to still select I want to select this problem setting and then you will get the cohorts and you will get the setup that the researcher who developed the original model used so you can then try and sort of you know you can meet them and this is our idea of we would love to set up a set will and we are in the process of setting up a set of patient level prediction benchmark clinical problems and this is going to be a set of clinical questions at the moment we're looking at a couple of like covid predictions one in major surgery and then the last one we're looking at in patients in diabetes what we'd love to do is to encourage like an ecosystem of researchers so you will say hey I've got a new method I think it's going to be world beating it's going to be you know much better than like lasso I think it's going to be fantastic and we will say okay well here's a set of clinical problems if you have a what CDM data you have implemented your algorithm through the PLP pipeline you can get all of the get all of the artifacts you need to run it and you can run it against our set of benchmark problems and you can see how you compare to all the other models that have been developed and what we then hope is that people will then take your model and they will start externally validating it they will start using it and it's going to make sure we are going to make sure that everything is open source so you should be able to access all the models all the code or the cohorts the only thing you won't be able to get is the patient level data because we can't share that for private series but what we'd love to see is the Delphi library to start being used used more for people to start interacting with it to start pushing information to it and the idea is to turn a model so instead of say you develop a model you publish it and then it sort of becomes this like static artifact it's to keep it dynamic so that you can say oh you can keep encouraging people to validate it it's making it easier to do these validations and what we also will like we'll make sure the validations are all linked to the model so you get the credit that's deserved for doing this work that was everything that I think I had to show you one thing I will mention at the end so Jenner and I will be opening up a couple of PhD positions here at Erasmus in the next in the next I think sort of few weeks or months so if you're interested in that then you can email either of us or keep an eye on the the Erasmus the Erasmus websites and various channels and we'll also be announcing it through the Odyssey community Jenner I don't know if you had anything you wanted to add before we close out the session Jenner thanks for the the great demo Ross so I added my email address in case you you want to email me if any questions because we kind of we don't really have time for questions I guess we have time for maybe one question but if anyone has any other questions feel free to email or post on the Odyssey forums and I keep an eye out on that but if anyone has any any question we have have we have one minute until you break for lunch see Ross added his email as well so feel free to reach out to Erasmus for any questions or if you if you need any resources like we went through quite a lot today so we're happy to show you more resources as well like the resources to kind of explain some of the tools we showed in more detail thanks for listening everyone thanks a lot everyone bye bye