 Hi, my name is Sean Meyer. I'm a recent doctoral graduate of the design science program at the University of Michigan College of Engineering. My research is focused on clinical predictive models intended for use with early warning systems. And today I'll be presenting GP models, a grammar prediction models. The main purpose is to summarize unevenly spaced and repeated observations from a database into discrete time periods used for predictive modeling. This presentation, the R package being presented, and the concepts it is based on are co-authored with Karen Deep Singh, assistant professor of learning health sciences at the University of Michigan Medical School. Within the scope of this project, predictive modeling is considered using available data to estimate risk of a future event occurring. This diagram depicts a simple perspective of predictive modeling where debt is acquired from some database or other source is simply prepared in some way depending on how the underlying data is structured or the modeling approach, and then an algorithm is applied. This can either be simple linear or logistic regression, or it can be a more complex but flexible flexible algorithm like random forest gradient boost machines or even neural networks. And the output is some sort of model which can be used to calculate predictions on new observations. However, inpatient care is not delivered at a single point in time, a patient state of health can change over time vital science can change over time. So it may make sense that a time series approach is more appropriate to capture the potential changes in risk as new data is collected. So the focus of this package is to incorporate expertise and reduce manual preparation and transforming time series data source from an electronic record. In this presentation, I'm going to show some key challenges of transforming time series data for predictive modeling, and how we built the package or to alleviate some of those issues. So why do you need to prepare your data. Well, preparing data is time consuming but it's often needed and predictive modeling. However, better prepared data can often lead to better results. There's a wide range of methods. And it depends on both the clinical relevance knowledge of underlying data as well as the modeling approach that you're taking. Another challenge with preparing data is that preparation methods are not often shared among the medical community. So there are some descriptions, even with both of them it can actually be challenging to reproduce reproduce results that someone on another author has conducted. While there are some data pipeline tools that exist. It still requires some sort of knowledge or guidance on how the underlying data is structured. What I've often overlooked is incorporating domain expertise into building predictors for predictive modeling. You can see in this conversation between potentially a domain expert in clinical care as well as an analyst that these conversations, the predictors are that are underlying this data is often can be vague variable descriptions often have abstract or implicit meanings. So there may exist in the health record for example baseline creatinine or highest creatinine in the last 24 hours, and they must be to be defined or calculated using programmatic techniques. Given these challenges, we decided to build a package that could approach these and overcome these challenges. The objectives for GP models was to reduce the mental effort, provide predictor transparency which could then provide predictor vulnerability, and then incorporate that domain expertise on the variables that we were able to create. Now I'm going to give a brief overview of exactly how this package works. So, at this highest level there are considered kind of global parameters and then they're considered like predictor parameters the global parameters exists and are applied to all of the things that you're creating. And this is done with this timeframe object that you're creating. So there's the sense of fixed data, which are data that is non temporal so it can be like gender or it can be an address or it can be a location or it can be a religion. And then there are temporal data, which have some sort of timestamp associated with them. There is their IDs, which really could be a patient or it could be an encounter or something that you want to associate with all of the data going through all of the predictions. You have a fixed start and a fixed end time. And what that means is your the fixed start time is the first prediction that you want to take, or you want to make and that is considered the at risk period, the start of that at risk period for that patient. And you don't actually want to stop making predictions. And you can actually, if you want to do one prediction you can actually set the start and end time to be the same it'll only make one prediction at that time. A step is actually the interval with which you want to make predictions it can be six hours it can be 20 minutes, but keep in mind that if you increase the step. It's actually going to increase the resulting output and more processing time that you're going to need. And then finally, max length essentially creates a an artificial cut off. So if you have patients that are hospitalized you can set the cut off to seven days to say I don't want to make any more predictions beyond those initial seven days. Additionally, when using timestamps timestamp comms for the start and stop when using different ones. It'll also create an auto aggressive fashion that will include data from the previous predictions. Next few sides will go over we'll talk about the, the underlying predictor level functions that we'll use so to start with we have baseline predictors this is a predictor type that will go over on the background and the right you can actually see that that was the previously used in order to calculate pre pregnancy and trimester weight so we're looking at median weight for the past nine months and three month intervals. And then in the box on the right you can actually see how we simplified that process using our new package that has this additional functionality. So here's another type of function for predictors that we can invoke is called the rolling predictor. So what this looks at is the, you're essentially have a fixed window that moves forward with every new prediction, considering T zero as that first prediction T one T two and so on, as subsequent predictions. So here we want to calculate the systolic blood pressure, the minimum maximum as well as the last value within each prediction. So we're looking at the variable SPP systolic blood pressure and then we look at a look back of six hours meaning we want to look six hours into the past, which moves forward with every new prediction. Finally, we have growing predictors, which essentially start at the point of the first prediction and then they accumulate over the course of for example a hospitalization for every new prediction. So we're looking at cumulative blood loss since admission, considering T zero as admission, we're looking at the variable blood loss and then we're looking at some as the statistics summary statistic. Just to recap and to show the side by side is we have a T zero as the start of the first prediction. And then we have rolling predictors we have baseline predictors which start before then and then we have growing predictors which start after that to contrast that. As I mentioned previously, my work is related to early warning systems. So essentially what this package does is it takes the data and it creates a, it creates an index or an anchor point that transforms those timestamps into indices. It summarizes those, those indices at the index level, and then it pivots them in from long data into one row per index or person pre which I'll show you in just a moment. It applies auto regression to where all of the previous data that's used in each prediction is actually concatenated to to the current data, and I'll show you that as well. So consider that you have a patient that is hospitalized starting. Let's say for example on June 2. You can see that little zigzag line that will consider that as the, the at risk period because they'll be in the hospital. And consider all of the different colors as different variables in material of what the actual value is we're really just looking at how many, how many values there are, and we want to summarize them and what in some way. So again just for example, the data lines can indicate or can represent when we want to make these cut off so in the past we want to for prior to the admission we want to look at baseline variables, splitting it up into basically like two month intervals. But during the admission because we can receive data and more high, high frequency we may want to, we may want to look at it a little bit differently. So we want to cut that up into shorter spans which in this example are 24 hours. So for example, the, we're looking at a single subject, or single patient or single encounter whatever you would like to consider. And then we're looking at some summarizations at the indices of time zero time one time two times. Normally these would be time stamps, and the package can actually interpret either one you can use indices if you've already got them, or you can use actual time stamps. The data is to change from long form into wide form, as you can see where it has person period. But the problem with that is previous data is lost. So, for example, for each of those observations, subject to time zero subject to time three. You have this issue where you don't have that previous data. So, to fix that, we actually create auto regressive predictions where we have. The data is actually repeated with every new observation. So you can actually train your model using the previous data that it that it has. In addition to this which I won't show you but in addition to this you all the package will also actually summarize each of those time periods so if you have more than one observation you can summarize it to like minimum maximum. And then you can add last median, things like that sort. This patch package is still in development. So we are we are still looking at developing it further. We didn't like to add more transformation techniques to suit more use cases to be able to use this and then we'd also like to add some visualization tools to help users represent the data to help them with what it looks like to be able to make these design decisions about the data that they're using. We'd also like to automate it when domain expertise isn't available. Thanks for attending my talk and thank you to our medicine for allowing me the opportunity to present my work. If you have questions or comments feel free to reach out on Twitter, or check out the GitHub page to learn more. Thank you.