 Is there a staff member here? Hello. Hi. Hi, my name's Celia. Hi Celia. Thanks for being on time. It's okay. So we'll start after 12 right? Correct. Yep. Okay. All right. Thank you. Yeah. And just a reminder, I'll just be here to keep an eye on, on the chat and anything or if you have any questions just to let me know. So I'll be listening in. Okay. Generally speaking, within the end of each hour, I will probably have about 10 minutes of break. And then we'll start in the next hour. Okay. Sounds good. Okay. Thank you. Hello, bro. So we are probably going to wait for a one or two minutes for. Others to join. We'll start. And I'm just pasting in the chat box. The website based on which the. Workshop that I will be. Presenting here. That all of the children materials are in this website. Feel free to check out this website. We'll wait. We wait for another one or two minutes. Thank you. Hi, everyone. So I am the presenter of this workshop today. And just to introduce myself, I am a son Korean and an assistant professor at the school of population and public health at the University of British Columbia. In terms of my research by training, I'm a statistician and I have done some research about the high dimensional propensity of the workforce. And that is basically the origin of today's workshop. So for those, for now I see we already have more than 30 participants. And in the chat box, I'm a game. Giving you the link for the website based on which I will be presenting the. Matters for this workshop. You should be able to see the materials in this website. Just before I begin. Can anyone confirm whether they can see the screen. Where I'm showing this website. And you can respond it in the chat box. Okay. Thank you. Thank you. Thank you for confirming. So in this workshop. We will be talking about high dimensional propensity score. And in terms of this high dimensional propensity score. We are going to talk about. Some of the implementation guidelines of this high dimensional propensity score. And we will be using an open data source. So that others can also replicate some of the. Implementations that I'm going to show here. This workshop is not going to explain. The. Our coding. Of all of the tiny details. But the workshop is more designed towards explaining all of the logics. That are necessary to understand why this high dimensional propensity score algorithm is. Useful and how to implement it using existing software packages. So that the general idea is that once you go through this workshop, you will be able to understand. The rationale of using high dimensional propensity score and. You will also have some understanding of how. Regular propensity score is different than the high dimensional propensity score. And then we will move on to some of the. Exciting extensions of this high dimensional propensity score. Towards these machine learning as well as double robust. Estimates that can enhance the high dimensional propensity score. An analysis to some extent and of course we will have. Some discussion about the advantages and limitations and controversies about this. High dimensional propensity score and we will also talk about. Some of the reporting guidelines. When you are using this high dimensional propensity score in an analysis and you were planning on presenting it to a. Manuscript what are the things that you are supposed to. Present or write in that. So in terms of the workshop prerequisite, obviously. We are. Expecting that you have basic understanding of our and basic. Regression. Understanding. And in terms of the course, obviously this is a. Website that was created based on quarter. So you, you were supposed to see the course as well as the. Outcomes. And. Then you can see. The analysis. So. Okay, let me try to understand. So in terms of the chat. You are basically seeing the outline. Okay. I understand what is happening. So let me stop sharing. And let me share again. With the screen. In this screen, can you, can anyone confirm they can see the website. Okay. Okay. Thank you. Thank you for. Letting me know. Otherwise, I would just show this website. Just show the desktop instead of this website. Thank you. So this is the website that I have been showing. And for those who have joined a bit late. I am testing the web websites link. One more time in the chat box. So that you can also follow along. The website that I am showing here. All right, so I was basically describing the purpose of this workshop. I'm going to explain the. Implementation guidelines as well as the difference between high-dimensional propensity score and propensity score. And then the rationale of that some of the extensions in the machine learning and. Double robust version. And then I will talk about some of the advantage and limitations. In terms of citation, if you think this is something that is useful and you want to cite it somewhere, there is a citation available here for this. And if you have any question about this workshop. Some additional clarification that you need, feel free to communicate with me after this workshop. And I will be happy to talk about. And receive comments from you. All right. So with that in mind, let us begin. And. In terms of the logistics of this workshop. You are. You do not need to run any course. All of the course are. As well as the outputs are available within this website. And in this which website. You can. See. Some of the citations as well as some of the texts that are available that might be helpful for you even after this workshop. And. In terms of this particular workshop. Just so that we can understand the concept I'm using one particular research question. That is going to be our motivating example for this particular workshop. And this particular question is. Does obesity increase the risk of developing diabetes? And this is the question that we will be. Trying to explain using one of the open data sources. And this is the question that is based on some of the literature that is already out there. In terms of a research question. This is not necessarily a new research question and that is not necessarily the. Objective of this workshop in this particular workshop. I'm basically using this question as a motivating example so that I can show you. How to implement high dimensional opportunities for algorithm. Around this question. One of the. One of the. Other thing about this particular question is that. Obesity is not necessarily the best exposure variable. Because sometimes. It is very hard to define obesity. In terms of. Not in terms of measurement but in terms of. For how long you have this obesity. And this is. True for many other exposures like this. And again, I want to remind you that. Defining a question or how to define a well defined question is not necessarily the. Objective of this workshop where basically using this question. As a motivating example so that we can try to. Show the implementation so. Finding the causal inference around this question is not necessarily our objective here. So to answer this particular questions say for example we want to understand does obesity increase the risk of developing diabetes. We want to look at some data source based on which we can analyze the data to get some answers. And within the US context to analyze the data we are basically using a open data source that anybody can go to the website. And download and that data source that we are using here is the. In health data source. And this means that national health and nutrition is the data source that has been collected. By the US government and by the US CDC. And they have provided these data sources. In the internet. Anybody can just go in the CDC website. And can download this data source and there is even a package that can help download the data source if you exactly know. From which component and which variables you want. For your analysis. You just need to. Have a very good understanding of the data dictionary. So in terms of this. Particular data. If you just go to the CDC website. You can see there is the data for. 2013 and 14. And similarly they also have the data source from 2015 16. As well as 2017 18. And this is the all of the data source. And anybody can just go to the separate demographic data. And they can. Look at the data dictionary. Just by looking at the demo underscore age. Dot doc. And this age. Is associated with the year. See for example, for 2015 and 16. When you go to the demographic data source. You see there is this I demo I. And when you go to the enhanced 17 18 data. And you go to the demographic you see demo J. Right. So in each year. So these subscripts are associated with the years. For which the data was collected. So in terms of the. Data dictionary. You can simply go to the data dictionary. In terms of the data dictionary. And you can see all of the different variables that are being collected. Within the data source. And if you want to know more about. The particular variables. For example, country of birth. If you just click on it. You can see. What are the categories. They were based on which they were collecting the information. And there were a lot of information. And that is available out there. That you can utilize in your analysis. One particular point I want to emphasize here is that this particular data source is a complex survey data. This is not a simple random sampling. And because this is a complex survey, they also give you some information about the sampling weights or the interview weights. As well as some of the variance calculating. Variables such as the strata and cluster variables that are necessary to properly conduct the analysis. But again, for the purposes of our analysis, so that we are not over complicating the concepts. We are basically going to use the data as it is. So that we. Can basically show if you have access to a data source, how to utilize the data. To analyze the. High dimensional propensity score. All right. The other aspect of this particular data source is that these data source. Is an observational data source. And what that means is that. We obviously need to think about the fund founders. In this analysis. And if you look at the literature. And you are if you're basically interested about exposure. Being obese as well as the outcome of developing the diabetes. You can look for. Many, many copious and confounders. That needs to be addressed several times. And here you can see education, family history of diabetes, smoking status and such and such. So there are many confounders. You can identify in the literature that you need to adjust to get a proper estimate. Of the effect of exposure on the outcome. Right. The way. In the end health data source, the way it works. Is that if we go back to the enhanced website one more time. So this is the demographic component that you are looking at. In here. And if you go back. You can see there are many other different components that are also available. So there are not only. Demographic but also dietary information examination data laboratory data. And there are some additional data sources. Such as the limited access data source that you need to understand to obtain all of the different variables that are available in different parts of the data. So just to give you a bit more understanding about enhanced, for example, for enhanced 2013 and 14. There is one component of demographic and in that demographic component, you can get the information about age, sex, education, race or ethnicity, mental status and so on. You have, if you want information about obesity, that information is not available in the demographic. You need to go to a different component for that. And it is called the body measures component. And from there you get the information about the exposure. There is another component about diabetes. And from that component, you get the information about whether the patient is diabetic or not as well as the family history of diabetes. And then for, say, for example, other COVID such as smoking, you need to go to other different component of smoking and cigarette use. And similarly, you can have, you have to get the diet information from diet and behavior and nutrition component, physical activities from physical activity component. So there are, because the data source is so large, they have split the data into many different pieces or components. And if you look at each of these components, there is, say, for example, I'm going to the demographic data one more time. And there is a variable called a CQN. And this is basically the, I think the data source, which is a variable called a CQN. And this is basically the ID information that is available in each of these components. So that even if the data is stored in a way that are in different data sources, because you have the unique ID number, based on that unique ID number, you can basically merge all of these data sources into one data source and create your analytic data in that way, right? So remember, we were talking about three different data cycles. So we were talking about 2013, 14, 15, 16 and 17, 18. And in here, we are basically only talking about 13, 14, because all of these other data source also have the same pattern. So you have to go to this particular component to get this information about a particular variable. And once you get all of the information about all of these different years, then you can merge all of this information from different years as well. So there are multi-layer involved in accumulating this data. And if you want to know more about how to do that, there is an appendix that I have included at the very end of this workshop material. And you can see there, I talk about different designs, usefulness and different cycles. And if you want to get the reproducible course of how to download this data, there is a package in R called Enhance A. And in that particular package, you can basically specify which particular data from which years. Remember, H was associated with 2013 and 14 cycle. And you can basically specify from which year you want to get the data. And then you can specify which particular variables you want to download. And how would you know what are the name of the variables? You basically go to the data dictionary and you can find out all of the names of all of these variables that you need to do your analysis. Right? So say for example in here from the demographic component, we're getting information about age, gender, education level, race, ethnicity and so on. And similarly from BMI component, we are getting information about the BMI from the diabetic component. We're getting the information about diabetes and smoking, diet, physical activity and so on. There are other variables such as sleep and there are other components that are coming from the lab components and that we are including in our data source. And once we have all of these different components, we can basically merge them based on the ID variable. Remember, we talked about an ID variable that is associated with each particular patient and this is a unique ID. So even if the information is sparse in different data components, you can basically merge them together using this unique ID. And you can do the same thing for the second cycle from the 15, 16. And you can also... And the structure is very similar. All I needed to do is to change this subscript that was associated with this 15, 16 cycle. And all of the variable names are usually the same, but you need to check one more time using the data dictionary whether the names remain the same. And you can basically reproduce all of the codes to get the 2015, 16 data. Same thing is happening for the 2017, 18. I'm basically changing the subscript to specify which particular cycles data I won. And then all of the... Most of these variables remain the same. Sometime they change some of these variable names such as this... In this data for the slip cycle, they change the variable names from the previous. So I needed to go back and check what was the new name that they were using. And after collecting all of this data, I basically merge all of this data to create the 17 data. And after merging, one of the things that you need to do is recording the data because sometime the variables would be associated with some particular category that you do not need. Say for example, for age variable, they have a continuous variable. If you want to convert that to a categorical variable, you need to specify which categories you want. For the sex variable, for the education variable, what are the categories that you need? Say for example, if they have eight different categories, do you really need all of these eight different categories or do you need to merge them together so that you do not have any problem of small cell size? Some of this consideration that you need to do your analysis, you need to take into account here so that you categorize the data properly. Also, you need to think about what are the eligibility criteria that you need to impose on your study. Say for example, in our study, we impose the eligibility criteria. Age has to be at least 20 as well as we want to work with the people who are not pregnant at that time when we are collecting the data. We impose all of these criteria in our... This is happening for the 2013-14 data and this is happening for the 15-16 cycle. Same recording. You just do the same thing three times because you are dealing with three different cycles. After you deal with all of this recording, you basically merge all of these data sources to get the entire population that you want to analyze. Of course, there are additional considerations that you need to take into account such as how many variables are missing and stuff like that. Specifically, in our data, we were dealing with a number of lab variables. In the lab variables, sometimes there are missing components or missing values. In a realistic analysis, if we were doing an analysis where we want to do some color inference and we were interested about finding the clinical implication, we would probably need to think about these missing values a bit more, but for our analysis, we basically needed data to analyze and that's why we did not overcomplicate our analysis with this missing data analysis. What we did is we basically considered a complex analysis and based on that complex analysis, we created the data. The reason I spent some time in explaining that enhanced data is because I'm going to use this data to show you the implementation of I-dimensional propensity score. If we understand the data, that will help us understand some of these implementation details. I have a question that I see in the chat box. In that question, it says, for any SQN, the information in the related data structures are single rows per patient case or there are multiple rows per patient. For example, for laboratory measures. Generally speaking, for most of the components, this is single row information that we are collecting. There are, however, some situations where we might have multi-row observations such as this proxy information that I'm going to introduce a bit later. For that, we have some multiple row information there. But other than this proxy or this ICD-10 course, all of the other different components that you have seen, these are basically single row information. There are some exceptions, however. Say, for example, for the systolic blood pressure or the dastolic blood pressure, sometimes some of these measures, if you do it just once, there are some measurement error considerations. So for those measurements, what they do is that they take multiple measurements of this dastolic blood pressure and systolic blood pressure. And in the literature, you will see sometimes they do an average to get just one value out of all of these different measurements that they get. So to answer this particular question directly, it depends on which type of variables you are talking about. Sometimes for lab variables, there might be multiple observations. Sometimes for the proxy information or the ICD-10 course, there might be multiple observations. But in this particular analysis, what we are doing, we are analyzing all of this information and making it a single row. And how we do it for the proxy information is something that I'm going to explain later. All right? Okay. So now that we understand a bit more about the data, and now it will make sense. So say, for example, this structure that you are seeing, we are getting these variables from this component, and then there are different components, and we are merging all of these components, right? So this is something that we basically do. And this table represents what I have already shown in the picture. So this DIQ component is giving you the information about diabetes. BMX component is giving you information about the BMI, and there are some demographic components, and there are some behavioral variables that are also coming from different components. There are some health history and access information that are also coming from different components. And also there are some lab information that are coming from at least three different components that are giving us all of these different lab components. Yes, sometimes it is possible that you have multi-measurements, but you need to figure out how do you want to use those information? Do you want to combine them into one value or do you want to use those repeated information? It depends on the aim of the study. In this particular study, I have basically used one information and not repeated information specifically for these variables. So in this variable list, what you are basically seeing is that I have 14 different demographic, behavioral health history related variables, and most of these variables are categorical variables, either binary variable or some sort of categorical variables. And there are some lab measurements that I have included in this data analysis, and these lab measurements are basically mostly continuous. And that is why we will see later how we deal with this analysis. Because we only have 25 variables here, right? And to do any kind of causal inference, obviously what we generally do is we go to the literature and we try to identify all of these confounders so that we can adjust for these variables in the analysis, right? But sometime what happens is that specifically when you are using secondary information, as an analyst, you do not have a control over which information should be collected because these informations were basically collected by the US government to make different policies about different things that they were interested about, but you are particularly interested about the association between obesity and diabetes and you want to understand this particular question, right? And for this particular question, you went to the literature and found out some of these confounders, but it is highly possible that there might be some confounders that you cannot directly measure from the data source that you are using. Say, for example, in this particular example, I am showing a variable known as comorbidity status that is identified as a confounder in the literature, but if you look at the enhanced data source, there are no direct way of measuring it. If you are familiar with the comorbidity indices such as this Charleston comorbidity index or the Alexander comorbidity index or the chronic digital score index, some of these variables that are necessary, those variables might not be, are not available in the enhanced and that is why it is sometimes not possible for us to create this comorbidity variables and that will remain an unmeasured confounder in this analysis. So one question that I have received in the chat box was that, how can we know the temporality of enhanced given its cross sectional in nature? Would that affect the data? So this is a very interesting question in the context that we, when we are doing an analysis, specifically when we are doing an analysis in the cross sectional data, we need to have a very good understanding of how the variables were measured and when they were measured, right? Say for example, in some of these questions when they are measured, say for example, when they measure, when they ask a question about a particular disease, say for example, did you have cancer? Sometimes they also have this question, at what age your doctor told you that you have cancer, right? And if you are lucky, you might be able to get those questions about the time component. Like when did you know that you have cancer or some other particular disease? And based on that time component, you might set a restriction on your data analysis of the temporality. Say for example, if your exposure was some variable and your outcome is some variable, you can set a restriction based on those time variables, whether your exposure was happening before the outcome or your exposure was happening after the outcome which you need to delete from your data, right? For this particular situation when we are talking about obesity and when we are talking about whether a patient was diabetic or not, we cannot establish that type of temporality that is what basically this question in the chat is asking is that if we cannot determine the temporality, how can we make the color difference, remember? Or that determination of the temporality. But remember at the very beginning of the workshop, I told you is that this is basically an example study where I'm showing you if we had this data and if we had this data, then how would we implement the high-dimensional property discourse? So in this data source, since we are not able to make the determination of that temporality, obviously we cannot make a strong claim about the causal inference because we cannot basically talk about temporality. So that's basically, and that is why we are not going to make any clinical conclusion about that. And yeah, that's okay. And so let's just move on. So that is an interesting question by the way. And this is something that you also need to understand whenever you are reading any paper based on the enhanced in the literature, that causal inference part cannot be really established unless you have some specific questions about temporality in the study. Okay, so of course we have talked about some confirmers and then we are talking about some other variables that are probably not something that is in the enhanced data or the data source that you are analyzing. So those are basically unmeasured confirmers, right? And if you have an unmeasured confirmer, what are the epidemiologic ways to deal with this kind of unmeasured confirmers? So one practical technique that people generally use is called using proxy. And this is a picture from a Vander Wally paper from the European journal of paper only GE from 2019. And in this paper, they talked about the modified disjunctive cost criterion. And in the modified disjunctive cost criterion, one of the things that they mentioned that what are the variables you need to adjust. Obviously the variables that are associated with the exposure variables of A is an exposure variable and Y is an outcome variable. So a variable that is associated with exposure variable and outcome variable should be adjusted. If you know there is an instrumental variable that should not be adjusted. And if there are good proxies of unmeasured confirmers or the unmeasured common causes, we can also try to adjust for it. Say for example, in this particular DAG, you can see U is an unmeasured confirmer, but C1 is something that is measured, right? So according to this modified disjunctive cost criterion, we should be adjusting for this C1 variable when we are adjusting for the, trying to understand the impact of A on Y, all right? So in this particular situation, so if we are basically talking about adjusting for proxy variable, what are the proxy variables that are available in the enhanced data that would help us understand the comorbidity status? So that's where this additional information about ICD-19 codes are relevant. And let me explain a bit how these informations are collected. So during the interview status, what happens is that an interviewer asked the participants in the last 30 days, what are the medications you have taken? And interview, we basically list some medications that they were taking in the last 30 days and then what the interviewer does is they convert that information into the ICD-910 code. For those who are not familiar with the ICD-910 code, this is an international system of recording the information about the classification of different diseases. And based on those information, we basically record those ICD-910 codes in our health administrative data sources. And say for example, if you find a code of A49.9, that basically means there is some sort of bacterial infection, right? If there is some sort of code of C61.p, that means that some sort of prostate cancer related medication this particular patient is taking. So these ICD-910 codes are there in the enhanced data source as well. If you go to the enhanced data source, you can basically see a lot of ICD-910 codes are available. And if you want to know more about the ICD-910 codes, you can see a long list of ICD-910 codes are available in the RxQ component of the enhanced data. And that basically gives you a very good understanding of what type of medications these patients were taking in the last 30 days before the survey, right? And from this, you get some understanding of what are the other diseases these patients have. And that could be helpful for us to understand about the comorbidity status of these patients, right? So if you are thinking of addressing your unmeasured confounding using some of this proxy information that you have available in the enhanced data source, you might be able to use some of this ICD-910 codes to get some information about your unmeasured confounder. One thing I should note here is that even though you are using an unmeasured confounder to adjust for your analysis, this would reduce your bias. But the problem with the proxy information is that sometime it will, like the direction of the adjustment is not necessarily obvious when you are adjusting for a proxy variable, right? It could be on the right of your null or the left of your null. And that is something you need to be able to understand if you have some subject area knowledge. But just from the analysis, it is very hard to understand in which direction this is helpful. Why it is helpful is that even though you do not necessarily know the exact direction of the correction, overall proxy adjustment is going to be helpful in reducing bias. And that's why sometimes the epidemiologist and the statisticians differ a bit because the statisticians are more interested about unbiased estimates where the epidemiologists are more interested about reducing the amount of bias. And that's why the thing that adjusting for confounding is reducing their amount of bias and they are happy with that type of analogy. All right. So I have explained you the data source, the enhanced data source, and I have explained you why somebody would be interested in adjusting for proxy information. And historically what happened is that when you have information in the data, even though we are dealing with massive amount of information, for example, if you are using enhanced data, enhanced actually includes a lot of information about a particular patient. But epidemiologists or the statisticians, what we generally do, we only want to restrict our analysis of the variables that are identified in the literature or that is identified either by some sort of DAG or identified by some sort of variable selection criterion. We are usually not comfortable adjusting for everything, whether we can interpret it or not. So that's not something that we generally do. We only want to adjust for variables that we understand are impacting our analysis in some way. For example, it might be a risk factor of the outcome. It might be a confirm that. And we also stay away from some of the variables, such as the instrumental variables or the mediation variables, right? That's where this high-dimensional propensity score algorithm is slightly different from our general understanding of epidemiology or the implementation of different methods in different data sources. And to give you a background about this high-dimensional propensity score, you can see this is a paper that was published in 2009 in the Epidemiology Journal. And over the years, this is gaining more and more popularity every year. Even in 2023, you can see there are some people who are still citing this paper. And the key idea that this paper gave, the SNEWISE paper gave is that we basically need to have some proxy variables. And we also have access to a lot of these ICD-19 ports. And specifically their implementation was more about the administrative data. And because the administrative data is usually not open, that's why we are using the enhanced data. But in both of these data sources, we have information available about the additional proxy information, right? And one of the argument of this SNEWISE paper was that even though we do not know interpretation about a lot of these ICD-19 ports or this proxy information, those proxy information might still be helpful in our analysis. And what was their logic? Their logic was that some of these ICD-19 ports or the comorbidities that we are seeing in the patients are actually helpful in our understanding of the health status of that person. Say, for example, here you can see there is a CPT-4 cord. So this is not ICD-19 or 10-4. This is a different classification system where this is a CPT cord that are usually collected in the hospital. And this cord, say, for example, if someone was using an oxygen canister that means that that person is very frail health. A regular or healthy person does not necessarily take an oxygen canister in the hospital, right? So what SNEWISE was suggesting is that when a patient is taking something like this, you obviously know that patient is with frail health. Or if you get some of these cords that are associated with the regular checkup visits or some to see their family doctors, then you know that this person is compliant. If a person's ICD-4 is every year there that they went to their family doctor to discuss their health, you know that this is a compliant person who is very conscious about their health. And that also gives us some information about that person's understanding or the state of their health. So what SNEWISE was suggesting in that paper is that we should not throw away the useful information that some of the information that can be useful in our analysis to understand that person's health status, right? So even though we, in our analysis, for example, when we are analyzing the relationship between the obesity and risk of developing diabetes, that has nothing to do with whether somebody used oxygen canister or not. But still, use of oxygen canister or someone was going to regularly to their doctor for regular annual checkup and stuff like that, those type of information are still helpful for us in understanding the general status of their health and those information can be helpful. But the problem is we have huge amount of information, huge amount of such information in the data source. And if you are using thousands of this information that you cannot really interpret, that can overpower some of the information that you already have identified as a confounders and you might get some bizarre results, right? So one of the things that this high-dimensional propensity score algorithm does is that it tries to identify some selected number of proxy information that are going to be useful for the particular analysis and then it does not use the full information. It only uses some of the information that are empirically identified as useful information. And how they select that is something that I'm going to explain when I'm going to explain about the high-dimensional propensity score steps. So you can see there are different steps involved in the high-dimensional propensity score. And by following those steps, they try to identify which of these codes are going to be helpful for the analysis. So one thing to notice here is that there are two types of variables we are basically talking about here. One type of variable, if we go back to our previous slide, you can see these are the variables that were already identified by the subjected experts that these are the variables that are helpful for our analysis, right? So these are called investigator-specific covariates. And then we also have additional information and these are proxy information. These are not investigator-specific covariates. So these are something that we are also going to use in the high-dimensional propensity score. And we will see how to select the useful number of covariates from this huge amount of information using this high-dimensional propensity score algorithm. So again, there were two type of variable. One was the investigator-specific covariates and one were the list of covariates that are commonly known as the empirical covariates or the recurrent covariates in the high-dimensional propensity score literature. Okay. So that brings us to our data analysis. And this will... In our data analysis, when we implemented all of our restrictions, such as 20-year or old, not pregnant, and we only restrict our analysis to those who had information about some sort of proxy information about the use of medications in the last 30 days and so on. So this is the data source we are going to use, and after merging all of these data sources from different components and merging all of the information from different subjects that adhere to these restrictions, we obtain more than 7,000 patients, and we are going to use these 7,000 patients to show you the implementation of high-dimensional propensity score. Okay. So at this point, let us take about 10-minute break, and then we will come back, and after that break, we will talk about this high-dimensional propensity score. In the chat box, I have a question about the website for this material, and you can see these materials in this particular website that I'm testing in the chat box. Feel free to ask any question in these next 10 minutes or so, or take a break, and then after that 10-minute, we can come back and discuss more about the steps that we need to take about the high-dimensional propensity score. All right. Thank you. Hello, everyone. We can try to start now. So I see there is a question. Participants between years are all different or the same participants might be included in different Enhanced Cycles. Okay. So, yeah, I think the way the Enhanced works is that they take a complex survey in different years, and the patients are usually not the same. So in that case, we can consider all of these patients as independent subjects when we are collecting information from different cycles. And that is why the concern about the independence of the observations is not usually something that we worry about. And this is not a longitudinal study that we are relying on the same patients over the years. I hope that makes sense. Okay. So, okay. Thank you. So in this particular study or the steps of propensity score analysis, what we have done is that according to our eligibility criteria, we have obtained 7,585 subjects. And obviously we can try to run a propensity score model on this particular patients. And what would be the confirmers that we would be selecting in a regular propensity score? In a regular propensity score, we would be basically using the investigator-specific covariates. Remember, we talked about 25 investigator-specific covariates where 14 of them came from demographic behavioral access information, and 11 of them were coming from lab information. So in a regular propensity score analysis, we would be using only those information in our propensity score model. But in this workshop, we are going to talk about how to incorporate the proxy information so that we can build a high-dimensional propensity score. So in terms of the proxy information, the way we obtained it is that, remember, at the very beginning of the workshop, we talked about merging the three cycles. And when we merge the three cycles, we basically merged all of these three cycles and created a dataset. And we also created a complex dataset for the investigator-specific covariates only. We then also talked about the proxy information of the ICD-910 codes that were available in these three different cycles. And we merged them into... So we merged them together to create a dataset of proxy information. So there are two different parts of the data that we are talking about. The part one of the data was the complex case data with all of the investigator-specific covariates. And the second part of the data is basically the information that we got from the proxy information. So let us talk about the investigator-specific covariate part of the data. So this is a data that was built based on the complex case analysis. And the proxy information is something that we have created based on the information that we have obtained from the last 30 days. And this is something that is very important to understand about the proxy information, is that even though we might have, for example, when we are dealing with health administrative data sources, for example, the data source that is collecting information about the hospitalization or going to the doctors or the emergency visits, in those type of scenarios, the information is collected on a long-term basis. So if a patient is enrolling into the study, just because that patient is enrolling into the study does not necessarily mean that that patient does not go to the doctors anymore. They still go to the doctors. So this is a different type of diagnosis, but in a high-dimensional propensity score algorithm, what we generally do is that we try to restrict the information of this proxy information, specifically in terms of the timeline. We do not take any information that was collected after the exposure. So all of this proxy information that we are collecting comes from before the exposure. In our enhanced data analysis, this was almost automatic because proxy information was coming from the last 30 days. So we know that they were coming before the BMI measurements. So in that way, we have made sure that the proxy information is not after the exposure. And this is something that is known as the COVID assessment period in the high-dimensional propensity score terminology. When you are using the Health Administrative Data Sources, usually this COVID assessment period can go back to six months or one year or two years, but mostly you have seen in the most of the analysis, it will be six months. And you can play with it if you think that the type of observation that you are dealing with or the type of outcome that you are dealing with, only six months of the proxy information before the exposure is not enough. You can obviously extend it to one year or two years. And then that also makes sure that none of the information that you are collecting as long as you are happening post-baseline. And that is something that is very important because if you are dealing with some sort of post-baseline information, then you may not be able to confirm whether this is either a mediator or a confounder. Right? Okay. So in the part two of the data or the proxy information, there is something else that you also need to consider. And that is you need to be a bit careful about which type of variables you are letting in in the analysis, because if you are including some information that are highly associated or the, sorry, if you are including some proxy information of your own exposure or own outcomes, obviously they would be considered highly associated with the outcome sometime, right? And that is something that you do not want. Also, if you know some of the drugs or the information that are going to act as an instrumental variable, those are the variables that you do not want to include as a proxy information either. Similarly, if you think that there are some proxy information of some investigator-specific covariate that you have already included in your investigator-specific covariate list, do not include those information in your proxy information list. Otherwise, you might get some information that are duplicated. You already have a covariate and you already have additional proxy information that is directly related to that covariate and that will create some problem in terms of the multicollinearity and whatnot. And something goes for the exposure and outcomes. So in this case, what you are seeing that there are some ICD codes that are highly associated with the diabetes, right? They are directly associated with diabetes and there are some one ICD code that is directly associated with the obesity or the overweight information. And this is our exposure and these are highly associated with our diabetes or the outcome variable, right? So these are the information that we do not want to include in our analysis, right? So I have a question in the chat box and it says that we initially had 30,000 information, but we are only working with more than 7,000 patients here. And that is because we restricted our analysis to the ones that only have proxy information because not all patients reported that they were using medication in the last 30 days. And for the sake of our high-dimensional proclinical analysis, we are not including their information in our data. So that restricted our analysis to 35,000, sorry, 7,000 patients only, not the 30,000 participants that we originally had in our entire population. Does that clarify? So that was one of the questions from the chat box. Anyway, so we try to be a bit careful. So obviously we are trying to include a lot of the information that are available in the proxy information, but those information needs to be sorted a bit in terms of whether they are proxy of outcome, whether they are proxy of exposure or whether they are proxy of some of the covariates that are already included in the analysis, right? If that is the case, then we can include the variables that we do not think are problematic in our analysis and we can try to see what those variables look like. So for example, for your understanding what I have done, I have printed the information from one particular patient that was associated with this particular ID 100001. This is the ID of one particular patient, the SCQN number. And then I tried to see what are the ICD-10 course that were available for that particular person. And I see there is a 3-digit code of F33, that means that there is a major depressive symptom. There was another code of I-10 that is an ICD code for hypertension and so on. So this is all coming from only one patient. So remember at the very beginning there was a question about whether there are multiple information coming from or multi-row information that are coming from these patients or not. For most of the variables that we have dealt with before, no, for some of the lab variables, we had multi-row information, but we have condensed them into one row, either by taking average or some other means. But for these proxy information, as you can see, these are multi-row information because like for this same patient you have multiple observations that you are receiving from this patient because these are patients that are taking major depressive symptoms, medication, also taking the medication for hypertension and heartburn and so on, right? So these are the multi-row information that are coming from the same patient. And what we try to do then is that we try to merge these proxy information with the original complete case analysis information and that is where we are reducing a number of subjects from 30,000 to 7,000 here, right? And that creates our data source based on which we are going to analyze our data, right? So in this particular step, and this is the step one, and in this particular step, what we have basically done is that we have identified the proxy data source. And in our case, our proxy data source was the medication data source that are available in the enhanced data. If you are using some sort of administrative data sources, it is possible that you have one dimension coming from hospital, another dimension coming from the emergency visits, another dimension coming from the diagnostics, say for example, somebody had x-rays or not and stuff like that. So you may have multi-dimension, but in our enhanced data source we have only one dimension of medication data source that we are using here. And we have merged that information with our complete case analysis data to create our analytic data. Okay, now that we have merged our data and now we know what is the original data source that we are going to use, it is very possible for us to create a frequency count of, say for example, now it is not patient-specific, but we already have an analytic data and we want to know how many patients are associated with this I-10 code. Remember what was this I-10 code? It was associated with hyper-tension. And you can see a lot of the variables in this data source are associated with hyper-tension. And then there was this additional code that was associated with the second frequency and so on. One important point you need to understand here is that when we are doing this analysis and obviously for some of these ICD codes you will see them repeatedly. So many patients will have this hyper-tension medication in this particular data source that you are interested about, say for example, this overweight people as well as the people who are at risk of diabetes. But you also may have some other ICD codes that are not that frequent. Say for example, you may have some lower counts or lower frequency of some ICD codes that may be less than 20 or even less than 10. And the realness is not really a problem. It is possible that some confounder may be there. It is certainly possible but the problem begins is that when you are dealing with this type of ICD code that are very real and that are, say for example, only have one or two counts or frequency, then it creates some numerical problem. And you can imagine if you have, say for example, many of these variables are associated with this very low cell counts that will create numeric instability in your data source or in your analysis. And to avoid that problem, we might want to impose some restrictions so that we do not include too many of these variables. So that restriction could be included in many different ways. One of the original ways that was proposed in the analysis in the 2009 snippets paper was that they suggested that from each dimension we just take top 20 covariates that are associated with high prevalence. But there were some additional papers, say for example, this particular paper then showed that obviously you can do that to bring your numerical stability but then again if you just restrict your analysis to only top 200 covariates, that might mean that you are not capturing some of the confounders that may be rare and that can create some additional problem. If you really think about the problem, so there are two different problems that are associated with this low count. One is that if you have a low cell count obviously that can create some instability because of the rareness of that particular covariate. The other problem can happen is that you have that is non-existent that is like maybe in the original data you had that quote but in the subset of the data you do not have any person who is associated with that particular quote and that will create all of the zeros in the data and if that is the case that is the case of zero variance and that would obviously create problem in fitting of your regression so you do not want that. So to restrict that if you look at one of the package that implements high-dimensional propensity score in R is known as the auto covariate selection package and one of the options that they provide is that they try to choose how many minimum number of patients you want associated with any particular quote and if you choose that you do not necessarily need to think about this particular option of how many so top 200 prevalent course of something like that. You can choose a large number there so that that will make a make the option redundant in here separately instead of 2,200 if I just select 7,000 or something like that that option will not be impacting your analysis anymore but if you are not restricting how many patients minimal you need to have in each of the quotes then you might some numerical stability and this is something that I think is very clever that if you choose that option it could be 10 it could be 20 and then you can redo some of this numerical instability problem that are usually associated with this high-dimensional nature of the analysis. All right so in the long format data when you have this one that data dimension and I named it DX because it was coming from the disease that were measured based on which medications that they were taking and this these are the three digit ICD quotes that were collected and in terms of the updated frequency after imposing these 120 you can see there are no covariates and these are basically copper and the tail observations that you can see but there are no frequency that are less than 20 in some of these cases and that is a clever way of dealing with this numerical instability problem in total there were 126 quotes that were retained in the final analysis when we have restricted our analysis to the quotes that were associated with at least minimally at least 20 patients per quote and that is basically our step two and in this particular step what we have done is we have basically chosen the co-virus based on the frequency if the frequency is too low we are not going to deal with those quotes anymore and that is primarily based on the numerical instability alright so in the third step what we do is we create binary variables out of those frequencies and out of those ICD quotes and the way we do it is that say for example this D63.9 was the original ICD quote and that is a quote for anemia right and we convert that quote into three different quotes and that is determined based on whether that quote occurred only once in the data or whether the quote occurs sporadically for that particular patient or that quote occurs frequently for that patient and based on these pre-conditions what we do is we convert each of these quotes into three different quotes so try to understand this part so in here what we are doing is that for each quote we are creating three different binary variables based on the recurrence of the variable based on these conditions whether they occurred only once whether they are occurring sporadically and by sporadically what someone can interpret is that at least it happened more than the median right and whether that happened frequently and that means that at least more than 75 percentile had that quote so for each of these quotes we are creating three different binary variables and these binary variables are known as the recurrent quotes but there is a catch it's not directly three times because sometime the once and the median or the frequent or the sporadic quotes are exactly identical because how they have occurred and that then we do not keep both of the columns because think about it in a regression when you are including two covariates that are identical that will create a multi-currency problem and you cannot that regression and that is why you cannot keep all of these recurrent covariates you only keep the covariates that are distinct and when we included all of these recurrent when we calculated all of these recurrent covariates and we calculated only the distinct covariates we have received only 143 distinct recurrent covariates so and these are only the covariates that are coming from the proxy these are these has nothing to do with the investigator specified covariates these are all covariates that were coming from the proxies so if you if you look at the you can see all of these 143 quotes that were happening and most of these quotes are once or frequent so there are no sporadic quotes in here right in total we have 143 distinct covariates that will be then think about whether to include all of them or not and that will be determined in the next step so in the step 3 we basically created three versions of the course based on the restrictions of whether they happen only once sporadically or frequently and then we have taken only the distinct ones and then we achieved this 143 and in the step 4 what we will do is that we will try to select which of these 143 we want to include in our analysis okay and then this particular step this step 4 is the most important step in the high-dimensional propensity this is where you determine how many of this 143 you are going to use in your analysis and there is a logic tweet and the logic came from a very old paper this is a I'm talking about a 1966 paper this is a paper that was written by a boss and he suggested this problem in a context where there was an unmeasured confounder even though you do not have the unmeasured confounder measured in your data but you have some understanding of that confounder and based on which you try to adjust your analysis so that was basically the premise of the original paper and the formula so in that formula what he suggested is that you basically need to know three things to get a general understanding of how biased your estimate is so what are the three things is that prevalence of binary unmeasured confounder among the exposed prevalence of binary unmeasured confounder among the unexposed and the association between the binary unmeasured confounder and the outcome right so first two were based on the exposed and unexposed and the third one was most about the outcome if you can have some reasonable educated gaze about these three components you can basically plug in the numbers in here so this is from the exposed group and this is from the unexposed group and this is basically the risk ratio that you get from the association, crude association with the outcome and the unmeasured confounder if you knew these three things or if you can reasonably assume these three things then you can calculate the amount of bias you will have in your analysis if you did not adjust for this unmeasured confounder alright so that was a very simple formula but that required some strong assumptions what are the assumptions like these three assumptions you need to have some understanding about the prevalence of in the exposed and the without thinking about our problem and our current context our problem is actually much simpler than this because we do not have to assume anything about the unmeasured confounder we actually have 143 covariates already present in our data that we have identified in our step three and our job now is basically to check whether the bias amount of any of these 143 covariates are high or not if the bias amount from any of these covariates are high we basically consider those variables as variables that we need to adjust and if the bias amount of any of these covariates are low very close to zero or very close to having null effect then we do not need to adjust for those variables so that is basically the main understanding of the high dimensional propensity score is that this is how it selects which covariates to include in your analysis or not alright so in our analysis what we do instead of this U or the unmeasured covariate what we basically do is that we have this recurrent covariate that we have already identified in this step three remember 143 of them we use this formula 143 times to calculate the amount of bias associated with that recurrent 143 recurrent covariates and say for example we we use the D63 once and D75 sporadic once and DE03 frequent once and then we get the amount of bias and to do it in our in again in the auto covariate selection package you get the get prioritized covariates and you basically say how many of them you want so it will basically give you the top 100 covariates and the other thing this is also done in the literature is that instead of the original original bias amount they convert it to the first take an absolute of that and then they convert it to a log scale so that the null value is close to zero in that scenario so you are just basically checking whether the value that you are getting in terms of the absolute log of multiplicative bias whether that is close to zero or not and this is a ranked list of the bias amount in terms of in the log scale and you can see some of these are very close to zero and some of these have are deviating from zero and if you just translate some of these course that you are seeing several I 10 if you remember this is associated with hypertension and this are 73 is associated with elevated blood glucose level if you look at this covariate you see hypertension and elevated blood glucose level and stuff like that these are something that are highly associated with your outcome right and what was our outcome risk of diabetes and hypertension and elevated blood sugar and stuff like that those are highly associated with this outcome variable so these are not some random covariates that we are picking we are basically choosing some variables that are going to be useful for our analysis but we have not considered them as a covariate in our primary analysis or the industrial space weight covariates one other small point that I want to make here is that in this particular analysis I am basically using the formula to identify which of the variables are useful for our further analysis this is somewhat different than our propensity score analysis because in our propensity score analysis what we basically do is we usually check the standardized mean difference right we make a table a 2 by 2 table sorry a stratified table based on the exposure status and we classify all of the covariates and we try to calculate the standardized mean difference of each of these covariates to expose the groups and we then try to identify based on that mean standardized difference whether the difference is greater than some specific heart point such as 0.1 or 0.2 whatever heart point makes sense but the particular research question at hand and you basically try to see which of these covariates are imbalanced in terms of the standardized mean difference but there is a key difference between the standardized mean difference and this brass formula the key difference is in the brass formula if you look at the formula one more time this is a formula where it is using the association between the unmetered component and the outcome right and in the standardized mean difference formula there is we do not use the information about the outcome we only use the information about the exposure of the covariate is stratified by exposure unexposed and we just use the standard standard mean difference to calculate whether that covariate is balanced or imbalanced but in here we are directly using the outcome information to choose a variable whether that variable is useful in our analysis or not and that is the main difference and maybe one of the most controversial point of the high dimensional propensity score you are actually using the association with the outcome to choose a covariate in your analysis this is a proxy variable obviously but in the main propensity score analysis or in the stream range propensity score analysis usually this is highly discouraged because they say that you should not look at the outcome when you are choosing a variable but one of the other differences that you also need to keep in your mind is that we are not talking about the investigator specified covariate we are basically talking about here the proxy information so that is one of one of the differences that you need to keep in your mind that we are basically talking about two different type of terminologies and type of covariates that we are dealing with here in the propensity score analysis we generally do not deal with proxy information we only deal with the investigator specified covariates and in the high dimensional covariate selection is actually only happening in the proxy information not in the investigator specified covariates so the context is slightly different here okay all right so once we have all of our bias information or the bias multiplier information in the log scale or log absolute scale what we do is we can try to create a density density plot out of all of these log absolute bias and you can see many of them are close to zero but there are some that are as high as 0.12 and so on in the literature however there is no particular card point that is proposed for the log absolute bias so for example in the SMD or the standard is mean difference there is a established card point of 0.1 to determine whether a variable is balanced or imbalanced but in the high dimensional propensity score algorithm or in the context of the log absolute bias there is no card point that was proposed but in the literature usually it is proposed that you can you can select top 100 variables say for example top 100 variables that are in here that you need to consider in your analysis in the larger data sources you can also deal with 500 proxy variables that are selected by this Bross formula so the top ranked variables or the proxy variables that are associated with high amount of log absolute bias those are the information that you need to select ok so once you select those information you basically are now dealing with two different types one information is the investigator specific covariates and another list of covariates is the proxies that you have just included in your analysis based on this proxy information and you have selected how many proxies you want and in this particular analysis we have selected 100 proxy information that we want to include in our analysis so now think about this we are now talking about 125 covariates previously we already have 25 investigator specific covariates and now we are adding 100 more that makes it 125 information proxy as well as investigator specific covariates in our analysis so in total 125 covariates ok so now that we have identified our proxy information so remember what was the high dimensional probability score algorithm in that probability score algorithm we were talking about including the covariates that were selected by the investigators and we are also talking about the empirical covariates or the recurrent covariates that were selected based on the Bross formula right so in the high dimensional probability score the only difference is that we are basically including this proxy information 100 proxy information in our analysis and the rest of the analysis is very similar to how we do our analysis in the regular probability score analysis so in this analysis you can see we are adding this investigator specific covariates that we have already seen before and then we are adding this 100 proxy information that we have just selected based on the Bross formula and once we have calculated all of this we can create a probability score and in this probability score you can see the overlap for the exposed and unexposed group and you can see basically there are sufficient overlap in the two groups for us to make reasonable comparison about these two groups one other important fact is that in this particular analysis I am using the probability scores in terms of inverse probability weight but if you want you can also use this probability scores for matching or stratification and so on so in this particular workshop I am basically focusing on the propensity scores used as the inverse probability weight and when you are using the inverse probability weight it is always a good idea to check the summary statistic for the weights and you can see the statistic goes from 1 to 53 and whether this 53 is too much or not depends on your sample size remember what was our sample size in our analysis more than 7000 so in this analysis at least there is one person who was represented 53 times compared to the person who was representing himself or herself so that type of inflation of the probability weight or the inverse probability weight is not too much in my opinion and that is why we do not consider 53 as an extreme value and remember see if you look at this there are not much patients who are overrepresented in this particular analysis so I would not be worried too much what I would be worried is checking the balance by the way so for example this red dots that we were seeing are associated with the standardized mean difference for all of these different covariates that we had before the weighting and after weighting you can see these somewhat blue lines where you can see all of these blue lines are within the 0.1 absolute value so we can say after the weighting we are getting a much better pseudo population where the parameters are adequately balanced and you should always compare so before you compare you can once you have the weights inverse probability weights you can try to estimate the ORS ratios and how do you estimate the ORS ratios you basically use the weights to inflate your population using the GLM I'm basically using the weights that I have calculated based on the propensity scores in my previous step and I basically estimated the log ORS ratio and the log ORS ratio was 0.42 if you are interested about other type of association measures such as the risk difference and so on you can also use the GLM but in that case you have to estimate the log ORS ratio from binomial to Gaussian and the link from logic to identity to get your risk difference estimates and you can see there is a estimate of risk difference in that scenario so this is the log ORS ratio that you are getting in the high dimensional propensity score algorithm very important is that even though when you are using the high dimensional propensity score algorithm it is always important to remember that you are basically dealing with a lot of proxy information and remember one of the things that I have told you earlier is that when you are adjusting for proxy information yes that is helpful because you are reducing bias but it is often not clear whether you are underestimating or underestimating because the direction of bias reduction is often not clear with the proxy information right and that is why it is always important to also do a regular propensity score analysis and compare the result with the high dimensional propensity score information yes so one of the questions that I have received in the chat box is that we are using enhanced obviously enhanced is a complex survey this is not a simple random sampling right and when we have some complex survey we need to use the survey weights to get the ORS ratios corrected so that the original target population is targeted and we also need to include the strata and cluster information so that we can make sure our variance estimates are correct right so in our analysis as I am basically demonstrating the propensity score analyze high dimensional propensity score analysis I did not want to over complicate my analysis and that is why I did not use the sampling weights but if you want to do the analysis properly what you have to do is that with this inverse probability weights that you are getting you also need to multiply your sampling weights so that you can target the original population this is not wrong either because like in this situation what you are doing is that your target population is the sample you are not targeting the US population but if you wanted to target your US population then you would have to multiply these propensity score weights with your sampling weights so for those who are late in joining and do not have the repo here is the link for the repo that I am testing one more time okay so one of the other question is that regular propensity score analysis is basically excluding the data driven selected co-vaders and that is absolutely right the main difference of regular propensity score and high dimensional propensity score analysis is basically inclusion of the proxy promotion only this part so if you are not including this part then that becomes your regular propensity score analysis yes that is absolutely correct okay and when you do your regular propensity score analysis see in this case you do not have the proxy promotion you are just basically using the investigator specific co-vaders and that is how you create your propensity score and you again take your overlap you again check your weight and you again check your balance and then you can use those weights using only the investigator specific co-vaders to estimate your ORS ratios and you can see that now the log ORS ratio is coming out to be slightly higher 0.68 where your previous log ORS ratio was 0.42 so there was slight difference so you can when you compare these two you can get some understanding of when you are adjusting what is the adjustment happening and in which direction this adjustment is going and that can give you some understanding of where this is eventually going and also you can also try to check if you did a crude analysis without any adjustment of any co-vaders I am talking about no adjustment of proxy information as well as no adjustment of the investigator specific co-vaders and that would give you basically a crude estimate of the log ORS ratio and that is 0.73 right okay again remember in one of our steps when we were talking about the creation of the high dimensional propensity score in step 5 we did talk about how many proxy variables you want to select and remember we chose 100 right but that was not based on any theory we basically chose 100 because we had in total 143 co-vaders and then we chose to select 100 to have a general understanding of the direction of how much this proxy is contributing in changing your ORS ratio estimate you can basically do a repeated analysis choosing different number of proxies say for example in this plot what I am basically doing is that I am basically plotting all of these ORS ratios as well as the associated confidence intervals right and you can see in the analysis when I just adjusted for 10 proxies this is the estimate when I just adjusted for 20 proxies this is the analysis when I adjusted for say for example 45 proxies and this is the information when I have adjusted for 115 proxies and so on so you can basically try to do a repeated analysis based on different number of proxies and that will give you some interesting information and see at the very beginning when you were adding some proxy information it was slightly volatile but at some point it stabilized a bit in the sense that when it was 50 covariates that you have adjusted up to these 115 basically you can see somewhat of a flat line that means that among these the information or the covariate that you are adding it is not changing too much but once you cross the point of 115 then you can see there are some drastic changes in your analysis ORS ratios so that will probably give you some understanding of what should be your care if I were doing this analysis I would probably choose somewhere in between and that would probably give me some understanding. Okay I have received some question in the chat box before going to the next break I want to answer those questions so the first question is is this using IPW? Yes this is the inverse probability weighted analysis all the way through I'm not using any kind of matching in this particular workshop's implementation I'm basically showing the analysis using the inverse probability weights Second question is I'm wondering if the higher number of covariate would induce an overfitting issue in the PS analysis that is exactly correct you see when the estimates are changing in this scenario this is something I would be very cautious because maybe you are using some information that are not useful anymore and that is changing your direction of your ORS ratio in that scenario Okay so does the top 10 covariate matter actually it does so the way these were ordered is that these are the top 10 covariates that were chosen based on the Bross formula remember the log absolute bias multiplier that we were using so this was actually ordered in the sense that this is the top 10 this is the top 15 top 20 and so on so in terms of that order we are basically adjusting if we were not ordering and we are basically adjusting random number of covariates here and there we would not get a clear understanding of the direction at which the estimates are changing when you are estimating more and more one other restriction that were originally proposed in the literature was that what should be the restriction in terms of the prevalence of the covariate remember in the step 2 when we talk about the prevalence of covariate and we were more worried about whether we are having some zero cell or the low cell frequency we chose that we want N to be 200 and you can this is something that you can also check using this sensory analysis you can choose the top 10 prevalent covariate in step 2 and you can repeat the whole analysis again and you can try to see whether that analysis changed anything over time and these are some of the sensory analysis that are actually very useful because as I said before proxy information are helping in reducing bias but often they do not give us proper information in terms of like in which direction it is adjusting so it is always a good idea to compare number one with the propensities for analysis itself doing this type of sensitivity analysis that could give you some clear understanding of why are it is going when you are adding more and more covariates and so on okay so let us take a 10 minute break again and then we will come back in 10 minutes and then we will talk about some of the challenges of high-dimensional propensity score algorithm as well as some of the machine learning and double robust extensions of propensity high-dimensional propensity score algorithm and in the meanwhile if you have any questions feel free to post it in the chat so see you in 10 minutes thank you hello everybody so for those who might have joined a bit late these are the workshop materials I have pasted the link in the chat box and you can take a look at that to see the materials okay so I have a question in the chat box and the question is more about the over-preting or the multicollinearity problems in terms of the propensity score estimation and this is a common understanding about the propensity score analysis is that propensity score analysis is not supposed to generalize and you are only dealing with the data at hand and you should not think more about the generalization and you will just check the SMDR balance and you do not need to think more about it I think so this is something that I we probably have to think a bit about so for example if we go to the propensity score analysis this is something that I have described in the in the materials is that if you look at some of the early guidelines about the propensity score this is exactly what they have said that you need to take into account of all of the coverage that you know also some of the model specifications beyond main effect and that means that you need to consider the interaction term as well as the polynomial term and whatever other model specification that makes sense and it is okay to slightly over parameterize your propensity score estimation if that helps to improve your balance right so and this is not new I mean this is something that has been proposed in the literature for a while but obviously there is a limit to it you cannot just make your propensity score model over parameterized and think that there will not be ever any consequence obviously there is a limit to it about which you need to be careful say for example there is a paper that was published in I think this was in 2016 and they showed that profiting can lead to the inflation of variance and that can be problematic when you are trying to estimate the treatment effect from a propensity score analysis and also in the propensity score analysis we are usually less concerned about the standard errors of the coefficients that we see in our propensity score model fit but then again there are some additional literature where you can see if you do not properly diagnose your propensity score model that can lead you to unstable estimates as well there are some recent literature specifically those that are coming from the double machine learning literature they know more about the double cross fitting and double cross fitting is a procedure that is basically an extension of cross validation process that we usually use in the prediction problems so double cross fitting is a procedure that is a tailored cross validation process suitable for the propensity score analysis or suitable for the color inference analysis where we are not only interested about predicting the outcome but we also are interested about the association of that outcome with the exposure I think enough understanding now to deviate from that thinking process that propensity score is more about just the data at hand and overfitting is a problem we cannot just over parameterize our propensity score model and think that there are no consequences right so if you are interested look at some of the literature and this is something that is a recent work that I have been doing where it also shows some of the consequences of some of the over parameterization choosing some of the very strong machine learning algorithms specifically those that do not follow that class and there are some consequences that you can remedy by using some of this cross fitting type of algorithm right hopefully that will give you some insight thank you okay so let us move on to we did talk about some of the challenges sorry we did talk about the high dimensional propensity score algorithm already let us talk about some of the challenges of using the high dimensional propensity score algorithm remember the core of the high dimensional propensity score algorithm is basically the bros formula and if you look at the bros formula remember in our analysis we had 143 covenants and we have used the bros formula 143 times to get all of these log absolute bias estimates right that means what we are basically univariately selecting or bivariately selecting all of these associated proxy variables and we are basically not using a multivariate model and the problem that begins is that say for example in the presence of a already selected proxy another proxy might not be useful anymore that might not be a confounder anymore right so in a in a multivariate structure the list of confounders may change because some of these confounders are already present in the model so that the rest of the confounders are not useful anymore so that's why it is very important to think about in the structure of multivariate modeling so that this type of problem do not occur also there are some machine learning methods such as lasso elastic net and so on that are very helpful in dealing with some of this multi collinearity related problem especially in the situation where multiple of these variables are giving basically the same structure they are highly correlated so in that type of problem maybe we need to think about multivariate structure and some of this lasso type of model and that will probably give us some better ways of dealing with this proxies rather than this univariate proxies that were proposed in the high dimensional performance risk algorithm also there are like as I have discussed in response to the previous question like there might be over sample splitting is a useful strategy in the prediction modeling however it is true that in the other inference literature we need more investigation about understanding the sample splitting or this double prospecting procedure that I just talked about how useful they are we need to do more investigation right in terms of controversy I have described earlier the brass formula uses the information associated with the outcome and that is something that sometimes do not sit well with the people who are very familiar with the propensity score framework and the literature who do not want to look at the outcome to select the covariance and for them obviously this is a deviation from the original principles of the propensity score literature when you are using the relationship with the outcome to determine whether a variable should be included in the analysis or not right so I mean obviously that is the problem and that is obviously a deviation from the original analysis but then again you have to understand one of the other assumptions of the propensity score analysis is that all of the variables or the confirmers are measured obviously the whole premise of high-dimensional propensity score analysis is that some variables are not measured and that is why we are trying to rectify this situation and we are trying to come up with a reasonable conclusion or solution so that we can empirically try to obtain some information from the large data sources where that we have access to and we are trying to leverage our huge or big data to get some understanding of how to reduce conformity okay alright so obviously as I have said earlier some of the sensitivity analysis and model diagnostics are always important when you are doing any kind of analysis because obviously you are dealing with a lot of covariates and a lot of diagnostics and carefulness should be embedded within your analysis okay so briefly looking at the literature I did this search in the PubMed and I wanted to understand what are the things that people generally do in terms of the high-dimensional propensity score algorithm enhancement and I found this seven papers right so this paper from 2018 they used lasso and this is me in 2018 I have also used lasso extension of the high-dimensional propensity score there is a Franklin paper in 2018 that also used lasso I and Franklin both used some sort of hybrid algorithm as well but there are other algorithms such as in the machine learning literature there are some algorithms known as the ensemble learning that basically collects information from different type of learners and combines those information and these ensemble learners are also used in the high-dimensional to enhance the high-dimensional propensity score algorithm there are some additional literature this one was a review not so relevant for our scenario this one was the situation where they considered the time varying interventions not relevant for our analysis and this is a situation where they have used a low-dimensional data source where they just included additional 10 covariate which is not really something that we are doing here we are trying to include how we can best utilize the large data sources that we have outside of this some information that we you are seeing here from the PubMed some of these other studies are also there that are not directly available in PubMed through these keywords that I have used and some of these are again using some of this lasso method say for example this Weber-Pal this lasso method and this they have also used some sort of deep learning method in the Weber-Pal and it is called the auto encoders other papers have used the CTMLA algorithm I do not necessarily have included CTMLA type of algorithm in this particular discussion because like this is a slightly more complex model with some of the strong assumptions that I have some disagreement about but then again there are some TMLA and the super learner or the ensemble learner method that are also used in the literature as an alternative to high dimensional propensity score but I thought that they are not using the full potential because when you are using this TMLA algorithm or the double robust approach that you are that are proposed in the 36 literature sometime they are not using super learners or non-parametric models and they are basically relying on parametric models so that is not really fully unleashing the power of the TMLA method and when they use the super learner method or the non-parametric methods as learners sometime they did not use TMLA so either case like I kind of thought that that was an incomplete literature and that is why I thought it might be a good idea to also use TMLA with and without some of the super learner or the ensemble learner methods and try to see whether they come up with some interesting results or not so anyway basically from this literature what I understood is that yes people have used some sort of machine learning methods that takes into account of the multivariate structure of the high dimensional propensity score instead of the original multivariate high dimensional propensity score suggestion right and in the next few slides or the pages what I am going to do is basically I am going to explain what is the logic that they were using to create this say for example the multivariate method instead of the univariate method right okay so if we want to use brass formula as a core of high dimensional propensity score there is no extension of brass formula that is available that takes into account of multivariate structure right so if we are going to think about this extension of brass formula or extension of high dimensional propensity score that takes into account of multivariate structure of the proxy information brass formula cannot accommodate it directly in a straightforward fashion so we probably need to go back to the drawing board and think about how do we then select our confounders or the proxy variables okay so one of the things that was interesting from the brass formula is that they were using or selecting covariates based on their strength of association with the outcome right and that is something that we can easily do because think about in the propensity score literature what are the variables that they suggest us to include in the propensity score literature or the propensity score model so noise variable should not be used the variables that are related to the outcome but not sorry related to the exposure not related to the outcome these are somewhat discouraged to use because some of these covariates or variables could be instrumental variables and that can amplify your bias and some of the confounders you need to include and some of the variables that are unrelated to the treatment but related to the outcome those are variables that are encouraged and let me explain why that is the reason because when we are dealing with the confounders obviously confounders are something that are after adjusting for the confounder you are going to get an unbushed association so that is something that you absolutely need to control when you are talking about instrumental variables that are only associated with the exposure but not the outcome you should avoid from them because like they can amplify bias as well as increase your standard error if you have a precision variable that is highly associated with the outcome has nothing to do with the treatment variable you should include it because including this type of variables can help you reduce the amount of standard errors of the treatment effect estimate noise variable should not be included because noise variables are something that are only going to increase the standard error of your estimate right so in terms of the overall picture if you think about what are the things that we should be including in our propensity score analysis the general suggestion is that instrumental variable no common effects no effect of the outcome no mediator variable no and obviously you cannot include the unmethod confounder but then again you should be trying to include the proxy of unmethod confounders so that is a given that's not a problem so in terms of the variables that you need to adjust in a propensity score model basically those are the confounders and the risk factor for the outcome the precision variables right so there is just one thing that is common among the risk factor of the outcome and the measure confounders and that is that both of them are highly associated with the outcome so what we can basically do is when we are dealing with proxy information we can basically try to identify the variables that are highly associated with the outcome and basically choose those variables that are highly associated with the outcome as our proxy variable that will bypass our need to use the brush formula anymore and that is basically what is done in the lasso implementation in the lasso implementation what is basically done is that you basically develop a model of the outcome based on the confounder variable as well as all of the proxy variables right and remember that this is the model of the outcome not exposed right and then you run a lasso on it and you try to identify which of these proxy information retain after running the lasso and you just select the variables that are retained in the lasso and you use those retained variable in the propensity score analysis right so in that case see here you are again selecting some variables based on your lasso and you are using those proxy information in addition to the investigative space very similar situation to the brush formula selection where you have selected your top 100 but in this case lasso actually selects you some number of covariates like it does not give you exactly 100 or there is no way for you to control it has to be 100 or 50 or something like that lasso will provide you some information and you just use those information in your lasso and to implement you the lasso what you basically need to do is you just need to run an outcome model based on the expo the variable as well as the proxy information and you just try to select the variables that come out of the lasso and basically you rebuild your model based on this investigative specific coordinates as well as this proxy list that you selected out of this lasso and you just use the inverse probability weight formula to calculate your inverse probability weights that you use in your outcome model to weight the outcome model formula and that will give you some estimate of the log odds. Remember what was our high demand for propensity score algorithm logbers estimate it was 0.42 and in our case we are getting 0.41 so very similar estimate in terms of the properties in one of my papers in 2018 I have shown that when you are using a hybrid type of method that is taking power from both of the high demand for propensity score as well as some of this lasso method you can get better statistical properties when you are using these type of methods to build the multivariate structure of the proxy information. The ultimate that you can do is that you basically try to select the proxy variables based on the bros formula. Remember we only dealt with 100 cobitters based on the bros formula and then we use the outcome lasso information so that only on the selected cobitters that were selected based on the variable propensity score based on the bros formula and that brings me from 100 to 52 cobitters or the proxy variables that you need to include in your analysis and you repeat the whole analysis and you get that logbers ratio of 0.44 again very close estimates and that is another estimate that you can use again as I have said before in one of my paper in 2018 I have shown that this estimate performs better compared to the regular lgps or regular pure machine learning methods right? Remember in the literature there was something else that was also described that when you were using a super learner and super learner could mean that you were using a logistic regression and you were using lasso and you were using some sort of spline method and then you were estimating your propensity scores out of this. There is an important distinction between the method that I just described before and the method that I am describing now because in the method that I have described before I was selecting cobitters based on their association with the outcome but in here when I am using the ensemble method what I am doing is that I am selecting methods sorry selecting proxies based on their association with the exposure variable and that is a dangerous proposition because if you are selecting variables based on their strong association with the exposure it is possible that sometimes those variables do not have anything to do with the outcome and if you are selecting only the variables that are associated with the exposure variable obviously these are some of the variables that are going to be confirmed but sometimes you might end up choosing variables that are instrumental variables so this is a very different methodology or different process of estimating the treatment effect estimate by using the ensemble learning or the super learning and when you estimate you can see the estimated ORS ratio is 0.47 okay and the last method that I am going to show is a double robust method and double robust method means that you only need to have one of the models correct, you need to have your treatment model right or your outcome model right and based on that you can try to estimate the treatment effect estimate using this TMLI or the double robust method that are associated with interesting statistical properties and usually these estimates are associated with statistical properties that are more interesting than the regular properties for estimates and when you use this use this propensity scores that you have created based on the ensemble learning you do not have to repeat the analysis again you just insert the propensity scores that you have estimated using this ensemble method and directly put them within the TMLI function within the TMLI package in the G1W argument and that will provide you some estimates from the outcome model and this is the log ORS ratio and the log ORS ratio is 0.43 that you are getting so out of all of the estimates that we have compared so far so this is a picture where you can see this is a crude adjustment and the crude adjustment is showing the ORS ratio of 2. So log ORS ratio is 0.73 and the ORS ratio is 2 and regular propensity score is associated with 1.98 so the crude and the regular propensity score estimates were somewhat close to 1.98 but when you used the high dimensional propensity score information that means that in any information why I have used the proxies the estimates are different than the regular propensity score estimates and as I have explained the ways I have used the propensity score estimates are very different so for example this is the this is the regular high dimensional propensity score estimate the regular propensity score estimate the estimate is 1.52 and you can see I am using the regular lasso sorry the pure lasso method the machine learning method and I am getting very similar estimate when I am using the hybrid method that means that I have first used the brass formula to select 100 covariance and then ran a lasso method to further subselect the useful proxy information that is the hybrid method and they gave me a very close estimate I have used the TMLI method with only logistic regression as the only learner as well as three different learners in the super learner and they also gave me very similar estimate I also ran the super learner which gave me slightly off but very different estimate so you can see from this picture is that whenever you use the proxy information the estimates are sort of comparable but when you did not use the proxy information or when you use just the crude information the estimates are like around 2 whereas all of these estimates are between 1.52 to 1.6 in here okay is there any questions so far about some of these machine learning extensions or some of the other methods that we have talked about so far so one of the question is we believe that the estimates from the proxy variables are less biased so this is an interesting question because we don't know because based on the analysis we do not know the true parameters and when we do not know the true parameters there is no benchmark for us to compare right but then again there has been a number of simulations as well as the analytic studies that compared the regular propensity score and the high-dimension propensity score algorithms results while the high-dimension propensity score algorithm included the proxy information generally we receive some favorable results showing that using proxy actually helped in those analysis but this is something that is very hard to generalize it might depend in context to context say for example if you have some unmeasured confounding say for example when you are working with health administrative data sources so in the Canadian health administrative data sources sometimes what happens is that there is no body mass index or the BMI information directly measured right and sometimes we also need to think about different proxy information but sometimes it is not possible to find a suitable proxy and that is when some of this additional ICD-910 course could come into play in helping us in reducing the bias of not adjusting for some of the useful confounders that are not present in the analysis so again I mean this is the context dependent question and if your data source have some useful proxies that are supplementing those unmeasured confounders sure that will reduce your bias but if you are dealing with a data source where you have some proxies but there are not good supplements of the unmeasured confounder that you have and and that might create might not provide you as good result so I see there is someone raised their hand do you have a question I don't know whether you can turn on your mic and ask the question yeah that's me I just had a quick question so are you aware of attempts in the literature so I'm just throwing parallels with the propensity score calibration so basically maybe we can like the example you mentioned the body mass index so maybe we can get a cohort from the administrative claims data and then like we prospectively collect their like for example body mass index and then generate a propensity score that based on the true variable and compare this to the proxy that the hide the measure score used and if they are the same we can say with some confidence that oh this proxy helps while we don't have the body mass index but if they differed so maybe this proxy just changed the effect estimate but maybe it made it more biased yeah I mean that is one way to maybe check whether the the huge amount of proxy information that you are incorporating in your analysis whether they are helping or not but I have not seen some such comparison in the literature directly where they have measured something and they try to compare it with what I have seen is that in the literature they have used high dimensional propensity score and they have used the regular propensity score and they have tried to compare the results with a separate clinical trial result and that way sometime they have claimed that using some of the proxy information where the proxies had some interesting supplement of the original analysis that had helped but you bring an interesting point that maybe it is possible to prospectively collect some information that can be included and checked but yeah we have not seen that analysis in the literature the other point that I have already mentioned is that we are dealing with some sort of proxies here right and we often do not know how to interpret them directly and we are relying on this analysis to say that maybe we are adjusting for this unmeasured confounder we have to understand that this is basically an assumptions under which we are operating we are operating under the assumption that all of the information that we are collecting from the proxies are supplementing for the unmeasured confounders right and it depends on the how good the surrogates are how good the proxies are of the of the original unmeasured confounding information so this is again an untestable assumption and probably we need to repeat the analysis in multiple data source before we come to a reasonable conclusion and that is one of the reason why I suggest to use high dimensional propensity score algorithm to use as a sensitivity analysis and not the main analysis I hope that makes sense alright so the next question I had was what other ML algorithm could be used actually if you look at the literature you can see there are a couple of different algorithms that are used in the literature so most popular ones are obviously lasso method elastic net method approach and there are some tree based method that this paper have used this one this paper have used some tree based method such as I think random forest method and there are other papers where they have used a deep learning method which is known as auto encoder this is basically an extension of the principal component analysis in the deep learning version yes so any variable selection method that can help you identify variables based on their association with the outcome those type of variable selection methods can be used for example when you are using lasso lasso is directly a variable selection method that can be automatically used another thing that can be used is that when you are using something like bagging boosting or random forest those all give you some variable importance measures based on which you can identify which variables are more important and less important in predicting your outcome right so you can basically select the top 100 variables based on this variable importance measures and then you can do this variable selections of the proxies based on those machine learning methods alright okay that actually brings us to the last point that I want to bring about and this is more about the reporting of the high dimensional propensity algorithm and if you look at the literature last year there two reporting guidelines came in the same journal and the guidelines were very similar and it was published in the same year as well as there are two other reviews that were published and this particular review was more targeted towards the machine learning and this review is more targeted toward the high dimensional propensity algorithm unfortunately for many of these reviews they underplay the usefulness of machine learning methods even though it was repeatedly shown that machine learning methods are actually interesting in terms of some of the statistical properties that we have observed but yeah maybe we have to write another new review that highlights that issue every year in terms of the reporting we probably need to think about what makes our description reproducible so that just by reading the description if anybody else had access to the same data that you have access to whether they can reproduce your analysis right so in that context obviously the most important part is that we have to think about the number of data dimensions that were used so in our analysis if you remember we were only using the medication data dimension but if you are using administrative data sources you could use the hospital data emergency visit data physician data billing data and stuff like that so all of these different data sources that you could use in your administrative data sources what was done to remove the proxies that are problematic remember we removed the proxies that are related to obesity and diabetes and similarly you can think about removing the instrumental variables as well as removing variables that has that are close proxies to the cooperative that are already utilizing the analysis as investigator-specific cooperatives some of the parameters we need to choose say for example in terms of granularity we have chosen 3 digit depending on the data source you might increase it to 3 digit to 5 digit in our analysis we have chosen n equal to 200 where n equals the prevalence filter whether or not this prevalence filter is useful or not this is debatable but we can always do a sensitive analysis by increasing the number of prevalence filter and try to see whether there is any stabilizing point where we can say with confidence that that amount of prevalence filter is helpful again the same thing goes for the minimum number of patients in our analysis we have chosen at least 20 patients have to have that code you can bring it down to 10 or we can play with this number and do several sensitivity analysis to choose this number the important part is that these numbers the way we have chosen it there is no theoretical guideline of like which number is the best right and it may depend on the data structure that you are working with so it is in my opinion it is always a good idea to run sensitivity analysis and repeat the analysis multiple times with multiple values of these numbers and try to identify what works best for your analysis in stabilizing the results in terms of the recurrence covariates usually the three different recurrence covariates are used these are the binary covariates based on the recurrence whether the code happened only once frequently or sporadically and the covariate assessment period in our analysis was 30 days but in the if you look at the literature you will see it can be 6 months to 2 years one year is more popular and 6 months is most popular in the literature in terms of the prioritization you can use brass formula if you are interested in implementing the original high-dimensional propensity square algorithm but if you want to also use the machine learning version you also have to then declare which type of algorithm you have used and then in some algorithms that are complicated you also need to explain what parameters you have chosen to select the proxy variables based on those learners selected proxies in our analysis we have used 100 proxies again there is no theoretical justification for this you should be running a sensitivity analysis to check at which point the covariates inclusion of more covariates is changing your orc ratio estimates and that will probably dictate you what type of number of proxies are suitable for your analysis again this is a very data dependent decision without doing a proper sensitivity analysis it is very hard to say how many covariates are going to be useful and also you have to also talk about which software you have used with which options and things like that so all of the analysis that I have done in terms of the original high-dimensional propensity square algorithm I have used the auto-covid selection package which is already available in the plan but for some of this machine learning methods you can look at some of the course that I have provided here that will guide you in implementing some of this machine learning functions of this course in terms of the diagnostics again you have to understand diagnostics and sensitivity analysis are very important in this type of sensitivity analysis in terms of the standard mean difference within 0.1 is something that are regularly used we should also report the weight summaries in our case the maximum weight was around 54 which is not a lot and you can determine that based on how many data points you have in our case we have more than 7000 data so 54 was not a big number but if you are working with only 100 data points and you are getting a maximum weight of 54 that is a big number so it is also a context dependent decision we have to also compare the propensity score distributions between the exposure groups so the overlapping or common support was not an issue in our case but you can imagine that when you are using a highly predictive machine learning algorithm sometimes what can happen is that most of the values could go either close to 1 and most of the values could also go to either close to 0 and in that case you would have an overlapping problem sorry non-overlapping problem and you have to be very carefully in checking this non-overlapping issue before making any reasonable comparison and we have to assess the distribution of the absolute log bias so I have mentioned that one of the ways we can select the number that K or the number of proxies is that we can do a sensitivity analysis one of the other ways would be we can check the distribution of the log bias multiplier and try to see how many numbers are very close to 0 and how many numbers are debating from 0 and that way we can try to determine what could be a reasonable K based on our analysis and it is always a good idea to compare your high-dimensional propensity score algorithm analysis with your regular propensity score analysis regular propensity score is something that has theoretical justification the high-dimensional propensity score algorithm is basically an ad hoc method so if you want to justify that as the main analysis you have to clearly state why you think that is the case do you think that in your data source you have all of the good proxies that are properly supplementing your unmeasured confounder and just list your confounders and how do you think this high-dimensional propensity score is helping in reducing bias that is something that justification should be very clear in that ministry in my opinion okay in terms of the sensitivity analysis as I have said we can run the sensitivity analysis of number of Ks or number of N or the prevalence filters and then we try to understand which type of which numbers would be suitable for that analysis in terms of the references for this workshop you can see the whole list of references in here that will probably be helpful for you to familiarize yourself with the literature a bit more in the analysis and at the end of the workshop you will see there is a description of the enhance for those who are not familiar with enhance it might be an interesting idea to just take a look at the enhance designs because these are publicly available data sources and they have information about a lot of different things might be a good idea to check out the data and try to see in your own particular context whether you can use this openly available data to contact your analysis that is basically my workshop so we have reached at the end I will still be here for the few minutes if you have any questions I am happy to answer any question that you present in the chat box otherwise have a great rest of the day and thank you