 All right, welcome, everybody. Our next session is a keynote by Dr. Brahmar Mukherjee, who's the chair of the Division of Biostatistics at the University of Michigan. Dr. Mukherjee has been, I first met her at a JSM conference many years ago when she was focused primarily on statistical genetics. Dr. Mukherjee has since moved on to modeling viral infectious epidemiology, such as the spread of COVID in India. And she'll be speaking to us today about some of her more recent work. Please, everybody, extend a warm welcome to Dr. Mukherjee. Thank you so much. Can you hear me? Can you see me? Can you see my slides? All yes? Check, check, and check. Thank you so much. I was actually very worried after Sandrine and not being able to share her slides. But I just want to thank the organizers of this conference for not only creating a platform like Bioconductor, but also data scientists with different perspectives and different viewpoints together, statisticians, biostatisticians, computer scientists, bioinformaticians, epidemiologists all together. And this is a wonderful platform. So I'm going to talk a little bit about electronic health record research. And I'm not going to talk about my modeling work in India because I think that the electronic health record, just the general premise of how to harmonize data, how to analyze data, how to think about selection bias, that would be of more interest to this broad group of people. And I was finally able to actually combine my methodological research in electronic health records with some of the very tricky problems in COVID that we are all grappling with. So you might think that, why am I not in Seattle today? I'd rather be in Seattle than this rainy morning in Michigan. But I run a summer program. I always dreamt of a data revolution. And this is my army. Over the last seven years, we have trained more than 280 undergraduates. And this is their last week. So I cannot really leave the nest as the mama data scientist here. And as soon as I finish the talk, I'm going to go listen to their rehearsals. They have their final symposium. So we have 39 undergraduates from all over the world learning about the things that we are all interested in in this audience. So I divided my talk into three acts. Personally, I find giving virtual seminars extremely frustrating and intimidating. So I talked to my father, who is an actor. And he said, maybe you should try dividing it up into acts like a play. And each part makes a point. And you should pause and give the audience some chance to reflect on what you have said. So that's what I'm going to take my father's advice and going to divide my talk into three acts. And the first act is really unusual for a statistician. Because most of the time, we are really involved in analyzing data which has been collected by other people. But I was really fortunate to be part of this vision and mission of collecting EHR data and connecting it to other auxiliary data sources in Michigan. And what that taught me as a statistician to be an integral part of data making, I want to share with this audience. And then in act two, it's the meat of my talk, the statistical meat of my talk. How do I, as a old hat statistician with a PhD in 2001, becoming completely anti-Diluvian, think about sampling selection bias misclassification in big data, in this era of big data? Do my skill sets matter? And then in the third act, I'm going to talk about some of the COVID-19 problems. But throughout the lecture, I'm going to try to integrate some of the things that we saw with COVID-19 and expanding our data repository during the time of a pandemic. So I have to really thank, it takes a village to do this kind of work, but I would like to thank three particular individuals. Lauren Beasley is the statistical architect. She's a Feynman post-doctoral fellow at Los Alamos right now. Lars Fritscher is a human geneticist, and Max Salvatore is an epidemiology graduate student. So we needed genetics, we needed epidemiology, we needed biostatistics, we needed our programming, we needed bioinformatics to really pull off this project and medical informatics. And you'll see a snapshot of all of those skill sets coming together. So in Michigan, Precision Health, this initiative started, it was supported by the provost's office. It started about 10 years ago. We are actually celebrating 10 years of Precision Health. And the idea was to create a data infrastructure or an ecosystem so that it can serve as an accelerator and incubator for interdisciplinary health research. And it can be in a hypothesis-driven way so that you can validate your findings or be an agnostic way so that you're interrogating this massive database for some leads and some clues, which happened in the early days of COVID when we did not know much. I really like Michigan Genomics Initiative, that is our cornerstone study, we call it MGI, and it started in 2012 as a part of a building a biorepository at University of Michigan. We do not have Women's Health Initiative. We do not have Nurses Health Study. What can we build upon? And so it started with gathering a patient blood sample prior to surgery. So I highlighted that sentence in red because the selection mechanism, how people were recruited into MGI, this participant recruitment mechanism is extremely key to what we do with the follow-up statistical analysis. So they were recruited in anesthesiology clinic just prior to undergoing surgery. So this is a perioperative population. And this was a very bold consent, one-page consent, which allows us connect two core data elements, electronic health records with genetic data, but also anything else that's linkable through your social security number. And it also allows us to re-contact the participants with any findings or for future studies and data collection. It's a one-page, quite forward-looking consent that was rolled out in 2012. And I really, we have seen biobanks all over the world. In particular, UK Biobank has been game-changing with genetics, imaging, all kinds of omics data are shared with the whole world in such a democratic way that has been transformative for health research. But I really like Michigan Genomics Initiative. It does not compare, it has, you'll see that it has about 100,000 participants right now. It's not half a million. It does not have all the data domains, but it's very close to my heart because I see the potential to translate my work into the patients on the other side of the campus for example, I know this patient and I know this physician. So I said that the two core elements of MGI is definitely genetics and electronic health record. And we have electronic health record which will be characterized as medical phenotype from the electronic health record to go to medical phenotype is a huge stretch and I'm going to talk about that as well. And the medical phenotype, we have longitudinal data. So we have more than 16 million medical encounters and we have genetic data after imputation about 20 million genetic variants across the human genome. We have 92,000 consented samples so far as of June, 2022. But then with this, we all know that an individual's existence cannot be codified through just genetics and EHR. So we collected other forms of data. We rolled out an epidemiology questionnaire and behavior and lifestyle on a subset of population in a collaboration with Apple. We collected smart watch data. We have environmental questionnaire, exposure questionnaire. We have social data, we have family history. So as we started, we also realized the potential of linking auxiliary data sources. This has been really game-changing and those of you who are thinking about these data linkage projects in your medical center, I strongly encourage you to think about the consent which allows us to link publicly available data. And so we have the core elements which I mentioned are the genomics and the epidemiology questionnaires and some surveys that we rolled out. But then gradually we built a database where we connected to patient medical insurance claims, prescription data. We connected to dental records because that stay in another electronic health record system in Michigan. We connected to national death index. We connected to state death index. We connected to cancer registry. And then from geocoded longitudinal residential information we connected to many neighborhood and socioeconomic variables. So this took 10 years to build but we actually really chew on it and nibbled on it bite by bite to get this data ecosystem. So when COVID happened, the advantage of having this was that we already had an engaged patient participant community and then we were able to build cohorts with this digital ecosystem with COVID vaccination cohort, COVID outcomes cohort, now long COVID cohort, as well as roll out surveys. For example, in the early days of COVID, May to June of 2020, more than two years back, we did not know much how our participants were feeling about COVID, what were their sense of community exposure, what precautions they were taking. So my colleague, Kristen Weller, actually rolled out this survey where we could really study our participants and communicate to them and also really build a connection with our catchment area. So recently, once you do these things and then yesterday we published a paper on do you really need, when you do these surveys and you have the survey linked to HR, does self-reported COVID symptoms or outcome data add to the electronic health record or vice versa? Do you need both sources of data? So once you have created this mechanism, it really becomes easy to think about big projects. What no investigator would be able to pull off alone if the institution provides you with such a system, then maybe you are interested in a particular question and you can deep dive into a particular questionnaire or a particular form of omics data because you already have a lot of data available on a lot of people. So what is our summary of this process in the vision moving forward? We envision a rich data set obviously with large N and large B. And so this data set can be exploited and used by various investigators. It could be a social scientist, it could be a political science fellow, it could be somebody in pharmacy, of course, School of Public Health and College of Medicine. It also has questionnaires which makes us compatible to contribute to Biobank meta-analysis and we have been able to participate in many consortium related to COVID or many other disease outcomes because we have common questionnaire with UK Biobank and some of the phenotypes. It's a shared resource for the institution, it's just not for one investigator and that's very empowering and gratifying that you're building a legacy and you create something for the future. And as I told you that it could be really leveraged when some emerging health crisis or emerging challenges take place. So based on this, I was brave enough that electronic health records is one part of the study with two other fantastic women in our school write a cohort development grant of recruiting a population-based study because electronic health record, I'm going to, now I have sold the data set to you and in the next act, I'm going to completely debunk the data set and say that what are the different problems with this kind of data. But I was fortunate enough and bold enough and courageous enough to put forward this cohort development grant called MyCaS or Michigan CaS to study environmental risk factors and cancer. Over the next five years, we are going to recruit 100,000 population-based sample targeting on underrepresented groups and collect data on environmental exposure. This will be a healthy cohort of 25 to 45 or four year olds and we are going to collect all kinds of omics and exposure data. These are cancer-free individuals. It's a proper cohort study. So as a statistician, without being involved in this process of precision health and doing recruitment, doing community engagement studio, connecting with the participants, seeing the joy of that, I'd probably just analyze someone else's collected data for the rest of my life. So I wanted to share that excitement with you that we have this study, we really can once you have control on the data collection, then you can actually share it in different forms, use it for other people who are going to study other end points, except for cancer in addition to cancer. So this has been really game-changing. So with that, I'm going to take a pause because this is my sort of journey as being a part of this data making process or creating the data quilt of MGI or Michigan Precision Health. And I'd like to just pause for a minute for us to think about how the database looks like. And then I'm going to talk about the statistical issues with electronic health records. If there are any burning questions, I'm happy to take. All right. So now that we have thought about what the data architecture looks like, let us think about what are the downsides of this data. So the pros and cons of having a database like this is that it is easy. With probably the effort that we put in to recruit 100,000 participants, it would be very difficult. Participants are already coming for their healthcare to Michigan Medicine. It has large sample size. It has a lot of variables. It also has longitudinal data automatically. The median follow-up time in our cohort is about five years of data. And about 70% of the patients are actually seeking their primary care at University of Michigan as well. And the piece that which excited me the most is this callback potential. That when a crisis happens, we can reach out to this engaged participant community. But then the cons are that we really do not know why are we seeing an observation. We see, first of all, this is a hospital-based cohort, so it's not representative of a general population, unlike UK Biobank, which has the other bias of healthy participant bias. But, and also, we do not know why we see an observation in an electronic health record because that depends on who you are. What kind of access to healthcare do you have? Do you have insurance? Are you a hypochondriac? Do you have health history of diseases? It depends on so many things that we do not know the probability of a sample being included in your analysis. And also, some of the phenotypes and the data we extract are incredibly poor. And so we have to go back to our classic training in epidemiology and sampling, thinking about who is in my study. So we have to think about generalizability. We have to think about representativeness. We have to think about selection bias. Otherwise, we are going to produce an incredibly rich wrong analysis and database. So I see our curriculum since statistics, biostatistics, bioinformatics departments being really, really becoming very heavy on machine learning and computation. And that's absolutely necessary. But I do think that it's also important to think about stat one-on-one questions for big data, who is in my study and what is the target population of inference? And my pet peeve is that I urge every computer science department to reciprocate that we are trying to learn so much of your tools. Please institute a course on sampling and study design. And so those of you from computer science departments, please relay this plea. So now I want to say that what is going wrong? Before I go to electronic health records, I just want to talk about this fundamental concept that the big data paradox, which Zhao Li-Ming has beautifully explained in many of his work, including a recent nature paper on COVID, that big data can often actually hurt you. For the longest time, statisticians complained about not having enough sample size. But now that we are gathering data from multiple sites, we are really thinking about bias, but bias which does not go away with sample size are more than precision. So we have to reorient our toolkit to think about these different sources of bias, sampling bias, selection bias, information bias. So I borrowed this slide from my colleague Rod Little that suppose you have a true parameter that you are trying to estimate. Suppose that you know it's a simulation setting that you have T equal to 0.4. And we have spent a lot of time learning in classical statistics to define inclusion or exclusion criteria and think of a very well-designed study where we are not so concerned about bias. We sort of think that T would be inside the credible or confidence interval, but the length of the confidence interval is wide because it's such a carefully curated study. But now we have this big, big non-probabilistic sample. We can deal with probabilistic sample. If it's like over sampling certain groups, we know how to deal with that as long as we know the sampling probability, the whole field of sample survey has been built on that to deal with those situations. But non-probabilistic sample is where we do not know the probability of a sample being in my observation, being in my analytic dataset. So here we are dealing with a situation where I have tremendous N. So my precision and variance is going to, at a very, very fast level, it's becoming very, very narrow, the credible interval. So I am losing the ability to overcome even tiny little bias. This is incredibly important that this bias is not going to go away with sample size and your variance is almost, your tiny little variance is actually even hurting you to overcome that bias and do plausible, valid, principled inference. So this is the big data paradox where your sample size or your standard error being like 0.0001 order and your bias being of 0.01, 0.02 order is not being able to overcome. So now we probably want to design statistical methods which borrow on both of these tools and both of these principles that you have the true data and you have the well-designed study and also you have non-probabilistic sample, how to combine that. So exciting research is really coming out and statisticians know how to do bias variance trade-off and combine through shrinkage and other principles how to combine these two kinds of inference and two kinds of studies. So coming back to MGI, I'm going to show you why we started thinking about this that when I first looked at MGI, roughly 50% people had a prior cancer diagnosis. So obviously we really do not know, we do not have a population-based sample in the population in Michigan, we do not have 50% cancer incidents in this age group. What is shocking to me and surprising and heartwarming that the people that we approached 75 to 80% were actually willing to share all of this data with the University of Michigan researchers so that we can help people and help human health. So when we see that in the first period of this data gathering, we are so excited that we have so many people with genetics data and electronic health record data, what can we do in terms of medical informatics and genetics? So the first part after the first five, six years, we actually spent on calculating polygenic risk scores for multiple, multiple diseases, in particular my focus was on cancer in this database and really create repositories where people can evaluate and also download different polygenic risk core constructs. But at the time we did not really pay much attention to the selection bias issue. Gonzalo Abecases who started this Michigan Genomics Initiative, I remember I'll be in like really feisty conversations with him that this is garbage in, garbage out, this is not going to give meaningful interpretation or even genetics odds ratios. So lo and behold, we did look at the data and I want to show you something really striking which led to the conversation that I'm going to study mathematically. So what we saw in phenotype after phenotype on the X axis of this left side of the plot, I have plotted log odds ratio from genome wide association study catalogs, genetics is a field where we are very fortunate, germline genetics is a field where we have lots and lots of large studies where we can see that this is probably the gold standard meta-analysis of consortium studies. And on the Y axis, I have plotted much smaller MGI with selection bias. And what we see is a striking reproducibility in terms of not just the P values or the ranking, but point estimates. This really made me feel like a failure as a statistician that I don't understand this. This is so 50% people have cancer, this is a bias sample. How can these be so concordant? And the lens concordant coefficient was around 0.9. And we saw that for many phenotypes. And so it was very clear that in some cases it works and you can ignore the selection bias, but when? Is it always? It's not enough to know it works for this dataset. We have to know as statisticians, it's our responsibility to figure out when will it work and when will it not. So that genetics work was really, I would say that did not pay much attention to these issues of selection bias and outcome misclassification. So what we embarked on this journey is that why? Why is the bias and rich sample giving similar answers as population-based million dollars of studies? And then the other thing which really bothers me is that Michigan is one institution. And you have, you do not have, we do not have like NHS in UK that all the electronic health records, all medical encounters of a participant is recorded in the same system. So it has lots of gaps. So we do not really know what's happening when we are not seeing a participant. So I wanted to really think about with my postdoc, Dr. Beasley, that what happens when the selection and misclassification are at play together? And I want to show you some progress that we made. So the two questions we are trying to understand is that we are not seeing the complete data. Maybe things and diseases are happening in these question mark areas, but we are not capturing them. So in misclassification language, it will say that there is truly a disease, but we are not capturing it or less than perfect sensitivity. It can also happen that the disease is truly being reported but not there. So there could be less than perfect specificity as well. So phenotyping or coming up with who has a particular disease is a very complex process in EHR. And this talk is not about that. We use multiple auxiliary sources, structured and unstructured content of the EHR in order to come up with who has certain disease. Many times just the ICD codes or the disease phenotypes are used, but most people use very complex and sophisticated algorithms to determine which using all parts of the EHR, including treatment data, including prescription data to come up with a phenotype assignment which is much more precise than just using the ICD codes. So still they are often noisy because you are not getting to see certain variables and certain reports because we do not have integrated EHR. And then thinking about selection bias, you really have to think about that question I posed. What is your target population of inference? Am I trying to generalize my results to Michigan population or to the US population? Or is it to just Michigan medicine? So we have data on people we approach for consent, but beyond that, what kind of data do we have? Do we have individual level data? Do we have summary statistics? Because there is no free lunch. We have to have some data on the external population in order to map my internal analytic sample to the external population. So here is, I want to really convince you that when misclassification and selection are at play together, all our classical intuition, even about measurement error is actually off, off the statistical intuition ballpark. So what I'm showing you is a very well known association between gender and biological sex and cancer. Women are less likely to get cancer and cancer as overall phenotype. But when you actually select your sample based on disease and gender, then you can end up in a completely different association from that left-hand side blue. You can go to the right-hand side red where you are saying that women are actually have higher odds of having cancer. So it's not really attenuation towards the null because if selection really starts to depend on what you care about, that variable, then you and also the disease, then you can end up with massive bias and incorrect conclusions. So how do we think about that? As a old hat statistician, as I mentioned, interested in epidemiology and interested in misclassification and sampling, of course you put together models. And you put together models so that you can relate the selection mechanism and this observation or the contamination mechanism with what you really care about. So what our framework is that D denotes the true disease status that we do not observe. D star is the contaminant EHR derived disease status after your pipeline. S is the sampling or selection indicator of being in your study. And then there are different parts of covariates that we are going to talk about. Some covariates, particularly W, is informative about the selection. So we need things which are going to explain selection. Then for the sensitivity and specificity mechanisms, we also need covariates X and Y, which are informative. Remember, we do not have gold standards here. And then what we are really interested in is the relationship between D and Z, which could be the genetic odds ratio that I showed you and motivated this. So now what we have are models and you might question models and we are going to come back and criticize this as well. But we need some thinking framework on studying mathematically how selection and misclassification is actually affecting our inference. So that I can go back to the question that why? Why were those biased sample estimates concordant with the GWAS population based estimates? We need to know that. When would that going to be true? So what we see here is that different components of the misclassification model and selection model. And I don't want to put too many equations here. Different covariates are informing different parts of the model that there could be overlap between these covariates as well. Our ultimate mathematical goal is that the model that you're after which is with the true disease variable D and a covariate set of covariate Z and the model that we fit. The model that we fitted in all of those genetics papers with contaminated variable outcome D star and only on the sample data. We just did this and we sort of blind faith assumed these theta tildes are going to be close to the true thetas without thinking about the selection without thinking about the misclassification. Now I want to relate the two through these models of X and Y. And this is where I think this is if you are like distracted because it's very hard to pay attention to people who are in a virtual in a hybrid conference who are virtual. This is the key slide. And this was sort of beautiful to me as a student of epidemiology and studying case control studies for a long time could actually establish a relationship which ties the model that we fit with D star and S with the model that we are after. And that equation does not look too complicated which sort of like actually starts making sense that where are the participation in terms of specificity and sensitivity are going to play into this equation. And there's an offset term. If you are doing case control sampling and outcome dependent sampling you'll recognize that sometimes we have this offset term but they don't involve the covariate they sort of get involved, absorb in the intercept and that's why we can do estimation in case control studies. But what we see is sort of really a transform logistic type model syncing in with the sampling variable that a mathematical framework to study the effect of sampling which is coming out as an offset term and then the two functions of the misclassification also appear in this equation. Now that we have this we can, it's not, it's still incredibly challenging to estimate those CZ, BZ and RZ it involves a lot of thinking and careful mathematics to go from these models and many times it's actually condition and probability and Bayes theorem. So what we were able to do is to come up with algorithms which really tell us how to estimate to decouple the problem because we have lots and lots of unknowns so maybe we can fix the specificity parameter and try to take care of sensitivity first and then we can probably ignore selection bias at one part and then come back to selection bias. So the good thing was that you'll see that if there was no misclassification this result sort of makes sense because if there was no misclassification then these numbers, the specificity will be one and the sensitivity will be one this will look like the logistic regression and similarly if this sampling did not depend on covariate it's looked like an offset term which is free of Z. So it actually matches and parallels what we believe is the true relationship. So that was kind of interesting and once we had this relationship you can read our papers, I'll share the references in a moment that you can actually come up with a transform logistic function through which you can start estimating the corrected theta from the naive theta and there are different ways of doing it in absence of gold standard data and there's a lot of literature on that. But then you put together this misclassification literature with the selection literature so at the end of the day when you found this relationship we did a lot of simulation studies and we were able to see that if you do the transform logistic link you are actually able to correct that bias in the log odds ratio and odds ratio that I showed you for cancer and gender it actually gives you reduces the bias. But then what about selection? What about selection and again how to deal with the selection bias you have to really think about the selection mechanism how someone is ends up in your analytic sample very, very carefully. And so what we did there is that there is a lot of literature now in combining a probability sample with a non-probability sample. And so if we have the target population if we have some individual level data then we did inverse probability weighting. But if you're not so fortunate that we do not have individual level data on the target population then we had to resort to suboptimal methods like post stratification and raking which only has marginal summary statistics but not individual level data. At the end of the day it was still a very complex estimation algorithm because you have all of these three things unknown and many times they are not identifiable because you do not have enough data. So we had an iterative process of getting the weights with a fixed misclassification probability but along the way lots of mathematics and a lot of computational tools really help us to come up with a solution. I wanted to show you that not all GWASs are free of this like MGI is not great for all GWASs. When we looked at AMD age-related macular degeneration as an outcome not cancer we saw that MGI log odds ratios are far from the consortium odds ratios and the Lin's concordance was 0.55. So the question was can we get to closer to what we had in the gold standard but there is as I said that there is no really free lunch but there is reduced price lunch so we could improve the Lin's concordance with the gold standard after applying our methods to point from 0.55 to 0.7 but it's not completely corrected at all. So I just want to mention that in this world we have to accept that probably we are never going to get to perfect inference so we should really promote the concept of less imperfect inference because less imperfect is all you can do with this data but you should also definitely try. So we developed a software, a package Samba we had many cool names for all the packages that I have heard about since morning and so this is cool one Samba selection and misclassification bias adjustment which has complete inference including standard errors and P values please use them and share your feedback with us. So at the end of act two, what is the takeaway that we actually were able to in the misclassification literature we did a lot of work where the literature is novel because we did not have gold standard data for sensitivity and specificity and then we tagged on the waiting piece with the misclassification literature. The reason why GWASs were largely okay I showed you the AMD example is that when we did the math we realized that selection mechanism is often not related to the value of a single SNP or a particular marker or an omics data they're not related. So you break that association arrow and that's why often selection does not influence the relationship between the genetic the particular SNP and the disease outcome. Misclassification could also be a bigger issue for cancers versus AMD. So we have to really pay attention to these things and in this case I'll argue that two plus two or one plus one is larger than two because the two literature existed but we really needed to study them together. So these are some of our publications on this issue the statistical publications and with that I'm going to move on in the last two or three minutes of my talk I want to show some of the COVID examples where we benefited from having this work done. So the last part about COVID-19 and electronic health record I have been extremely disappointed by the lack of integrated electronic health record data in the United States. If you see the impactful studies that came out in real time about vaccines about variants they have been from UK, Israel, Denmark and all of them have great data integrated data systems where the vaccination data, the testing data the clinical outcomes data, the infection data are all integrated and you can get to that in real time. In US it's pockets of places and systems but not really at a national level. So it has been really, really very difficult to understand what is truth and what is a generalizable finding and all of these studies also show you that basic statistical training in terms of study design, matching, addressing confounding selection bias, meta inference these are all very, very important. So because we had COVID we had done all of this work with cancer and other phenotypes when COVID happened what we did was again we are looking at a disease risk model because our outcome now the D becomes COVID and then Z kept changing sometimes we are interested in genetics sometimes we are interested in race ethnicity sometimes we are interested in socioeconomic status or pre-existing comorbidities. The outcome also kept changing whether it's test positive or infection or is it hospitalization or is it mortality or death so it kept changing and selection was very prominent who is getting tested particularly in the early days of COVID whose data are you seeing? So we had built a lot of models on who is getting repeatedly tested so the probability of selection or probability of selection of the into the sample was being built. So with that in the early days of COVID we were able to study differences in COVID outcomes across race ethnicity. This was a paper which was in 2020 still in early days and what we saw is of course that there was a disparity in terms of COVID outcomes this is very well known but you have to do proper adjustment and proper adjustment in terms of many people could not do it they were just publishing the unadjusted odds ratio and we can debate what is the proper adjustment but what we noticed is that even one year later where COVID outcomes that you can see the height of these bars have gone down dramatically thanks to vaccines and treatments but still the reduction has not been equitable across all communities of color so I think that this mechanism of having this method set up this database set up helped us to delve a little deeper into some of the profound questions and again this is another example where we looked at we did not know who is at high risk of COVID related severe outcomes so we searched agnostically throughout the phenome and we looked at what contrast in terms of the past THR who is at high risk of getting hospitalized and not hospitalized there have been many many many models for risk stratification for COVID outcomes since then but this was very early days and we could study stratified by race and ethnicity do we see particular comorbidities in the past electronic health records showing up across ethnicities so I think just getting data together harmonizing and trying to get better will never be at a population-based level I have accepted that I have reconciled with that truth but still trying our best with correcting all of these errors is incredibly important so there are lots and lots of projects ongoing in my lab particularly long COVID this is the most poorly measured phenotype that you can think about about lots and lots of work going on with repeated testing data how do you think about booster effectiveness as we are getting first and second and now probably an Omicron booster in the fall how do we know who needs it and whether it's giving us additional benefits or not so I think that this has been a fantastic time to work on this data and as a takeaway I want to mention that having the right tools and the right team and all of this infrastructure setup was very crucial to do this work I see a lot of things on Twitter just criticizing academia that industry is offering a lot more energy a lot more money and so on but I have to say I have to defend that I do not know of any other profession where I did not have funding I did not have any resources but overnight I just decided that COVID is very fascinating to me I want to work and study and analyze COVID I think that working with EHR and such massive data set you make a lot of mistakes so it's very frustrating but it's also fascinating I do recognize that I'm a data dreamer but I'm not the only one you have been listening to me for the last 40 minutes so you are part of that dream and I do think that statisticians have a lot to offer to major public health and healthcare issues and I'm so fortunate to be a part of this community so with that I want to thank you all and also thank my funders as well as every participant of Michigan Genomics Initiative and the Precision Health Initiative and the organizers of Bioconductor so thank you so much Thank you for a wonderful talk Dr. Mukherjee Do we have in-person questions for Dr. Mukherjee? The microphone, we have several in the room You've got one question online from Gabriel Odom Can we take a stratified random sample from the EHR or use a virtual twins approach? You're muted Dr. Mukherjee Sorry, that's exactly what I mentioned that there is so much work that can be done with the digital twins and matching approaches and just think about your samples sometimes not using all of it is actually better so you have to really that's why I was really proposing that there is re-institution of study design and sample survey courses in our programs not just go overboard with machine learning we need that too but we need to think about how is data made because if we don't do that then it's going to be perpetual bias and study design need to be taught in computer science departments as well Absolutely right So the survey sampling has always been kind of a dark art even within stats departments how do you propose to address this especially places that might not have as the depth of resources that somewhere like Michigan School of Public Health does? So I think that I'm very lucky to be in Michigan because I took a course from Rod Betel like two years ago where he made us read Namens like initial paper with finite samples calculations it is incredibly hard sample survey is very hard for me but we persevered too and also because we have the Institute of Social Research which is a leading sample survey organization so I'll admit that I can actually knock on the door of Rod Betel, Mike Elliott and say what am I going to do but I do think that thinking about selection bias with big data is incredibly interesting and important area and there are some things which are informative of why you see an observation right for example like you know number of visits that the patient has historically whether they have access to insurance there are predictive covariates so I do think that we need to do some modernization of our sampling and design courses and make it a requirement so Michigan actually in our health data science program that we have sampling is a required course and we teach it jointly and I do think that we need to recruit faculty in those domains yes So not entirely unrelated to the previous the Michigan Department of Health and particularly its nexus with the Michigan School of Public Health is if not always cozy at least fairly friendly we have a question from I believe Sarah Stankiewicz asking she says I'm in Florida how helpful or supportive was the Michigan Department of Health or I guess the DHHS I understand this might be a little bit more difficult to pull off somewhere like Florida but can you comment on how generalizable this is at the state level or the province level? Yes, so I think that what we did is for many of this so we have a paper coming out exactly talking about because you can prove all of this theory you can even have the software but how do you get this external sample data that's really key to be able to do this like you know mapping and so in many cases we have used the breakfast data of ERFSS we have used in Haines we have also worked at Michigan Department of Health Michigan Medicine has very good connection with Michigan Department of Health but even with that we cannot get the individual level data we can get margins in total so we have to be prepared to do some of the methods which can only use joint distributions margin but not individual level data but that's the tricky part that's why I said that we can put forward all of these methods and beautiful identities that's definitely a first step towards understanding but estimation is still 10 steps away and you may not have enough data to correct for all of these biases similar question unsigned but I'm not mistaken but noting that compared to the NHS or a typical European or even Canadian health system there's a lot more fragmentation in the US so there's potentially ascertainment bias at the level of counties or individual systems plus on top of that we have large systems like say spectrum in Michigan where they have their own interests and how the data is used and what it's used for do you see this being insurmountable relative to a national health care system I note that for example ascertainment bias in the UK Biobank appears to be more severe than in all of us so maybe it's not always bad that they were behind the curve in terms of data unification but do you see this as a pros and cons or just pure cons? So I think that I really think that you know some of the studies in COVID particularly have come out of the VA have come out of Kaiser Permanente have come out of a large data system where they have harmonized data so I do like the fact that the UK Biobank is a special example of half a million people but if you think about public health England on NHS data they were able even Israel Israel's data is actually through Clalet Health Systems which is a large health care provider which has very good coverage but they're extremely favorable to statistical research and scientific research and producing things in real time what I saw in US yes we are like right now there are a lot of efforts for example the recovery trial and also different consortiums studying long COVID but we never can get our act together in real time it seems to me and so even with the variants right so even with the Delta variant I am amazed that all the originated in India but the sequencing data and the vaccination data and the clinical outcome data were not linked so the initial papers all came out of the UK so I think that this mapping and also seroprevalent or when as did this random sampling of the whole country this is also possible because of the geography of UK as well but NHS I do think that I'd rather have a national integrated data system where people are willing to partner with scientists and it's not really just protected data for the government you mentioned the VA do you see that as our sort of the closest we're ever going to have to a national health system in this country? But it is a very selective population right they have done a lot of work in terms of the VA studies on COVID vaccine effectiveness have been very carefully done with lots of controls and selection of you know a contemporaneous control as well as historical controls test negative design various variants of that they could do it because they have the system but it is a selective population so we need better data on people who are invisible to us if we train our algorithms on data on visible people we are always going to cater to visible people Can you hear me? Yes, I can hear you Very interesting topic in general very fascinating pertinent one I've been interested in for a number of years so two-part question and when you were drawing out and those key slides and everything what about the causal mechanisms a lot have been written about taking a biased sample and trying to blend these together have you looked at it from a causal perspective in terms of some things you can get an estimate from the data and some things you really need that knowledge and writing out the graph about the causes? So that's what we are working on right now this has nothing to do with causality this has like an observation model we did not do a causal treatment effect but I do think that what we are trying to do now particularly with vaccine effectiveness study to emphasize you know causal inference we have done propensity scores and we have we there is a model-based approach but design is also very very important that we have seen in many works so we are trying to really bring that design-based causal inference flavor into this work and I think that you are absolutely right that's the next step there will be lots of other things in the DAG in terms of collider bias the selection variables and the treatment variables and we have to think about that this sort of points to that because Z is our variable of interest and what you can do is honestly based on how is Z correlated with W or Z the misclassification covariates and the selection covariates and what kind of conditional independence can you assume? So you can decouple some of the biases under those conditional independence assumption and there are some cases they will be still entangled and then there you cannot identify and decouple this is exactly pointing to some of the you know missing data and the causal inference literature exactly the map is exactly similar Yeah and the follow-up on that then it's interesting you were picking genetics as an example does that give you any special ability as an instrument variable with the genetics being nothing causing that although we could talk about population biases and sampling biases being present in different countries but I'm curious does that open up any doors that you're going to leverage that you wouldn't be able to leverage otherwise? To be honest genetics Yeah to be honest the genetics example started because Gonzalo believed that we'll be able to learn like odds ratios and I believe that we cannot so it has started almost as like intellectual bet that no this cannot be mimicking population based U.S. studies and so I was proved wrong in some cases but it is not always blanket truth that was my point we have to learn under which assumptions you can recover these estimates so that's how it started because the core elements of MGI were electronic health records and genetics but now we see the advantage of having that right because the germline susceptibility even having polygenic risk core as a confounder or covariate we can do that because we have automatically stored that in our database probably normally if we don't have that as a covariate we not use that but because we have done all of this work germline susceptibility can figure out in our models but you are very right that we can talk about like there is not too many confounders of that genetic odds ratio that's why you can retrieve it similarly like when I decide to join a study or surgery sometimes is not related to a value of a single nucleotide polymorphisms on my genome that's why that association between Z and W is very weak and that's why you can retrieve it so but if the associations were stronger if we are looking at a like a gene environment association and so on then you are not going to be able to retrieve it depends on which polymorphism or the polygenic risk or you're looking at and what is your selection mechanism in this case they were automatically like they were very weakly correlated and that's why you could retrieve it Thank you Thank you Thank you for that question Do you have time for one last question from Ian Smith or we're running short here but it's a reasonably good one about balancing the cost high quality data and sampling with the with essentially where's the exploration exploitation tradeoff and how do you determine it we all know that garbage data produces garbage output but we also know that no data produces no output so somewhere in between there there's a tradeoff do you have any work upcoming on how to determine that how to determine that that tradeoff and the optimal balance point for it in these types of studies Yeah, so this is a great question and I'll point you to Zhao Liman's paper where he actually tries to compare the information content of a random sample of size hundred versus a bias sample of size million how do you really compare and he has a very nice paradigm of of actually studying it and it all depends on if you have a self selected sample what is your like contamination factor of the what you are trying to study and your selection mechanism but I believe that you know the figure that I showed you that there is a narrow credible narrow credible or confidence interval the bias is not that much obviously there is information I do want to use the HR data I learned a lot during covid there was method in this madness when you look at fee was plots when you are able to design studies carefully you want you are actually able to learn something and one thing I learned is that maybe the absolute numbers of prevalences and things like that are absolutely wrong but there is a relative thing that you can still learn and so this is something that I think that we are writing a piece on really building on Jowley's work because Jowley's work is not in a regression based model that how can you when do you think that this is the information threshold has been so much distorted the bias the selection bias is so much that it's useless that paradigm of thinking about it how to compare the information content I think is very useful and actually he shows in a new recent nature paper that the Facebook Delphi survey gave such an overestimate of vaccination in the United States and when you look at like Sam carefully done sample survey much smaller size how you can retrieve that information better so I think good quality data we need that in order to even understand our mistakes even if it's not that large scale thank you so we're out of time thanks thanks again for our wonderful and detailed talk and some stimulating discussion I guess everybody give a warm adios to Dr. Mukherjee because it's time for lunch. Thank you. Thank you so much. Thank you.