 Hello, I'm Joseph Ricker, a director at the R Consortium. Welcome to the COVID-19 data forum and today's webinar Beyond Case Counts, making COVID-19 clinical data available and useful. The COVID-19 data forum, a collaboration between the R Consortium and the Stanford Data Science Institute is an ongoing series of webinars and ongoing discussions that bring together experts from multiple disciplines who are working to collect, curate, and share the data needed to drive scientific research and formulate an effective public health response to the pandemic. Today's webinar will focus on the high dimensional patient level data required to understand the causes of the disease, its transmission pathways, and patient treatment. Our speakers are a diverse group from both academia and industry who we believe have direct knowledge of the relevant issues. We will hear presentations from Dr. Jenna Repps of Jensen Pharmaceuticals and Odyssey, the observational health and data sciences program, Dr. Andrea Ghana of the COVID-19 host genetics initiative, Dr. Ken Massey of SAMA Technologies and the end pandemic national data consortium. And Dr. Ronnie Rosenthal of Carnegie Mellon University and the Delphi COVID-19 Response Team. Each speaker will have approximately 15 minutes for their presentation. Immediately after the last presentation, our speakers will participate in an open discussion or during which they will take questions from our virtual audience. Please use the Zoom Q&A tab to submit a question. Our moderator today will be Dr. Sherry Rose, Associate Professor of Health Policy at Stanford University and the Freeman Smogli Institute of International Studies. Dr. Rose is a statistician who works in public health, epidemiology, computational statistics, causal inference and machine learning. She's well known as a co-founder of the Health Policy Data Science Lab at the Harvard Medical School. For two books, she co-authored on targeted learning for her publications and short courses and for her active participation in the art community. Please welcome Dr. Rose. Thank you, Joe. I'm delighted to be here today to moderate and introduce our first speaker. Jenna Reps is Manager of Epidemiology Analytics at Jensen Research and Development. She is a Senior Epidemiology Informaticist at Jensen where she is focusing on developing novel solutions to personalized risk prediction. Jenna's areas of expertise include applying machine learning and data mining techniques to develop solutions for various healthcare problems. She is currently working within the Patient Level Prediction Odyssey work group with the aim of developing open source and user friendly software for risk models. Prior to joining Jensen Research and Development, Jenna was a Senior Research Fellow at the University of Nottingham where she developed supervised learning techniques to signal adverse drug reactions using UK primary care data and acted as a data consultant to other researchers within the university. Jenna received her BSC in Mathematics and MSC in Mathematical Biology at the University of Bath and her PhD in Computer Science at the University of Nottingham. Welcome, Jenna. Hi. Hi there. So I'm gonna be talking about Odyssey and the approach we have. So it's a collaboration with lots of data along a network and how we've been using this to gain access to data for COVID-19 and answer lots of research questions. So firstly, what is Odyssey? I know some of you won't be aware of Odyssey. So Odyssey is an interdisciplinary collaboration. So it's got a range of researchers with lots of different expertise and lots of different backgrounds. And they're all working together to try and come up with best practices for extracting information from the big observational healthcare data that we have. It is a large collaboration that spans all over the world and you'll see later on a map where you can see the scale of the collaboration. The central coordinating organization is Columbia University, but you'll see anyone in the collaboration can ask questions and get them answered through the network. So Odyssey's mission is to improve health by empowering a community to collaboratively generate the evidence that promotes better health decisions and better care. So what this is saying is that we're effectively using the expertise that everyone has in the collaboration to come up with best practices and answer important clinical questions that can help us gain insights to improve healthcare. Everything in Odyssey works because we have lots of standardizations and the main standardization we have that enables this large collaboration is the OMOP common data model. So this is a data structure that any observational healthcare data set can be mapped to. So whether it be electronic health records, claims data, clinical data, it can all be mapped to this format. It's evolving over time because there's a work group who are advancing this. So we're on version 5.3 at present, but it's evolved and it's still improving. So if there's things that people want added, they can get involved and get that added into the data model. But the way this has a big advantage is that if you imagine there's three collaborators, one collaborator has electronic health records and that has a bunch of tables and then it's another collaborator, they may have administrative claim records and that will have very different structure to the original electronic health records. And then there's a third collaborator, they may have clinical data. And again, this is gonna have a very different structure to the other two databases. So if we were to do an analysis and collaborate, we'd have to write custom code for each of these data sources. But if we add the standardization of mapping to the common data model, now we have three different datasets that all have the same structure. So this means you only have to write analysis code once and you can share it across the whole network and get multiple answers. Now we tend to use R as our front end. So we will do open source analysis using R to write an R function or a script or a package and get people to run that on their dataset. And then rather than getting one answer from one dataset, we can get multiple answers and we can look for things like consistency. We can do external validation at scale. So we can really scale up the evidence. The Odyssey Way, I mentioned it's open source and we have a lot of collaborators. Just to put on to scale, this is a map of the world. Each of these red dots is a collaborator in the network. We're missing part of Asia, but you'll see that on the next slide. But there's red dots everywhere. And all of these people are either experts in some methods developments. They might be people who are writing R code. They could be people with data. They could be clinicians who have the important questions that we need to answer. And they're all working together to solve healthcare problems. Now let's get back to COVID-19. Why is this useful? Well, because we have this standardized data structure, we have lots of datasets that have COVID-19 data. So rather than just having one dataset site with a small amount of data, we have 16 databases and they span the US, Europe and Asia. So we have worldwide data. We have 4.5 million patients that have been tested for COVID-19. 1.2 million patients that have been diagnosed or tested positive. 380,000 patients with confirmed positive laboratory tests. 249,000 patients have been hospitalized with the COVID diagnosis or test. So across the network, we actually have very large data. The Odyssey way, we obviously have large numbers of you saw on the last slide and we have lots of diversity. We don't just have one single site. We have this all over the world. So we can look at insights from around the world rather than just one hospital in one country. But we can't share patient-level data. You can't just ask someone to give you this patient-level data as privacy concerns. And this is where the standardizations come into effect because what we do is we visit the data. I can write analysis code knowing that everyone has their data in the OMOP CDM. I can write the script that will extract that data, do whatever analysis I want and then give some aggregate information. Whether it be characterization, what the COVID patients look like or developing a prediction model, can we predict severe outcomes? I can develop the model in their data and then all I ask them is that model return. I don't use their data. They just run the code that I've written and anyone can write the code and get their studies done. So effectively we share population-level data by just visiting the data sites and this is how it works. We have a hub-and-spoke network approach. So you'll have one person who leads the study from some center and they're the one who comes up with a research question, gets the code written, whether it be collaborating with people or writing it themselves and then they ask people in the network to run their study and get that population-level aggregate data. So as an example, you might have obviously a network that looks something like this. You have some sites that have multiple databases and you have some sites that don't have databases but they'll still be involved in various parts of the collaboration and anyone in this network can lead a study. So Columbia may say, I wanna see what the COVID-19 patients look like. So they may write a study that answers that question and anyone with suitable data will run that study. But anyone can lead a study in the network. So there may be another question. I wanna develop a prediction model on COVID-19 and then the researchers will develop the study and then anyone who has suitable data can run that. So anyone can lead a study. If you wanna lead a study for your Odyssey and get access to this data, we encourage people to use the Odyssey Studies repository. So I've got a screen shot here. It's Odyssey Studies and it's just a GitHub repo where you put in our package, that will run all your study. We have a summary, shiny app that is on a website, data.odici.org, Odyssey Studies. This gives you information about all the studies people are currently doing and you can click through and it'll tell you things like the protocol, how they're progressing, what they're actually trying to do. And then you can create your own studies and add to that. If you wanna do a study, the first thing is to write a protocol. We believe that you should always write your protocol upfront and get comments. So we recommend that you get this reviewed by the community. The community is full of experts from different backgrounds. So it's always great to get their perspective of what you've written and get some feedback. And then once you've had that done, you can post the final protocol onto GitHub. So we have the GitHub repo that anyone can load to. We can give you permission for that. And the good things about the final protocol is this can be used for IRB approval. So although you're not gaining access to the patient level data, you're still obviously using that in a study and it still needs to have IRB approval. But if you've written a good protocol, people can use that as a starting point. Then you can develop a study code. So this would generally be an R package. And you put that onto GitHub as well with your protocol. That should be doing exactly what you specified in your protocol. And then it's good to get it tested on some sites. We have standardizations, but everyone has different computational environments. So people might have different data management systems. They might have different operating systems. You wanna make sure that your code works everywhere, not just on your environment. We have software that helps you, or we have R packages that basically help you run on different environments, or summarize that a little bit later. So we have support for this, but you still wanna test it and make sure that it will go across everyone's environment. And then you just invite sites to join. They can run your analysis. You'll invite their sites to have the data. They'll run your study on their data and they'll give you the aggregate results. So we have three main areas of interest, clinical characterization. This is the descriptive observations. You have patient updiction. So what I focus on a lot, developing prediction models of what's my risk or severe COVID-19 outcomes. And then we have population level effect estimation. This is the cause of this. What is the safety of the various drugs that are being used for COVID-19? We have our libraries that support all of this. So the main ones are the population of estimation and patient updiction. These are packages that have been developed to answer those questions. And it will go end to end from extracting the data to giving you your answers or your trained model in evaluation. But we also have a lot of supporting packages. So these will be things that enable you to connect to the database or translate the SQL between the database management systems or log, as a logger to tell you how you've progressed. So we have a lot of supporting packages. And this is all in the odysseygithub.io slash Hades. Hades is what we've kind of named all these R packages. So just to give you an example now of the power of odyssey and how it's been used. At the end of March, so pretty early on in the pandemic, we got together as a collaboration and we decided we wanted to answer some research questions. So we asked people to suggest questions and we'd start focusing on them. The first one that people unsurprisingly came up with is what does severe COVID-19 patients look like? What are common comorbidities that they have? And how do they compare to severe influenza patients? So we did a study where we looked at hospitalized influenza, hospitalized with COVID-19 and how do they compare? This was written up and I believe this is being accepted now in pretty good journal. And this has led us now to do scale up this characterization. So now we're actually looking at lots of subpopulations of COVID-19 to see how they look. And if you want more information, that's on the odyssey.org slash COVID-19 updates. You should have these slides after you should be able to get these links. So causal inference, this is like drug safety. So there was a big hype about hydroxychloroquine, whether this is good for COVID-19 or not. We decided to look at the safety of this drug. So we looked at rheumatoid arthritis populations in our data and we looked at safety and we found that generally alone, it was actually okay. But with another drug, it actually doubled the risk of 30 day cardiovascular mortality. And we were able to do this study across lots of different data sets. We're able to look for consistency. This wasn't just one site with a small amount of data that kind of lacked power. We had this across the odyssey network with lots of different data sets. And we were able to show that. This is the power of odyssey. And then we actually got asked to investigate the psychiatric safety after we wrote the original paper. So that kind of led to another study. We used odyssey, we answered that question and that paper's also been written and published. But we're not just stopping there. We're actually looking at lots of different drugs for the COVID-19 and we're able to do that with odyssey because we have data sets all over the world, very diverse. And although some single sites might be small, collectively we have pretty big data. And then the prediction is the area I tend to focus more on. We had a Korean doctor. He was finding in earlier the year that people were dying at home with COVID-19. And he wanted to come up with a model that could identify the people who were high risk for death so that they could prioritize them and get them into hospital. But in March, we didn't have data that would be sufficient to do good machine learning for COVID-19 specific. However, we did have big amounts of influenza data. And this doctor actually hypothesis, he thought that the people that he saw that were having severe outcomes and dying from COVID-19 might be similar to the people who have severe outcomes and died from influenza. So he said, well, can we use the big amounts of influenza we have in the Odyssey Network to develop a model and then validate that on the COVID-19 patients in the network to see whether it transports. So we used our COVID-19 data sets that we had which tended to be too small to develop a model but they were actually very pretty big for validation. We validated across the network. So we came up with a model and actually performed incredibly well in terms of discrimination when we transported it to COVID patients. And the calibration was surprisingly reasonable as well. We were expecting calibration to be a lot worse but it was actually pretty reasonable. There's risk calculator online, it's a Shiny app. We tend to get our results out as Shiny apps and we've also got this under review. So if you're interested in contributing, if you have data, if you have methods, if you have research questions, we're very open. We always happy for other people to join. And I've got some information about how you can contribute in a bit. So in conclusion, COVID-19's novel which means there wasn't much data around because it was so new. And a lot of the times one site would have small amounts of data. So that kind of limited the studies that you could do. But because Odyssey has the standardizations it was able to run studies. We're able to run studies across the whole network that would span the world. And we're able to get new insights a lot earlier than the people who just had one small data set. We're an open collaboration. And if you wanna join, there's plenty of ways to get involved. So we have a forum, the forum is theodyssey.org. This is just an area where you can chat to people, discuss methods, discuss research questions, discuss bits about how you can map your data, whatever your question you can go on the forum and chat to people. We have a bit where you can introduce yourselves and make a little emoji of yourself. GitHub, so if you want to be involved in the software development and all the tools that we're doing, the best practice, we have a GitHub, Odyssey, Odyssey GitHub that has all the packages and all the libraries that you saw. We also have a weekly call. There's the details in the general forum that will give you a link to the weekly call we have. It's always a fun call because you have such a diverse set of collaborators. The topics are the range and they're very diverse and it's always a fun call to join. And then the main thing is the book of Odyssey. So this is a collaboration that took place a year ago. We go to all the people together and a face-to-face meeting to start writing this book. And it has information about everything in Odyssey. So it will tell you about the common data model. It will tell you about how to run a network study, best practice. It will tell you about all of our methods development and the best practices that we've been learning through the network. It will tell you information about how to do the mapping. If you have data that you want to map, it tells you how to do that. It tells you how to assess data quality. Anything you can think of is in there. So it's a great starting point if you want to start getting involved in the collaboration. So thank you for listening. Thank you, Jenna. I wanted to ask one brief question before we move on to our next speaker. There's been a lot of interest from, for example, students interested in getting involved and contributing to COVID-19 research. Could you expand on your comments regarding anyone getting involved in proposing a project? Do researchers need to have a particular type of affiliation or title or be at an institution with an IRB? And so Odyssey's very open. So if you go on the Odyssey forum and you post that you've got this research question, then you will start to get people like responding to you. And then the first thing will be to start writing that protocol. Once you've written a protocol, then you can start getting, if you have the experience to write the code, you can write the code. If you don't, you can try and get someone else in the collaboration to help you write the code. But I don't think anyone looks to anything about affiliations. If you have a good question, we're happy to work on it. Fantastic. Thank you again. I'd like to introduce our next speaker. Andrea Ghana is an EMBL group leader at FIMM and an instructor at Harvard Medical School and Massachusetts General Hospital. Previously, he did his postdoc at the analytical and translation genetic unit at Massachusetts General Hospital, Harvard Medical School and the Broad Institute and his PhD at the Keralinska Institute. His research interests lie on the intersection between epidemiology, genetics and statistics. Andrea has authored and co-authored both methodological and applied papers focused on leveraging large-scale epidemiological data sets to identify novel sociodemographic, metabolic and genetic markers of common complex diseases. He has been working with large-scale exome and genome sequencing data focusing on ultra-rare variants in coding and non-coding regions. His research vision is to integrate genetic data and information from electronic health records and national health registries to enhance early detection of common diseases and health and public health interventions. He is a coordinator of the COVID-19 host genetic initiative, phenotypes steering committee. Welcome, Andrea. Thank you for a very extensive introduction. So I'm gonna present a little bit about the COVID-19 host genetic initiative. There is a nice website where you can go and check many things about us. And there's a couple of Twitter account that you can also use to follow our work. We started this initiative in March, 2019. And when we started, it was clear from the beginning that we had to create an environment where people could come together and study the host genetic. We call host genetics the study of the genome, not of the virus, but the study of the genome of the patient, of the people affected with COVID-19. And when we started this initiative, the idea was to, rather than doing the typical consortium, just focus on publication, we wanted to create something that was more an environment for people to share resources. And so the main goal was to create this environment to organize analytical activities and also to create a platform to share the results of this activity immediately available to the broader scientific community. Now, when we started this initiative, we immediately thought about what our potential principle of collaboration to make different sciences for different different backgrounds to work together. And for us, a very important point was the idea to promote early career researcher, but also to not create a monolithic initiative with too much rules, but really let leave people the ability to interact with each other, come up with a new research idea and move out of the initiative that was required. And so we really didn't want to inhibit the work of any of those studies. So the other important point is that in the human genetic community, we have a lot of close connection because of the way we work. And we have been doing genome-wide association study in large consortia for the past 10, 15 years. So there is a quite strong network around the world of human genetics, but with COVID-19, we noticed there were new people coming up and especially from underrepresented country. And I think the problem of having underrepresented country in human genetics is a large problem. It's a very Eurocentric type of science that we're doing right now. And there were many new studies coming up, especially led by clinician from different institution. So we structured this initiative in a way that everyone who register online are reporting information about the study and reporting information about what they were planning to study and how they were planning to do and many participant where they are serving or bringing together. That has allowed us to create a database of studies that are interest in studying the host genetics for COVID-19. And so currently we are more than 200 study across the world. I will describe this later. But we have a nice browser on the website where you can navigate this different study, look at the research question, look at the investigator. Here you will notice that some dots have a different color. The green dots are the one for which we actually have already studied sharing the data. But you can also contact directly each of the different study that register an initiative and propose a potential collaboration. Now, this is one way. The other way is to create a Slack channel and that's something we did where we invite everyone to work together. Now, the problem and we like to discuss if you have better idea on how to deal with this is that we had a main channel with 1,200 people which is not very practical to have a nice discussion. And so we have been slowly redirecting people to a lot of sub-channel, each of these are trying to study specific aspects of the host genetic for COVID-19. So to give a little bit an overview about the number of these initiative, we're currently around 1,200 members and the website is really the important part because there is where we put all our results. And currently we had more than 80,000 unique user for almost 3,000, 300,000 page view. And we have a study from 51 countries which is very important. We really aim to cover study across the world and not being just Europe and US which is really traditionally being the places where most of human genetic research has been done. And minority of this study has been actually contributing with data. So this is around 19 study. This number is clearly growing. You see actually United States are quite underrepresented right now. And that's because clearly takes time to genotype sample and extract DNA and so on. So we're seeing a delayed wave in this direction following the pandemic. I'm pleased to see that it's not only a European country but we have data from Qatar already, from Korea and from Brazil. Now, what we actually do with certain range in a wide association study but we also make available metadata. So these are resources that we have been collecting in the beginning and they were extremely important to get other study to gear up and initiate get IRB approval and so on. So we have the questionnaire used across different study. We have consent and protocol available. We have created data dictionary to help the different study deciding which kind of variable to collect and how to report it. And importantly, we have all these catalog of study with different research question. And this is really important to understand how scientists are thinking about the host genetic of COVID-19 and what kind of different aspect are they studying. Now, we wanted people to share the data immediately. That was the goal from the beginning. We didn't want to be a close consortium. So we have two options. The first option is to share individual level data. The second option is to share summary statistics, something similar to what Odyssey were mentioned, which is sharing aggregated data. Now, we really wanted everyone to share individual level data, but we actually been not very successful in this part, and we can discuss later why, but we've been much more successful in sharing summary statistics. Now, for the individual level data, we have been partnering up with two institution in Europe, the EBI, European Genome Phenomarchive, and in the US, the NIH and NHGRI Unveil Portal. We decided the data access to actually remain for the study PI for the European part, for the Unveil part, that the data access is more similar to what the big up model of data access. But as I mentioned right now, most of the study feel more comfortable to share summary statistical or aggregated data, but we are working to motivate these. The other important part was to create a phenotype for centralized activity, and clearly there's a lot of discussion on how you might study severity of COVID-19, and so we have been creating through a working group, we have been creating a different definition, looking for example, as severe confirmed cases versus hospitalized cases. I think the take home message here is that we really wanted to be pragmatic and trying to create definition which were available across different study. And similar to the Odyssey collaboration, we had to deal with many different data sources, some are electronic health records, some are patient collected, data collected directly from the patients. So we need to be clearly very flexible in the phenotypic definition. Now let's come in a little bit about the science. We run a genome wide association study, I don't know how familiar you are, this is a Manhattan plot, each of these dot is a variant in the genome or ordered by chromosome one to chromosome 23, which is the sex chromosome. And this is a minus log 10 p value. So this is a 10 to the power of minus 14 p value. And as you can see, there is right now just one very strong peak on chromosome three. And this is the only validated signal for host genetic for COVID-19. And I will show you later that this is really about COVID-19 severity, not susceptibility. These regions spend several different genes, it's a quite difficult region and you will see later why. And there are some gene that we believe might drive the signal and this is the CCR9 and CXCR6, which are cytokine receptor related genes and may have a role in inflammation. But this is something that we are studying right now. Now, this is some data from Iceland from Decode where they've done a great job in testing almost the entire population. You can see that people, they have defined different class of severity for COVID-19 from class one, which is in patient hospitalization to class three, which is basically people that have been diagnosed with COVID-19 but they have no symptom or very mild symptoms. When you look at our variants on the chromosome three, you can see that if you compare people with very little symptom to the entire population, there is no association between our variant and this COVID-19. While the strongest effect with odds ratio around two is seen when we compare very severe COVID-19 cases versus moderate or no symptoms COVID-19. So really this indicates that this signal it has to do with COVID-19 severity. So just take a message out and interpret these results is that people carrying these variants has around twice the risk of ending up in a hospital given that they have contracted COVID-19. Now something quite interesting about this chromosome three signal is that is actually obtained from an integration with Neanderthals. And you can see very clearly about the frequency distribution across the globe but it's not seen in the African population, a little bit more common in Europe and actually quite common in the Indian and Bangladesh and Pakistan. Our collaborator, Devina Neel has a quite large court of Bangladeshi from UK and you can see that the frequency of this variant it's around 38% in hospitalized COVID-19 cases versus only 27% in the non-hospitalized case. So there's 10% frequency difference when we compare these two groups. Why he has this specific distribution across the globe remain a mystery is a sign of positive selection. So probably we can speculate that it might be in past infectious where this variant was or this appetite was valuable and so has been selected. The things I find quite amazing is that we have been working with different experts in different area of human genetics and we told them prepare pipelines to do your favorite in Silicon analysis. You have 48 hours. So we released the data and then we'll ask everyone to run their analysis and in 48 hours we were able basically to create potentially six papers on these results. So we had time check. Sure. Irritability analysis and so on. And I think I want to stress that all these results are made immediately available on the website. We're actually not seeking for publication at this point. We're not doing any paper out of it. We're just taking the result and putting that immediately on the website. So which direction are we going? The first is how do we motivate people to share individual level data? And we have found several challenges and we need to think very carefully about the legal framework to do this. How do we sustain beyond emergency? And that we foster collaboration that are not just the central analysis but their site project. We're also thinking a new way to report the results. We just don't want to do a paper every time we run a new iteration of these analysis. So how do we make these results accessible both to lay audience? We don't just want to target scientists but also available to scientists in a way that is a living paper. And so we're really thinking in that direction on how to do that. This is a huge effort. And we have actually a nice acknowledgement page that auto generates according to who contribute to what. So it's quite nice. And I want to especially thanks Mark Daly which has been co-leading the initiative with me but this has clearly been an amazing journey with many different people around the globe. So thank you very much everyone for listening. Thank you Andrea. You discussed your efforts for this initiative to be worldwide and not centered on the US and Europe. Could you say more about how you intentionally create a consortium that has leadership and participation from a diverse group of countries? Yes. So in terms of leadership, we have been very flat. So I mean, there's not been really any, anyone stepping up taking strong roles. And so that's allowed people from underrepresented groups but also underrepresented universities to step up. And in terms of working with minority, one way we have done that is that we offer free genotype to countries or studies that cannot afford that. So for example, we have been here in Finland actually, genotype sample from Africa and genotype sample from South America for free in order to motivate also this investigator to be able to participate in this initiative. Thank you. Our next speaker is Ken Massey. He has had a 25 year career in the pharmaceutical industry successfully leading and growing teams in clinical operations and global medical affairs. As vice president of US medical affairs at Merck, Ken led overall strategic planning, governance and management of a multidisciplinary organization consisting of field medical professionals, professional society and patient advocacy groups, continuing medical education providers, key scientific leaders and market decision makers. He is currently chief life sciences officer at Sama Technologies and an affiliate of the N-Pandemic National Data Consortium. Ken received his BS and farm D degrees from the University of Florida in Gainesville. He completed a two year postdoctoral fellowship in pediatric clinical pharmacology prior to joining the faculty at the University of Tennessee. Welcome Ken. Thank you, Sherry. Appreciate it very much. And thank you to Jenna and Andrea for fantastic presentations as well. I wanna make sure you can see my screen. Okay, everyone. And thank you for joining us today. Good afternoon, good morning, good evening to those of you joining us around the globe. Thank you so much for being with us. I am pleased to present our thoughts and ideas and efforts around the N-Pandemic National Data Consortium which we formed back in March at the beginning of the pandemic as well. I'm very pleased to be able to present on behalf of the team. And I'll show you in this slide the range of partners that have been working with Thama Technologies to create this range of capabilities that are necessary for us to be able to conduct the analyses and insights that we are, I'll describe here shortly. So we're working very closely with organizations like Index AI, which is really a state-of-the-art organization in terms of the ability to aggregate and provide deep analytics around genomics, biomarkers, flow cytometry, et cetera. So obviously a very, very critical capability in the pandemic. We're working with specialty laboratory services like Caprion and others as well who are able to bring in patient-level data and insights with Andaman and Clinarion as well. So large initiative, number of partners working together in order to address this challenge. And I think we would all agree that there is a race is on. There's an immense amount of data that's being generated. Last time I looked, it was north of 1,500 studies going on around this particular disease, creating these large data sets. But I think one of the challenges that we all collectively face, some of which was addressed by Jen and Andrea, was that that data ends up siloed and therefore not able to be mined and insights gleaned across the multiple studies. So as you can see, there's multiple effects of these silos that certainly limits the ability to rapidly advance our understanding of the disease state. You get fragmented data. Oftentimes each study will have maybe small numbers in particular subsets of patients which leads to lacks of statistical power in any individual study. There's no real mechanism outside of some of the fantastic consortiums we just heard about for collaborative research. And certainly, I think as we look at the variety of studies that are ongoing, there's no question, there's a tremendous amount of duplication, not only time and energy, but a funding, often looking at the same question. The real takeaway and the real key challenge here is that we don't have a mechanism whereby investigators can really leverage quickly the insights in the scientific momentum that is being generated by others so that they can then inform their follow-on trials. And all of that collectively, of course, has a negative impact on efficiency, time to completion, and overall risk management. So with that backdrop, that group that I mentioned earlier largely driven by some of the technologies and some of our advisors was to think about how could we collectively contribute to the variety and the barrage of activities that are going on here in the US, but also globally. And so we proposed what we called the End Pandemic National Data Consortium, the two goals of which were to dramatically accelerate data analysis. And we wanted to accelerate that analytics, of course, by integrating those disparate data sources and providing the tools and the capabilities for the communities broadly defined to be able to drive the analytics engine, both on the disease itself, sort of the clinical manifestations of the disease, i.e. COVID-19, but also the underlying virus, SARS-CoV-2. And how is that, as we just saw from Andrea's work, how is that being experienced differentially by different types of both phenotypic and genotypic signals? Our goal was to have a dramatic effect, as I said, and so we wanted to reduce the time it takes to find both treatments and vaccines by up to 50%. So we wanted to have this dramatic acceleration by aggregating and accumulating and making this data digestible and insightful. And then, of course, as a member of the ecosystem to help advance really the entire scientific knowledge around the virus itself, its pathology, what are some of the unique clinical courses and presentations that exist? And so how could we contribute that collective capabilities to do this? And that's what we call the National Data Consortium. The underlying engine of this is SAMA Technologies Clinical Data Litics Hub. And SAMA is a 23-year-old company based in Silicon Valley here in the United States, 1,000 or so colleagues scattered around the globe who are collectively focused on this challenge of taking data from disparate data sources, as you see on the bottom part of the slide here, is how do you take that heterogeneic data and be able to leverage technology stacks that are able to take that data and integrate it, cleanse it, and harmonize it using a variety of data rules? And then govern, standardizing, and securing that data. Obviously data security is a very, very critical issue. So as you'll see, we have created at SAMA a IP-protected, very proprietary technology stack that's able to take data from these different types of sources that are used, largely, of course, by the pharmaceutical industry and the CROs that we work with, but also be able to integrate other disparate sources of data that is unstructured as Jenna was addressing earlier to include not only all the operational data around how do you actually execute these programs, but the clinical data, importantly, the real world data, and then other multi-omics, et cetera. So we've created the ability using our machine learning and artificial intelligence capabilities to very rapidly take those disparate sources, integrate and harmonize them, and then make them analytic ready. So we create, if you will, one source of truth from all of those disparate data sources. Now, we work with about 50 or so organizations, books, pharmaceuticals, CROs, patient advocacy groups, et cetera. And so we have created a host of applications whereby that data is consumed, whether it's operational data, clinical patient data, risk management data, et cetera. And we also have an at-market place where other organizations, like I mentioned in the previous slides, are able to build on that platform and create an integrated capability. And so the one I wanna really focus on today, of course, is the work that we've done with INDEX AI and the other partners to bring in all of those multi-omics insights and integrate that as I'll show you in a moment with the various other data sources that we have to create what we ended up calling the COVID-19 Command Center. And the idea behind the Command Center was a deeply verticalized capability, very specifically focused on leveraging these types of capabilities, but deeply vertical around the COVID-19 situation itself. So the basic sort of framework and concept was to provide our technology stacked, along with our partners, to be able to take all of these disparate data sources, whether it be sponsors doing work, there's a lot of academic centers and others, of course, healthcare organizations, big integrated delivery systems who are involved in a variety of research-related elements around the disease. There's master trials going on. Jen and Andrea beautifully outlined some of the big consortiums that are going on. And then lots of just other sources of real-world data. So how could we provide or we've made available, I should say, our technology capabilities this Command Center in order to be able to do exactly what I mentioned. And that is how do you take all of those unstructured, differently structured data sources and use this technology stack to normalize it, and organize it, and then be able to visualize it and present it back to the medical community, as well as the regulatory community as well, so that collectively we can build off of each other's momentum. And as I said, the goal of being is, which is just to rapidly speed the engine of analytics by leveraging all of these data sources. This has been a very common discussion since the beginning of the disease. Certainly at the governmental level of many agencies, XFDA commissioners and others have all talked about the power and the importance of being able to take all of these data sources. And so that's really the framework and the concept behind the consortium. Now, this work is done. And what I'm showing you in this page, I know it's difficult to read, but this is actually a screenshot of the COVID-19 Command Center and this is the landing page, if you will. And what you see is that we've been able to take data from multiple sources up, and including genetics level data on about 3,000 patients now largely from Asia. We'd be able to integrate all of that data into this single source and then be able to parse out these different views of the data. So you can see in one snapshot all of the data around the patients themselves, their backgrounds, the diagnostic markers that change over time relative to the individual patients. We have brought in through the partners I mentioned earlier, we actually have ECOA and APRO data that we've been able to integrate. Certainly all the clinical data around laboratories, summaries, advanced inflammation markers, actual clinical outcomes of these patients. And then largely through the work of our partner, index.ai, who has a really, as I mentioned, state of the art ability to integrate genomics data, flow cytometry data, proteomics data, cytokines, et cetera, all into a single workbench whereby you can do extremely advanced analytics. And so what you're looking at in this one particular screen is actually, and across the bottom, which of course you can't see, these are a number of different bio markers that I mentioned. And when you look at those at the individual patient level, what you see is two clearly disparate groups. And much like Andrea just mentioned, these turn out to be patients that have severe and non-severe disease. So there's clearly a variety of markers that can be collectively combined in order to be able to priori identify patients who are more likely to have an advanced case. They also have the ability to digitize radiology scans. And so they can bring those in as well so that you can now look at a single view of all the patients, what's their clinical markers, what's happening at the genetics level, including all of the diagnostic and radiology work all into a single view. Now, you can obviously drill down deeply into these data. So what we have is sort of this information structure that we've created for the command center. And it really is across sort of three what we call flow patterns. The first of which is all of the developmental operations work. So the patient, the sites, the investigators, all of the type of work about how do you actually execute these programs and really quickly advance the ability to do so. In the middle is all of our genetics, radiology and flow cytometry data. So here's where you see all the integration of the genomics, the diagnostic markers, the radiology scans, the inflammation markers, et cetera. And then lastly, flow three is the clinical data. So this is the actual patient data around everything from their laboratory data and adverse events to other clinical insights, but up to and including how they experience it from an ECOE activity, EPRO perspective as well. And then that data is then made available through three different view levels. So you can drill down to an individual patient and really study very deeply all of those integrated data for an individual specific patient. If you need and would like to do so, we certainly offer, and many times from an analytics overview perspective, the ability to see cohorts or overall study level. And then I think really importantly and sort of one of the primary drivers behind the pandemic consortium was the idea of being able to do cross study views where you can look at a specific type of... Two minute time check. For example, across a number of different studies. And so we believe at the end of the day, what we've been able to do is by building this sort of very purpose built, deeply verticalized set of capabilities is really to and support everything from translational sciences all the way through late stage clinical development. And we're really agnostic to any sources of systems or data that an individual organization or a consortium or an academic center might have. We can help drive clinical programs themselves via operational and clinical data and visualization of that. Integrating proprietary information that one organization might have with the other externally sourced for advanced analytics, real world insights, et cetera. And then I think really importantly, the ability to do deep level multi-factorial analytics simultaneously across a whole host of variables as listed here is critical. And then the last element of this is we do have some really interesting capabilities around natural language processing and understanding whereby we can actually sit on top of the code 19 open research data set and draw insights from the now north of 100,000 scholarly articles that are sitting in there. We've done that, for example, recently to help organizations find a series of or a cohort of investigators for some follow on work. So I'll pause there and appreciate the time and the opportunity and I will answer any questions you might have. Thank you, Ken. There has been a lot of attention in both the research community and the general public regarding the interface between treatment and vaccine development and the regulatory component that you mentioned briefly. Could you speak more to that part of your pipeline from the perspective of your role? And Sherry, just to make sure I'm clear, you're talking about how we can help accelerate either treatment or vaccine work. Is that what you mean? Yeah, I mean, that's really, I would say the fundamental business model of all of the partners that I mentioned, we address various elements around how do you stand up, execute trials, close them more quickly and ultimately get them through the regulatory process to make them available for patients, right? I mean, at the end of the day, it is about getting new novel and important therapies to patients. And we do work actually both in the therapeutic space. We have organizations that are doing development work for therapeutics on our platform today as well as in the vaccine space. So many of those elements are similar, of course, in terms of how you stand up and execute. The outcomes are very different, right? The various variables you're monitoring are quite different. But at the end of the day, sort of the collective goal here is how do we really dramatically accelerate that process to get new novel important therapies to patients more quickly? Thank you. Thank you. Our last speaker is Ronnie Rosenfeld. He's a professor and head of the machine learning department in computer science at Carnegie Mellon University. He also holds a courtesy appointment at the Heinz School of Public Policy at CMU and an adjunct appointment at the University of Pittsburgh School of Medicine. Rosenfeld has been teaching machine learning and statistical language modeling since 1997. His research areas include statistical language modeling, machine learning, speech recognition, and viral evolution. Ronnie's current research interests include tracking and forecasting academics, using speech and language technologies to aid international development, using machine learning for social good, and advancing data numeracy for all. He's a co-leader of the Delphi COVID-19 response team at CMU, which produces real-time COVID-19 indicators. Ronnie obtained his BSc in mathematics and physics from Tel Aviv University and his PhD in 1994 in computer science from Carnegie Mellon University. Welcome, Ronnie. Thank you very much. Can you hear me? Yes. Okay, let me share my screen. Can you see a lot of faces smiling at you? Yes, I can. Before you start, I wanted to just interject quickly. Since this is our last speaker, I wanna encourage attendees to put their questions in the Q&A box for the panel discussion after Ronnie's presentation. Thank you. Thank you, Sherry. So the work I'll be describing or the activities I'll be describing are a result of a large number of people working together. This is a Delphi research group and this montage of pictures is already woefully out of date. It was from, I think, April and by now we're over 40 people. So some people have not been added to the picture to the montage yet and some of them I believe are actually on this call. I was asked to talk about the promise of clinical data for the fight against COVID-19 and I'll give you a little bit of our experience with it. First, a little bit about our group, the Delphi research group. We came into being in 2012, well before COVID with the mission to develop the theory and practice of epidemic forecasting and its role in decision-making. We're a small, typical research group, maybe with an applied band, two faculty, that's Ryan, Tip Shirani and myself, somewhere between two to six students starting in 2012 and going until 2019. The main things we did in that time is the pioneer data-driven now casting and forecasting techniques for epidemics. We participated in almost all of the forecasting challenges that were put out by different government organizations in the US, including CDC, DARPA, OSDP, in fact, were perennial winners of the CDC annual flu forecasting competitions. Everything we do, we put in open source, we put all the data that we have or that we generate, we put out as much as is allowed by our DUAs under very permissive terms and this continues to today. During the COVID period, we've pivoted from our work on a variety of other diseases such as flu, dengue, norovirus and others and focused exclusively on COVID. We set ourselves a goal of developing and validating and sharing geographically detailed real-time indicators and short-term forecasts. So I will focus here on real-time indicators the rest of the talk, but I just wanna mention that we also work in parallel on forecasts. Our emphasis is on the geographically detailed part, go down to the level of county and lower and in the forecast we go to about four weeks. Further, our mission for this war period of COVID war is to support public health decision-making at all levels of government from federal, state, county, city and even school districts, as well as industry, NGOs, fellow researchers, the press and the public. Our primary main priorities is government. We've grown very, very fast, where I would say I think over 50 people now are growing almost daily. Many of the people are volunteers or part-timers and many of them are now outside CMU including several at Stanford. This is our main site for COVID work if you're interested in looking at it. Most of you are familiar with the severity pyramid but bear with me, I want to superimpose our work on indicators on the severity pyramid. So the first thing to know about this pyramid is that it is not to scale. The proportions are not what they appear here. They could be very, very, very large dynamic range of proportions. How many people in the population are infected? How many of them are symptomatic? How many of those go to a doctor, attend a visit or reach out to a medical professional and so forth. Not only is it not to scale but also the proportions are vastly different as you probably know in different age groups and based on different comorbidities and also different races and ethnicities may have an impact. In parallel with, well, let me just point out I assume you can see my cursor that this part of the pyramid is what we call the medically attended part. This is the part that comes to the attention of the healthcare community in one way or another. And within this, I'm going to focus on an chronic medical records as they pertain to tracking of the outpatient visits, hospitalizations to some extent ICU. What you see in green are the different indicators that we've developed and where they fit in this pyramid. So we have a survey that we put out with the help of Facebook. So Facebook has lent us their platform. So people who are on Facebook through the regular feed on random may see a invitation to take the survey and if they take it, they go off to our site, fill out our survey and we get the results. Facebook does not get the result of the survey although we do share aggregate statistics with them as well as with the public. This survey is used to track behavior and attitudes as well as symptoms. Right now we're primarily looking at the symptoms but there's a lot of work left to be done to look at the rest. In addition, at the level of population we track mobility, courtesy of Safegraph and Google. Courtesy of Google, we have online search signals that track to a percentage of symptomaticity in the population. And then another important distinction we should make is the syndromic distinction. So everything from this level up can be based on exact either lab tests or sort of diagnosis codes that are based on a significant amount of interaction but this level of outpatient visits and people even before they go to their visit based on the symptoms. We have to deal with a syndromic picture namely a constellation of symptoms. Now this is where things get a little bit messy separating the COVID positives from the rest and this is what this arrow is about, specificity of the signal goes up as you move up the pyramid. The other important thing to note is that you can think of these indicators either leading or lagging for the purpose of forecasting. It's very important to draw this distinction. Clearly the behavior starting with symptoms, symptomaticity is a leading indicator relative to hospitalization and death. But even before you get to symptomaticity public's behavior such as mobility and the other kinds of behavior mask wearing and so forth can be thought of as leading indicators of what may come ahead. Having said that, let me now focus on the squares. Within these specific data sources, so in general this list applies in fact to all indicators and not with neurological data, you wanna look at different dimensions of the data. Some of them are well known, geographic scope and resolution, namely what part of the world are you covering and at what level of country, state, city, county and so forth, temporal scope and resolution are also pretty clear, temporal scope applies, means how far back you have data for building statistical models, this is crucial. Demographic scope and resolution has to do with are you covering only adults, children are you covering specific ethnic groups and of course, comorbidities as well and if you cover everybody, do you have a breakdown by demographic comorbidities? Important when you collect data to make distinctions between incidents, prevalence and attack rate, one of the most frustrating things for us now having to deal with hospitalization data coming out of official reports is that they are reporting prevalence, namely the number of people who are hospitalized in any particular location or any particular day, whereas what interests us for forecasting is incidents, namely how many people were admitted today in that location and the two are not convertible to one another if you don't know the number of people who were discharged or died that day. So you can have prevalence data without being able to derive reliable incidence data and vice versa. So this is important distinction to make and also you need to think about coverage whether your data is available per population basis per timeline basis and so forth. What I wanna focus on is the last three dimensions which are specificity which I already touched on, namely is the data at the level of cleanliness or clarity that is implied by lab confirmed diagnosis or are you basing it on a syndromic picture and if you're basing a syndromic picture how strong is your syndromic signature and how well, how specific and selectivities. One of the biggest issues in using syndromic picture as well is the reporting heterogeneity, the fact that different doctors, different hospitals, different parts of the country and different times during the pandemic may use their codes differently and then you need to work your way through that. I'll show you some examples and then perhaps the most, the least noticed and maybe the most crucial element of an, sorry, of an epidemiological data stream is its latency and its data revisions which goes by the colloquial term backfill. So I'm going to talk about and give examples of these characteristics of clinical databases and insurance claims which form the basis for outpatient, inpatient, and lab test data. So what is a COVID-19 patient? Well, it turns out if there's no clear uniform agreed upon operational definition. There is a consortium, a national COVID code collaborative that tries to standardize this by creating a phenotype definition and we can go there if desired during a Q&A and look at their current definition. It's an attempt to create a unified definition of a COVID positive, COVID negative, COVID presumed and COVID probable but it relies heavily on lab test results. Lab test results are typically available in clinical databases but often only after some delay after the test was performed. So for real-time systems that need or want to have data up to the day it's very frustrating. You get a patient in, you get some preliminary ACD codes. You might even get an indication that the test was ordered but you don't have the test result yet. That could be delayed any number of days. If you base your signal on claims data which is what we have access to right now, the situation is even worse because claims data does not contain and will never contain test results because it's not necessary for the processing of the claim. You may have indicators to the test that was performed but not the result of the test. So opposite the test result, you need to base your definition of what it means to be a COVID patient on diagnostic codes and perhaps which are coded by the ICD-10 system or perhaps on CPT codes which are procedure codes. Here's an example of some of the ICD codes that are relevant to a syndromic definition of COVID-19. Thank you, let me move on. Some of the problems with these is that different ICD codes are used differently in different parts of the country. So this is two counties, one in Arizona and one in New York and you can see over the same period of time from January pre-COVID but still already with awareness of COVID coming to April and you can see that the decision on how to code patients is very different between the two including some ICD codes that are used heavily in one but not in the other. Another important issue is latency. This is what we call the data revision triangle where epidemiological time is along the x-axis and what you have along the y-axis is reporting time. So when you first report on a particular date you have a preliminary report a week later or a day later you revise that report and provide a report on the following day and maybe a week later you revise that report again, revise this report and then provide a new one and so forth. So the most up-to-date reports you have are here but their quality is not the same. A report that was issued just yesterday is not as good as reported as issued a week before. Here's a demonstration of how important that effect is. This is data going back to January but reports for different data drops in April and you can see that data for April 8th does not contain almost any data delivered in April that contains almost no reports from the last four or five days but then what comes two days later fills in some of it and what comes much later fills in more of that. So backfill is an important problem to deal with. We try to deal with statistically but there's still a lot to be done. Here's an example of the effect of backfill on repeated revised estimates. I'll jump to my conclusions. EMR claims data are very valuable but require careful attention. The biggest challenges are latency in revisions and the coding heterogeneity and drift. Thank you. Thank you, Ronnie. I wanted to follow up on this last slide here and your discussions about the dimensions of epidemiological data and healthcare insurance claims. So one of the other issues we face with this type of patient level data is who isn't in the sample. Frequently this disproportionately impacts marginalized groups including those from low income populations, individuals with less access to healthcare such as in rural communities and racial and ethnic minorities. The pandemic is also disproportionately impacting these communities. So how can we make sure our studies don't prioritize majority and privileged populations? By trying very hard to get line list data and if we can't get line list data at least to get the breakdown of aggregated counts by the categories that matter to us. If you can get a breakdown by say ethnicity and race then you can build separate models conditioned on them and then reconstruct them based on what we know about the parts of the population. But if we don't in the absence of line list data in the absence of breakdown by the important categories we're very much flying in the dark. Thank you. I'd like to welcome all of the panelists back for the last 16 minutes of our event where we will have panel discussion. I wanna start off a broad question for the entire panel. Anyone who wants to chime in and answer. There was an emphasis on the panel where many groups are building bespoke infrastructure rapidly to address these critical needs during the pandemic. We're also trying to remove redundancy in these processes. Where do you see other areas to remove redundancy? I can maybe go quickly on these. So for us the phenotypes definitions of where to collect information, that's something. We then on the way we discovered there were other effort but in the beginning we noticed we were basically recreating a data dictionary that someone else has been creating. So that's something we then slowly moving to existing like WHO form or this kind of existing. The other things that we really wish it's legal frameworks that we don't know exactly how to, especially in Europe how to deal with sharing the data. And we wish there would be someone, some organization taking over the legal aspect and creating a framework where they can indicate how a researcher can share the data. Because having lawyers on board it's difficult and everyone is on opinion. So it's difficult to point really to where the, what's the right legal framework here. Thank you, Ken, did you want to chime in? Yeah, I did, great question. And we were actually very early in announcing the end pandemic consortium back in very, very early March and subsequently it felt like an avalanche of very, very similar activities. So that at the end of the day, I think we had a number of very duplicative efforts and energy focus that really led to sort of a shattering if you will of any sort of single holistic approach of trying to leverage our collective efforts, resources, time, energy around really wrestling this down. And so like in many ways, the idea behind creating these national consortiums was to minimize duplication and inefficiencies. But because with all the right reasons, everybody is trying to get after this in uniquely different ways. What we ended up with was I think a pretty broad ranging collection of sort of overlapping consortiums which led to a lot of confusion. I think in terms of which one of these should I participate in, can I participate in more than one? And in some ways we've heard directly that it was a little bit paralyzing just by the overall opportunity to be involved. So I do think that there was a lot of redundancy and that particular pace and that particular space. And I think that is pretty profoundly slowed our ability to do this in a much more holistic way. Thanks. So another area of emphasis on the panel was trying to bring data to a wide group of researchers. And I wanted to ask any of the panelists who would like to speak to this. What challenges have you faced in making sure protected patient information stays protected and in general the balance between maintaining the trust of individuals and the databases and advancing the science? Well, I can suggest at least one example where we had an interesting solution. So we have a survey that I mentioned that we have done with Facebook. The survey is not protected medical information but it still is subject to significant privacy concerns, especially because the platform is a Facebook platform and people may feel that Facebook does know a lot about them and can combine their answers with what it knows. So we've worked out a firewall of sorts where we have the result of the survey but we don't know who the people are. Survey knows who the people are but doesn't have the result of the survey. We send them a scrambled ID that allows them to identify who the person is. They use that to look up the demographics of the person. This addresses your previous question about demographic balance. And using their own algorithms, they calculate a weight for how much that person should be up weighted so that the overall responses are indicative of the population as a whole. Send us back the weight. We applied to all our respondents and weigh the responses appropriately. So this is an example of sort of a very creative solution that works quite well. The aggregated statistics from these surveys are available publicly. The underlying data is available to individual researchers under DOA. So I think it's just one example. Thank you. Yeah, I think in Odyssey because we never actually share the data, you only really worry is sometimes aggregate data if you've got something rare, it could obviously be used to identify. So we do have filters in Odyssey or like a minimum cell count that the user can put in depending on their data. So if they have a minimum cell count of 10 is what the data says that they have to have, they put that if someone has five, they put that. So you can make it as restrictive as you want for the aggregate as well. Jenna, a related question for you. Potential participants may be unfamiliar with some of your software tools at various levels. The package mechanisms, specific R tools or R itself, what help can you provide or what additional contributions would be useful in this area? Yes, that's a great question. So we've tried to make our tools as friendly as possible for people of different backgrounds because there's expertise in different fields. Some people obviously have the code and some people don't. So we actually have a web interface where you can design studies that will create an R package. You don't actually have to know any R coding. And then you can just simply point it to database and we have tutorials for that. So we have online tutorials for everything. So there's only a little minimum amount of R coding that has to be done. But then if someone is able to code an R, we also have tutorials of how you can customize and actually contribute. So depending on your ability, you can just use what we already have as a web interface. You can create R code and you can use any backend you want. So if you want to, if you Python, you could use that. And we have all these videos, tutorials, vignettes, lots of resources. If people are still stuck, post the issue on our forum and say that you still need help and we're happy to help. If there's anything in the R, post an issue on the GitHub, we're always available to chat and discuss things. Wonderful. Andrea, is there a process for data validation for contributed data sources for COVID-19 HGI? Are there other quality checks to ensure data fidelity? Yes. So we have a series of QC where we compare. So from the phenotypic side, no, that we trust the different study to do a good job. But we have such a easy phenotypic definition like hospitalize yes or no, or on ventilation yes or no, that it's quite standard. Although some of these can vary across countries. For the genetic data, yes, we have several quality control check imposed where we compare, for example, the frequency of the variants with standard reference. And especially for study that are coming up for the first time and they don't have expertise in bioinformatics, that's very important. Related to that, we also offer a service where we basically, you know, there were much more bioinformatics, specialist interest in participating and actually study. So we have tons of PhD student post-doc that were interested. And we have kind of this job market where the students might be matched to a study if the study doesn't have the, you know, expertise in human genetic combined formatics to do the analysis. Thank you. Ken, I had a follow-up question here related to access that we talked about earlier. How does Sama make data available to researchers? Do they need a specific type of affiliation or title? Yeah, so that's honestly a part that we never fully gotten to sharing because of our inability, frankly, to get sufficient amounts of participants to contribute to the consortium itself. And so the whole downstream implication around governance and data sharing and how do you access that data, et cetera. Our goal at the beginning of this was to secure a sponsor, ideally a governmental sponsor at IAIDH or BARDA or someone like that CDC to really provide that infrastructure on the governance side for the data share. And it's related somewhat to what I mentioned a moment ago and that is because of this rapid proliferation of similar type consortiums, building a sufficient quantity of data that would allow it to be contributory back to the medical community is really, we're still in that process today. So largely to be determined and really sort of limited in governance by the lack of participants in the full range. So we worked with a lot of individual sponsors. Sherry, as I mentioned, 50-plus individuals but the ability and I'm sure you can appreciate the intellectual property issues, the privacy issues, the many things you talked about aggregating that data is a bit of a challenge in this space and time. Everyone is racing as hard as they can to get the right answer most appropriately and thankfully for all of us. But the idea of sort of aggregating, sharing and learning collectively is going to be sort of a midterm, I think a midterm solution, not a short term. Thanks for that. I have a question that I think spans a number of the talks here and one issue with large real-world databases is harmonizing variables between the data sources and how do you have databases follow a common schema? How is the harmonization and things like missing data, all of that, if somebody would like or more than one panelist would like to talk about that? Yeah, so I think this is something that Odyssey deals with a lot because although we have a common data model people can have things recorded very differently. So there's been a study that looked at just the individual concepts like we use SNOMED but you can think of ICD-9, ICD-10 looking at that use as you're across the network and they found that it's actually very different so things are recorded very differently. So we're actually doing a lot of work on trying to come up with consistent phenotypes that will work across the whole network and actually we've realized from studies that things are defined very differently in the different datasets whether it's US claims or an EHR in Europe. So we actually have to do preliminary studies to check like how we're defining things and see are we getting the same sort of people and that's why we're doing this characterizations to see are the people we're identifying for a given variable the same or do they differ across the datasets? So that's kind of what we've started to approach to do. Yes, Sherry, great question. I would say just from the SAMA perspective specifically that is the business model. That is the core, I think, contribution of SAMA to the drug development and healthcare space and that is, as I mentioned, 23 years old has been in big data integration and advanced analytics and harmonization of data for 23 years. So as you might guess, before it was called big data, so as you might guess there's been a tremendous amount of energy, time and effort to create the technology capabilities as well as all the data standardization rules that you can apply in order to take those data sources in an automated way, map it to standards and ultimately harmonize it. And so that really is, I would say, what the core business model of SAMA is and it's a technology stack that's able to do that with a host of business rules and standardization capabilities. So for example, we work with just pick one sponsor that we might work with. Oftentimes they're working with four, five, six different CROs, for example, including all of the internal data sources they have and what we found is that for any individual clinical trial there's somewhere between seven and nine different disparate data sources being utilized. And then of course, these big companies don't run one trial, they run hundreds of trials, which then multiplies those numbers and then multiple CROs who have structured data differently as well. So really what SAMA has created is the ability to take all of that heterogeneic, unstructured, agnostic to source data and really harmonize that in really days and hours, not months and years, which is more traditional. Can I ask a question actually? Absolutely. I just wonder if any of you has been working on generating synthetic patient data as a way to allow data sharing. Can you repeat that, Andrea? I didn't fully catch it. Yeah, like if anyone has been generating synthetic data, so like data that maintain the same statistical properties of the original data, but they are some kind of differential privacies that guarantees to not be redentifiable. There's some work in the machine learning community to think about that. And I don't know if you have taught these as a way to actually then being able to share the individual level data. So I think, honestly, we haven't, I don't know of anyone who's done it for their own database, but I think it's at SimPath. I think SimPath might be a data set that's been developed that way. I know that Odyssey was definitely coming up with some simulated data just to test things on. But I don't know of anyone who's done it with their data, taking the kind of information about the data and kind of replicated it in a simulated way. I don't think anyone's done that in Odyssey. Yeah, this is a great question, Andrea, because this is the panel talking about this next phase of COVID research, which is focused on the patient level and creating synthetic data is something that's becoming more common in health care in general. But it's something that a lot of the attendees and panelists, we should all be thinking about when we're sharing our research and our work that if the patient level data can't be shared, that there may be opportunities to create synthetic data. And it's also an opportunity, especially for students who may be on the call who are less familiar with creating synthetic data. There are a lot of resources, including our packages, that handle these types of methodologies and topics that may be of interest to dive into. But Andrea, I wanted to raise another question that came in, this one was in the Q&A where somebody had asked about your collaboration with commercial repositories like 23andMe. And could you expand on how you responded to that question in the Q&A? Yeah, yeah, we certainly work with them. So Answers3DNA has been sharing the result with us or they will be part of the meta-analysis the next round. 23andMe have promised that. We didn't see the result yet, but they've been very collaborative in the past. So I'm sure as soon as they feel ready, they will share the data there. And this is actually the, it's very interesting to see how actually in the US most of the data, they can come from companies while in Europe most of the data are coming from more like a research study. So interesting approach, difference in approaches. One question, Ronnie, that I was hoping to get your insights on is if you could expand on your work and the studies that are being conducted with a view towards the real-time policy impact, things going on now, things you see in the future. So I mentioned in my talk that we focus on forecasting for weeks. And the reason we do that, it's after talking with quite a few public health officials. Of course, what politicians and public health officials want is for you to forecast the rest of this nightmare for as long as it lasts, right? They want you to tell them what will happen four months from now in each county and so forth. We resist that very, very strongly. The reason is the expertise we develop to his expertise in how the virus behaves, not in how people behave. And our feeling is that what happens more than four weeks from now depends as much, if not more, on the behavior of people and governments than it does on the behavior of the virus itself. And we're no experts in forecasting people's behavior. So this is what puts the upper limit of four weeks on what we think we can do and also what we can validate. So we didn't start putting forecasts out until we had several weeks of prospective validation. And if you're putting out forecasts for two, three months from now, then you have to wait two, three months and then some more to have some prospective validation. We didn't want to do that. On the bottom side, why did we go all the way to four weeks? The answer is that for a variety of decisions that are based on forecast, four weeks seem to be sufficient. If you're concerned about overloading your hospitals, there's not much you can do about the coming two weeks. That's pretty much set in motion, but you can still do something about week three and week four out by imposing very strong restrictions. A little bit, you'll affect week three, a little bit, you'll affect week four a lot more. So this is sort of what led us to this timescale and in talking to local public health officials, we found that actually they have a lot of usage for one week and two week forecasts. One of this crisis they encounters is training, recruiting and training enough contact tracers. And for that one week would be quite helpful for them. So what we found out is that there is a huge demand for forecasts from different stakeholders, including someone we never thought about before. And it spends the range of things to forecast and time spends, but mostly it's the short term. The long term is sort of outside of our. So given those considerations, just a quick followup, if there were, is there additional data that we could be collecting in order to have more relevance to other types of projections, or it sounds like your scoping is really geared towards making those recommendations, but would additional data availability change that? I spend most of my personal time chasing new data sources. I did that in the last eight years, since 2012, the differences that before COVID I had to beg and it was happening at a rate of one or two a year. And I had to deal with lawyers over months and sometimes years. In the COVID period, I no longer need to beg. In fact, there's so much goodwill and so much sort of motivation to help that we have so many data providers coming to us and saying we would like to give you the data and the lawyers work amazingly fast now. So, but it's still a very big open question as to what other data sources shed light on the current situation. The current pandemic situation is really multifaceted, not only in terms of different geographies, different countries, different regions within country, but different aspects of the disease cases, hospitalization, deaths, demographics. There are so many dimensions there and some of them, as you pointed out in your question, some of them are not captured well with the current surveillance. So you have to always be creative in looking for new sources. If it's anybody on this call who has data to contribute, I'm listening. Thank you, Ronnie. And this might be a good question to wrap up the panel with is to have the other three speakers also comment on the roles of their initiatives in shaping public health policy. Jenna, did you want to go next? Yeah, so I think like Odyssey can have power in answering questions maybe for, because there's a question earlier about the kind of bias that we can kind of get from things. And I think that Odyssey can actually have a lot of value in kind of answering questions in kind of minority groups because we have the big data. So I think that kind of be where Odyssey really could help shape decision-making because randomized clinical trial obviously is going to have small amounts of data for those people but Odyssey is going to be big data. So I think at the moment that we're just kind of looking at lots of different research questions and prediction that could help. So the prediction model we did was there to help try and strategize who to shield, for example, is how we solved it. We've got the drug safety. We can look at the safety of lots of different drugs a lot more quickly at a lot of the times than a single study would do. So I feel like we can help there with regulatory. And I think that we actually did get asked by, I think one of the studies was actually asked by regulatory to invest in the studies because we have the data for that. Thank you, Ken. Great, great wrap up question, Sherry. Thank you. Yeah, I mean, one thing that we clearly know now is that different groups, whether it be genetic or phenotypic or health disparities, public access to or access to healthcare, et cetera, are all critically important variables in the sort of overarching policy perspective and how do you deal with it? Our thoughts are that by aggregating all of these data sources and being able to pull in things like genetic markers which can help individuals, institutions and certainly policymakers to understand who are those that are most vulnerable, most likely marrying it perhaps with Ronnie's data around where's the disease evolving over time, for example, and helping to really unpack not only the phenotypic, but what are the other genetic and other bio response markers that can help us wrap our heads around who really is the most vulnerable here? Because we know what some studies say 40, 50% of patients are asymptomatic and yet others who appear to be completely healthy, obviously have a completely different course in a matter of a few weeks. And so I think that the more we can aggregate data, apply advanced analytics tools and visualization ability to draw those types of conclusions which we can then aim our policy making initiatives around and focus those into the right areas, whether it be regional, whether it be particular types of patients, subgroups, ethnicities, whatever it might be, I think is allow us to be sharper focused and more thoughtful in where we direct our spend and time is sort of the concept. So I think that's how we can contribute. Thank you, Ken. Andrea. Yeah, so from a genetic side, I don't want to oversell the promises. I don't think genetics can have a value from a public health perspective. It's gonna help with discovering new biology. It might have some value in the clinics on top of existing prediction models for COVID severity, but I don't expect to have a major role. I think a revealing new biology for drug discovery is probably where genetic is gonna be useful. Thank you. We appreciate a cautious nuanced pitch as far as the role of the work rather than overselling. So that will wrap up my role as moderator. I'm going to send it back to Joe. Well, let's thank all of our speakers and particularly like to thank our audience who has hung with us to the end. So please watch the COVID data forum website for information about upcoming events. And thank you all. And with that, we conclude our broadcast. Thank you. Thank you, everyone. Be well. Thank you. Bye-bye.