 Good morning, everyone. Good morning here from Washington, D.C. We have a full room here of scientists and speakers ready to go for this fantastic workshop. My name is Andrea Bacarelli. I'm a professor of environmental health sciences at Columbia University, at the Mailman School of Public Health. I had the great pleasure to serve as one of the two co-chairs with Kristen Malecki of the Standing Committee on the Use of Emerging Science for Environmental Health Decisions. So I would like to introduce first what the committee does and then give you an overview of our two days together for this workshop. And of course, I mean the National Academy of Science, the National Academy Standing Committee on the Use of Emerging Science for Environmental Health Decisions, short ESEHDS, as I used to say from now on, examinees and discusses issues on the use of new science, tools, and research for environmental health decisions. We convene workshops that provide a public venue for communications, for discussion, for brainstorming new ideas, and for government, industry, scientists, environmental groups, and the academic community. We want to propose topics that are around scientific discoveries and advances in methods and approaches that can be used in the identification, quantification, and control of environmental impacts on human health. The committee is organized under the auspices of the Board of Life Sciences and the Board on Environmental Studies and Toxicology of the National Academies of Sciences, Engineering, and Medicine, and sponsored by the National Institutes of Environmental Health Sciences. And of course, I mean, this is all about the people. We have an amazing committee here that you see on screen, that is co-chair by, as I mentioned, by Christian Marecki and I. And I would love for all of you to be in the room when we meet about, and hear all the amazing ideas and the proposals that you can listen to being there. It's really a very proactive and very diverse and very enthusiastic group, and of course very qualified. We also work with a multi-agency group of federal government liaisons, who, along with the standing committee members, help shape and guide our activities. And here on this slide, you see some of the topics we have covered in the past few years. Since 2019, we have hosted numerous workshops on a wide range of topics. And depending on the maturity of the field, the topics can be broadly categorized as emerging research strategies and analytic methodologies, emerging areas of convergence, and emerging advances in science and technology. And the good news is that many of us missed these workshops in the past, but we have proceedings online, notes, some videos as well. So please look us up on Google and you will find that the committee, the proceedings of the workshop over there is a very rich repository. As I said, this is really about the people, sorry, this stopped working. Okay, got back online. This is really about the people and that includes of course you. So if you have ideas about workshops that we want to consider, please email us. There is this email on the screen and send us any idea about workshops we should do in the future. We are organizing new ones and we look for new ideas. And just a few housekeeping rules. I encourage all of us to be active participants. We are in this new brave world where some of us are in the room and most of you are online. So we want everyone to be active, everyone to participate and contribute. And please use the chat box or the comment box online to submit your comments and questions. And of course, those of us here in person, we have a microphone, remember to bring them up and to take them down and to switch them off when you are done. And we lost the video here. Fortunately, I'm almost done. I don't need the slides really. But what I would like to say is that also thoughts and ideas shared during this workshop, considering this workshop is public, are really attributed to individuals and not to the National Academies of Science, Engineering and Medicine. And lastly, recordings of the workshop will be available. So if you missed the workshop today or you missed any part of the workshop, please come back to the website. You will be able to stream them at your own convenience. And at this point, I would like to start to introduce our two days together. The title of the workshop is Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration. And of course, I mean, we are all here in person or online because we are excited, so I really don't need to give you much about why this is particularly important. But all of us know and everyone knows me, know that I'm over-enthusiastic about this, so someone will have to slow me down during these two days. But really everything from online searches to chat GPT seems to be powered by AI today. And clearly it is ubiquitous, it has become very much part of our lives. And this workshop will consider what artificial intelligence and related techniques mean for environmental health, for biomedicine and for health. And of course, we would like to thank the planning committee for organizing this workshop and give a big round of applause for the hard work and make this workshop that we are all participating in possible. And the planning committee includes Carmen Marzit as a chair from Emory, who is here in the room. Yao Ye-Chung from the University of Minnesota. Thank you. Christopher Duncan from the National Institutes of Environmental Health Sciences. Anindita Dutta from the University of Chicago. Megan Lat Show from Johns Hopkins. And Gwen Ellen Ottinger from Drexel University. Although I had the smallest slide ever seen by humanity, I've been able to read them, but I don't need an eye test not yet. So, and I'll give you just a quick snapshot of the next two days. As I mentioned, we will hear from experts, speakers and panelists who are trailblazers in the field who had done new advances and have substantial experience in this area. We want to hear what they're learning, what they're doing and where these still need to go. We will wrap up the first day with a keynote address by Eric Topol and a chat with both Eric Topol and Rick Vojciak. Then please join us again tomorrow, 10 a.m. as same as today for an equally exciting lineup of speakers. The agenda is online and you can access online and download it for your own convenience. I'm sure it's going to be innovative, it's going to be exciting, it's going to be helpful. Our workshops, and this is no exception, intend to be stimulating, intend to be novel. We want to be a venue also in which new ideas can be generated and where new collaborations can be made and new areas can be brought forward. With that, I would like to turn it over for session one to Megan Latcio. Thank you, Megan. All right, welcome everybody. We're excited to kick this off with the first session which is meant to sort of be a level setting session. And I guess I should start off by introducing myself. As Andrea said, I'm Megan Latcio. I work at Johns Hopkins University in the Department of Environmental Health and Engineering. And as you'll probably see throughout one of my longest stints in my career was here in DC when I worked for the Association of Public Health Laboratories and also the Association of State and Territorial Health Officials. So I have a very practice oriented approach to environmental health. And also I served as the chair of the APHA environment section. And so that's a general summary of who I am. So to this session, what we're going to be focusing on is what are the foundations and challenges to using AI to integrate environmental health and biomedical data. And so we're going to lay out the current state of research when it comes to environmental health, when it comes to biomedical science, and when it comes to AI. So we'll all sort of have a sense of where things stand in these three fields. And most importantly, we want you to be thinking throughout this session and throughout this entire workshop about the opportunities for how AI can advance these fields and bring together these two areas. So our first speaker is Dr. Patrick Bricey, and he is a professor of environmental health sciences and medicine at the Johns Hopkins Bloomberg School of Public Health. Dr. Bricey recently concluded an eight year tenure as director of CDC's National Center for Environmental Health, an agency for toxic substances and disease registry. And he earned his MPH and his PhD from Johns Hopkins. And I'll turn it over to you, Dr. Bricey. I have a single slide. Not sure we're going to get it up or not. I don't really need it. I took a minimalist approach to this talk. And since I'm really just trying to level set, I want to start off by humbly stating that I certainly don't think I have my finger on the pulse of all environmental health research might be relevant to this. But I'm going to talk about it from one perspective, something I've been thinking about for a while. And so we've heard many statements, things like, you know, place matters. And so we know that there's many challenges to understanding what we need to know to identify quantifying control factors as we just heard from from Andrea during his introduction, but I would add one thing to that humbly. In addition to identifying quantifying control input anticipating. And I think that anticipation is probably an important component, especially when it comes to AI. So can I have the next slide. So I think the biggest challenges we have and biggest opportunities we have rolls around access to new data that allow us to look at environmental factors. I'm going to speak on the environmental side. Other panel members were talking about the biomedical side. Understand factors laid to the temporal spatial resolution of hazards and risk factors in our environment. I think there's an unprecedented opportunity here. And along with that comes big challenges. So I put a number of things on this slide I could have put dozens of them on the slide I didn't want to make this a big slide heavy talk but, but we certainly know that in terms of internet of things and smart cities. There are things that are happening right now in cities that they're collecting data for all sorts of purposes but usually for how do they manage the city functions how they manage the city assets how they manage resources and services, and to officially improve the operation of cities. And the environment is usually not part of those discussions at all but absolutely should be and so we need to make sure we're integrated in all that going forward. So some examples of things that are happening now that we can easily see how we can build on those. So for example, in Baltimore City. I read recently they, they're putting noise monitors, a certain place in the city to monitor for gunshots and to target police operations for gun violence going forward so, you know, obviously noise monitors to be used for a lot more than gun violence going forward and, and I was thinking as one or if anybody in the city of Baltimore is talking about, you know, where those things go what else that data can be used for going forward. And I'm sure there are other cities doing the exact same thing. Many cities collect electronic water use data to manage water sensing water use, especially, you know, in the city of Baltimore having a drought right now. There's a water source that has to be tapped when the reservoirs get too low. And so they need to manage that when you figure out where the water is going who's using water in addition to just how to build people for their water. They need to manage that but it's very easy to think we can easily put water quality parameters in that data collection as well. We know for example that excellent water crisis that people started developing home led water sampling detectors in real time. It's an opportunity to monitor stuff that we probably need to start thinking about it for if we're building these systems to manage city functions. Let's see how we can use it for environmental health data. Temperature is a big deal. So, we all know that, you know, heat stress and health effects of extreme heat events is a growing concern in environmental health, and many cities are measuring temperature around the city, but they're doing it to for environmental assessment. They're not thinking about how that can be used for health environmental health assessment. So I'm talking now about how do we, how do we integrate data streams just from the physical side of things. I haven't even touched on the challenges for the biomedical side of things going forward. Traffic monitoring is another one. Many cities are monitoring traffic. We all know that traffic is being very carefully monitored so we can access it on our phones every day. Traffic is being used for air quality management in the future as we go forward. So but how do we think about how do we use that data more efficiently. So in addition, there's explosion in opportunities for satellite remote sensing going forward you see a picture on this graph. So this is a this is a graph of wildfires in Canada affecting the east coast. And you might think this was last week. This was in 2002. So this is from a paper that we published at Johns Hopkins looking at the health impacts of wildfires on the east coast in 2002 so this has been around for a long time, and actually 2012 sorry. And so, you know, but that's just an example of one satellite data but satellite data is being used right now in addition to tracking weather wildfires. This is being used to assess water quality. You can you can look at the color of the water to assess algal blooms and growth of algal blooms going forward. There's all sorts of land use vector quality vector ecology issues that can be assessed with satellite data that might help us understand changes in infectious disease parameters. There's there's measurements of soil moisture and drought that can be used. All that have health consequences as well going forward. Obviously, there's big uses for climate change extreme weather events being used for that right now. And, and, you know, as we watch the radio this morning listen the news this morning, the extreme weather that's about to hit the southeast today is a good example of that. And so we're already using data like this for environmental health for emergency management and emergency response conditions. Now I'll split down your sensor technology and something clear. I think near and dear to David's heart is, you know, there's been an explosion in capability of measuring things using small portable sensors. It has huge implications for for data management issues data quality issues going forward. And then has big opportunities for citizen science issues we need to think about how to, how do we use citizen science better, and how do we take measurements that people take on themselves better and so we're now getting the realm of, of what's been called precision environmental health. You know how do we take measurements that people can take on themselves and how do we use that an individual basis, as well as a population basis basis to assess health impacts going forward so a good example that is, is heat stress. So for example, my colleagues at CDC and I published a paper a few years ago that said in in 2000. 20 I think it was, there was a heat wave in the Northwest, and we were able to show that over a four or five day period there was a 70 fold increase in ED visits from that nothing startling about that going forward. Those ED visits don't have to happen why do they happen because these are vulnerable people we can use. We can use data sets to identify vulnerabilities. These are people who don't have access to transportation we can use data to assess that. We know that when we looked at that why cooling centers aren't being used in another community in the Southwest. We saw that people who need the cooling centers don't know where the cooling centers are, and they don't have access transportation to get there these are all things once we understand we can fix using data and how do we make it so that people don't have to go to emergency departments during a stream heat events going forward. So there are many challenges to using these data. This includes how do we integrate it in a meaningful way which is part of what we're going to talk about today. And how do we anticipate I think one problem how we use these data to anticipate when problems are going to occur not only just to kind of quantify them and address them and how do we use them to evaluate the effectiveness of whatever control procedures we put in place. Perfect example that so we spent a lot of time doing things that we think are improving people's health and reducing people's exposures, but we don't always evaluate how effective they are and so there's optimistic effectiveness research here these days that we have to look at as well. So how do we identify the risks are greater greatest how do we prioritize their efforts. This leads us to questions that inevitably about environmental justice issues, because you know, oftentimes, we have we address for the exposure to greater. That's where we're looking at disadvantaged communities because that's where the exposures and health risks are greatest as well so so these are there's a challenges here associated with using these data to address environmental health that inevitably leads us to addressing environmental justice issues as well. So I'll just end right there and I'll say these are exciting times. There are many challenges we need to start training people to utilize data like this. Our training needs to include more data science assessment and then environmental scientists we trade. We need to think about how do we use these data how we take advantage of them how do we address the quality issues associated with the data we collect, and how do we make decisions about it. But I'm very excited about these opportunities hold. Thank you. Great. Thank you so much, Dr. Bracey. Next we're going to hear from Lucila Ono Machado, who's the deputy dean for biomedical informatics and the chair of biomedical informatics and data science at the Yale School of Medicine. Her research focuses on privacy for healthcare and biomedical sciences. Before that, she was associate dean for informatics and technology and the founding chair of the UCSD Health Department of biomedical informatics. Also, she's the PI for the California Precision Medicine Consortium for the NIH, all of us research program, and she received her medical degree from University of Sao Paulo and her doctoral degree, medical information sciences and computer science from Stanford so I'll turn it over to you, Dr. Ono Machado. Hi, thank you so much for the organizers and for the facilitator here. I would like to share my slides now I think I need to be granted permission by the host. Sorry, I still cannot. Okay. You have permission. Somehow it says host disabled participants screen sharing. Oh, now I got it. Thank you. And please let me know if I have the correct mode or I need to swap displays. We see the notes. So I think you might want to swap displays. Yeah. Thank you so much. I'll briefly comment on precision medicine and the role of AI environmental health and biomedicine. Again, this will be a very overview of some things that we are doing and colleagues in various projects. So to enhance health and healthcare using data, we definitely need new algorithms tools and systems that use informatics and data science AI statistical learning and so on to enable the personalized health that personalized medicine via harmonizing and data from several modalities, including electronic health records, EHRs, human genomes, microbiomes evidently environmental data that we're talking about today, surveys and so on. And then we build predictive models that can be used at the individual level. Most importantly, we also not only characterize but use that information to help mitigate inequities in health in healthcare. So this circle here is just to illustrate some of the sub specialties in data science that are involved in terms of improving human health. We obviously won't talk about them all, but you can see that several types of data and specialties revolve around those data types. And in a particular interest today is sensor data environmental data and integration with electronic health records and omics data and other modalities. So I'll talk primarily about models, but keep in mind that, you know, the data court studies are super important. I'll speak a little bit about that. And then the implementation aspect is something that we totally need to work more on the science of implementation to get our discoveries and find is disseminated in practice. So precision care in health. I spoke about we have a genomic data for variant based therapy and cancer and HIV treatments and so on. I'm phenotypic data, which it's my area I'm very familiar with, but also environmental and social determinants. So where quality exposures diet access to care and so many other items play a significant, if not larger role in someone's health, but then the genetic makeup or the particular diseases that the person has at the moment. So predictive models in medicine, we require a lot of data in order to build reliable models. So AI requires a whole lot of large and representative data sets. So we need to build this access to large data repositories to improve research, but we need to do in a way that protects privacy of the individuals and privacy of the institutions involved. We need to aggregate data from different countries. If we are to make this covers at a faster pace than we do today. So there's the dilemma that gets to all of us. electronic health records, for example, is it right to share them, since people have not been explicitly asked. Now, is it right not to share because the numbers and acceleration will depend on that sharing. And can we share without moving data around can share with explicit permission of the individuals, and that's one aspect of it so I will briefly say that the all of us that we participate in isn't NIH large initiative is exactly doing that is a building a large and diverse data set by asking people to actively contributed their data from electronic health records as well as physical measurements and other items that roll into the protocol for the this development of this cohort. And right now, there are many institutions already making use of the data available through this program to researchers at large currently in the US but soon in other countries as well. There is a treasure trove of data there, and it's important that it includes several items including some environmental items but perhaps not as many as we would want. Another aspect of it is the diversity of the people involved because traditionally many segments of the population were left out of research studies, and one segment in particular that we're studying is at mixed populations and this is part of another NHGRI initiative, because we know at mixed populations improve the power of variant discovery portability and genome wide association studies as well as in predictive models and other areas. One of those predictive models that is coming along is polygenic risk scores. So to use the genetic information as well as phenotypic information and hopefully social determinants as well as environmental data to be able to assess risk for particular individuals and then do something, preventive or therapeutic about the diseases for those individuals. The problem is, as this polygenic risk scores are produced, methods need to be improved. For example, the scores may be have a need for continuous update. Will all of this generate or introduce more disparities and more discrimination than already some healthcare algorithms introduced. Can we also protect privacy, particular with some of this forgotten populations were also the ones the least likely to trust biomedical research and what do we do with individuals from mixed ancestry. So it's a center of mixed science and technology for genomics for everyone. And I'll tell you how that relates to environmental data, which is the topic of today. So think of a trait of disease, for example, as a function of genetic determinants of health, as well as environmental and social determinants. So if we are to develop an individualized risk, we need to know all of those items and we're working hard on getting some sequences from from individuals, for example, in order to assess variant effects and also ancestry hereditary. In fact, but we also need to collect exposures diet education and also to determine it's in order to determine the risk for a particular trade. So we need to do this in, in a way that the algorithms to analyze data and stay in their end. In short, our vision is that no one will be left behind and we will increasingly replace concepts that are being used today for race and ethnicity with a combination of genetic environmental and social determinants of health, because each individual is different, different not only in the genetics, but in the way the course of life proceeded. So we will develop new methods and tools that allow model findings to be applicable to all in not be as they are today, mostly referring to the majority of population that does participate in clinical trials and observational studies. So with that, I want to thank everyone. And thank NIH for sponsoring several projects related to this. And again, thank the organizers for the opportunity to talk to you. Thank you so much, Dr. Ono Machado. Very interesting. I've been taking notes all morning, lots of good ideas for how we can advance this field so getting really excited for the conversation. Next, we're going to hear from Dr. Marzia Ghassemi. She's an assistant professor at MIT in electrical engineering and computer science and at the Institute for Medical Engineering and Science. Professor Ghassemi focuses on creating and applying machine learning to understand and improve health in ways that are robust and private and fair. And previously she was a visiting researcher with alphabets, verily, and an assistant professor at the University of Toronto. Before she got her PhD in computer science at MIT, she received a master of science degree in biomedical engineering from Oxford University as a Marshall scholar and a BS in computer science and electrical engineering as a Goldwater scholar at New Mexico University. So I'll turn it over to you, Dr. Ghassemi. Thank you. So today I'm going to talk to you a little bit about designing machine learning processes for equitable health systems. And the thing I want you to keep in mind is we all have a focus and a goal, I think, of creating actionable insights in human health. But to get there we really need to understand how we build models to perform well or be healthy. How we decide which data is most appropriate to be used in model training, what kind of health care we want to represent, and what kind of behaviors we want to encourage in end deployments when we couple machine learning models with human usage. And so let's say that we have a patient, Simone, who's having trouble breathing. And so she goes into the emergency department. It's very late. And, you know, she's told, hey, we took a chest x-ray, you might have pneumonia, but we really need somebody to come and look at it. It's swamped and so it's going to be three hours. You can wait here for three hours for a doctor to look at your chest x-ray. We have a machine learning model that can look at it right now. And this machine learning model performs at or above humans at this task of saying you're healthy, you're in triage, you can go home, you don't need to be seen, or you should wait around in the hospital. And so, which would you choose? You can have this AI screen you know, or you can wait for three hours for a human. This isn't a hypothetical situation. This is actually the situation that we're in with several clinical AI that perform at or above humans in a range of tasks across the human lifespan. And this is a paper from our, to be keynote speaker soon, Eric Topol, that I really enjoyed from a few years ago, really demonstrating that across these different settings, we have AI that is no longer a calculator or a simple toy. It's now performing in ways that are really impressive. But the issue is, you know, even once a model is regulated and the FDA does regulate a software as a medical device, we have some questions maybe about exactly the performance that we might get in different settings. And the reason we have this is AI learns from human. Both in terms of the data that we generate and feed into the model and in terms of the design decisions that we make when we do optimization of these models towards some training objective. And so we have some issues with the current medical data that exists. So if we're looking at just the data that exists, for example, in randomized control trials or top tier journals, medical journals, that data is often very sparse because RCTs are very hard to run. It's often very narrowly scoped. So most of the RCT populations are not diverse. And so the findings may apply in varying levels to a more diverse population. And some of the new reversals happen more commonly than you might imagine where we run in RCT, we get a result, it's standard of care, and then there's a reversal years later saying actually that's not the way that we should do things. If you look at patient records in large databases, so let's say 250 million patients, you might imagine, well, can we just mimic what happens with patient care, can I just look at your nearest neighbor in sort of treatment space, but even for extremely common conditions like hypertension, depression, and diabetes, it's been found that large proportions, so almost a quarter actually of hypertension patients, follow completely unique treatment pathways. So that means they would have zero nearest neighbors if you were saying, you know, there's no RCT for you. Let's just look up what worked for a patient like you. But unfortunately, you know, the solution here is to have better improved data, right, and larger that 250 million patient population is not large enough. Health data really lags as compared to other machine learning subfields so when we compare machine learning for health to natural language processing or computer vision or just general machine learning. Health papers tend to not release code as often not release data as often and not leverage multiple data sets as often, which leads to a substandard of reproducibility that is not common in the machine learning space. So, let's walk through what I just talked about. Let's say that you've engaged me as the machine learning researcher to go through and train a model. Here's a triage example. So if I take the three largest chest x-ray data sets that exist, that's over 700,000 chest x-ray images for the United States, and I train a dense net, a kind of convolutional neural network to predict no finding. It means the patient is healthy, you can send them home. And I get the best possible performance state of the art just like the paper. Maybe one thing I would want to compare is the false positive rate in different subpopulations, and I could call that an under diagnosis rate, because if you had a higher rate of false positives, I say you're healthy when you actually have pneumonia or another condition. In one subpopulation like female patients, this would lead to this deployed model having a higher rate of no treatment for patients who actually need it in that subpopulation. So we do this and we find that the state of the art model has the largest under diagnosis rate in female patients, young patients, black patients and patients on Medicaid insurance. And intersectional identities have it worse than aggregated groups. That means if you're a black or Hispanic female patient, you are under diagnosed more than white female patients or than female patients generally. And you might think this is a very simple point that we should just audit models, maybe we should ask the FDA to create very specific categories of insurance type, sex at birth, self reported ethnicity, and all of these software as medical devices that get approved they also have to hit certain bars within those categories. That would be a great start. But this can get really complicated. So everybody loves language models, right? And note completion is actually one of the tasks that GPT is being used for right now. In fact, Epic has an add on with open AI, or if you start out the patient's note, it'll auto complete it for you. So we took Cybert, which is a transformer model, so a contextual language model. And we took a real sentence from a real patient's note in the Boston area. And we filled in the first word, the patient's race, so blank patient became violent, sent to and we asked the model to fill in the blank. If we say that the patient was Caucasian or white, the model fills in the rest of their note with they were sent to the hospital. If we say the patient was African African American or black, the model fills in the rest of the note was sent to prison. This is not a simple audit category that you could just check for every model. These kinds of associations that we learn are going to be very deeply proxied within the data that we use, especially as we get into large language models and other high capacity models that use lots and lots of human generated data with human biases and biases are a really strong part of the clinical landscape and they're not something you can simply escape. How strong are they you say? Pop quiz number one. So this is a real medical note as well, but we've redacted the patient's self-reported race from this note. Can you tell from this nursing progress note what the patient's self-reported race is? In the clinicians that we surveyed, they could not. So this is not something humans are good at at all, they just guess. But machine learning models can and some of the clues they're using are maybe fair game like in the Northeast, which is where we did the study. There are more African American patients that have dialysis due to hypertension. However, we tested this in two data sets in Boston and New York. So that's over 4 million notes that we evaluated it on and we found that a lot of these statistical power that these language models are gaining from running over all these notes are things that maybe we wouldn't want a model to generally use, but it does use because that's actually what's in the data that we are generating. Like if you talk about a patient's skin at all, that patient is probably white because there are very, very few references to skin related diagnostics of disease and darker skinned patients. Or if you use the word difficult to describe a patient or their family, that patient is probably African American. And maybe you can see how that would happen. Tiny ways that I describe a patient or their family over time and the notes would add up and a large language model can, you know, read between the lines and see what kind of patient you're talking about. What about chest X-rays? Surely there is nothing proxied in here. Can you tell that this patient's self reported race, not genetic ancestry, just self reported rates, which I will say has a very heterogeneous genetic ancestry associated with it for most categories we are talking about. So is this patient black? Radiologists cannot tell. We surveyed them. We tested them and they can't. But when we look at several different data sets from both major chest X-ray releases and also from a data set that was more race balanced from Emory University, we find that here neural models can tell the self reported race of a patient with extremely high performance. And then we look for the proxies. We ask the radiologist, tell us every cheat code you think that a machine learning model could be using here, just like in the note example where it was looking at, do you talk about the patient's skin? Are you, you know, calling a patient difficult? It's not body mass index, breast density, bone density or disease distribution. In fact, this is information in the frequency range, which we suspect, although we cannot prove because we do not have photos of the patient's skin is related to the melanation level of your skin. Darker skin has a more fraction of X-rays and of most radiation. And these tiny, tiny differences in the frequency spectrum are not things that are ever perceptible to human doctors. It's trivially obvious to a convolutional neural network. So much so that even when you band pass filter these images to where they do not really look like chest X-rays anymore, you can still tell a patient self reported race. And so what are some ways that we can try to improve these models if these are issues, these deeply proxy dissociations. One of the things we can do is explicitly include fairness constraints and design things for what they're intended for. For example, if you need a parsimonious model like a decision support checklist, build a complex model and explain it down, just build the thing that you actually want. Now, when humans come together to create decision support checklist or risk scores, often it's very difficult. They have to come to consensus. And then years later, we'll figure out that actually those risk scores we created, they're really biased and they over or under estimate risk, for example, in this paper for African American patients across many different clinical sub specialties. If we learn an optimally predictive chest checklist from ICU data as a mixed integer programming problem that directly minimizes error with fairness constraints. We can do things like predict mortality after continuous renewal replacement therapy in the ICU, but ensure fairness across intersectional groups. And so if you put no fairness constraints, you might get a checklist like this predict mortality after this treatment, if three or more items are checked. This is not referencing your gender or your self reported ethnicity. It's also not using any obvious proxy so an obvious proxy for sex at birth would be height or weight. Those aren't in here there's nothing obvious for self reported race, but this checklist with no fairness constraints is no better than a human made checklist. It is a max false positive rate gap between black women and white men of over 50%. When we include constraints from the get go, we get a model that performs much more fairly and could potentially be used in a deployment. The last thing I'll talk about in technical problems is subpopulation shift, which is a huge area in machine learning. Generally, there's many categories of it. You could have serious correlations attribute imbalances class imbalances, or just attribute generalization. And we have many data sets that fall into each of these categories. Some of them are medical such as mimic notes or chest x3 data sets. Some of them are not medical at all. They're just large machine learning data sets that people use as standard benchmarks when they make new algorithms. Well, we benchmarked all of these state of the art algorithms that we could find that perform really well in published papers on these tasks. And we found that these existing algorithms really improves various correlation and class imbalance, but they really don't improve other shifts because often shifts get lumped together when people evaluate their models. This is a problem because we have to improve both representations and classifiers to get to these better attribute generalization results that we don't currently have with state of the art models. A final note if you use some of these subpopulation shift papers is they often optimize for worse group accuracy, but that is often inversely correlated with worst group precision, which is maybe the thing that matters more in many healthcare applications. So I'm going to skip the content for the last part because I don't want to cheat the panel. The last thing that I'll say is we've done some evaluations where we've deployed models, and we've asked specific loaded questions in high stakes decision making settings. We've given bias GPT advice to people in different ways, and we found that the the use of specific AI doesn't matter sometimes as much. If you have a biased AI or not biased AI as the exact way that you give the advice to people either with descriptive or prescriptive advice, and here we found that clinicians listen to bias prescriptive advice, but not to biased descriptive advice. And if we want to get to safe integration, we should take lessons from places where there are self safe technology deployments like aviation. There are many federal agencies that could have a role in regulation and, you know, advice for technology and recommendations for developers for how to safely integrate AI into healthcare spaces. But there are some things that are going to be unique to help that we need to figure out as a community in panels like this. So inequity and underlying data processes that will be learned and automated, like gender concordance increases a patient's probability of heart attack survival. And that effect is driven by increased mortality when male physician street female patients. That would be like if male pilots crashed more when they were flying with female passengers, or patient physician race match improves medication adherence, and many other things. That would be like if black patients only got in flight safety announcements when a black pilot was flying, or a majority of Muslim women experience poor quality care maternal services with indicated stereotypical or discriminatory behavior. That would be like if most of the passengers were always randomly checked by the TSA, which is true. That one's in there for me. That's an actual true on just FYI. There's no simple fixes here. This is going to be an ongoing process. We should consider sources of bias in our data, evaluate models comprehensively and recognize not all gaps can be fixed. This is work from a fantastic team. Thank you so much. Thank you so much, Dr. Gosemi. All right. So now it's time for discussion. This is my favorite part. So we please put questions in the chat if you're online, here in the room, if you have any immediate reactions, things that you're excited about. Feel free to jump right in. Well, I'll ask a question. So this was probably to Dr. Ono Machado. I was really excited to see you thinking about the incorporation of other types of data into those PRS models that you talked about. And I'd love to hear if you have good examples of that working. I mean, so where do you see that happening? Where is that actually being used or is it being used and can it, where do you think it might have the best implications for use? Yeah, I think currently it's more use of social determinants than environmental data to be completely clear. But the concept is the same, right? If you want to individualize, we should, you know, get the more specifics we can about the person, which of course in order to build models would require even more and more data. However, every variable we add implies a lot of observations to be added as well. But I think in healthcare algorithms as well, I think AI was very instrumental in bringing the issue to attention, but it has been there all along with simple statistical models as well. There is currently a lot of interest in understanding more the biases created by algorithms and how to regulate, how to mitigate those biases or eliminate. So it's an area of active research right now. I would say algorithms that have been historically used in healthcare are being rethought as to whether they are fair and whether they are actually improving care rather than, you know, creating more disparities. Great question. Oh, yeah. Fantastic presentation. I have a question for the last speaker, but sorry, I don't know how to pronounce. Sorry. Fantastic presentation. I have a question for the last speaker. Sorry, I don't know how to pronounce correctly for the name. It's very interesting for you to show the example of how the language model generates different things for different rates. And I read some paper also on the same direction. But the issue is devising for machine learning models become a little bit difficult when, say, when a large-language model comes because, and you also notice that when this kind of large-language model can be used on a wide range of tasks, this kind of bias will be propagated to a lot of different downstream tasks. And the devising become more difficult. Do you have any, like, suggestions on that direction? Sure. So I think what you're asking is when you train, you can decouple in machine learning processes, training a representation, right, that maybe could apply to many different downstream tasks, which large-language models are an example of, say, a classifier that you apply to the representation. And I think what you're asking is, when we train representations right now, like large-language models or other, you know, large high-capacity models that contain biases, it may be very difficult to remove them because they are going to be applied to many, many downstream tasks. And so it's difficult to do, like, a one spot fix. Is that accurate? Yes, yes. Okay, great. So there are a few papers. We've looked into this a little bit, but there's also some topology papers, actually, at machine learning conferences, that are trying to focus on if I know that there is some statistical bias in my data that is not desirable, or a part of this larger ball that I'm learning, right, if my representation space is a big sphere. If I know some place in the sphere is undersampled, or is shaped weird, or is just kind of, you know, spiky, because it's not smooth. I haven't seen tons and tons of examples, and so I don't have this smooth space that I can generalize on. Maybe there are ways for us to use augmentation, right, so take an example in a space where there are very few examples, and do small perturbations of it so that we fill in sort of that part of the space and we get less anomalies. And there's also other work on instead of trying to, you know, generate data that sort of addresses this either by actually sampling it from real people or augmenting the data you have. Just knowing when your model is out of distribution, somebody hands you a trained model, how do you know that it is in a, you know, the popular term for this is hallucination, right, how do you know that it's in an extrapolation or it's not interpolating between answers that it knows, it's kind of out here beyond anything it's seen it's in this extrapolation regime. And so there, there are some works that are also focusing on that so you could think about it from either side. Thanks, and we have a question from online from Claudio Sorrentino, who asks, how would you suggest applying the development of AI models to evaluate cumulative impacts of pollution to communities. I might throw that one to you, Dr. Bryce. So, you need to do two things. First of all, you need to be able to understand what those cumulative impacts are so you need to have data to understand kind of what the breadth of the environmental risk factors are going to be and how do you quantify those. And then you need access to the health data to look at kind of the relationships there. And those are those are the challenges think we're talking about today how do you integrate across those two dimensions so I think it's hard to do right now but I think it's where we're trying to get to. Going forward by having access to environmental data in a comprehensive way, and as well as the health data that we've just heard talking about right now so how do you, how do you look at somebody's how do you predict somebody's risk factor for heat stress. For example and then how do you use that in a in a clinical setting to minimize their risk by making sure that if you know that somebody doesn't have air conditioning, and you know somebody lives in inner city where it's hotter. And you know somebody is, you know, got the physical out of condition that puts them at risk, and then you know something about what the temperature they're being exposed to over the period of time. How do you collect all those data how to integrate all those data and then how do you do something to minimize any heat stress or hospitalization or ED visit the might be a consequence of that. And as I go back to a component earlier then once you figure something out how do you go back and double check and make sure that you evaluate whether it's working or not so that's that's just one example but I think I think you're asking a good question and that's part of the challenge before us today. I was thinking about that also you know one of the things that I feel like can be the focus is like this individualized approach but I think you know the word communities and that question was really interesting. I always come back to the idea of, you know, Flint Michigan like if we had AI looking for trends in blood lead levels or in lead levels in the water, could we have prevented something like that and what can we be doing to prevent something like that in communities or populations I don't know if you want to comment. Just a quick example, I think that's a great idea so we collect a lot of environmental surveillance data, and I'll put blood lead in there is as a environmental status data even though it's a human based measurement. And then we don't do a good job of mining it and we don't do a job, a good job of doing exactly what you said how do we, how do we use that to anticipate when a risk is about to happen, rather than waiting. And for a year or so after the switch to show that the blood levels. Oh yeah they went up. You know we need to be able to use the data to anticipate that problem as I said before and that's a great example. And then also Claudio thank you for your question so the next question you posed is can AI help identify critical parameters to evaluate and predict cumulative impacts and you know this term cumulative impacts might not be well known by everybody so I'll take a step back and just sort of explain my perspective on what cumulative impacts are. So it's the idea that communities that may be unfairly impacted by environmental issues are unfairly impacted by a lot of environmental issues so if you're exposed to high levels of lead and you're drinking water, the likelihood that you're also exposed to high levels of crime in your neighborhood or that you're also exposed to higher levels of unemployment or, and so all of these different factors become cumulative and they may add up in ways that is not additive but actually like exponential they the impacts that they can have or more than you would expect if you added those risks together. What I think Claudio is trying to get at is, we haven't been good in risk assessment and environmental health risk assessment we tend to look at one chemical or one exposure at a time and estimate risk. So can we use AI to look at all these risks because it's sort of like that expose on type of approach. I'll just throw that out to anybody on the committee, the ones to answer and Claudio I hope I didn't take your question in the wrong direction. Yes, Darrell. Yes, Megan I wanted to second that motion in terms of we do have frameworks that we're building with, of course, the support of the NIHHS but more often than not, these chemical and nonchemical stressors, as we were talking, Lucinda I think talked about that as well, one exacerbates the other. And so the polygenic approach in terms of risk trajectories that we should be looking at and turning that into, you know, risk model but the cumulative we've been talking about that since Charles Lee in the early 2000s and I'm happy that we are finally getting there. Yes, definitely moving in the right direction but I see a lot of opportunity for AI to help with it. I don't know if that's happening yet. Well no I and I worry from the standpoint of underrepresented communities with respect to, you know, the medical records and the inherent bias I mean, we're covering it we're getting there, but not we aren't moving fast enough but the speakers did a wonderful job I was really hopeful now that we can close the gap. That's great hope is a good thing. There's a great Langston Hughes quote I think there's a Maya answer about hope is a caged bird. Is that right. My poor English teacher. So I think the question though hits at the heart of what we're trying to talk about is data integration, how do you, how do you combine all these data to identify all these cumulative stressors. And they're sitting in different places and they're, they're accessible in different ways. They have different data structures they have different time frames they have different geographic frames. So this is a perfect place to think about how do we integrate that. Yeah, and I'd like to add to that when we're thinking about these cumulative exposures. The action items and thinking through very carefully our methods and our techniques, which levers do you pull right if you're measuring everything from employment opportunities commuting outcomes on road emissions water quality. Pollution crime stress and indoor air pollution outdoor air pollution where your schools are what are your schools exposed to when children are spending eight hours a day school right on top of a 10 lane highway. So, AI allows us or just you know big data allows us this data integration, but as policymakers as clinicians as local community organizers advocates and so forth. What levers have the greatest pull, and how do we integrate these and accurately tease out where can we make some biggest impacts where our biggest hotspots. And that comes back to the methods right and this and finding these various correlation so I just kind of wanted to add that little fine point to what's our end goal. Once we put these things together. I agree yeah so what how do we actually impact health how do we make a difference with limited resources where are we going to drive those resources yeah absolutely. Great point. Thank you. I think I might say on the health side is. Sometimes people try to propose technology as a solution where it's not a good solution. So, even if we look at the levels and Flint Michigan right like that was a result of state emergency managers changing the city's water supply right and maybe we could have predicted that this is something that would have happened and I don't know if an AI telling them. Hey do you realize that these old lead line pipes are dangerous would have changed their their action right like this is a cost cutting measure. So I think one of the things we need to understand is AI is this really great really powerful tool for data analysis for giving us better information. We're not always great at using that information when we get it. And so there's a big push in the machine learning community right now. Some of you may have heard for your sort of two camps you know this sort of alignment camp that saying we need to align AI with with human objectives and then there's. This other camp called the safety community that says well, you know we shouldn't just try to align it we should just use it as a tool and use it safely. I think these are two different mindsets but they're both hinting at the same thing that unless we know what we want from AI as a society, people will use it for whatever they want and that might individually create really poor effects. I would agree and in the medical area, there is also, you know, you need to know why you're creating a model and whether you're listening to it or not, and if not, again why. So it's not simply saying well we could have detected this the fact that we didn't right and the fact is also that we didn't in many cases, no not this specific one but in many cases because there are so many false positives of others that we would not have created each one of them so I think the whole area of post marketing surveillance of devices medications, or even surveillance in general safety surveillance is, you know, has been evolving. Even before AI became more popular but it's always caught in this dilemma right how, how do, do we have ability to respond to signals and do we have enough specificity in this alerts and alarms so that we can act upon them. So it's very tricky area. So we have some more questions from online Olivia Harris asks, have there been AI analysis of situations where most medical errors occur, or would fatigued health care provider factors overwhelm analysis. So I can say that there've been several studies of medical errors and many areas prior to AI. Again, AI analysis would be possible, as long as there enough data labeled data, I would say about this, because just counting on outcomes. would not necessarily indicate the errors, right. Sometimes you do the right decision but the outcome is not favorable. So I would say there've been analysis and in many analysis also indicated fatigue being a factor or a poor user interfaces and things like that. And I think was so is exactly right. We know this is a problem without AI. Right. And so one of the things that we have to think about is when when we train models we're training them on the data that exists. And I love her point, it would need to be labeled. I can imagine asking a hospital to go back through their records and label every single decision made by a clinical staff member that could amount to malpractice. I don't think that would happen. And so one of the things we have to you know consider is machine learning runs either on millions billions of examples with no labels and it learns. This is the way you act you haven't told me how to act but I see how you do it and I'll just do it that way and accept that it's going to sometimes act like a fatigue doctor a doctor that used a poor user interface a doctor that didn't have their icon, whatever those situations are and make a mistake or we have to decide that we want to go back and somehow create an accounting of these are things that should not happen. If they ever happen in the data. Don't use that data that should not be used but that's not in the data that gets sold to private companies for the models that they create that gets even without large language models or any high capacity models. Even if you're just training with just a regression or XG boost models, you're training with error in there and you're training with maybe error that affects different people differently. David, did you have your hand up. So I wanted to dig a little bit deeper into the Laura's question, because I was actually formulating a very similar question. You know, when we are trying to integrate you know use AI to integrate all of these extremely highly correlated factors. Dr. Cassini you touched a little bit on this in your presentation, you know, how much can we use this to tease out you know what is just kind of a spurious correlation, you know, people are of an underrepresented minority, they also are living in a low SES community. High proximity to traffic because of different policies, you know, all of these are going to be showing up as drivers and how do we identify which one is the one that we want to target for intervention. So, I think there's maybe two parts to this one is what data do we even have for a long time we didn't have any social determinants data, and so you could not use it. And now we do in many data sets thankfully thanks to efforts like with those with the all this data set that I'm a huge fan of. There are social determinants questions in that data. So, I think we're starting to look at whether we can integrate that data into an analysis. We have a paper that's coming out in a month at the ACE conference which is one of the AI ethics and society conferences that the community runs demonstrating that in hospital tasks specifically so acute care tasks, social determinants data that we found at a state level in Massachusetts did not improve prediction in general population. It did improve specificity of machine learning models in specific tasks for minority groups right and so for example for looking at data and trying to predict outcomes for diabetics when they're hospitalized there having social determinants data did help a lot with specificity, but just as a general thing you throw into a model. It's not a, it's not a solution a simple solution because you have to imagine things like self reported race, and social determinants are deeply proxied in data, and they are deep proxies for other things in data. So I think in that CRT example where we made that checklist. I can't see what the proxies are with what could this be. It was still an incredibly biased checklist right and until we included explicit fairness constraints it was not addressed. And so this is an ongoing area of research that we need to work on. We tried as researchers to decouple information about self reported race or sex at birth or social determinants from clinical data, and then take this decoupled you know sort of washed data and then use it for clinical tasks. And so models don't perform as well in all cases right and so for some data. There's information that's being deeply proxied in different parts and so we have a couple papers we're working on now on the completely machine learning side just trying to understand. How do you remove enough information about somebody's for example self reported race or sex at birth if that's information you don't want out or about their social determinants if you don't want stratified care you don't want to learn these sort of weird social things that we have or poor systemic injustices that how do you remove that while retaining enough information that you can actually perform the clinical task that you care about. And it's a very hard question but it's one that more diverse open health care data, like the all of us data set and others would be extremely helpful with. I'll give you an example to in the COVID crisis in which we analyze data from several health systems 14 health systems, large ones, including the VA, and we were showing in hospitalized patients that the mortality of patients identified as Hispanic was lower than the mortality of Caucasian patients, for example, which is against the, you know, whatever everything else we were hearing. So I want to emphasize the importance of having experts who know the data and who can explain certain things because once we did the multivariate analysis it was clear that age was the factor that was really leading to higher mortality and it happened that the white population was older than the Hispanic, or the ethnicity Hispanic in that case which again age being the major driver if you have a correlation of Hispanics that were hospitalized being younger, you would see a lower mortality of that group anyways. Right, so the interpretation of all this models and what, you know, AI or some modern predictive models telling you it's something we need to be extremely careful about because we can make again this whole movement of liberating it's interesting is important but the, you know, misinformation is everywhere and miss an interpretation of models would be even worse because now they are data driven and someone is using the data and just interpreting in the wrong manner. Lucila, with that data that you just referred to. Is that at the county level, or census tract block group. Those were hospitalized patients with electronic aggregate. Okay, I was just trying to figure out your source. Yeah, from those who, you know, access to care was no longer an issue because they were already hospitalized. Right. They were received here uniformly that that's what it seemed to indicate to us that there were no adjusted for other factors there were no mortality differences. So I think we have time maybe for one last quick question. So, and I'm going to kind of change it a little bit dip on John Amalek asked if you think historical database on stressors and biomarkers should be developed using routine health check up data and environment data and that was one of the questions I've had as we've been talking you know everybody's like all of us is great. But there was also a comment that all of us doesn't really have a lot of environmental data in it. So, yeah. So maybe that's sort of thinking about next steps right that's the whole point of this workshop where to help how can we advance it. Yeah, I think that's exactly right. And so that's the challenge before us. I mean how do we, how do we get better environmental data and then how do we get it into a system that can interact with health records so it's very easy for childhood asthma for example, to talk to clinicians say the treating asthma is the you know the kids who have uncontrolled asthma because they don't have access to a physician and they don't take their, they don't have access to medications. And then of course you talk to our health scientists you say be able because they got cockroaches and in their house and they got bad air quality in their house and there's all these environmental determinants of asthma. They're exposed to violence in the community there's huge stressors and there's all these environmental things that cause asthma. And we don't really start this out very well because we don't those two data sets don't coexist in a way that allow us to look at kind of what the, what the whole constellation of drivers are. And it's obviously some combination and speaking back to David's point before we can use that to determine what's the best bang for the buck in terms of, you know what do we want to do with that you know there's a study. There's studies that show that, you know, putting an air cleaner in someone's home is is as efficient as putting them on controller medications for as many, you know, an air conditioner cost a fraction of what controller medications do. But the health system won't pay for an air cleaner in someone's home so so but how do you build the system to allow us to make answer ask questions like that and answer questions like that is think is why we're at the table. That was a great I feel like that was the perfect summary we have one minute left so I might just draw us to a close, unless either the other two panelists wanted to throw something else in there quickly. I just want to agree with the Allison Motsinger comments that the all of us program is working hard to incorporate environmental data. Yes, and David said the same thing in the room. Yes. Thank you. Well perfect so thank you so much to our panelists and to the folks online and in the room for a great conversation I think this is really setting us up for some diving deeper into a lot of these issues and coming up with some plans for what we can actually do to better integrate all of these different fields so I think we have a five minute break. Enjoy. I'd like to welcome everybody back to session to title of the session is leveraging AI ML for environmental health and biomedical data integration. Introduce myself first for the speakers. My name is David right. I'm chief of the predictive toxicology branch at NIH s and what we do is very relevant to this workshop. Our branches expertise in data science toxicogenomics spatial temporal exposures computational methods and new approach methodologies to advance predictive toxicology applications. Prior to joining in IHS last fall I was a professor of bioinformatics at North Carolina State University, where my lab was focused on data integration for environmental health sciences and toxicology. So I'm delighted to be part of this workshop and really look forward to doing the engaging session. Our objective with session two is to provide research studies and use cases of first how integrating environmental biomedical and health data is important understanding human health and disease. And secondly, how ML can be used in the pipeline for these data. So for this session we're going to have three presentations, followed by about 30 minutes of q a panel discussion similar to the last session. So, you know, hopefully try to keep each speakers to, you know, 10 to 12 minutes for the presentations, but we do have that buffer time built in at the end for discussion so hope we have time for a good robust discussion. So I'll start out by introducing our first speaker. Very lucky to have Dr shrug Patel, who's an associate professor in the Department of biomedical informatics at Harvard Medical School. Another research interests include developing multi-scale computational and data science methods to dissect the role of environmental exposures and genetic factors in complex traits and disease. With an emphasis on a trajectory from obesity to diabetes and its complications. Dr tell is recognized as a leader in exosomic science, developing methods to map systems of dietary environmental exposure factors with disease. So like to turn it over to tell thank you. Thank you, David. And thank you all for having me it's an incredible honor to be here to talk about multimodal AI. I think a extremely timely, timely topic given the emergence of some of the the approaches that we've seen earlier today in the data. You'll find that I'm reiterating a lot of the things themes that we've talked about already so hopefully towards the end I'll point out some things that we can that we can add to the field here and compliment to those awesome stocks we heard before. So as you know informatics for us in informatics it's been a deluge of data coming from biobank scale as resources things like UK Biobank now all of us and existing resources such as and hands the CMS Medicare data. There are a lot of opportunities that we should discuss on how to make them useful, in particular with multimodal data coming from expose own genome and phenome and we've seen great examples in the literature from genomics and phenomics and so can we leverage these data sets, along with multimodal AI techniques to make them useful. We've talked already about a lot of approaches in clinical AI some fantastic talks earlier. So the key questions that a clinical AI people are asking just to reiterate or to try to integrate these data that may come from diverse places such as one's healthcare institution, like a acute care setting, and try to mash up things that are high frequency time series or low frequency time series a when you think of time to death. Things like tabular data spectral data 1D spectra 2D spectra free text. And so you can see the diversity of data that are being tackled in the current approaches for looking at multimodal data. And the current approaches might use some new cool ideas of in neural networks to create a new data process or new data structure called an embedding essentially it's a, it's a new way to represent data that you can then input into your favorite algorithms of choice to make some sort of learning algorithm work or make some sort of decision. So questions that emerge from these are, first of all, does when you have have new models of modalities of data, do they actually help you with with prediction. And so there's multiple evidence that coming from our folks in a collaboration in AI, suggesting that here's a fantastic paper from so angst and colleagues, which I think we'll hear more about this when Eric Topol gives his keynote about how predictive capability here you're seeing AUC curves, in fact increase for several outcomes that are captured in this mimic data set used for for machine learning research to look at high frequency data coming from different sort of modalities. And you can see here from the AUC curves that when you increase the number of modalities for these different outcomes, you fact get a bump up on average and this is also showing, but in fact it is diverse, it does depend on the type of outcome you may be looking at so can we tease apart the outcome from the modality in in dissecting when our capability increases. For us environmental health studies, we're dealing with this, this complex expose own type of data set so they come might come from targeted mass spec which may be tabular or we may be looking at the spectra, like biomonitoring of lead we're looking at geospatial by digital markers so area level, we're seeing 2d spectra coming from images which are, which many of which I'll have talks about later tomorrow I think so and aggregating this to zip code or even a personal level. We have self reported questionnaire information very important that are also tabular, giving us behavioral information, new modes of getting behavioral information from sensors and of course on targeted mass spec so diverse types of data diverse types of examples here that we need to integrate across to achieve this idea of the expose home of analyzing things in the multitude to, to in fact, increase predictive power. So he's a great example of looking at the sort of the blood expose almost small molecules that might be indicative of environmental exposure. And complementing this are geospatial markers, other analytes that may soon be commoditizing in, in, in, in addition to those, those markers. And of course there's geospatial information emerging from satellites, again, diverse multimodal data sets, how do we stitch these things together. So first is, I'll show a couple of examples here of how people are doing this. Here's one from nutritional, our friends in nutritional epidemiology and AI. So here, Ravi Shah, bank, more than colleagues, we're able to look at what we call the internal expose home or the metabolomic signatures that are related to dietary measures, and then connect those signatures to to cardiovascular related traits like incident CBD and these. And also to use these metabolomics measures or the internal expose home or how the exposures respond at can detect some biological response to get a pattern of dietary exposures using this idea of canonical correlation analysis, and then connecting those patterns to incident CBD and chd. So I saw this I was like wow you could actually get something that's trained on dietary behavior, those biological measures of the internal expose home or metabolome. And they'll, once they're trained they can actually do better in predicting the outcome there's just something is biological signal that's also emerging from essentially predicting these behaviors. This is essentially what this figure is trying to show here from Ravi Sean colleagues is that when they're predicting dietary behavior, based on these metabolomic measurements and then going forward to predict disease, you're actually doing better than the the the self reported recall measures themselves. And then they use this idea of dietary patterns using canonical correlation analysis a very old school approach but with given these new sets of data have found new found uses. And then they've used a this way of finding a pattern and then also looking at how these are related to the health, healthy eating index and found that in certain circumstances they might also be more predictive of diabetes and CBD. We and of course committee members like Allison have developed the poly exposure score so this supervised approach of developing a score that summarizes exposures across diversity of domains. Here we use one to look at diabetes undiagnosed diabetes in the UK Biobank and our colleagues use the peg score to also in concept kind of validate what we had. And what we find is and I think this is key for us looking at multimodal AI techniques is how to compare it to the state of the art. And so one thing we did was looked at polygenic risk scores and we found the poly exposure score did better with our with our measures. But of course you also want to compare to how people may be using this to screen populations in the clinic like using blood pressure like using family history, like using glucose. And we're not as good as as as as the poly the clinical risk scores, but you could ask the question of how well you can reclassify individuals who already have some sort of risk latent risk for diabetes. Maybe their blood pressure is very high their glucose are in the pre diabetes range you can ask the question. And that data the clinical risk that you might have how much does it the exposures adding them actually add to reclassification, we have some evidence that it's doing much better than, than genetics, it might be even helpful for looking at individuals who have not been able to be screened yet for diabetes using the gold standard measures such as undiagnosed diabetes. Building these approaches of looking at risk factors that are already strong for particular outcomes like COPD, and looking at how the other exposures might push people over the edge for, for example, if you are already smoking already at high risk. But then if you start incrementally adding this information, how much better can you do in prediction for things like a lung function. And we'd say that in the UK Biobank we're comparable to using a polygenic risk score across smoking status for COPD. I think the use of the deep learning technologies is also going to be very useful in integration of exposome information. So totally another approach of looking at new phenotypes for for aging. So here what we did was ask the question can we predict abdominal age or age for biological age given magnetic resonance images from individuals in the UK Biobank so here's examples of images and we just apply some deep learning convolutional neural necks against these these data to ask the question how old these people are, and then figure out the residual them so given the prediction and the their actual chronological age. Is there any biological signal in that in that residual. And in fact we did find some biological significance we did ran a GWAS against those residuals. But we also ran an x was against the that residual to find that things that are that are lifestyle factors such as packers of smoking alcohol, all connected with this accelerated ages as measured by the deep learning algorithm, algorithm, exercise behaviors, accelerometer based physical activity all connected also with with decelerated age of abdominal agents unpacked ways of looking at the data in different ways. For example, do these this exposome that intersects with pancreatic or abdominal aging actually come first before diabetes and or or their independent sort of any lot independent etiologically. These same tools can be used to look at accelerometer data, as we've done here I showed this as an example, we looked at biological aging with respect to how how people exercise. I think one word of caution is that it when looking at these very complex approaches multimodal AI. It's unclear how we even do the studies to begin with the studies to begin with our have a lot of things under the hood that we haven't really figured out. Here's example from just doing one at a time simple epidemiology where we where we where we look at ingredients from the Boston cooking school cookbook and we see that there's a lot of, if you will vibration around around these these studies that that lead to different relative risk estimates so simple regressions. What happens when we start adding parameters to these models and understand trying to understand the risk. So, here we would say that the john shown fellow johnny and needs and colleagues that are weak statistical evidence against some of these non replicated inconsistent effects and so the ideas once we do we exacerbate the problem by looking at multimodalities and and multiple parameters. Today's in meta science to sort of elucidate the role of study design. One is to throw the kitchen seek of compute against all these different factors and seeing which one stick which ones don't, the most robust should be very robust to the data assumptions that might make. So we asked the question for for all cars mortality for example, through all these study design parameters in it could be different variables that you're adjusting for your model. Do any of these affect the estimates that you might get or the hazard ratios you might get and in fact for many environmental factors if you're looking at all cars mortality. In fact, do some of them are very consistent like like coating, but some of them are wavering around that estimate and so this is very, this is generalizable across data sets here's an example using UK Biobank. And so the idea is you could use this type of approach to screen for those that are non robust to your assumptions that you make in the modeling and for example here's one that has a risk profile for some of your models and a and a protective profile for other models and you'd may want to say that this is something that's not robust the assumptions you might make and filter that out from your, from your future studies for example. So the technical opportunities and challenge for environmental health research are, are, how do we integrate across the these diverse scales such as geography tissue time model systems. We integrate findings from new approaches and compare to the state of the art, and what data sets and approaches are need to establish benchmarks which are very important to help us interpret findings for the field. I thank all of my collaborators and look forward for the discussion. Thanks very much for having me. Excellent yeah thank you so much and, especially for you know the theme of the session is having examples and use cases right so I think we'll have a lot to see our discussion later. I appreciate that very very much. Thank you Dr tell and our next speaker is Dr. I don't john, who's a full professor of computer science and the School of Engineering and applied sciences at the University of Virginia. She also holds joint appointments with the Department of biomedical engineering and the School of data science at the University of Virginia, and her research interests include machine learning data science bioinformatics and health informatics. I look forward to talk. Thank you Dr john. Yeah, thank you for the introduction. So that I share my slides, or. Yes, I think what we're seeing up on the screen is as your, your main slides that looks correct. Yep. Okay, but should I share mine that you know so I can control it, or this is a from the audience right. Lily, are we running the slides from the room or should Dr john shares. You can share your own if you'd like. Okay, someone start did start sharing them for you but it looks like you're pulling them up now. Yeah. Okay. I think you just want to go to display mode on the slides. I don't understand this. So you. So, if you go to the bottom right hand corner there's the little like thing kind of looks like a screen icon if you click on that. That's a way to get it to presenter mode. Yeah, I. Okay, I so I cannot get to that. We can, we can share them if you'd rather that's fine too. And then you can just say next slide. Oh, well, let me see. Okay, but do this. Just still get here. I think it works now. That looks correct me. Okay, sorry about that. Okay, so. Thank you for invitation, you know, to give this talk, and I've been listening and so I think it's great workshop to organize this topic. I'm going to focus on more on the more fundamental part of how, you know, people have mentioned how machine learning will help with multi model data. And so I will talk about the specific approaches that we have developed on the multi model learning. So, basically, as you know, all of us have seen in the previous talks, people have been integrating all different kind of modality data. And these are from different sources, different sensors, especially currently there are all different sensors measuring all different kind of data, and it's actually becoming a very critical issue how to integrate them. So, in the past, you know, I, you know, from the previous talks showed the integration of the multi model data will help. We also have experiments to show on the multi omic data when we integrate different modalities together, and you will see the prediction of the samples increase significantly. But the issue is how to integrate this different modalities. But the challenge here, if you look at it, so the most important challenge, I would say is the missing part, missing data, because in one modality, you don't see that much, but when you combine different modalities, for example, you look at it here. And you will see most of, you know, mostly very common, you will have missing data from different modalities when you combine them. So, then how are you going to deal with this if you want to fuse the data and for prediction or other, other any research you want to do the follow up. In the past, people have used the imputation. They have used most of these imputation, but now with the advanced machine learning models, we really should look at how, you know, beyond the imputation and how we are going to fuse different modalities and to do a more effective follow up research for this. So that's what I am going to focus on. And so this is part of an SF project is funded for multi-modal learning for multiple domain data with incomplete modalities. Now, so what our approach is, is to, first of all, with the input data when you have different modalities with incompleteness, we want to be able to construct this data and to represent it in a way we can avoid the missing part. And at the same time, we can handle each data represented in a unique way so we can use the machine learning to process it. So people have used deep learning or other kinds of learning. You know, if you have data missing, you cannot fit into the deep learning model. So you have to be able to represent in a way you can fit into the model. So graph model is a very popular strategy to do that. So that's where we propose to model the multi-modality data into a graph model. Then we can use graph neural network to map the data into an embedding representation for the follow up research. So to be more specific, how we will handle such, you know, different modalities with a lot of incompleteness into a graph model. That's the central part I want to talk about. So for example, here we have three modalities. And so with three modalities, you know, all different combinations, you will have, you know, seven different kind of pattern, right? So you have seven different patterns. So with more modalities, so normally you may have hundreds of modalities. Then the missing part becomes more complex. You will have more combinations of all the patterns. That's what I hope to show with this slide. So when we deal with the missing data, we can convert, treat this as a different pattern with each different kind of machine. So with that, you can work, so you have different patterns. So then each sample, for example, this patient one has only one modality and two other modalities are missing. So this is one kind of node, we call it a hyper node, but it's one kind of pattern. And then the patient two have a different missing and will have different patterns. So this is another hyper node. So each node is represented as a different hyper node with different modality it contains. So I hope that's understandable. So with that, you can represent each data with a unique hyper node without this, you know, you don't have to impute the missing data. So then we got the nodes, and then we have to establish the links between the nodes. So the hyper eight edges we established is calculating the similarity between the different hyper nodes with the each modalities similarity. So, for example, this node has two links to this one to this one because the MRI image has a MRI image in this node and MRI image in this node. So you will calculate the different similarity. And if it's above threshold, it's very similar and you will establish a link. So that is called happening. So you will link different nodes with happening. So for example, this one has the second modality and third modality, it will link to this one and to this one. So I hope it's understandable that way we can convert the multimodal data with incompleteness into a heterogeneous hyper node graph this way. So we don't have to deal with the imputation of missing data. So after we represent this in a graph, then it comes to the issue, how you're going to handle this in machine learning algorithms. So it becomes the graph model and the input into so we machine learning you normally map the original data into embedding space, which is just a vector space, and then you will do follow up or follow up research, for example, prediction or other decision making. So how do we do this? So basically, this is a purely machine learning and nothing to do with domain anymore. So basically you have different modality that you can convert into vector space. With each modality, you may have images, you can use a CNN convolutionary neural network, you may use other kind of neural network, if it's texture data. So different modality can map into a unique space. And then with that, this is just modality. But each hyperlink, remember what I said here, you have an interaction between the different modalities inside of hyperlink. With that, we can use the embeddings to catch that using the neural network to generate embeddings for the interactions between the features. So in the end, with each hyperlink, you will be able to convert into internal space with the node representation, and also the interactions between the modalities inside the node. So for example, this node has three different modalities. So not only it has embedding for the node, it also has interactions between the nodes which are caught by the embeddings inside. So this is what AI machine learning is very advanced now. We can do all this automatically in the system. So after that, you have the internal nodes, and you have the hypergraph. So then it becomes a very technical neural network issue of how to convert this graph into embedding space. So, which I will not talk very detail, but on the high level, we can convert this heterogeneous graph, which is a hyperlink, the graph into a homogeneous graph with each node is embedding and connections are also interpreted in the embedding space as whether they are near or far. With the embedding space. Okay, so that's how you can basically represent your multi-model data into embedding space, catching all the features between modalities in machine learning space and to do follow-up research. So in that way, you don't have to do imputation, and so how this works? Does this work? So the experiments here shows, although this data set is more on the object recognition, not really on different modality in the biomedical data, because in that paper published, we need to compare with the state of art and baseline, and so they all use this data set, so we have to use that. So basically, we have all this data set, which has a huge number of classes in each data set to recognize the object faces, and they have all have different modalities. So in this case, we use the two modalities. So the object has a different way to catch their image. So combining different modalities together, and we showed here with complete data, only the complete data, and what are the performances, and then randomly dropping 30%, 45%, 60%, and 75% how each different model works. So you can see our model can work very well, even dropping the 75% of the data in the experiments. And you can see with the data missing gradually increasing to higher original, and then you will see the drop of accuracy. So that's understandable how different approaches work. And you can see all the other approaches actually impute zeros into the missing data. So, and that shows how this approach work. That's basically I hope to introduce how using advanced machine learning, we do can, you know, fuse different modalities but without using imputation data, and we can actually make a new representation for the complex data set to represent the complex data. So I will not go to more details, but you can see this paper we presented in the top data mining conference KDD 2020. And we also have a project on multimodal machine learning for database in completed modalities. So this has been worked, you know, applied to different domains and to show how effective the approach can be. So I will end here. And so I think, you know, in summary, the multimodal fusion can be achieved in graph structure. And as I shown, so it is effective. And by modeling data in graph, we can deal with heterogeneous modalities with incompleteness. And this shows the effectiveness. And we also have worked out the in distributed cases, you have different mission data and this approach can also apply. So I will stop there. Wonderful. Yeah. Thank you so much for the presentation. I think I look forward to the discussion about, you know, not just fusion of data but also fusion of machine learning and AI elements right. So there's a lot going on there that I'm looking forward to hearing more about so thank you again I want to introduce before the discussion. Our third speaker, Dr. Heidi Hansen. Dr. Hansen is a group lead of the bio statistics and multi scale systems modeling group in the computational computing and computational sciences director at Oak Ridge. Her research is focused on disentangling the interactions of genetic and environmental influences on disease risk throughout the life course. And she aims to link her findings to clinical measures that can be used to improve precision strategies for screening and treatment. So that like to welcome Dr. Hansen. Thank you. All right, thank you so much. And I'm honored to be here to present to you today. So I'm going to start off my slideshow with this question, the power of large multimodal biomedical data. And I have a question there, question mark there for a reason and the reason is is a lot of the algorithms we're talking about are super data hungry. And they don't always transport well into other scenarios. And so this is the vision that we'd really like to have where we can follow an individual over their life course, collect as much data as possible so that we can identify who's at risk for disease early on and treat them. And then not only stop following them at that point in time but really follow them through the course of treatment. But to do this well it takes a lot of data and to do this well across many different diverse communities. It takes thinking about things for more than just the perspective of a single place. So, this is just a look into some of the projects that we have going on at Oak Ridge National Laboratory one of them is a partnership with Cincinnati Children's Health. And we are working with them to build a model for anxiety trajectories for children. And that model should the goal that models to have something that's rolled out into the clinic can identify children at an early stage of anxiety diagnosis. So that they can jump the doctors can jump in and help treat them. We also have similar projects ongoing with the Veterans Administration. And then one with the National Cancer Institute that I'm going to talk about in just a little bit, but you can see, there's lots of data available when we start to combine it with the environmental data sets. We get even more data which is exciting, a little bit overwhelming and thinking about how we deal with that. It is still a question that we are dealing with. But this study is designed just for Cincinnati Children's Health and we're using data just for Cincinnati Children's Health. I'm going to walk you today through something more of the infrastructure that I'm very excited about not that I'm not excited about that that study. But how can we start to think forward so that we're making sure that as we're designing these things they have the ability to scale up. So can we start really thinking in a way that allows us to design early on so that we can talk studies can talk to each other and really start to build a full population health models, rather than just models that are specific to different clinics and I know that's a hard question and we're not going to be able to solve it in a day a week, a couple of years, but I really have been thinking about how we can do this in a more broad population level way. So what's required to do this well I'm going to talk not a lot about this. We have a lot of compute at Oak Ridge, but probably our biggest advantage is we have a spot where there's strong interdisciplinary team science. So when we talk about multimodal data we're talking about omics we're talking about imaging we're talking about the environmental exposure part electronic health data. The person who knows all of those things really well is brilliant. And most of this maybe don't quite hit that level. We're still brilliant but not not quite that level so in order to do this very well. We need folks that are experts, not only on the compute side in the domain, but they have domain expertise and expertise in their own specific problem set to be able to come together. And talk about what it means to really do multimodal data analytics and what kind of things should they we be thinking about so that the algorithms that we come up with aren't biased and don't have issues when we actually roll them out into the clinic. So I'm going to walk you through a use case this is a collaboration that we have with the National Cancer Institute, it's called the mosaic project, modeling outcomes using surveillance data and scalable artificial intelligence for cancer. The reason I picked this project to highlight is because it actually is one of our biggest projects that is truly a population level. So the data that we have to work on this project is coming from the surveillance epidemiology and and results registries. So if you don't know what those are sir registries are across the US, they cover approximately 47% of the, or 48% of the population, and they collect all data within their catchment areas on cancer so anything that has to do with that cancer or your incidents at data is being collected by a sir registry. They're also involved in the national childhood cancer registry so through that partnership we get information on imaging information on omics and electronic health records. So we're collecting a lot of data on cancer, and by we, I mean, the partnership MCI is actually doing the data collection. The day that we're training on includes data from Louisiana, Kentucky, Utah, Seattle, New Jersey and Mexico and California. We train our algorithms using that data, and then we test it using the full ser data. And basically what we're doing right now and I am going to walk you through some of the architecture types of things because these problems are so big is we are extracting information from unstructured data using large language models. We take pathology reports which is what you see on the right hand side, and we basically extract the information that seer needs to be able to do their cancer reporting so we categorize site sub site histology, laterality and behavior. The reason that I'm pointing this out is right now this is rolled out in production at the national level so these algorithms are used on a daily basis and seer and seer DMS. We have built in basically uncertainty quantification. So we're talking about health space which means we're high risk and our algorithms are always going to give us an answer. Whether or not we can trust that answer is a question that we should be asking if we're talking about rolling things on to the clinic if we're talking about a causation we can deal with that in a different way for thinking epidemiological studies, but really truly the science that we're rolling out needs to incorporate uncertainty quantification so what we do is we built that in. We don't predict on reports that we don't trust, we set the bar very high at a 98% accuracy across all data elements. And then we predict on those ones that we trust and so right now across the US, we're able to take a report and predict 17. Those are interesting percentages that I need to fix but we basically provide report somewhere around 23 to 27% of path reports across the US so we can take that information rapidly phenotype those cases, and then we use that information for downstream tasks. And that's the part that's really important here is that we can extract information from one modality. And then we can actually use that information for downstream tasks if we need the interpretability there, as was talked about in some of the other talks we can also use the embeddings for that. But this gives us that intermediate step if we want interpretability. I'm not going to get super into the details of some of the things that we're doing but outside of the large language model that we have in production, we are working on foundational models and the reason this is exciting is we can take all of the data that we have pre train and then we can predict on many different types of tasks. So basically we come up with one foundational model that specific to pathology reports and we can start to predict on many different tasks, which is a domain shift from what was done before where I train one model for every single task that I'm doing. It saves on compute, it makes us more flexible, and it also allows us to do what we're trying to do right now so we're partially in the building part of this infrastructure, this is the vision for it so essentially what we're trying to do now is build a library of foundational models. So what I mean by that is we have all these different modalities, can we start to build these foundational models that can be used on imaging data, pathology data, clinical text data, survey data, social network data to have those models be trained and then start to predict different tasks and combine them in ways that are specific to the study at hand or the prediction model at hand. So right now we're in the stage of building some of these, these models. And so we hope that this will allow us to scale to different types of questions across the US and make something more reproducible and replicable. Let's talk a little bit more about what we're doing on the information extraction side from electronic health data. Now I'm going to talk about the fact that Sear is also linked to residential history data which is super exciting to me. So what they have is linkages for from Lexus Nexus across 11 of their Sear registries right now. There should be 15 Sear registries by the end of the year. This means we have for those same pathology reports that we have records for we have 3.2 million records with residential history, and that residential history goes back to it goes back to 80, but it's high quality from 1995 to 2020 and 83% of those records of geocoded at the point location. So we're working with NCI to start linking to these three data sets first indoor radon exposure RSE I micro data and air pollution data, so that other folks can start to do research using this information. So this is the way that we're doing this and this is, this is going to be I'm going to walk you through how we're doing this, but it's applicable across many different data sets that we're trying to collect in our environmental health data repository. Essentially we are using over age hexes, and the reason that we chose uber age hexes as we can take that point information or we can take polygon information, we can overlay that information with a fishnet of hexes, and then we can extract the information that we need. So one example of this is with RSE I micro data so we took RSE and micro data, overlaid uber age hexes on top of that, then we were able to summarize what the value was within that uber age hex. The beautiful thing about uber age hexes is you can scale them based on population size. So what you see on the right side is the bigger hexes have smaller populations, and we need to scale out to so that we don't have to deal with privacy issues. So we have those hexes large in spaces where there's very little population so that we can scale to the level that we need to to protect privacy in urban areas that are like here, they're really really small. And that allows us to get specific and the types of exposure areas that we are looking at. So that's an example with the RSE I data but what we're actually doing is doing is we're dropping those hexes over every single data set that we have. So we start to build an external expo zone measure that we can stack on top of each other and measure as many different things as we can, and start to describe all of the MOLU of what you're exposed to an environment. The other thing that we do is we drop a population mask over that so we use the land scan data, which identifies where populations reside. That way we can throw out information where nobody's living, and get rid of it, and only store information that we actually need based on where populations are living. The beautiful thing that allows us to do is it allows us to assess missingness so this is what you're looking at on the right hand side is from individual home rate on test kits. And there is a selection bias into what we actually see here so in order for us to get a test kit someone has to put it in their home test, which means certain areas or certain populations may not have testing in their home. We assume that they don't have high rate on exposure but it may purely be for the fact that we don't get those measures. The population mask allows us to identify areas in green here where populations live but no testing is happening and those may be targeted areas or things that we need to think about in our modeling. And again, similar to what we're doing with the electronic health data is we're building basically libraries of models that we can use for downstream tasks so libraries of foundational models for environmental exposures. And this includes behavioral social data and chemical exposure data that then we can pick and pull to create our multimodal models, and also combine together for looking at disease risk. Yeah, this is a question I get asked a lot is like that's great and all you're doing some of this for mosaic but how does it apply to the rest of us. And this is a part that I'm really really passionate about is making sure that what we're designing is reproducible, reputable and usable for real world applications. So at the project level what we are doing is making sure that all of our code from start to end so data processing pipelines. Actual code for the models is publicly facing and available, so that folks can run things in the exact same way that we're running things. And that's one way to get to that scalable population level models that I talked about. And another way that we're kind of thinking about doing this is federated learning. And so we have with the same project that we're doing, we're setting up federated learning runs, everything on the electronic health is cross silo, lots of data, one center. And we're basically playing around with the privacy accuracy trade offs that are required in a federated learning space, so that we can enable folks to participate in some of these runs that historically have not been able to because they don't have the compute or they aren't able to because of data use agreements. I'm not going to dig into all the things that are there. There's a lot that actually goes into that, but it's an exciting direction that actually will, in my opinion, allows to scale up to the population level. And I will stop there. Yeah, thank you. So, you know, I admit to, you know, some jealousy over computational resources sometimes when I see oral folks present I know you all share a lot of stuff and I think the last slide addressing my questions about, you know, how do we get at this stuff right because it's fascinating. So I do want to invite the three presenters to turn on the cameras for some panel discussion here. And they're my no doctor tells in the room in the actual room there. So it might not be doable. But I feel like the presentations that we just saw we actually did a fantastic job of meeting that first session is that objective of actually presenting people with use cases and studies where they can see this right so I think we've covered that really well. I think people follow up with the speakers if you have questions on those. I did want to ask kind of the second objective is some thoughts about how AI and ML can be used and in a pipeline and maybe even thinking about what a pipeline is right and I think maybe I think the next slide or the title can comment first. Because I think clearly, you know, the Miss a project is, I would call that a pipeline right it's, it's things are being watched. You know, there's automated learning, you know how do we turn some of these other things into what we might consider more of a pipeline. So that's a very good question in the application you can see from the presentation, he had just give, you know, there are a lot of different modalities and sensors and you you have to integrate them. So I think the integration is the most important part and I do believe the advanced machine learning approach can help the best that we can, comparing with other approaches to integrate in the pipeline. By doing that integration, a lot of the issues need to be resolved. But one of the like I said, you know, how do we integrate data when they have missing this and also heterogeneity, right. Other part I did not mention the heterogeneity all different data have different, you know, different models. So how do you integrate them. I think this really need the machine learning approaches to do the job. So if I may add to that David I think there's a couple of things that need to be put into place or are able to do right now, in terms of building pipelines the first our access to clinical data that are linked to potentially some of these other banks that we're already collecting so you can imagine a scenario in which you run a process some multimodal AI to develop your algorithm or to do some sort of discovery, and which is linked to electronic health records where you can then prospectively evaluate some of these methods on real patient data or even call back individuals for to develop some sort of AI based intervention, all in sort of electronic health records scenario. The second thing is I think we need and I think our speakers touch on this are libraries of models if you will so we have libraries of data I think one thing is to make them also accessible. As we've heard in a in our fantastic last talk here by Dr Hansen, make them available so that others can quickly evaluate them and for that that would require data standards to be to be agreed on so that we can implement them on, you know, for example, all of us to UK Biobank to electronic health record setting etc in real time evaluate all these model performances across different core data. Yeah, I think, you know, one before we, you know, there's a good chat question that I want to get to. But, you know, I'm a little bit of a pipeline thing I think something I saw, you know, certainly in your presentation. Dr tell would be the idea of, you know, for building a pipeline. Some of these studies are also life course, right. So we saw Dr giant presentation as well right now had, you know, we're predicting things for different stages and modalities of people's So in a pipeline sense, what does that mean right because we you'll have a the model for today. But then that same corpus of data, right that's the same cohort of people I'm sorry. We need to ask questions about them tomorrow you know so what is in terms of a pipeline what does that mean are we constantly updating outputs or are we offering interpretations or how does that work for actually communicating these things. That's actually very good question. That's, you know, if people have heard shift, right the domain shift, you know, over the time, the domain changes all the time how are you going to build your model will will be able to generalize to all to the domain shifts. And I think that is a major major topic in machine learning community to deal with that. And I think it's still an open problem. How do you deal with the domain shifts. Yeah, it's I would agree with that that's a great answer and it's still very much an open problem. Some of the things that we've started to do is build in temple check so basically we're, we're assessing temporal drift as we, we go forward in time, and that's part of our data profiling so not only do we have data profiling, but we have model profiling where we're assessing different types of biases or temple shift in the data. And then that might inform what how we make decisions about how frequently we update it, but it's really trying to think about things is rather than a single endpoint, but the full pipeline and an iterative pipeline where you're constantly reassessing and updating things. I think that'll be especially relevant for. I'm sorry, Dr. Oh, sorry, sorry, David, I just wanted to add to that. You know, many of these modalities may be redundant. And so one sort of experiment one needs to do is understand the correlations between them I think Dr. Valshaw mentioned that the hyper correlation between exposures so it could be that some of these modalities are simply repeated information and that you don't need to measure that and you can implement as far sort of a set of measures to get the capture the same amount of variance explained. And I think, you know, as kind of a segue into the next question right we're all doing this in the face of climate change. Right so you know clearly all of our you know, the environment is not static right we know that much talking about pollution but you know all of it gets affected so you know you know on that. Like as a segue here the check question you know do we have suggestions for dealing with the need for privacy and spatially heterogeneous environmental contamination data that can be associated with localized exposure patterns right we saw. I think that Dr. Hansen presentation somebody had about spatial joining and how to scale up and down there but you know do folks have thoughts on that. That's a very interesting question and it's a very good one and one that folks don't always catch on to is the fact that when you start to combine all these modalities you really run into privacy issues so I may identify my health data. But then I join it to some sort of information about environmental exposure at a small spatial extent, maybe one KM one KM. And those patterns become very unique to that space in some cases right and so I went through all this work to de identify I joined into an environmental exposure and hey, I have an identifiable information in a way I didn't really think about. So I would say there are, I think that there are ways we can start to think about making sure that these are privacy protected. So I think this question is is a very good first step so a lot of the things in the privacy, privacy space haven't been really well quantified. So, at what level do I need to think about this at what happens when I get to certain patterns I think we need more research into that sort of question asking or critical thought. I also think that we need to start to think about other methods and that's kind of why I'm interested in the federated learning spaces. At some point when we're talking multimodal data, it's going to, it's always easy to triangulate a person so we need to think about other ways to deal with this so can it be privacy preserving federated learning. There's always going to have to be IRB, DOA sorts of things but can we start to get innovative in the ways that we're processing this data would be my answer or question. Maybe I can add the other aspect of the question on the contaminated data. So, you know, machine learning goes very advanced in this serial learning if you have heard of serial learning. I think that's a similar approach, you know, contaminated kind of fake data or similar to those adversarial and also generative AI has gone, you know, advanced to data augmentation to try to reduce such problems. So I think that we do have a future on that. And I think as we go back to the, you know, from the first session to some of the questions about, you know, the dangers of identifying folks and having, you know, vices built into our, you know, our models I think that the map you should, you know, Dr. was 48% of the country, but they were certainly geographically contiguous, right. You know, they're like tronches, right, like tronches across the country. And I think a lot of cohorts end up looking like that right and so you're going to, if you're not building out at the beginning it's going to get big and if we don't think about the deidentifying things, you're going to end up with some of those non desirable outcomes about where the learners bias themselves. So I'm sorry, Dr. to tell you something, I'm sorry, Mike. Oh, sorry. Thank you very much. Yes. David, yes, I just wanted to comment on Heidi's. So everyone did a great job. Thank you very much. I learned a lot. And Heidi, yes, that was a great trick or, or the methodology with regard to using land scan, right, to derive without crosswalks, you're now at a what one, basically a one almost two kilometer level in terms of, yeah, your aggregation there. So without a gridded data set, generally you would link, you know, modus we do that, you know, we use modus 2.3.5 and then link it to our public health expose on to be down near about that and thank you for bringing those issues up. The IRB really doesn't like to see that. But, but I want to compliment you on that. And the reason why this is very, very important, because the imputation in Columbus. I mean, you know, smart cities smart and sustainable cities have to have monitoring, you know, throughout the city. Generally, these are purple air monitors. Generally, these monitors aren't placed in communities, vulnerable communities. Columbus didn't so we got a grant to put them in those census tracks where they weren't, and found out that we were almost an order of magnitude off in the imputation. That's what I'm saying with respect to real versus imputation so I want to caution less to be careful in that regard but that was without AI or some sort of supervised clustering methodology. Yeah, but other than that. I was just asked Dr. Are you are you saying that those were placed with the benefit of some learning models where you guided your placement of those or they're replaced based on just where you had missing data. That's where we have missing data which is always in vulnerable, low census to low socioeconomic status census tracks. Right. Thanks. Yeah, well maybe you know I'll ask as a follow up to maybe Dr. Particularly I think Dr. Jung might be able to comment on this of, as you're doing these multimodal data and you're finding holes in the data, the missingness of the data. The methods didn't really care about whether you had prior knowledge of of the, you know, anything about that particular data set like did I misinterpret that or is it beneficial if you're having ex bosomics data on one particular kind of contaminate that you sort of know a spatial pattern. The graph embedding method is that actually adding information or is it, can it just be, what's the right word agnostic to, to any sort of prior idea about the nature of that missingness. Yeah, well that's a very good question because in the past, people have used the prior knowledge to impute the data right and, and sometimes they just simply use the data to impute. In this approach I presented we try not to use those but we utilize the interactions between the modalities. That's the key part because the, why do you want to use a different modality, because they have interactions. And they have a common features that you want to extract. So I think machine learning can extract those information automatically from the data set. Yeah. David to this I think for certain critical questions I think we need to do be better at measuring. So I think several topics that emerged from the last session were about non additive relationships between cumulative exposures, or versus multiplicative effects and I think that while I'm very optimistic about imputation I think eventually we need to measure everything such that they're not missing so you can really ask those questions about non additive versus, you know additive relationships between exposures and health health status. I wonder if you have thoughts about your. I think you presented it as I think the word used was like a old method or traditional method you're talking about a canonical correlation. You know, this is an AI ML workshop and we're all pushing, you know, the methods throttle as fast as we can. You know, but I wonder about like, you know, you have some thoughts about the benefits of like for interpretation, maybe some methods that were around for 100 years and just slow. I want to speak for bank murphy and David, and, and Ravi who are the authors of that paper but I would say that that that was enabled because they had complete cases on both the metabolomics and the dietary behavior. And you're right they're able to get something that was easy to interpret and look at out help outcomes with that so I think that matters and it adds to the impact of the things that we can we can do. Well, I would add that when you do integration or you look at a correlation, especially using machine learning models that you always face the interpretation problem. That's, that's also one of the problems that machine learning communities trying to solve. I think we, we are optimistic to have a good solutions to do that. And maybe I'll just ask everyone if you know if you have a, you know, closing thought as we're about time, you know, for the panelists, especially on that key piece of like, you know, going into the next sessions. Are there things we can do now to help interpretation, you know, are there like quick things that are reachable right now, you know that we can start doing. Yeah. And so I think, you know, look at the, you know, like it was mentioned, comparing simple method to more advanced machine learning method now there's a trend going whether more advanced will help and with interpretability. So I think this is a very important topic to have the balance between two. Yeah, I would agree. I like that answer. It takes a little bit more time and we all are methods geeks a little bit so we're all like, yeah, but I think it's very important way to go forward. I also think thinking about building uncertainty quantification in what we're building is extremely important. Again, kind of a hard space. There's a there's a lot there a lot of open questions but rather than just accepting what we're modeling can we start to build in uncertainty quantification and then incorporate that into some of our pipelines. I think those are I also like those answers I think enhancing our existing cohort data or existing surveillance data like the National Health and Nutrition Examination Survey, so they do not have missing information so that we can build baselines for example, when we do implement these data or measurements multimodal measurements and things like that, like all of us in biobank samples. So I'd say baseline or baseline metrics to assess how we're doing. Thank you all and I'll let the, you know, the thank you again to the speakers for everybody I'll let you all have the last word you've earned it. And we'll now I think move into a lunch break, and we're set to reconvene at two, and I think we'll want to start exactly at two because right after that, you know we have our key notes and a visit from Rick way chick so you know want to make sure we start session at the same time so unless the organizers have any more comments I think we're going to take a pause now and reconvene it to. Yeah, thank you all again to you know all three. Thank you. Hello everyone. We'll come back. We're going to start our session three. So the goal for the session three is to provide an overview for the emerging AI and ML methods approaches for data integration and applications. I have three wonderful speakers today. And my name is y'all each time I am an associate professor University of Minnesota I'm in the computer science and engineering department. But my background is a mix of computer science and spatial sciences I'm also a GIS professional. My area of expertise is on spatial AI so we work on applied machine learning methods, data mining methods systems to look at different problems that involve spatial data. Without further ado, today our first speaker is Dr joey's whole. So Dr joey's whole is an associate professor in the computer science department at Emory University. Her research focus on the development of novel data mining machine learning algorithms with problems in health care. She also had co founded successful healthcare analytic company. So today we're going to hear from her about her work. The floor is yours. Thank you so much. So today I'm going to talk about the integration of novel data streams to capture neighborhood level measures. And to start this off I really want to motivate it from the perspective of health equity and sort of take a case study, which is to think about heart failure. Now, it's well known that social determinants of health and neighborhood really have this effect, this unknown effect, right and it's unclear exactly why it is. So if you look at existing literature what you'll see for heart failure and for peripheral artery disease is that there's conflicting evidence of whether or not neighborhood deprivation index actually plays a role in this. And given this you might say well maybe there is no neighborhood effect. One of the main premises is basically deprivation indices really fail to fully capture the neighborhood or community factors. And so the real question that we want to think about is, how can you get at these factors without asking the patient without burdening healthcare providers. So what we thought is that you in fact can tap into novel data streams that are already publicly accessible, you know people create. And so I'm going to cover four different ones that we're quite interested in thinking about. So the first one you can think of as market segmentation data. This, this would be things like Instagram, Facebook, Twitter. Another one is thinking about geo data services so this would be Google places for square yell and open street map. Another one might be mobility data that's covered you know so for Atlanta we have Marta. We've got Department of Transportation and Apple Maps and Google Maps as proxies for mobility. And then there's also community based forms so this would be things like patient, like me patient info next door and even thinking about Reddit. Now these are all already publicly available to some extent they all have API is that you can sort of get at the data and users already posted so it's not something that we necessarily need to collect. So the question is, does it do better. And so we went about to answer this by thinking about prototyping new neighborhood measures. So in particular we focused on three different sources of data. This is going to be Twitter. So you can see that in the middle an example of a tweet. We also thought about four square which has, you know, thinking about groceries in New York. And we've also been thinking about open street maps which is, you know, curated by the public essentially. Now for four square and open street maps what we're looking at is the points of interest within the census track, and we're focusing on census track because you have to validate against something. And the natural ones that are typically used when thinking about health equity is census track level. Now for our Twitter is we've also been thinking about well how many of them are geotag with specific keywords. And we're going to focus only on six categories of interest. And these are the ones that we sort of curated initially because there's been some literature to support that there is some effects thinking about these. So these would be the number of parks that you live nearby. Right so recreation thinking about pharmacies so how readily available is medication grocery so do you live in a food desert and as a proxy that would also be restaurants. We also looked at sports right so sports team sporting events things like that. And then we are also thinking about health as well so looking at health care access. So we looked at this and what I'm showing you are three measures our own area deprivation index which is commonly used by CMS today, and then also social deprivation index which has also been commonly used across different literature. One of the things is that from the six categories of this very simplistic approximation of neighborhood we in fact can improve whether or not a patient will come back in 30 days, over 31,000 plus heart failure patients admitted to the Emory health care system by almost 10% improvement, which I think is really exciting. So this suggests that in fact, you know, ADI and SDI both may not capture neighborhood or community factors well, we really want to think about how we can get at better data. And somehow I jumped off sorry seems to be unstable. Somehow I don't like the internet here so I tried to tether to my hotspot. Sorry about this. I don't know if my phone will be any better. Yeah sure if you don't mind if I can use a quicker that also works as well. I'm pretty agnostic. Yeah, so, you know, thinking about this we actually dealt into the six categories a bit more. And you'll see it along the six dimensions and interestingly enough, even if we adjust for patient characteristics so this is looking at age gender comorbidities. What we found was that if you can advance one. We found that pharmacy and grocery actually have protective associations. So that means that if you live near more pharmacies and more grocery stores in the census track, you are less likely to be readmitted in 30 days. And whereas surprisingly, well, at least for me, restaurants have a negative hazardous association with 30 day readmission so somehow, if you live near more restaurants you actually were more likely to come back and surprising. What can we do with this data right what can we do this with this information. We've been thinking about how you can pilot pharmacy interventions. So the idea is that if you don't live near pharmacies in a census track that's not nearby. Maybe what we can think about doing is doing a meds to bed program where we send pharmaceuticals to them or we think about you know sending someone out there and making sure that they're taking their medications and have that. Now in thinking about this a little bit further. You know what we also want to consider is how can we actually do better right so this is with very course measurements. Surprisingly restaurants is not better, and we really couldn't understand why because our intuition was, if you lived in a food desert right that would mean that you aren't anywhere near restaurants. The opposite is true so we dug into some of the census tracks, and what we discovered is that fast food restaurants are the prevalence right. And so that motivated us toward this next idea of how can we actually get at more fine grained measures. So as an example we might think of health care facilities, we might talk about it differently. So if you don't mind advancing to the next slide. What we're thinking about is how you can actually do human guided category refinement and what I'm going to show you is a use case on one of the six categories and we're looking at in the context of health. And we were looking at Twitter in particular because Twitter tends to be on structured data. So I wanted to say well, how you preprocess the data, right, how you maybe think about topics that are emerging in the tweets are the things that we want to consider. And in thinking about that we really want the domain expert to come help us guide these clusters or these refinements because for every category we're going to think about. You really want it to, you know, sort of reflect some, some notion right, or at least maybe potentially improve it. But on the next one. What we discover what we've done in the interactive under the hood is that we've developed this unsupervised machine learning model that can take these tweets. Extract keywords get at the representations or the embeddings right so we're sort of agnostic to any big changes in the natural language processing world thinking about large language models they're sort of plug and play for what we've developed is this mechanism to figure out well how do I learn word representations where I push similar words together in the space and to distinguish this similar words and what this allows us to do is identify topics that might make more sense. So if you click on the next. What I'm now going to show you right is a comparison of sdi. If we looked at just the health score in terms of Twitter, you'll notice that it does a little bit better. Now if we just do you know this topic refinement without human guided annotations, you'll notice that we are in fact not any better than sdi. But if we go through this iterative process of defining and pushing keywords that make sense together, we in fact can do a little bit better. Right. And so this is, you know, an example of how I think human domain knowledge can help guide these refinements. But also in the process when we've been thinking about this right is that we've been treating each data stream as separate right and in fact every data stream is likely to be correlated with one another. So when we've been thinking about the pharmacy one what we did is we pulled the Georgia board of pharmacy list. And what we found is that there was very little congruence between our data and the pharmacy data. And so if you advance to the next slide. What we really want to think about is how we can match any two objects between different data sources. Now traditionally in the computer science domain, this is known as entity matching. So if you look at these two entities right one is from Zomato which is you know, years ago and one is from Yelp, would you say that these two are the same. And based on right yes right it's clear to us that these are the same. This is what's an integral portion of database right and it's very well studied in database. Now if you click one more. The biggest problems with current models machine learning models that are trained on this task is that in fact these models assume the two data sources share the exact same columns. Right so someone's gone ahead and align the databases, and then someone's gone ahead and clean the values as well so you can imagine this is actually a lot of work. And so if you click on the next one I'm going to show you what it looks like originally right in the pure original data, because real data is messy real data is ugly right things are not going to align and we cannot expect that we want to pay the price of manually annotating thousands and thousands of these to get which pharmacies are really true and which ones are not. So if you click to the next slide please. What we so what I'm showing you now is for common benchmark data sets. So version one is the original benchmark data set that has been cleaned has been standardized version for us to take the original models without any cleaning. And this is an existing state of the art model using large language models. So what you'll notice is in fact is there is a significant degradation and accuracy right in fact if you look at the last two lines it's really terrible. So if you click on the next one. What we've developed recently is this new deep learning model that really learns nonlinear relationships between column names and values. And the idea is that we can learn which columns we should really focus on which values are the ones that really matter. And if you click one more time. What I'm going to show you is the results from our model right and what you'll notice is that because we're thinking about this from the perspective of real world data right we're really modeling interactions across the columns and the attributes. You'll see that we actually have a relatively stable performance. There is obviously going to be a performance drop right as you could tell it was a much harder problem. But in fact it's not terrible right it's actually fairly usable. So last set of slides. So hopefully what I convinced you of today right is that really we can think about these new data streams and thinking about integrating them in right beyond traditional environment. We really want to think about how you incorporate human interactions to get at these better fine grain metrics. And really when we're thinking about these data integration models these entity matching that we want to do we really should train it on real world and messy data sets not on the clean ones that we assume will do well. And with that I'm going to if you click two more times. Thank my collaborators and my students who have made this all possible and obviously NSF NIH and J&J. Dr. Ho. Our next speaker is Dr. Thomas Harten. So Dr. Thomas Harten is the chair for evidence based toxicology in the Department of Environmental Health and Engineering and John Hopkins. Dr. Thomas Harten is the vice president of the University School of Public Health. He also have a joint appointment at the Whiting School of Engineering. And he also have affiliation with the Georgetown University. And he is also a consulting vice president of XSIM in your audience. So without further ado. Dr. Harten. The University of Constance where just give a lecture. Okay. My slides I've made available I will show the QR code at the on the last slide again. I think for this discussion important I'm also the field chief editor of frontiers and artificial intelligence so I share a very general interest in AI. Next slide please. At the moment the synergy of three elements the increase in data the increase in computer power and the increase in our AI algorithm. Our data are increasing by more than 60% every year. They turn the world so which means more than 90% of all data in the world have been produced in the last two to three years. Moore's law gave us over the last 60 years a doubling in computer capacity every second year. But AI for the last decade has been doubling in capacity every three months and altogether next. This led to about a billion fold increase in the computational power of the systems over the 60 years of my lifetime. Next. And this is leaving nothing unaffected. It's no longer the world as we know it. This is also environmental health and toxicology as I want to show you in the next slides, please. We have seen that on P day, the 14th of March, GPT for was released and it is now outperforming most human tasks. We have simulated by exam of a lawyer in SAT. These are astonishingly powerful models as everybody is witnessing at the moment. Next. What we see is that over time, something is happening that we have moved to deep learning with which the AI really became powerful. We are moving at the moment to deep reinforcement learning, which has made this tremendous possibilities happen. And now we are increasingly also further working with distributed agents. So not just one central computer doing the entire job. Next, a very strong feature of this is the so called foundation models. These models are now these big models like chat GPT. And these models are no longer trained for one specific task, but they allow us to fine tune them use them as a foundation and add our problems to the body of work which has been done in the past and they especially are being trained on what is called multimodal data, which is using pictures, they're using videos, they're using text speech and all of these together. And this is the basis of the current foundation models next. Yeah, you choose Lewis Carroll. Sometimes I've believed as many as six impossible things before breakfast. And in fact, several things I would have called impossible a few years ago have happened in recent time. For example, number one. Alpha zero showed us that without ever studying a human game with an eight hours. Alpha zero is playing so far better than any human and so different than any human that human players now study how chess could be played more effectively. And there's several others. Just to mention, while in 2020 there was no AI designed drug in clinical trials in 2022 there's already 18 AI first drugs in clinical trials at the moment and you can marvel at some of the others independently. Next. In toxicology and safety sciences environmental health. There is for my point of view a very strong need for such type of technologies to revamp. What astonishingly has not changed for long. Most of the methods we are using in regulatory toxicology have been introduced when I was not yet born on kindergarten. I had the privilege to lead for the Department of Defense, a future directions workshop. I found to court chairs with Anna was a scene from Columbia and by tissue from Texas AM and this 30 avant-garde stick toxicologist and 15 agency observers next. We published two weeks ago. This workshop is document, which is in essence a call for human expos on project. Next, it has three pillars. And I already described them in this paper earlier this year. Next please. These three pillars are to make toxicology back please more exposure driven technology enabled and here AI is one of the core technologies we refer to. And important for today's discussion evidence integrated. Again, AI is the methodology here. Next please. That's exactly what it says is AI different way of evidence integration and toxicology. Next. The starting point of machine learning and AI is big data. Actually, I like to define AI as making big sense of big data. And the important part is AI really shines not just if the volume is big, but it needs variety needs the multimodal type of data. It needs the velocity is which this information is accumulating. And next. We actually do have a lot of sources which feed in our field. We see that more and more legacy data of the of the past are created in databases. We have a lot of scientific literature accumulating. Pub made alone gets about 1 million entries per year. 80% of them have show some chemical effects, not as drugs and toxicants but often as inhibitors or effectors in our systems. The internet is full of data also in the safety sciences. 100,000 safety data sheets, sensor technologies, robotized testing, the omics technologies and high content imaging are new sources of data rich technologies. Next. Interestingly, at the moment, it is an enormous part of the work of any data scientists to get access to later data and clean them. About 45% of the time of a data scientist is used for this purpose next. We, in this case, I have to praise Tom Lüchterfeld, this talk track who is working 50% of his time in my team and has is entertaining two companies in the ice space. He has been programming so-called bio bricks, which is a one line command code to import an entire databases and he has done 50 of these, so that essentially all relevant databases can be uploaded within minutes to an hour in total to give the largest database on safety science data in the world. Next. What you can see here is one example, the first five databases he combined in chem harmony. For example, they are giving us 200 million cases where we have a chemical, a property and a result. And this is a treasure trove for training now models on these it is definitely a first time really a big database of chemical effects. And we are going to make these 50 bio bricks soon publicly available that really was a one line command without knowledge of the structure of databases or the programming code being used. Anybody can import the relevant information. Next. A big part of the progress at the moment is natural language processing. The big advance at the moment is really that also this becomes multimodal already now scientific literature is essentially read by a computer, almost as good as by a PhD student but not one paper per week but millions a day and never forgetting anything. And the big challenge is actually multimodality. The reading of tables and figures is the challenge we are seeing at the moment. Next. But if you see the number of scientific articles logarithmally increasing at the moment. And this is really of key importance. Interestingly, next. If you ask for what is the information coming from which mage maids chat GPT and others at the moment intelligent. 9% of all data, which went into training this are in the science and health area. And you can see here the top sites. Very clearly the open access journals of of plus and frontiers are representing a very important source. So this is really a reason why we can expect these models also to help us with some of the of the task attend. Next. Here you see in a logarithmic scale, the increase in these large language models and already the contribution of bidirectional encoders which Google made in around 2018. The bird system led to next, specifically pre trained systems of cambert and cybert next please. So these are systems where the context of scientific articles is considered that the system understand it is reading now a scientific article. And then, next, in February of this year, Microsoft released bio GPT. Bio GPT is now specifically trained for scientific literature. And next, you see the effectiveness of this bio GPT is performing better than a human annotator in annotating scientific articles to retrieve the respective information. And this shows you the more enormous progress in this field of integrating information. Next. Our own interest is very much into systematic reviews. And one of the products talks track has developed is called sis ref. It is a publicly for free available program software, which has given rise to more than 10,000 systematic review projects, which are registered here already. And is using system semi automated systematic reviews machine learning. So it trains on the inclusion criteria and then the actual choice of abstracts and after 100 to 200 abstracts by human assessor has learned as good as a second human to find which papers should be included and can then assess hundreds of these 1000s of these next please. So what this is next, what sis ref is doing is auto extracting annotating. We have we are boost during this at the moment to recognize natural entities such as genes enzymes, and others, and the causal relationships between them. And our goal is really to fine tune this to in order to import talks relevant information out of the literature. Next. In the context of the oops. This is really forward. Please go for a click. And again. mistake. Okay, so what we have been doing ourselves is now developing predictive algorithms already in 2018, which built a large model for 10 million structures for 900,000 we had some type of information 74 properties. And we showed that we were able for 190,000 chemicals where we had classifications done based on OECD animal tests that we were 87% correct in predicting these, while the reproducibility of the animal test was only 81%. In the meantime, next, we have shown that also with respect to human data. This model is superior towards the best animal test in predicting human skin sensitization for example. Next, in November of last year we showed we can run this on thousands of chemicals in this case we chose 4700 foot relevant substances and carried out in less than an hour. The equivalent of 38,000 animal studies which would cost more than $250 million. And we were again about 83% correct in a small validation set we analyze so at least as good as the animal tests. Next and at the Society of toxicology in March we showed that also the more complex toxicities such as cancer and reproductive toxicity can be predicted with reasonable predictive values accuracies of 75 and 82%. Next, in the context of the European on talks project, a $20 million project with 18 partners, us and talks track as the US partners. At the moment we are applying this methodology to liver kidney and the developing brain. I have the privilege to lead the AI part next, which is essentially trying to use his ref to extract data from the literature. Next, it is using databases through these bio break tools, which I described and third. Next, based on another product from one of Tom Lichtenfeld's companies chem chart, we are crawling the internet for soft data this together next is giving us the data. We then use in the second phase next, in order to train a variety of models. And the cost based structure activity relationship is the method I described for chemical structure based predictions, we have been developing. But we also feed automatically in physiological maps and AOP networks to understand the causal relationships. And this next is being used from understanding the perturbation of physiology and the chemical structure properties, a probability of hazard. We can predict, ultimately, how probably how problem is it that a certain chemical of interest has a certain property. Next, this is leading us to a key problem from my point of view in environmental health, which is that we like to work in a black and white environment of a substance being toxic non toxic carcinogenic non toxic carcinogenic. The probability is a lot of grades of shade shades of gray. There's a lot of uncertainties. Next, and we have made an AI based probabilistic risk assessment, the core of evidence integration in our project, and we're going to have next month the second workshop on probabilistic risk assessment. This is the paper from last year, where we are looking into perturbation of biology to train for probability of hazard. Next, I think this is essentially coming to an end here. However, we are at the moment witnessing that our traditional system of hypothesis driven research experimental tests of one aspect after the other is being enhanced by two things. Next, one is machine learning and AI through big data we are synthesizing evidence and combining the various pieces of information. We have evidence based methodologies. You can say this is the best what humans can do systematic reviews quality scoring and similar of the privilege to have the first chair for evidence based toxicology and we are hosting the evidence based toxicology collaboration, and we're trying at the moment to bring these two together so that we can use for reinforcement learning the evidence based methodologies and help to mine the big data, even more effectively. Next, with this, I would like to close with john men at Keynes an economist who very rightly said the difficulty lies not in the new ideas but in escaping from the old ones. I hope I've shown you some of the new ideas brewing at the moment with talks ecology written with an AI with how I like to call it at the moment, a science and transition. Thanks a lot. Thank you, Dr. This is a wonderful talk. We don't have our third speaker today so we are going to go directly into a discussion and our panel. That was short. I can ask a just a little question and I think it's been been coming up. Not just in the context of this talk but I do I really appreciate all the, the really great overview of all the increasing increasingly available resources with respect to AI and increasing ways that we can use it for, I don't know, lots of things. And so what the question it's raising for me is around public trust in these tools and just recognizing that because they're so new, because people might be aware of sources of biases that already exists in various systems and technologies. How are we thinking about or how might you be thinking about in your work, how we can provide sufficient information for the public to understand the way these tools work in a way that also reassures them that it can be trusted, I don't know. Yeah, you nailed this. This is really the problem. But what I'm not arguing for is to give these tools any autonomy and see them as the ultimate decision taker is at the moment, a way of integrating data, which a human brain cannot handle anymore. And giving on the silver platter some suggestions how to interpret these. But we need the human in the loop, not only by training these models but also by evaluating them. We have to ask, does it make sense. What the system is suggesting. But I think it is at the moment, a prime tool to enhance our decision taking to look what smells like problem, what can be put on a back burner, perhaps, and increasingly learn what the quality of these predictions is via the very beginning. These systems are around for four or five years now, and they get better in the visual velocity, which is unheard of, we must use it would be stupid not to take advantage of this. So I'll go with the computer science way of saying that the, I think trust is potentially overrated right so you know there's all this discussion about whether we care about privacy and security but if you look at what everybody posts right like in a public forum. I do wonder whether you know a lot of it is we're just hiding behind the need to convince the public that this makes sense. I think what we can think about, you know, some perspective is to think about well how can we utilize it to make to help better decision making. In the end it should still come back to the human right and that's that's the best we can do for now. In the first set of sessions there was a discussion right if we keep training these AI on on human decision making. Maybe that's going to exacerbate biases right but I think a lot of it is well what if we integrate more data would that help mitigate some of that and I think those are questions that we don't know the answer to. And part of it is convincing the public that maybe that is the perspective that we should be taking. I mean when humans can find the bias machine can also find the bias we only have to train the machine to learn that bias is a problem. And, and this is why I think that that the machines have to learn from our evidence based approaches which are the most systematic way of objectively and transparently analyzing information. And there's also a big move to explainable AI, so that the AI explains why it comes to a certain conclusion. This makes is a game changer, if you want to take decisions on the basis of predictions. I'm in my latch off for those who are online can't see in the room. I'm almost afraid to say this because I feel like it's anathema to Americans but you know I feel like the games that I play on my phone I'm giving away more information than I give away through my medical records and trying to get one doctor to share my medical records with the doctor is ridiculous so are there folks talking about this in the US I mean it just seems like we should be doing a better job to advance the science I think somebody had a slide on that in the in the in an earlier session. But maybe that's a recommendation that can come out of this workshop is rethinking what we as Americans share about our health data. It's a private it really needs to be. I feel like people are going to really push on me so go ahead David. Yeah so so I'll push back a little bit and then I'll also give you words for encouragement. You know, so I think you have to look at the reasons why we have the protections that we have in place, because there is a history of abusing the power that we had. So, you know, I think we need to be very careful in in stepping back from some things that we put in place to protect the most vulnerable. You know, that said, the all of us program has an extraordinarily aggressive data sharing policy, you know so all of the data that they are collecting on everybody that they are collecting is made available. You know, in a pretty transparent way, you know, we are going through an effort now and I don't know if Allison's going to talk about this to add location information to there and so there's going to have to be, you know, as we were having the conversation earlier today, there's going to have to be some level of protection, you know that's not going to be as transparent as as you know the so the HR data is going to be. I think there is reason for optimism that that we can work in other ways within the existing systems that we have so we don't have to walk back on the requirement for IRBs to look at things that we're doing. So I would also say you know there's been a lot of research done at the federated learning and privacy preserving world as well. I think what's holding it back a lot is that databases aren't standardized right, even if you look with an epic any two healthcare institutions have different ways of doing it. And that's really unfortunate because you can't even share within the epic ecosystem. And I think part of this is pushing, you know, those problems to the forefront. A lot of it we've been sort of hiding behind. Maybe we should develop fire standards or maybe we should think about Odyssey or omap or different standards. Probably needs to be some push towards just saying this is the one standard we're going to go with and this will enable more of the sharing across institutions. And in that perspective then the sharing will be more along what we've doing with our games or posting on social media right about ourselves. And the good thing is we're not alone. The same problems hold for every aspect of life and these technologies pose new challenges but they also give some answers. We heard some of them already. You can train things systems behind firewalls and never hand the data out. And that's the federated approach you have blockchain to exchange information in a secure way. You can produce synthetic data sets which still contain the information but nothing is identifiable anymore. There is really a number of tools coming up because we have to solve this. But as a society we should be interested in encouraging data sharing by telling the people what we do and what we will do and why we want the data. And then there's an astonishing willingness to share if they have trusted players. And this is something which is really important if you tell them what they get in return for making data available. People are astonishingly willing to share. I would like to ask Dr. Ho a question if you guys don't mind as important as the privacy conversation is. So going back to the social determinants of health point that you were making and I love all the sort of new and novel data streams that you are using and exploring. I guess my question is you know when we talk about things like the area deprivation index or the SDI. These are these constructs about latent processes that we can't really measure or ground truth in a specific way but we think. They're capturing a larger context around deprivation issues right and I completely agree with you that there's a lot of ways they're sort of misused or not used to their full potential. But based on your talk are you saying like you know these novel data streams are they doing a better job at capturing factors related to social determinants of health or are they just seeing something very different that's perhaps closer to the human experience in a way about you know things that we think of as exposure measurement error classically right like I think the novelty is they can really get to what people are experiencing on the ground. But also layered onto that do you sort of worry about like biases and who's contributing the data or who geotags their tweets versus not or like open street Matt. I know that's a huge topic but I guess I just want more of your thoughts on what are they really capturing. Thanks. So we've had this conversation a lot between my collaborators and myself. And part of it is, you know, gaining acceptance in a field. So I, you know, I feel I'm an outsider right and people really don't like me. Maybe because I, you know, computer scientists stomping over other places. But I think what we're capturing right so a lot of the ADI and SDI are very census based measures and they're captured every 10 years or something and you really don't see changes in them fast enough right and if you look at you know how populations have been shifting towards cities there's there's a lot of that at play. I do agree a lot of these streams will have biases right but the hope is that in the integration of more of them right. And looking at how you distill information and get it. So let's take the pharmacy one for instance right so what we found actually is when we asked the Georgia Board of Pharmacy for the list, they sent us everything, including the small mom and pop shops that typically patients won't go to right most of us will use CVS Walgreens and those things. And so it's unlikely that someone will go to those places because that's just not what they're used to right. And so from that perspective, you know in some ways you're capturing more of what is likely to be used right, and it is a different construct than what ADI and SDI were set to do right. And we've been having this discussion as to how you can showcase that it works, because everybody, you know. I mean, I think Thomas said it best right like it's really hard to get rid of the old ideas right now and replace it with new ones. Everybody thinks deprivation is probably the best indicator of neighborhood measures or at least you know it's very commonly used and very widely accepted, and to propel new measures is actually hard unless you sort of like validate against existing ones if that makes sense and so I think what we're seeing is there's mismatch of information across the two sources. I'm not really sure which one you truly trust right and really there needs to be more analysis to determine, you know, what makes sense so all the different categories that we presented. They'll all have different impacts right healthcare facilities might look quite different than what we might see in pharmacies and so I think we've started to tease and unravel it I don't know that I know the answer yet. Definitely, but I'm, you know, all this work that I presented is all under review at the moment and so it's all very new, even from my group. If may quickly comment on this. I think that these type of indexes are 20th century type of science. What humans can do to handle complex data but we are also losing a lot of information by squeezing everything into an index and the opportunity of machine learning is to leave them as they are and find the way through all of this. In the end, identify what are that what were the what were the parameters which really informed best for meeting our result or needs. I think the we should not sort too much humans can only handle seven variables. The systems can handle billions. There's hands up. Thank you very much Thomas I agree with you to a certain extent but I think that what Karen was saying, you know the trust so as a public health practitioner, I tend to be contrite with respect to the history and that trust thing. Just look at the National Public Health Study of simplest right so that's you know that trust thing is real, but Joyce, in terms of, I think the deprivation index right well, for example with COVID. The CDC is now considering and has put into motion, taking away SBI and replacing it with HOI. Okay, so you know next time I see y'all would like to answer to. Run that versus HOI right EJ and EBD environmental burden. Plus you was and let's see what happens. Okay, terms of the malls okay. Yeah, but yes, it's a difficult question. I'll make my graduate student do that. So we also have a question from Christian. Yep, so I am not there to be with you all today it's, and you don't really want to see me on camera so I apologize but going back to this question of these deprivation indices versus sort of the newer version, it opportunity for using natural processes and AI to really think about it. I think we also as scientists really need to really put everything in context and really think about the questions we're trying to ask with these data that we have and so, and the mechanisms by which these processes work right so we're trying with the long term impacts of deprivation over a life of course on an individual in accelerating disparities and disease, ADI might be appropriate, right, because we can't use, but if we're trying to predict hospitalizations and or better care and or exacerbations of disease and a more acute framework than these other models and AI really do make a big difference so I think it's like anything we've always done in epidemiology but we really have to think about context that we're and the questions that we're trying to answer, and and the mechanisms to really tease out the value validity and capability and value of these different metrics in these different analyses we're trying to accomplish and I think that's going to become really critical as we start to think about all of these different opportunities for data linkage that we have available. I think the basic questions always is can the answer be in the data, I will always find an answer, but the common sense should tell you, can it be inside. If you train a prediction of stock markets on weather data, you will get a result but not something I would invest on and this holds for everything you're looking. Yeah, and I guess it goes back to that classic, you know, back in the day, just to the number of flamingos predict births in Florida I mean I think that was a critical question 25 years ago and AI could be you know so how do we avoid those, you know, those false discoveries given that we know that I will always find something so what is it as humans we need to be training our students thinking about things to avoid as we start to make use of this technology in these innovative ways. Sorry, you go for it, go for it please. Alright, so I mean I think education is a crucial thing and I think one of the things that we have sort of masked under the hood is critical thinking. And I see this as a faculty member these days right especially thinking about the impact of COVID. You know, I even see it in my own kids right they don't have much attention to focus on some particular thing. So part of this is figuring out well what what can we engage them in. And how can we make them. You know, I think the biggest problem is, we teach them black and white, right but the bigger thing is nothing in life is black and white and so how can we think about ethics and legality and can we teach that at a younger age and I think a lot of this can help address stuff. So when should we think of false discovery as false discovery one might it be something correlated under the hood. And I think that might be the bigger direction to push as well. So amazing series of presentation really and one thing that I've been trying to do when everyone was present is to go online and see whether I could find the method you had for instance by your GPT. Really a lot of excitement I really thought perhaps one day I will have an helmet that's not only typing in the question where just someone can read my mind and just write for me. But really, I'm joking about this but the part of the question is what what what has been presented today is ready for prime time and available to everyone. It's against what is still being developed. And it's a general question about to the speakers but also I'm thinking, perhaps this could be one of the outputs of the workshop where we could create brackets of different products that scientists or practitioners can already use and they're available online and hopefully for free some of them perhaps others under the paywall versus a project that we might be on the lookout for and perhaps in one year 10 years that might be available to us. So would you see this what type of reaction you would have to this idea. I mean we are at the moment very impressed by the large language models and generative AI. There's only one of the many many flavors AI comes in. And these systems are because of the nature how they are trained and what how we are prompting them to work. They are very much hallucinating. But already now they're extremely good. Yeah. I was asked to write a comment on an article by a colleague and for the fun of it. I put it into GPT for said summarize it, praise it criticize it and put it as a supplement document to my own comments. It was pretty good. I was very impressed. I could have just submitted and everybody would have been happy. This systems get better and better. They were not made for scientific reference texts. They were made to make things up to fill gaps like we do when we are chat when we are chatting their chatbots. Yeah, they're trying to to just get through with whatever things are close to reality. But if as soon as we start really putting in more science as input data, more scientific processes as the way we reinforce, they will do scientific jobs. I just read an analysis by a venture capital company. They say that now already a draft of a scientific paper is reasonably good. But in five years they expected to be of the quality of the most skilled scientific writers and writing scientific papers. We might be the last generation who learned to write papers on our own. The next ones will just know how to correct them and polish them. Do you have a question. I saw the race. So I think, you know, it depends on what you're thinking of as, as available. Right. So I think Marzia sort of alluded to this that a lot of this stuff is not reproducible and I don't know that you want to, like even thinking about the chatbbt right, you prompt it with the same question, and it regenerates answers for you. And that really harms reproducibility, because you, you know, any two patients, do we really want different answers depending on slight language tweaks. I think, you know, part of it is stepping back and figuring out, well, what do we want the tool to be able to do? Is it to write papers in a deterministic or less, you know, generative, where there can be a lot of craziness? Do we need to put it behind a paywall? Who's going to use it? And I think if we think about along those dimensions, there will be a bunch of tools that are available. But I think in the end, reproducibility is the biggest question mark in my mind towards achieving. And I think there's been this push towards fair principles as well. Right. And if I'm going to use that as the ethos, then I'd say a lot of these tools are not there. But if for, you know, for discovery, for thinking about it, I think most, a lot of it will be available here. We have Genshin, Nix. Very nice discussion. I want to add on to the discussion about reproducibility. Because I'm from the geography background, there is a very good paper on PNAS by Professor Michael Goodchild about how to do reproducibility, how to achieve reproducibility across space, especially for this larger model, because our topic is about environmental health. So it's care about how the model prediction across the space. So usually all the times when the foundation models, the prediction is what performs very well, even if you're looking from a geographic space. So it can achieve very good in US, but not in Africa or other places. So I wonder, like, is there any suggestions on, but this model is very hard to modify or even to, so is there any suggestions or what will be the solution or suggestion for us. So I think, oh, sorry, go for Thomas. I mean, the first thing is, I think actually the opposite is the case. This is one of the most democratizing technologies we have ever seen. If you see that GPT-3 found 400 million users in two months, there's never been access to technology so fast anywhere. And the prices are extremely affordable for what it delivers actually. I don't want to advertise for them, but I'm just saying it is delivering already astonishing good products. And if you now unleash all of these clones, all of these open source activities, we will be seeing tremendous improvements of these models. But we have certainly also to manage expectations, what can we expect from such a system. We have to set in place things to control that you cannot assume another identity that you should be transparent where you use these tools because it is plagiarism if you just squeeze out something from the similars in the world who have done something like you. And so these are challenges for society, but the access is I think the least is the one I see most positive of all. So I think there's been a big movement towards making all of the data trains on open. So there's been a lot of discussion, especially in thinking about GPT models and chat GPT. A lot of it is reinforcement learning base. There's human in the loop. And those prompts are not available actually for anybody to replicate. And so I think until we get to the point where we can get all the data that went in and all the human annotations that came along with it, it will be very hard. But that's exactly where, you know, they're making the money right is they're not going to release this data. And so I think part of it is, you know, maybe thinking about open sourcing some of this so that the community creates it helps curate it. We think about sharing prompts and how we're using it and exact samples that we're using for it and clearly, you know, if we're thinking about health care, or health diagnosis there will be HIPAA issues but thinking about maybe synthetic data to help train it. So there are all options that are feasible for these large language models. Yeah. I just want to come back to Andrea's point. One of the things that I would love to see come out of this workshop are some tools like a really short just toolkit for people who are interested in integrating these areas so that's just going to inquire about ethical guard rails. Thomas, could you say something about that we'd be remiss if we didn't discuss that a bit with regard to the language models. Overall, actually. The first thing is the, I think that ethics should be the guidance for regulation. You cannot regulate on the basis of technological developments. If you see that this is doubling in capacity every three months here. No legislative process is fast enough so we need to understand what are the principles and they're not very different to the principles which are behind current regulations and laws here. You cannot cheat. You need to be clear who is talking and you cannot take what other people produced and sell it as your own. I think these are important things. Then there are specific things where we have to find ways of validating these tools. We need people we can trust who give it a check and tell us how well does it work. I mean I've been heading a validation body for cell culture models for the European Commission. It is a big task but I think the. It is exactly what we need if you want regulatory decisions to be taken on the basis of AI, then somebody has to write it and and really explored extensively to understand what is the risk in using such a tool. I have a different type of question for the panel, which is honestly inspired by the comment. One of the comments that Thomas made that one of the tools you presented can be better or similar now to a PhD student. And by the way, my PhD students are better than me so I'm really worried I'm going to be phased out soon. But the thing that I really wanted to ask is that there are two issues I would like to bring up. One is I read at some point a comment by Noam Chomsky about charge GPT that the difference between machine learning artificial intelligence and human is that artificial intelligence will give you answers that are probable based on the data, but only humans will give you answers that are improbable. I want to see whether that is true given given your knowledge I seem possible to me but I'm not an expert so whether I mean and I to say probable is very possible that we overestimate creativity we have in our jobs. I mean, it's very possible in most of us and certainly this true for me come up with probable hypothesis and solutions rather than improbable. The other question I guess is more related to another issue which is the issue of especially inspired by the personal experience that I have with charge GPT about hallucination that charge GPT seems to lie flat out sometimes I tried. I asked the writer biography of Andrea Baccarelli, I look much better than I really am. I have so many degrees that some prestigious schools never even thought about having and and I to say I mean, when I work with my team, I know the level of trust I can put on people. I started to learn to learn that and I know how to gauge I mean some someone can be great at epidemiology but I cannot trust them in toxicology or vice versa. Sometimes some one is can be 99.9% honest and transparent other people less so I mean I work with any type of people. So I'm wondering how it is a way to gauge the confidence we have in the systems I mean how to do that of course every system should be perfect but but as myself I also understand that I now started to learn what I can trust in charge GPT and what I don't I mean clearly I cannot get get charge GPT to write an article about myself. And not without checking it. I mean I did I did the same I tried because I said who is Thomas out and then I was also assigned an invention which I have never done. But I can check it yeah I can I can check the possibility most of what was written was pretty good short summary of which would have been for me difficult to do so condensed. And I, we also have to see, we have to train these models for their purposes. If you want to write a scientific text and say you restrict yourself to actual sources which you find on the internet. That's a completely different beast than a chatbot here, which is trying to give an answer. And it is trained on on on all of the BS in the internet. So we get to get this back. Yeah. And but also humans are chatting this way. I mean, I would be happy if they would give me the most probable answer they know. Very often they give me the most provocative answer the one which they just come up with and in this moment yeah. I think that we are really, we need to define what is not chat GPT doing for science know what will be science GPT doing for us, if we now give engineering criteria what we need to make it really useful for our purposes. That reminds me of Isaac Asimov foundation series. Did you guys remember that it basically everything is predicted based on statistics but they can't predict the one like rogue actor. He was biochemist if I recall correctly. I think so. Almost time. I want to thank our panelists again for the wonderful presentation and the discussion. I think one thing that we can all come back, go back home and think about is all with all these AI tools that we talk about today how would we as experts in certain area, work with all these tools, and how do we trust them how do we have as other help other people to trust them as well. It's going to be there so we need to be ready. So, thank you. Thank you Thomas. We are going to have a 10 minutes break, and we will come back at 320. Welcome back from the break. We will get started with this rest of the session today with a brief remarks by Rick Vojcek, and then followed by Eric Topolski note and finally a fire set chat with the two of them. So first, I'll introduce Rick Vojcek, who became the director of the National Institute of Environmental Health Sciences, one of the National Institutes of Health and of the National toxicology program on June 7, 2020. In this roles he oversees federal funding for biomedical research to discover how the environment influences human health and disease. He and his staff receive input from several advisory boards and councils to accomplish this significant task. Prior to becoming director and since 2011. Dr Vojcek served as deputy director of NIEHS. In this role he assisted the former director Linda Burbaum in the formulation and implementation of plans and policies necessary to carry out NIH as missions and the administrative management of the institute. So please welcome Dr Rick Vojcek. Terrific. Just doing a sound check. Can you hear me okay? Yes. Okay, good. Lucy, great to see you by the way. It's great to be here this afternoon. And so thanks for this opportunity to provide just a few kind of remarks. I suspect just looking at the program is sorry that I haven't been able to join for the rest of the day. But I just want to reinforce. I think some of the things you probably already heard. So I'll just start off by saying that I think we're working in some very exciting times and actually just to be honest, we're working now at a time and we're doing things that I've been waiting for my entire scientific career. So we can now imagine employing fundamentally a transformative approach to understanding health and human disease. And this involves integration of multiple different types of data. It could be involving genomics experiments where we get about go beyond one gene at a time to look at it whole genomes and whole transcriptomes. We can incorporate data from environmental sciences and we can also incorporate very importantly data from research that evaluates the social determinants of health. And so now we're taking this more holistic approach. So it's not just about transcriptome or not just about genome or not just about one environmental exposure. But we're not taking this holistic approach that recognizes the interconnectedness of all these different factors and how they influence human health and well being. So a key driver of this approach of course is what you've all been talking about. It's having available and our ability to generate omics data. And this is genomics data, epigenomics data, proteomics data, transcriptomics, and more recently, this whole notion of exposomics data, which as Chris Wilde defined back in 2005, it's the totality of exposures over the life course. And I'll say, you know, recent developments and AI have the potential to provide us with the tool is to integrate across these various omics data sets and other types of data that we're collecting. And we can do this now in a way that will help us to understand human biology and to understand the etiology of human disease. The expectation is that by leveraging AI's analytical capabilities, biomedical as well as, you know, environmental sciences can join forces. And they can begin to elucidate those relationships between environmental exposures and human health with a degree of resolution that I predict that we haven't seen up to this point. Hopefully you've been participating in the meeting today. So, I mean, the purpose of this workshop is really to gather experts from diverse disciplines and from various sectors across the biomedical enterprise for the purpose of delving into some of the latest research findings and to explore opportunities for integrating environmental and biomedical data. So we hope that by leveraging new and cutting edge developments in multimodal AI, we can begin to unravel the complexities of an integrative health model. And I'm a big believer of the types of collaborative product platforms that will foster discussions on innovative approaches methodologies, as well as technologies that can enhance our understanding of the interplay between environmental exposures and human health. So at the National Institute of Environmental Health Sciences and IHS. So we recognize the significance of AI and data science and driving innovation and advancing our mission. And IHS has already initiated several AI and data science initiatives and we are actively involved in developing and supporting policies for data integration data management, ensuring and ensuring that in IHS data and here to the fair principles and making them findable accessible interoperable and reusable and also the some of the new efforts around developing standardized vocabulary is an ontology is on how we handle environmental data. The convergence of AI and big data, including that from environmental health scientists, I think represents a powerful opportunity to drive scientific knowledge and improve the human health outcomes. So NIHS will continue to support and to develop resources that promote the responsible and equitable generation integration and utilization of environmental health data. We are committed to collaborating with other NIH institutes and centers to integrate our data resources across the broad the broader biomedical enterprise. I know you've been talking about this today. And we feel that by leveraging the synergy of AI, big data and environmental health research data, or you can expect that we can make significant strides and unraveling those complexities of human health, and to ultimately pave the way for innovative approaches to promote healthier and more sustainable lives for all. So that's, that's it for me. I will turn the virtual podium back over to, I don't know, is it Lucila. Yeah, thank you so much, Rick. And everyone will hear from Dr. Vojcik again at 345 Eastern, when we will start the fireside chat. And we have a short keynote address, and I will introduce someone who needs no introduction, Dr. Eric Topol was Professor of Molecular Medicine and Executive Vice President of Scripps Research is the founder and director of Scripps Research Translational Institute. He has published over 1300 peer reviewed articles with more than 300,000 citations. He's an elected member of the National Academy of Medicine and one of the top 10 most cited researchers in medicine. His scientific focus has been on the use of genomic and digital data along with AI to individualize medicine. He's also a practicing cardiologist in 2016. Eric Topol was awarded a significant grant from the NIH to lead a part of the precision medicine, now called the all of us program from NIH. And prior to coming to Scripps in 2007 he led the Cleveland Clinic to become the number one center for health, heart care, and I was the founder of a new medical school there. He was commissioned by the UK to 2018 2019 to lead planning for the National Health Services integration of AI and new technologies and has published three bestseller books on the future of medicine. So please welcome Dr. Eric Topol. Well, thank you very much. I'm glad to join and of course be mainly getting into the AI in the medical sphere. So let me share my slides so I can get going here. Looking forward to our discussion. And hopefully, you can see my slides. Is that right? Yes. Okay, great. So I'm going to be talking about multimodal AI, which is really a very recent and exciting opportunity. We're just starting to get into of course. And the first thing that we get to when we start to see AI impact is reducing the diagnostic errors. There are over 12 million serious errors a year in the US and there's a classic NAM report about each person, each American will experience at least one of these diagnostic errors in their lifetime. And when it when the diagnosis is thought of in the first five minutes the accuracy is very high but after that it precipitously drops. And we have a big problem with overconfidence of physicians who when a patient dies the autopsy is demonstrates that the wrong diagnosis is the case 40% of the time. The term precision medicine is a real problem because in effect if you keep making the same mistakes over and over again it's very precise but inaccurate and we need accuracy in medicine. And so what's happened here is that we move from the deep neural network like convolutional neural networks and recurrent neural networks architecture to a whole new architecture called transformers, which allows multi attention inputs. And this is where we get to how we get to multimodal AI. So just with deep learning we've had an enormous impact before unimodal focused on medical images and what the term I like to use of machine eyes obviously machines don't have eyes, but their ability to interpret medical scans is is a remarkable and far beyond human capability. So by training from millions of chest x-rays, the radiologist experience radiologists will miss in this chest x-ray, the presence of a nodule, which turned out to be cancer. And this is across all types of medical scans this is mammography the largest study that was conducted by NYU. It's much bigger than that. It's seeing things with machine eyes trained as inputs with large numbers of annotated scans with ground trees that you then can get to look at through the eye for example the retina ability to trap kidney disease blood pressure glucose diabetes control a window into the likelihood of developing Alzheimer's disease predicting heart attacks and stroke gallbladder and liver disease high lipids in the blood heart calcium score. It's remarkable. These are things that we can't see as humans. And then for the electrocardiogram as a cardiologist. I could never tell you the agent sex or the hemoglobin or the ejection fraction make difficult diagnoses be able to tell from the electrocardiogram the presence of valve disease and its severity, whether the person is likely to develop a arrhythmia such as atrial fibrillation or develop a stroke diabetes and prediabetes the feeling pressure left ventricle kidney disease hyperthyroidism. This is the ability of machine eyes which is right quite incredible and it's already being implemented in the particularly in Japan and Asian countries that are doing machine vision during colonoscopy and endoscopy picking up polyps that otherwise would be missed. And also, in real time, reporting the likelihood of whether they need a biopsy whether they are potentially cancer. So that was the unimodal world of AI until March, when we had the first GPT for multimodal large language model. And that has led to something that's not just remarkable with respect to the parameters, the interactions of the neurons as shown here on this graph, exceeding a trillion, but bringing together all the different modalities whether it's speech and voice and video and images and text structure and unstructured text. This is not something that happened overnight. This has been building up for decades and getting to this extraordinary levels of computing power pedophiles. And when we look at the parameters trained. We see GPT for as mentioned was a trillion and the tokens were trained. They've gone up considerably. And this is the meta llama model, which has got over well of 1.4 trillion. So basically, to simplify this, these are the building blocks, they should be connected but I haven't done that yet. But these Lego blocks. They needed transformer models, they have had requirements for massive amounts of GPUs. But with that, the ability to get into computing power of flops that are is just unforeseen. This required not to be able to annotate the input but to have self supervised learning, and that got us to this level of multimodal AI. Now, for the, the ability to understand each individual's uniqueness, all these layers of data can now be captured. And that includes the environment of course, the expo zone, but all these other biologic layers beyond just DNA and RNA and microbiome, the physiology through sensors, the anatomy through the scans and of course the electronic health record and the immunome. So we had a recent review about this topic of multimodal AI in medicine, and it leads to all sorts of remarkable possibilities. I'll touch on at least one of these but it's very broad applications, once you can bring together all these different field domains of data. This week, a couple days ago, there was this publication in nature biomedical engineering, which interestingly showed that with multimodal AI, this specifically Irene, it was far better than previous models for being able to make the diagnosis of pulmonary disease by integrating chest x-rays, all the lab tests, electronic health record unstructured text, also for predicting adverse outcomes of COVID-19. So this is one of the first validations of the remarkable increase in accuracy from multimodal AI bringing together these varied input. The hospital is a dangerous place as it turns out. This is a study from last year of 11 Massachusetts hospitals, some of the leading hospitals in this country. The adverse event rate of people in the hospital was almost 35 per 100 admissions. And you can see the breakdown about adverse drug events and surgical procedures and infections. That really sets up the hospital at home in the future, which of course would rely on multimodal AI. So we've done several reviews on this topic beginning a couple of years ago about the update on health and medicine, self-supervised learning, how big a step that has been because we didn't have data sets that were massive that could be annotated in this sphere. The one I just mentioned and the one that we just published in April, which was the Foundation models or generative AI, large language models that we're discussing here. In that nature review, we basically predicted what GPT-4 would enable. It wasn't yet put out, that was in March, in the middle of March this year. The idea of taking everything known in medicine through publications along with the different inputs of images, electronic health records, sensors, biologic data, all those layers of data that you could take medicine to a new plateau. Now importantly, this is summarizing the different phases of a large language model development. The biggest part is this pre-training and in GPT-4 that has used approximately 30,000 GPUs. So there's only a few places in the world would have access to that many GPUs, after which the fine-tuning that we're seeing now in medicine for various functions is much less computing requirement as are these reward modeling and reinforcement learning phases. Now, just to show you a fine-tuning, this was for medical imaging and it already very trained very quickly in 15 hours. And it could do things that's really remarkable and challenging, like looking at a chess x-ray and saying all the different devices that are present, which would be hard for even radiologists to be very accurate in this. And this has already superseded the performance of GPT-4. But there's lots of problems here. It's never so straightforward as you'd like. So we see we can have great summarization of lots of data on a particular patient. We can improve and even promote empathy. We can do all sorts of administrative tasks like getting rid of or reducing the need for data clerk function and keyboards. But there are lots of concerns about hallucinations and data security and bias that are very well founded, no less as I alluded to carbon emissions. And we need, of course, a lot more work to validate their presence and utility and safety in healthcare. So this is a good summary just a couple weeks ago in Nature Medicine on the chasm. I call it chasm of AI and healthcare. The fact that we have lots of myths and we have these realities that people need to be familiar with. So I'll just stop here, which is that there's this amazing two-edged sword with AI in medicine as it will be applied in many different directions. And the one, of course, that I'm most excited about that I'm not going to get into is the gift of time that will be hopefully afforded and propelled by AI and medicine to bring back humanity, which has suffered so greatly over several decades. So with that, let me just acknowledge a lot of my colleagues who I get the privilege of working with our funding support and open it up for our conversation and discussion. Thank you. Thank you so much. And I think we have about four minutes before we do start the fireside fire chat. So why don't I abuse my privilege as facilitator to ask a question to Dr. Topo and then we, I'm asking if people who are present in the room there can also help identify who has questions. My question is precisely regarding the large amount of resources needed to train the initial models. Do you think that will widen disparities in whole nations that will be able to do it and others that will not? Yeah, this is really troubling the CLO because only a few tech titans like Microsoft and Google and limited others can do this. So we have dominance, hyper dominance. So not even just countries, it's companies and globally. So some people think that there will be a much less requirement to do these do this free training in the years ahead that that will be compressed substantially, but we haven't seen that yet. So this is setting up a very awkward situation where the role of academics and so many others working in the space is on the fine tuning piggyback or not the actual development. That's okay because in effect, at the moment, these models like GPT for and barred and made palm these were not really medically trained at all. And so they do need fine tuning, but ultimately, we can't go on to be fully dependent on just a couple of few companies in the world that can string together 30,000 GPUs. It's preposterous. We're lucky to get a handful of them in our effort. So this a lot of people are not aware and not only that but the energy requirement, the cooling requirement. And I mean this is not helping our environmental crisis. So, you know, the people I had a recent discussion, a podcast that will be posted soon with Al Gore, who of course has been so deep in warning about the climate crisis and I asked him, what about this large language models, what's that going to do. And he thinks, you know, it can help come up with new ideas to solve the problem, but it's also going to engender more problems. Yeah, I wonder if Rick would comment on that environmental effect as well. I'm sure that there's much else I can say on this that I think Eric very accurately pointed out the environmental parameter here, and that we just have to be conscious of that. It's there. So, yeah. So I think we can officially start to the fireside chat, but I see a hand raised from Darryl. Thank you very much. You can unmute yourself and ask the question I think. Dr. Topol, that was just great. I, you know, I thought that we us that work on the public health expose on side the ecto expose on had a great catch all phrase being from the cradle to the grave, but you just topped us with that free tune to the womb. Wow. Thank you for that. But but no, but but what I'm more concerned about, and thinking about this, should we be concerned for an example with Irene, the input, are there equally weighted inputs of the five that you show there, or because this is iterative and machine learning, is that something to be concerned about. Right. Well, thanks, Darryl. It's funny just you pointed that out. One of my colleagues and mentors, Bill Kelly, he once said I didn't have the right title for that paper that review he said it should have been from lust to dust. So, yeah, a lot of different ways to express that. The inputs are the whole story. And obviously that's where we see the potential biases emerge. It's not so much the transformer architecture of the deep neural networks it's our, it's our culture. It's putting in the whole internet and Wikipedia and literature so inputs are big. And I think the biggest thing here is that also I study I didn't mention from last week from NYU, they took 330,000 patients from NYU house health system, and just basically use all unstructured text and structure text to come up with predictions of readmission of morbidity of insurance denials and prognosis of survival, extraordinary. So what we're learning is this basically machine ability to classify process, massive amounts of data. And what what we got to this point if the inputs are are good, and that I think we could say and obviously large enough. We're starting to see things that I don't think we expected to come up with. This is of course, the whole debate in the AI community, which is, is this some sparks of agi is the preprint of Microsoft put out is this and level of understanding that we did not expect to ever see if not this early in advance but of course there are many still I think this is a stochastic parrot. And it's just a matter of statistics and word prediction. It doesn't look that way to me, but it's somewhere in the middle here and it's, it's a source of big debate so I think the inputs is not is a big deal, as I mentioned, particularly the potential of biases but the bigger concern is what are we seeing here is this understanding is being called a world model understanding the world. I, I, you have to decide yourself about that one. Yeah, maybe let's say like maybe I can chime in here too. So, you know, I, you know when I when I look at some of the key challenges and opportunities, especially as it relates to say environmental data, you know, it's the, you know, one of the concerns I have is the just the broad heterogeneity and the diversity of different data. So it's very easy to say expose home is the totality of exposures over the life course, but what are we actually measuring. You get to the totality of exposures over the life course, you'll have less dust, whatever. And I think that that's, that's a real challenge. And the different data sets, you know, you compare say transcriptomics to you have the, the, the chip based transcriptomes then you have already seek and you're just the heterogeneity of the data. Whenever you get up with AI is going to be a function of how good your data sets are, you know, coming into this. The other question too is, is it, is it useful to do, you know, you know, transcriptomics on a whole kidney, or do we have to ultimately get to the point where we're looking at which genes are being expressed or proteins are being expressed in those transplanetial modifications in specific cell types within say the collecting tool, the kidney. So I think there's some challenges. So I just, I just worry a lot now about what we're going to get out is going to be a function of the quality of the data that we get in. The last thing I'm going to throw on the table here is the, just the overall. I'll talk about consistency and the sustainability of our data repository system. It's been very in my mind has been very kind of ad hoc and someone has some data they developed their own database and now we're, we're, we're hemorrhaging with databases. No good standardization consistency and how the databases are constructed. I mean, in the environmental health sciences we haven't agreed upon a standardized set of vocabularies and ontologies and how we collect the data. And then, you know, we're also spending huge amounts of money, you know, maintaining this data repositories. You know, what do we do on a global scale to eliminate some of the redundancies that occur. Everyone establishes their own database because they know they can manage this. But I just worry that, you know, if there's no thoughtful integration across different databases, and that could be an exbosomics or it could be in genomics or other things. I'm spending a lot of money to, to support things and, and the other thing too is that I've been doing a lot of exploration on sustainability frameworks. I mean for many people sustainability is getting your grant redo it. Well that's not terribly sustainable. It's, it's, you know, what are the elements of a database that that make it worthy to continue to be funded. Or can we imagine that there is a life cycle of data repositories where, you know, they, they have usefulness for some period of time but then after some period of time, you know, there is a framework where we can measure whether they're losing their usefulness so we can, we can take the resources out and then read, redeploy them to, you know, to another data repository, but it's, so I just raise a number of different issues I'll put it in the context. Yeah, the one that you raise really resonates with me, and I should have emphasized it as well, Rick is the incompleteness of data. So, as you aptly pointed out the biologic layers, even though you might have a DNA sequence of a person. The other ones are cell specific, whether it's, you know, RNA, epigenomics, I mean, so you can capture all different types of the body or tissues. And also the same, you know, the, we're just scratching the surface in the area of the environment, we might get air quality or something like that but what about all the other things. So, it's, it's terribly incomplete and that's why it depends on the tasks that of interest if you're trying to that Darrell asked about Irene, the model that was used for better pulmonary diagnoses, you know, that you have to understand the limits of those inputs, but we, we right now, even theoretically would have all the layers, each of those layers have an incompleteness that's noteworthy. Right. I mean just the physical and chemical complexity of the exposures that we have are, it's complicated enough, but then, you know, as I'm sure Darrell would accurately point out, psychosocial stress is probably one of the, the biggest factors. So how do we measure psychosocial stress in a way that we plug this into these multimodal AI systems where you can be factoring in say epigenetic modifications of key genes that may be downstream from psychosocial stress or sleep behaviors, or your blue and green spaces and so it's very complicated, but I think these are things that we as the, especially the environmental health sciences community needs to pay attention to. So we can actually develop better, better data sets. It would be great to get to the point where we could do the types of things you were pointing out, Eric, with with the, you know, the imaging capabilities that we have in hospitals having, you know, AI capabilities to actually read the images and, and potentially have, you know, more accurate diagnosis. We eliminate some of those, some of those problems that we see, you know, in hospitals and with physicians. Yeah, one of the biggest challenges in the multimodal space is the high frequency or continuous sampling through sensors. So we can get to stress these days with risk sensors with heart rate, heart rate variability. We can get sleep data through sensors. I mean, we can get it. But when you have immense data sets of an individual, you know, with that type of continuous sampling, how to process that no one has yet done that by the way. Okay, taking very multiple sensors and then integrate it with all the other data. So this is at the current at the moment this is one of the analytical challenges. You can take electronic health record and image. Okay, when you start throwing in a bunch of sensors with weeks, months worth of data, try and understand their stress level and their sleep health and, you know, all sorts of other things. It hasn't been done yet. You know, Eric, I totally agree. And that's where I think quite frankly where you know the AI and ML can come into play, because it's it's these are, you know, any given data set is just very complex just take transcriptomics which is probably the simplest of, or genomics. Okay, I mean, we're also still not factoring into the equation the somatic mutations that may be arising during the life course that may be influencing in very serious ways, development of cancer or other other health outcomes. And that's where I think it's, it's, you know, we have these these capabilities to go in and let some of these powerful tools integrate across these these different data sets and hopefully give us, you know, greater insights into into human biology. I have a question on the chat, though I don't know if Rima Habri has a microphone. So I'll just repeat it here. So it's a question for Dr. Topol. You mentioned it's not more quantity of data but rather higher quality and more complete diverse label labeling of data that is needed. Investing in creating systems to generate better and higher quality labels that are unbiased to better serve discovery as much as we are investing in better predictive power of this AI models. What wouldn't we need to get there to incentivize this. Yeah, well, I think it's fair to say that the labeling is going out the window here in healthcare. There were only a limited number of well done annotated data sets. So, you know, we saw it like for chest x rays and skin lesions and you know retinal photos, and then basically the medical community abandoned this, because it's so resource consumptive and you can't get experts to spend their time doing it. And fortunately, around that time when that realization was being made there was also the whole cell supervised learning, letting the data, you know, basically, if you will label itself. We are now we're not unlikely going back it would be great to have cross validation between annotated carefully annotated by expert and cell supervisor unsupervised data sets but it doesn't look like it's likely to happen. We just don't get that many expert physicians to be willing to put in the time at scale to do this with hundreds of thousands of, you know, of images or, you know, whatever inputs, you're talking about. So, that's where things are headed now is is it's basically abandoning that idea that Rima is brought up. But on the other hand, we're learning that you can get a lot out of unlabeled data. Like I mentioned that NYU study. The study I mentioned with the Irene model this week for to pulmonary. These are completely self supervised model examples. So, we'll be seeing a lot more of that, but that goes back to what Rick is emphasized with the incompleteness and accuracy of the data, as well as the bias of the data that goes in, which you can, especially when you're not even trying to label anything and it just compounds the problem we could with large language models we could make these problems worse. Hey, Eric, I'd like to get your, your point of view on the, the issue that I raised around heterogeneity of data repositories and you know we have this this new efforts that are happening I think primarily in Europe around the global kind of data coalition where maybe we could be a little bit more proactive in planning. What are the data repositories that the world community needs. And if they don't exist, how do we create them and how do we fund them with sustainability frameworks under what are your thoughts on this whole concept of data repositories. Well, you know the one that's had the biggest impact in medicine actually has been the UK Biobank, which has been just every week there's new insights papers being published. So that I think has proven now that these are extraordinarily important with open access to the research community. And all of us of course aspires to when it gets, you know, fully loaded with the million participants and all the deep data for each. But we don't have as you nicely point out the kind of collaborations and standardization and, you know, we haven't seen many data repositories that have made contributions at the level of the UK Biobank. It would be great if we could do that now. What's interesting in Israel they form what's called the human phenotype project, and they've had 14,000 people who have actually about 5000 already been back for the second visit, where they do, you know, every layer of data that you could possibly collect, including cognitive testing and retinal photos and you know everything in medicine you can imagine. You know, microbiome and and of course sequencing and you know things that immunologic studies now they are trying to do they are going to other country. And Japan countries in the Middle East to try to come up with as you're getting at a standard way of collecting data of organizing the data making it accessible all open access. They're getting a little traction on this, but they're one of the first groups I know of Wiseman Institute that's tried to do it. We haven't really had the kind of global collaboration that we need. And essentially we're relying on you know just very limited sources even though there's so much activity in this space. I think that that's where I mean that being an IC director I we've been talking a lot about this what's the role of the NIH to help kind of nucleate some of that centralized governance. And so we're trying to get some things happening along those lines it just we have to get away from this model where everyone does it on their own. And you know it's a lot of both that UK has done this with UK Biobank it's a lot of both that is real is doing this the Wiseman Institute. Yeah, but we got to get away from everyone realizing something needs to be done so I'm going to pick up the mantle and then I'm going to run with it. You know there needs to be more coalition are you familiar with the global bio data coalition. I've heard of it but I don't know any of the details. That's something maybe we can all pay a little bit more attention to I know one of the things to we're struggling with the NIH is that maybe we need a GBD model for the different databases that are funded across the 27 institutes and centers at the NIH. So that's a challenge as well. Yeah, that would be great to pull that all together. So I have limited time I just also wanted to throw in an era kind of touched on this the issue of, you know, wearable biosensors and various things how we collect data I think that's, that's, you know, if you look at I think one of the questions that were the advance and in the coming years, I think getting to the point where we have those individual wearable technologies and that can dump data in some structured way into these data repositories we could take advantage of this I think it's going to be really important. Yeah, and that's one of the things that came out of UK Biobank because they had 10s of thousands of people over 50,000 with Fitbit. And so that was, you know, exemplary of what you can learn from that sort of thing and yeah and the other thing that you're touching on which is a theme that can't be emphasized enough is diversity of inputs. And, you know, the UK Biobank is largely European ancestry, almost exclusively, and it's only by combining forces. As you've touched on that we can get to the diverse race ethnicity and every other aspect that we want to have as inputs for for these models. The one thing I appreciate almost anything you look at is going to be a function of genetic background. Exactly. Genetics slash epigenetic backgrounds and yet. Now you've got if you got a factor that is the equation. It's really laudable that that the leadership at the all of us program are really making a real effort to make sure that that's a million person cohort is represents the diversity, at least across the United States. Of course in the global community as a challenge of looking at the global diversity that exists in the human population. Yeah, with just about half of the people of the 600,000 are underrepresented minorities. This is a feat that's never been, you know, even approximated in the past so it's a really stand out aspect the other thing of course is returning the data to participants which is what all of us aspires to do. And that's something we've got to do better as well. You know, gather all these large language bottles and build all these publications but when when are we going to start helping patients right and so getting this to be things like virtual health coach in the future are exciting opportunities and you know I could foresee where, you know, where you were someone's at risk for asthma. And giving them all their inputs like we've seen with the Louisville study on hotspots for asthma and reducing the toll of asthma. It's just one example, whereby people having that feedback that they're getting continuously for for themselves so that's another thing that we haven't done yet it's nice to publish all these papers, but let's help patients. You got it. I totally endorse that and it's really getting putting putting our science to work for the benefit of public health. Exactly. Excellent. I see that Kristen Malaki has a question maybe she can unmute. Oh, sure. Sure. I mean I think this is all really helpful and I think that last answer, or that last example really starts to get at my question. You know the goal of these workshops is really to think about are we ready now to integrate into these large predictive health AI type, you know mechanisms and if so, where do we start how do we begin what you know what are the current barriers and, you know, are we at a point where chemical structures could be integrated in here. So what do we do with that information, and if not what are the barriers and how do we get health care really just start thinking about these environmental factors as critical to sort of patient vulnerability and susceptibility and or or response. Rick may be able to answer that better than me. It's, you know, the, I guess, we need an operational kind of definition of how we collect the environmental data, you know, do we have chemical structures but it may not be just one chemical. I mean, if we're really, you know, if we, you know, Eric, you may have heard this but the environmental health sciences community is absolutely embracing the notion that if we want to understand the effect of the environment on human health. We have to erase this totality of exposures, because it's not just air pollution. It's in what's not just all of the different components of air pollution air pollution is a catch all term for PM 2.5 for ozone for a whole variety of different chemicals. But it's going to be also the, the flame retardants you're probably breathing in coming from, I don't know you have a leather chair, maybe you don't have it in years. It's, it's this totality of exposure we've got to figure this out and how to put this together in some way where we can ultimately use this to to our better understanding. And, you know, the other approach to is a lot of the exposomics work and we're not talking as much about, well, can we measure some of the biomarkers and biological effects of exposures. So, you know, my mother was probably exposed to things that she wasn't even aware of, but my epigenome is probably reflecting that it's influencing how my genes are being expressed so we, you know, it's a it's a it's a complicated issue but it's one that we need to be focusing our, our planning activities around. I couldn't agree more. I mean, there's a lot of things out there that are very disturbing like seeing the rates of cancer increasing substantially and younger people, which brings to mind what is it about their environmental exposures that's doing that because, you know, seeing people in their 20s with colon cancer, more and more now, what is going on here. And there's nothing to indicate that there's genetics in play. So, I hope we can make some headway here. And this is when we barely have our arms around in the medical community, because the only things we can tap into are things that are, you know, nominal marginal of very there's no depth there, as you're getting it. I hope those will see a real improvement. The other thing too I just want to throw on the table here is that, you know, in the environmental community we focus so much on all the bad things in the environment. Well, the fact is that there are actually a lot of good things in the environment. I don't think we we still have any clue of why exercise is so beneficial. And there, you know, there's plenty of data now emerging that those omega three fatty acids. And from fish, even though they may have methylmercury, they're there are health benefits associated with those omega three fatty acids, and the polyphenols and blueberries and blackberry is this for real. So they could be positively impacting our health but you know figuring out how do we integrate that into an exposomics framework is also part of our challenge. Yeah, the nutritional side of this is is is ginormous. And just another part we in our studies we try to get two weeks of everything a person eats take pictures of it so that we can use AI to analyze that that's also just scratching the surface about nutrition. And I think the, you know, I really applaud the, the all of us program that you probably know that there's a small segment of the million person cohort that they're doing the high resolution mass spec and actually doing metabolic profiling specifically for this purpose of precision nutrition. You're not relying just on what people tell you that. Yeah, you're not be accurate. You're actually looking at the metabolic profile as a way of objectively assessing what actually made in the system and then also factoring in individual biological variability. So the people who could you can be eating western McDonald diets and stay rail thin for their entire life. So it's, it's good I mean I think we're finally beginning to really embrace that we can do something about this. Yeah, I do think as you're getting at the metabolomic layer of data will be more helpful than, you know, it's been difficult to get at that because of some of the expense to do it at scale to do it properly. I mean that may give us a lot of important insight. Well along those lines Eric, you know the, the other thing is, you know with our page kind of coming on the scene now. I know that at least, you know, Renee Wagersen is the director of our page, you know, it's very interested and and are there new technologies that we can actually imagine where we can start capturing some of the high resolution mass spectrometry, what are we doing on a scale that we need to do and with an affordability that needs to happen. I mean we've seen this in the genomics community. I mean some of the primitive technologies we started off with you know those two color micro arrays, you know spotting CD days on microscope slides. You know it's, it's, it was a start. But, you know, it's, you know where we are now we focused on those technology development what are the things we actually need to need to get done between do the same thing environmental health sciences. Yeah, I mean, a really recent and I think notable example is the wastewater surveillance. We didn't do that. And here now we can look at you know every pathogen, whether it's SARS-CoV-2 or polio or M pox or in the whole list. You know we've never done that we were a terror on the ranking list of the world we were, you know at the bottom of the rich countries for doing it and so we've learned how important this is. And that's just one example of how you know it took a pandemic to figure this out right. So I'm pretty familiar with a lot of those, the wastewater experiments and so I'm actually part of my evening that weekends I spent on the Radix project and Lucila and no knows that well managing all that data is important. But it's interesting some of those, some of those, those, you know, digital PCR strategies that are being used as part of wastewater. Actually, they came on a DARPA projects. So, so that's why I'm thinking it's, you know, we also talked about frameworks people approaches, you know, we need more collaborations. So people developing, you know, new technologies, you know know what to develop, you know, the engineers, so that they have a really a solid framework of what do the physicians and biomedical scientists what what do they actually need, rather than having you know just an engineer is building the things that they think we need, which may or may not overlap with where the actual needs arise. Sure. And I hate to interrupt such such a great conversation. I think we're out of time but I love to see how positive both of you are. So my question is, what is coming up and what impact all this will have in health. So my function is to thank you both for taking the time to speak to the, the panel here and to pass the baton to the organizing committee with closing remarks. So thank you very much. Thank you. I'm here. Bye bye now. Great. Thank you very much for leading that session and for Dr. Weitschek and Dr. Topol for for really, I think, ending us on a very positive note so thank you to everyone who's joined us all day and who's stayed with us for this really compelling discussion. I'm Carmen Marsit from Emory University, and I've had the great pleasure with a group of some hardworking colleagues and National Academy staff of developing the program for this workshop. I'm going to take a little time now to try and summarize some of the themes that we've heard across this tremendous lineup of speakers and panelists. So we started the day in a first session that was really designed as what we described as a level setting opportunity to bring everyone to the table to understand the foundations and challenges of bringing AI to environmental data to address health questions by talking about the current state of that environmental data, AI methods and their application to biomedical research. We started with Dr. Bryce who highlighted some of the biggest opportunities and challenges in is an in environmental data. In particular, he noted the explosion of data with increased temporal and spatial resolution and opportunities to think about how data that is being collected for one purpose could be repurposed to consider the health related consequences with examples including pathways, water, heat, satellite data and traffic. In addition, he also highlighted opportunities that are developing because of the growth and personal measurements and we had that highlighted a few weeks ago in another one of these workshops. And this is really allowing for a movement more towards the concept of what we talked about as precision environmental health. These are the biggest challenges that come with those opportunities, integrating this data, making use of the data to predict risk and prioritize efforts, and being sure to assess the effectiveness of this kind of data. He also highlighted the importance of training researchers and our public health workforce to use this kind of data moving forward. Dr. Grotto talked about opportunities to use data and AI to mitigate inequities in healthcare by developing new algorithms and tools in informatics and data science to bring together individual health data, like clinical data from from electronic health records with biological data like genetics or genomics or microbiome to build predictive models she highlighted really requires large data repositories to train the models, and this needs to be done really moving toward a more global effort and with privacy protections. She highlighted the NIH all of us program which is enrolling a cohort of diverse individuals who are actively giving permission to share their health data, and it is growing its body of environmental health data as well. She also highlighted we're considering genetic risks and add mixed populations, and the need to be able to begin to include other types of data such as social determinants and environmental data in building risk models. And to do this in diverse populations so that as she said, no one is left out. Dr. Gosemi reminded us that our goal should be about identifying actionable insights in human health with the ultimate goal of identifying models that can better perform than humans in making decisions about a person's health. At the same time she noted that AI learns from humans, and so there are biases that are being introduced, given known biases in existing biomedical and clinical research and implicit biases that are in the clinical landscape. To help address these shortcomings, Dr. Gosemi talked about the need to consider the effectiveness of these models across different groups of people to see if the models are being developed perform equitably across different types of individuals, including those at different intersections. She also noted the importance of building in fairness constraints to try and avoid inclusion of these biases and to learn from other regulatory agencies how best to regulate the use of AI tools. In the second session we dove into more concrete examples and use cases of how AI and machine learning have been used to bring together environmental health data with biomedical data to address important health questions. The session started with Dr. Patel who noted a number of large cohort resources that are starting to do these types of data integration. This includes examples like using metabolomic data of diet to understand internal exposomes and predict outcomes, polyexposure risk scores, which bring together multiple layers of exposure data, and considers those along with polygenic risk scores, and novel measures of aging based on MRI phenotyping. Given that there are all these new possibilities to integrate multimodal data, there are needs to think about the ways the models are developed to ensure that what we're doing is really robust. Dr. Zhang focused on how to bring various modalities of biological data from often large multiomic data sets together. One of the biggest challenges that we see when doing this is that there's often missing data, missing modalities, and that was traditionally addressed with the imputation. With AI, he introduced different approaches that could be used by looking at similarities between individuals and using different methods to improve the models and address the missing data problems. These methods may be more powerful as they don't need prior knowledge and they may produce less bias in their results. And then Dr. Hansen brought up the promise of being able to follow an individual throughout their life to be able to provide early identification of risks so that intervention and treatment can happen, and then to continue to monitor those individuals to understand how the environment can be impacting their treatment as well. Importantly, this needs to happen in a scalable framework so this can be done at a population level. She noted that to address this challenge requires interdisciplinary team science and used as an example of collaboration between her group at Oak Ridge National Laboratory and the SEER program at the National Cancer Institute, where they've developed machine learning models to auto code information on tumors from pathology reports with built-in accuracy checks and have developed methods to integrate information pulled from residential address histories to build external exposomes. The goal of this is to build foundational models from such data streams that could then provide flexibility to address various downstream questions. And those foundational models is something we've heard as a theme across a lot of what was talked about today. She also emphasized that work needs to be done in an open science framework to assure reproducibility, replicability, and usability for real-world data applications, and also to assure appropriate protections for individual privacy. In session three, we focused on new methods that can be used to aid in data integration. The session started with Dr. Ho, who motivated her work based on health equity. As example data streams, she suggested using common social media, app maps, mobility data, and community-based forums to aid in developing potentially different measures of social determinants of health on shorter time scales that could then be related to health measures. This could allow for more fine-grained examinations of information about neighborhoods and their characteristics, particularly if you bring in human domain knowledge to guide refinements. The challenge there is that these types of data streams are messy, but that new models need to be trained using this type of real-world messy data. Dr. Hartung pointed out the convergence that is bringing us really here today, the huge increase in data, increase in computing power, and the development of new AI methods. He particularly highlighted the growing importance of foundation models, which underlies GPT, that can be applied to a variety of tasks. He called for rethinking and advancing regulatory toxicology and the need to have a human expose on project that incorporates new technology to drive this new toxicology with an implicit incorporation of AI to help in extracting data from the literature and synthesizing that information to inform policy. He highlighted a number of existing tools to demonstrate the growing potential for this approach, including the utility of AI tools to replace expensive and time-consuming animal testing protocols for various endpoints. He challenged us to all think about how we can take advantage of this new science and be ready to leave behind old ways of thinking. These presentations led to a discussion on public trust and privacy. The speaker suggested that trust may be supported by continued reliance and understanding of evidence-based solutions, and the use of approaches like explanatory AI, which describes how these tools are getting to their answer. There may also be a need to rethink what privacy is and how willing people are to share their information, particularly if they understand what the potential importance could be. There was also some discussion of thinking about the questions being asked, the ability of the data to have an answer, the ethical guardrails, and the continued incorporation of human thought when applying these tools and utilizing their outputs. Finally, in our last session, we welcomed Dr. Rick Wojcik, Director of the National Institute of Environmental Health Sciences, and our keynote speaker, Dr. Eric Tobol. Dr. Wojcik set the stage about the incredible wealth of data that is now available to help to understand the environment's impacts on human health and the real potential we have to bring this data together using new AI-based approaches. He highlighted the importance of collaboration and how he is strongly supportive of identifying synergies between NIEHS and its researchers with researchers across the NIH. Dr. Tobol talked about potential for AI to help improve diagnostic errors. He provided a number of examples where machine eyes, based on unimodal AI, have already been successful at identifying information about patients and their diagnoses beyond what a physician would be able to see. Now we're moving into multimodal approaches, which can take the expose on with multiple biological layers to improve medicine, and he showed examples about integrating various layers can improve the performance of diagnostic testing, and potentially more importantly prevent errors. He also cautioned on challenges, including bias, which we heard a number of times today, carbon emissions based on the computing needs, and the need for validation of utility and safety of these models. He ended that he is most excited that there's an opportunity to bring humanity back to medicine by using some of these approaches. I wanted to thank all of you for joining us today, and then we hope that you will join us tomorrow starting at 10 a.m. We will talk about governance and infrastructure for AI, technologies and tools to advance environmental health and biomedical research, and provide an opportunity for everyone to get involved in the conversation. See you then.