 Okay, well, thank you everybody and especially thank you to Leah and to all of the speakers this morning for really lively and terrific session. So the morning session was intended to be about open access for research articles, but as you saw the article and data questions are frankly really intertwined. And so we did have some really good discussion about data and data considerations this morning. So we're going to follow up on that in this session and we're going to talk about how the open science mandate creates opportunities for better data archiving and for better science stemming from better data archiving. And we're very fortunate to be kicking off with a talk from Bob Hanisch, who is the director of the Office of Data and Informatics Material Measurement Laboratory at the National Institute of Standards and Technology NIST in Gaithersburg, Maryland. And this next sentence says he is responsible for improving data management and analysis practices and helping to assure compliance with national directives on open data access. And you notice it doesn't say at NIST. And that's because if you operate in the data space you know that he's responsible for all that in the world at large. You know, Bob is just ubiquitous all over the data space and and you know as I said he's he's just he's an evangelist and he is an organizer and he is really just vital to to this whole endeavor. Prior to coming to NIST in 2014, he was the senior scientist at the Space Telescope Science Institute in Baltimore, and was the director of the US Virtual Astronomical Observatory. For more than 25 years, Dr. Hanisch led efforts in the astronomy community to improve the accessibility and interoperability of data archives and catalogs. So please join me in welcoming him. Well, Jake, I don't know if I can match that introduction, but I really appreciate it. I'm going to talk about fair data repositories, expectations, obligations and expenses. And as I said earlier in some of my questions, my colleague Brian from from DOE is always extremely careful in the way he words things. And I'm going to be deliberately uncareful because I'm close to retirement and if I get in trouble. I don't care. You all know about fair. I don't need to explain that acronym in this room. I don't think expectations we've already heard today. Mariam set up the scenario this morning with the Holdren memo from 2013. I also interjected here sort of a kind of pornography. The Fair Principles came out in 2016. In 2022, the Subcommittee on Open Science published this white paper on desirable characteristics of data repositories for federally funded research. And then of course the Nelson memo that we heard about earlier today in 2022. Another important expectation which lies below the radar is an annotation of measurements with their units of measurement. This is an area that is ripe for improvement. If you don't do this right, you will never have machine actionable data. There are numerous accounts of huge mistakes that have been made in the aerospace industry, for example, because units were not properly annotated assumptions were made about units that were incorrect. And, you know, a mission to Mars failed because units were doubly converted when they didn't have to be. So, I let a paper a year and a half ago or so stop squandering data make units of measurement machine readable. It was an opinion piece published in nature. And really is something that all of us who are involved in taking data annotating data need to be aware of that if we don't get the units into the measurements, you'll never be able to have machines understand how to compare those data sets correctly obligations. Again, we've heard about the requirements on data management plans data management and sharing plans. The obligation I see as somebody who works at the infrastructure development level is on assuring that data are born fair. You don't want scientists to bear the burden of annotating their data by hand. For one, if they do it, which they usually don't. If they do it, they will do it poorly. We're not interested in this. It's a waste of their time. So we need to make sure that data are born fair and are born fair through automation. Laboratory information management systems electronic laboratory notebooks, any of our technologies that can extract data automatically from a machine from or from a computer simulation are what is really key to doing this right. We need to understand have data models metadata standards, increasingly we run into problems with vendors who sell you an instrument and the data is in a proprietary binary format. This is sacrilege to me, because in terms of the transparency and reliability of research data, we as researchers need to understand what's going on inside that machine. Of course, vendors want to sell you not just a piece of hardware they want to sell you a whole software staff. But this is counter to me in my mind to the good practice of science which relies on transparency in the whole process. Fair digital objects is the emerging technology that is being talked a lot about a lot now there's a conference coming up next month in Berlin about FDOs. I'm on the steering committee for this activity. And this is an idea that you take any digital information and wrap it in metadata with appropriate PIDs the persistence identifiers, such that this is a machine actionable piece of information. And you can build FDOs of FDOs of FDO so you can have sort of a hierarchical construct, all of which becomes machine, not only machine read readable but machine actionable. So this is a really important piece of, again, leveraging on us as infrastructure providers to make sure that we make our data most widely accessible and reusable as possible through technologies like this. I mentioned the units of measurement interoperability service a colleague of mine Stuart chalk has been developing with support from from us at NIST. This is a toolkit again that allows researchers to encode the units with their data properly using an established metadata encoding scheme. And if you have data that is represented in one encoding scheme, you can translate it to another without any loss of information without any mistakes in scaling. It links automatically to the fundamental constants. NIST is the custodian along of the code data fundamental constants. These are things like the plant constant and the electron mass and so forth. All of that information forms the SI system that is international system of units of measurement. So if you a big thing for us in metrology is traceability. I make a measurement is calibrated against something else which is calibrated with something else again. That whole traceability chain is fundamental in measurement science and services like this are key to making that work in a in a dynamic computer oriented way. And of course it would be remiss of me not to mention the research data framework this came up this morning again in Merriam's talk. This is a document and a website that we have been developing now for over four years at NIST that basically is a guide to the research data ecosystem. There are six major data life cycles. There are roles and responsibilities of different individuals in the research data ecosystem, whether you're a bench scientist or whether you're a librarian or whether you are a funder. If you are a dean of faculty, all of these people have roles and responsibilities in research data management in the research data ecosystem. It is not a standard. It is not even a guideline. It's a tool to help people assess their capabilities and to improve and to decide where the most important areas are to invest their their resources obligations slide to federal agencies in the US supported $54 billion in university based research in 2022. That's a lot of money. $54 billion I could have a nice vacation home in Spain for that. I estimate that about 10% of that gross budget supports research publication costs. This is an article processing charges subscriptions. I based this on the fact that Elsevier income in 2022 was 3 billion euros. The STM Association estimates the journal market value of $10 billion. So this is this is an astronomical estimate so it's an order of magnitude sort of thing but roughly 10% of that research budget goes into assuring that the research results are published. They are an essential component of that research record. We need the data to assure real reproducibility reliability these are words. Also the T words transparency and trust. And of course, open source software is like data it's not exactly the same as data, but it's part of the process by which people reach scientific conclusions based on the data that they are looking at. So all of this has to be visible as part of the research record. Quality data are also the fodder for artificial intelligence. Garbage and garbage out spend part of the mantra of computers for the past 50 years. We see this now in large language models that are trained on garbage information. They give you garbage conclusions. Right. They come up with nonsense. They come up with hallucinations. So, again, the onus is on on us as people, myself as somebody who helps to build infrastructure to aid scientific research to assure that we have the best quality information feeding into the system. And it's done as automatically as possible, taking the human error system error propensity out of the equation. And this is really in terms of obligation. The thing I want to drive home funders should be compelled to set aside long term support for data repositories. I see it as nothing short of a moral obligation. With $54 billion going into basic research and universities. The fact that there's no assurance that the outcome of that research is preserved and perpetuity to me is simply. What's the right word? It's just reprehensible. I think we have to change our thinking about this. The onus is indeed in my view on the funders to make sure not only that the interpretation of the data gets out in papers, but that the data themselves and the tools that were used to reach those conclusions also get out and are preserved for posterity expenses. How many times have I heard that data curation and preservation is too expensive. I don't know every month I hear that it's not true. It is not true. I worked for 35 years in astronomy where data curation and preservation has become has has been part of the the fabric of astronomy now for three decades. And I've done a pretty thorough survey of the expenses of data management in astronomy, and it varies between one and 10% of the annual operating budget of a major telescope. 10% gets you highly calibrated, highly curated data. 1% saves the data off the instrument. Somewhere in there is a sweet spot, right? At NIST, we live on a shoestring and we have implemented our public data repository for 0.1% of our annual operating budget. Now we do not a terribly thorough job on all of our research data in terms of curation, but at least we're saving it. And we're saving it sometimes even in those vendor proprietary formats because we don't have an open alternative, but at least we're saving it. But we did that for less than or for about 0.1% of our annual research budget. So my proposal is that we should set aside in our major public funders 2 to 3% of the federal research budget, and we would solve this problem. This is not a technical problem. We know how to store data. We know how to annotate data. We know how to transfer data from one medium to another. We know how to figure out the cost in terms of cloud versus on premise. We know all this. This is not a technical problem. It's a social engineering problem. We need the commitment to do it. And the community needs to start knocking on the doors at NSF and NIH and DOA and say, we demand this. Because if we don't do this, we're squandering our resources for future research. This data can be reused and repurposed for a fraction of the cost that it took to acquire it in the first place. In astronomy, I worked on the Hubble Space Telescope data archive for many years. That data is used three times more frequently by researchers who had nothing to do with acquiring it in the first place. This is a huge return on investment. The Sloan Digital Sky Survey, the team wrote, I don't know, a few hundred research papers. The community wrote 6,000 research papers using that data. When you have data that is well curated, well annotated, well characterized with metadata, it will be reused and it will be recombined in ways that were not imagined by the people who took the data in the first place. So for 2 or 3% of this gross budget, you will get a return on investment, which I predict will be 50, 100%, 200% of the cost that you put in this. This can be done by setting up a network of domain specific research repositories. They can be re-competed on a three, five-year basis so that you're getting the best return, the best quality in that data curation. We don't want to see 5,000 data repositories with non-interoperable metadata standards. This would be a disaster, but we do have to recognize that data curation does have specificities that depend on the research domain. And even in astronomy, maintaining x-ray data is different than maintaining radio data. The techniques of observing are different. The noise characteristics are different, and so we divide the data up in terms of the wavelength spectrum in different research domains that will be divided up in different ways. But this can be done and the fact that I've been arguing for this for 20, 30 years, my head is pretty sore from banging it against the wall, but I keep doing it because I think it is so important. And there's no reason that we can't do it. So, end of my evangelizing, as I said I would do, that's my story, and I hope that you as a community in chemistry will also help drive this message through that we need to do this, and it's shameful if we don't. Thank you. The QR code should take you there. Oh, yeah. And if it doesn't work, just Google. I think it's a terrific resource. Okay, great. Perfect. Yeah. Okay, next up we have a professor Olaf feast, who is the grace roughly professor of chemistry and biochemistry at the University of Notre Dame where he has spent the lion chair of his career after training at the University of Bonn and then at UCLA, and he is also a Camille drive his teacher scholar who is a fellow of the triple as and held visiting appointments at UCSF dfci h custom pk us z. And, in addition, he directs the NSF Center for computer assisted synthesis, and was an associate editor for Journal of organic chemistry until 2021. His research focuses on the elucidation of mechanisms using electronic structure theory. Data chemistry and machine learning, and the development of synergistic predictive methods in synthetic organic chemistry that combine the qt mm method he co developed with experimental studies. The mechanistic insights and design principles derived from these studies are then applied to projects in computational biophysics and drug design for the treatment of infectious and rare diseases, specifically nemen pick type C where his work form the basis of a clinical trial. And I just want to emphasize that he is a particular go to, if you need somebody to look at a study that combines computational and experimental chemistry, he just has a keen eye and a deeply analytical perspective on those questions. And that's what he's going to tell us about today in his intriguingly titled talk fairies FA IR ghosts and trolls data challenges in the age of artificial intelligence. Thanks Jake. We're moving on you from, you know, people in the federal government to people, you know, ten year faculty members so it's going to get a little wild or I can can probably be even more opinionated which I probably would be anyway. So, so this is in a bit of a of a view from the trenches, if you will, as somebody as a researcher and working in this area of computer synthesis in the center where when we started back in 2019, a lot of the stuff that we currently confronted with didn't really exist in that form. And so that got kind of dropped into a lab and we're trying to figure it out. So, let me see this work so again the the idea of fair data has been around in that form since about 2016 actually the original ideas go much further back to about 2007 I think it was. Two years ago or three years ago I happened, or so are we told. And so one of the issues that got as we started up CCAS came to the forefront, even though that wasn't really planned is how do we deal with the data on that and I do want to share a few things. How we address that I don't claim to have all the answers for that one, but I also think we have a lot of new tools that might help us some of these problems. So, I'm really looking forward actually to the next speaker because if it comes to the sustainability issues, then I think the PDB told us a lot about how to do this and I think the key point in my opinion is value. If you can demonstrate value, people will figure out how to fund it. And so I'm really looking forward to your presentation in the context of AI. One of the things that has been mentioned a few times now is the question of quality data quality. And I think that becomes a lot more important. And at least in my definition of quality, it means trustworthy and transparent. Where does the data come from and getting getting to original data back there, the curation of this data putting it into context, taking out some of this noise and things that are frankly wrong. Getting a concept around how uncertain things are, you know, a number or data point is not just a number there is a range around it, and that matters. Okay. And then how complete what are we missing how consistent is it internally, because one of the beautiful things about looking particularly original data is that if there's a problem, and you start thinking in the context of consistency, you will figure it out. And so, how do we get this data. I think the concept that is important here that we at least talk a lot about within CCAS is the question of pre competitive in other words things that is a value for everybody, but doesn't necessarily rise to questions of IP and such. So we need to, in the context of value we need to make sure that people see value in depositing the data and using the data. And so we, in our job is partially making that as as easy and then painless as possible. And that goes back to some of the things that were here just a second ago how to automate that. Getting the right data so this is, this is my bit of evangelizing right now. Coming from the University of Notre Dame you do a bit of that. The idea getting the right data we haven't I think sought enough about what is the right data that we really should do. And so here's an example of that. This is for the case you care about is the Woodward half the book on topic reaction. If you're the blue points in the high throughput experimentation if that is the data set you have, but you want to get the products that are in yellow, your predictive model is going to be awful right. So the idea of getting the data set design right something we also talked about about lunch in getting the right data with the right coverage needs to be part of that conversation. All right, so what are really the challenges that I see in this area. And so we got the fairy so one of the big challenges are the ghosts. Okay, and what I mean with this. So there's a definition of goes based from the wiki. And so they're invisible presence sometimes translucent sometimes they're real. And that's a way I look at data in a say, there is real explicit data, but very often it's implicit, or even inferred. And then how do you, how do you deal that how do you draw this out this data. And this is true across all the major data sources that I think about so all the legacy data, the publications all the databases that are built on these publications. The all the electronic lab notebooks that are out there with the original data and then in last maybe decade or so high throughput experimentation data. All of those have inferred implicit data in there how do we make it explicit because that's what you need. And so, so one of the question that came me up earlier today was a question of negative data this is an example for this in saying what is negative data, because it can be. You experiment didn't give you the product well that it, what did happen. Everyone of us who works in academia the first thing we would ask is for a student that sets a reaction didn't happen. Did you drop the flask that you, you know, starting material, something else what happened. You have to infer this data and need to get it out. The second big challenge that I see are what I call trolls. So, again, wiki is helpful there. Basically slow with it really helpful even dangerous to to human beings that is about as good as a definition that I can think of for an electronic lab notebook. A wonderful study from nature protocols, two years or a year and a half ago and looked at 172 electronic net notebooks that basically pointed out all of these problems, a lot of them are in a discontinued that proprietary that was an issue. They're not interoperable. They go away. At the same time, this is really where the data is an industry more than an academia because an industry, it's, it's there and you know there's no discussion about it. But this is, can we get the data out there is going to be one of the big problems. Despite the fact again this is inconsistent incomplete contradictory. It's a hot mess, and I worked myself enough in industry to actually see some of this data and it's horrible. This is where I want to talk a little bit about the work that we do at CCAS a little bit in case you're not familiar with the program CCAS is a face to NSF Center for Chemical Innovation so this kind of the flagship program of the chemistry division. We started out with a smaller part back in 2019 since November since September 22 we're in phase two, which is a five year $20 million project renewable one so we're probably looking at about a 10 year or so time frame, and we have been very fortunate in getting quite a bit of buy-in from industry for the idea is to really change the way chemistry is done. How do you discover chemistry, optimize them, figure things out. This whole idea of what we call data chemistry I'm going to talk about that more in a minute. And this is something where industry really made the, you know, they're basically this is a done deal as far as they're concerned. This is the future. And so we work with a lot of the companies here we work with the community to really how do we implement data chemistry. What do I mean, and if you want to know more about this there's some websites and things like it. So what do I mean with with data chemistry. It's the idea that data is really the foundation of what we do data streams, combined with the representation of molecules and the correct algorithms that is really a new way of doing the chemistry, but the data is the foundation. And that's why we care about it so much. Once you have this, then you can do things in the database as molecular representation and I'm not going to talk about the algorithms today but there's new machine learning algorithms that you have to develop. Those will be the things that allow you to address the things that we care about as chemist, you know, optimizations and synthesis and reaction predictions in a novel and much more efficient way. And so we have put out a few of the results already in the optimization the Bayesian optimizer retro synthesis programs the co scientists and that's one of the things that seek has a strongly committed to everything goes out, everything is free. Okay, nothing is propriety source code everything goes out. All right, one of these things that is pertinent to the discussion today is the open reaction database where it's really the data for things that we care about organic synthesis goes out. In a form that's freely available to everybody and it's easily readable. The idea again being that it goes into machine learning products. Interactive people controls robots. And so just as an example of what's out there right now there's approximately three and a half million data sets, data points in this database right now. We also were fortunate enough to get some funding from Schmidt futures on this. Conocoli has been the leading the effort on that one. And if you want to know more about it here is the data set itself. And of course, the publication that goes with it. This is roughly what it looks like. And this is the lot of thought went into the design of that in terms of the idea of getting away from prioritizing the data itself over the format. Then the problem of part of the data make it easy to bring data in and this is something that we're currently working on the automated duration. I'm going to show you something like in a second there and then it easy to get out but everybody if you feel so inclined and you want to download the whole thing be our guest. Okay, it's all out there. So how do we get this automated curation and that is actually I think a lens said that earlier really a game changer something that has changed in the last two years and that is large language moments. Okay, where, whether you can have either proprietary things like open AI or open source things like llama or mistral or whatever your favorite one is where you can take information in from a variety of sources, like publicly available like CTO. In principle, if you feel so inclined you can analyze things that are proprietary like CCAS or reacces, but maybe more importantly, all this other free text information we talked about the digitized information out there free text, like electronic lab notebooks like publications like PhD thesis all of that can go into this. And this is actually something we're actively working on where developed the tools and the commands in your into machine learning something called in context learning where we really can figure out these inferred and implicit data and make it explicit. We can change check the data is the polarity that you give really consistent with what you give it at the yield in the grams or whatever. They can actually figure this out so you can all these cross referencing that I'm showing up there. And it turns out if you do the statistics. If you use something like chat GPT for its 100% right based on what we could find there are things that says okay I can't do it about 3.7% of the cases but it is pretty reliable to do this. And then you can put this into these kind of open reaction databases. UDM is a standard that is used in Europe and combine it as I mentioned earlier with feature databases. And with that, I think this is really where I think is the future is going where some of these problems that we're dealing with here can be addressed with modern technology. Thank you very much. Okay, you got a preview in that talk for what's coming up. And I probably don't have to tell this audience that when you plan a workshop like this. The protein data bank is the absolute gold standard for the value add that a well curated repository breaks. And so we're very pleased to welcome Dr Steven Burley, who's going to talk to us about what it's going to be. And all of the value derived from the protein data bank and how that sort of story might inspire other data repositories that are that are coming up in the community. And he's an expert in data science and bioinformatics structural biology and structure guided drug discovery for oncology. He's the director of the RCSB protein data bank, and within Rutgers the State University of New Jersey here serves as university of as university professor, and Henry Rutgers chair, founding director of the Institute of quantitative Biomedicine and cancer pharmacology research program co leader within the Rutgers Cancer Institute of New Jersey. Professor Burley's previous roles were distinguished Lily research scholar at Eli Lillian company, chief scientific officer and senior vice president for research at SGX pharmaceuticals, Richard M and Isabelle P furlough professor at the Rockefeller University, and Howard Hughes Medical Institute investigator, his degrees include an MD from Harvard and a doctor of philosophy from Oxford, as well as a bachelor of science and physics, and an honorary doctorate from Western University. He's published extensively in data science and bioinformatics, AI and machine learning structural biology and clinical oncology. Please join me in welcoming him. Thank you. That's a that's an introduction there only a mother would believe. So it's, it's an honor for me to be here I'm very grateful for the opportunity to be representing the RCSB protein data bank here today, and my colleagues who are part of the worldwide protein data bank partnership. I'm going to talk to you about the value proposition of getting it right. 53 years ago, as protein crystallography was being established as a field, the community got together a minority at the time to be sure, got together and made a compelling case to the Department of Energy initially that the results of protein crystallography experiment should be preserved. There was a lot of opposition in the community initially people were unwilling to share but here we are 53 years later and everybody has bought in. So the title of my, my talk is protein data bank from two pandemics to epidemics to the global pandemic to mRNA vaccines and packs love it. So my plan is to tell you about the critical role that structural biologists and the protein data bank the PDB together have played in fighting the COVID-19 pandemic. The punchline of my talk was put very succinctly by Dr. Anthony Fauci who requires no introduction in this audience. In the New York Times magazine in 2023 he cited the importance of 3D bio structure information and said, show me a person who's vaccinated got infected took packs a bit and died, I can't find anybody. I'll put what Fauci said into context and explain the, how the global efforts of the structural biology community and the PDB made all this possible. So, as I said the organization was established as the first open access digital data resource in all of biology and a vanguard in the open access data movement. It was established in 71 with just seven x-ray crystal structures of proteins and it's been continuously funded by the United States government ever since. In biology function follows form. This variation on a famous phrase coined by the architect Louis B Sullivan succinctly explains that the function of a biological macromolecule is determined by its form its shape, its 3D structure. Since its inception the PDB has grown more than 30,000 fold to become the single global resource for experimentally determined atomic level 3D bio structure information. At present we provide fair fact compliant access to more than 215,000 structures of proteins, nucleic acids, viruses and macromolecular machines. As in the global importance of the PDB it's been managed jointly since 2003 by the worldwide protein data partnership, which includes regional data centers in the US, Europe, Japan and the People's Republic of China plus two specialist data repositories for electron microscopy and nuclear magnetic resonance spectroscopy. PDB data are essential for responding to emerging viruses since the early 2000s the world has faced down three major outbreaks of coronavirus infections that jump the species barrier to human. First detected in 02 SARS coronavirus infected about 8500 individuals worldwide killing just over 800 people between 2002 and 2004. The second coronavirus epidemic was by the Middle East respiratory syndrome or MERS coronavirus struck in 2012. To date it's killed more than 900 individuals and remains a public health threat in the Middle East and parts of Asia where camels are endemic. Despite these where they're where it's endemic in camel populations despite these two warning signs, very few countries including the United States prepared for the possibility of a much more serious coronavirus epidemic. Fortunately for us and the global community structural biologists and the PDB actually laid the groundwork for a successful response to the inevitable third coronavirus wave. Effective mRNA vaccines were designed and antiviral agents were discovered with the benefit of open access to PDB structures of SARS CoV MERS CoV and SARS CoV two proteins. The first PDB structure of a SARS CoV two spike protein appropriately using the persistent identifier of six Victor Sierra Bravo was released in February 2020 less than less than a month after the nuclear cast sequence became available. Today there are nearly 1600 SARS CoV two spike protein related structures in the PDB. Like the earlier coronavirus spike protein structures that are archived in the PDB they provided important insights into receptor binding fusion of the viral lipid bilayer with the plasma membrane and vaccine and antibody design. Specifically the very first mRNA vaccines generated vaccine designs generated in January 2020 were actually based on PDB structures of the SARS CoV and the MERS CoV spike proteins which are very similar in amino acid sequence and 3D structure to their SARS CoV two counterpart. When designing Moderna's inaugural mRNA-1273 vaccine encoding the SARS CoV two spike protein the company relied on PDB structures of a double proline mutant form of the SARS CoV and of the MERS CoV spike protein that stabilized the spike in a highly immunogenic pre-fusion confirmation. I believe the vaccine designers at BioNTech Pfizer use the same information because their inaugural mRNA vaccine encoding the same spike protein included the same mutations. The rest of the story is well known to you. Two essentially identical highly effective mRNA vaccines against SARS CoV against COVID-19 and their successes have been administered now to more than 5.5 billion individuals worldwide with impressive results. Designed with the benefit of open access to PDB structures they're credited with saving tens of millions of lives worldwide and preventing severe forms of infection in hundreds of millions if not more than a billion individuals around the world. You've got to love the PDB. The organization of the SARS CoV two genome is depicted schematically in this slide. All coronavirus genomes are very long single stranded positive sense five prime capped three prime polyadenylated messenger RNAs that are ready for translation by the host cell ribosome. Most of the non structural proteins in the virus are expressed within a pair of related poly proteins. The individual non structural proteins are then excised from the poly protein by two SARS CoV two proteases that themselves that are themselves part of the poly protein. The papain like protein is cuts at three sites here denoted with dark blue down arrows and the even more important main protease the focus of the rest of my talk cuts at 10 sites denoted with light blue inverted triangles. Sorry. So let me turn now to structure based drug discovery the PDB currently houses more than 750 crystal structures of the SARS CoV two main protease or M pro the first of which came out of Shanghai ID six lemur uniform seven. This protein is the Achilles heel of the virus without the main protease the poly protein cannot be cleaved into into its constituent non structural proteins and thereby stopping an infection in its tracks. It's the target of Pfizer's highly effective drug known as Paxlavid, which is a fixed dose combination of near matrilvia the active ingredient and write on a beer. The near matrilvia origin story actually begins back into that in the early 2000s with the discovery P F dash 00835231 shown here in the right panel inhibiting SARS CoV main protease. In PDB ID six X Ray hotel Mike. This drug was intended by Pfizer for use as an injectable in hospital antiviral agent for acutely ill patients. It was the product of a structure based drug discovery campaign at the company that began in 2003 and was facilitated by open access to my structure of the SARS CoV main protease PDB ID one Quebec to whiskey. This very first structure of the main protease was determined at SGX pharmaceuticals where I led R&D as the chief scientific officer. Altruistically, my company deposited the structure to the PDB so that work on countermeasures could begin without delay by any any company that wish to do so. By the mid 2000s SARS CoV had disappeared from the scene the virus had simply vanished and Pfizer halted the project. In early 2020 Pfizer reactivated the project with the goal of discovering and developing a SARS CoV to antiviral the outpatient use starting with PDB ID six Lima uniform seven. That's the first structure of the SARS CoV to main protease that I talked about earlier. The result at Pfizer was near Matrelvia, which is shown here in the right panel inhibiting the SARS CoV to main protease and this is another PDB structure ID seven Romeo foxtrot whiskey. Both near Matrelvia and its predecessor PF-00835231 that I showed on the previous slide are covalently acting inhibitors of the protease that bind to the active site cysteine residue that's visible within the red circle. When the enzyme encounters the drug it catalyzes a single turnover reaction that forms that covalent bond and irreversibly and activates the enzyme. And that prevents the polyprotein from being processed and stops an infection in its tracks. Near Matrelvia is structurally similar to the first generation Pfizer compound with key chemical differences that allow it to be administered as a pill as opposed to an injectable. So there is one major drawback to near Matrelvia. It's very rapidly metabolized by one of the cytochrome P450s. So it's coadministered with a drug called rightonovir, which is a potent inhibitor of cytochrome P450-3A4. And that coadministration prolongs the half-life of near Matrelvia and presents its degradation when the drug passes through the liver. So Paxlavid, a fixed-dose combination of the two drugs, received emergency use authorization from US FDA in 2021 in December, less than two years after public release of the viral genome sequence. So the speed with which Pfizer was able to move was unprecedented. It often takes a decade or more, as many of you know, to go from identifying a drug target to a approval of a new drug. At the end of my talk, I'm going to come back to the fact that Pfizer reported that near Matrelvia is also active against the main proteases of both SARS-CoV and MERS-CoV. And this important finding will unfortunately lead me to a sobering postscript. So working with the Rutgers Institute for Quantitative Biomedes and colleagues, Sagarkare, we've been collaborating on a study of the diversity of both M-pro active site structures and the polyprotein cut sites of all coronaviruses. There's very high conservation of the amino acids that line the active site across all of the known coronavirus main proteases. Each column in the right panel shows the frequency of each of the 20 possible amino acids at each of the positions of residues that line the active site. Dark blue means that there's no variation in the sequence. So the dark blue means that the amino acid does not occur at that position. So this tells us, when we look at the diagram, that approximately 12 of the amino acid residues that line the active site of the enzyme don't change throughout the evolution of all known coronaviruses. At position number 27, for example, the leucine residue occurs 99% of the time. There's similarly high conservation of the main protease polyprotein cut site sequences across all known coronaviruses. Each column in the right panel shows the frequency of each of the 20 possible amino acids occurring at each position P5 through P1 prime in the polyprotein cut sites across all of the known coronaviruses. Again, dark blue identifies amino acids that don't occur in that position. So you can see at P2, leucine and valine together account for 94% of all amino acid occurrences. At P1, leucinine accounts for 99% of the occurrences. And at P1 prime, it's alanine, asparagine, and serine that together account for 96% of the amino acid occurrences. So what these data are telling us that there is in fact double evolutionary selection pressure at work, both on the 3D structure of the viral polyprotein, the viral main protease active site, and the various cut sites for the main protease across the polyproteins. During evolution, the catalytic activity of the of the enzyme has got to be preserved and all of the cut site sequences have got to be preserved. So you've got, you've got conservation of both the enzyme active site structure and conservation of the substrates across all of the known coronaviruses. So going beyond Pfizer's work, we're currently testing in silico predictions that near Metrelva is actually going to be a very broad spectrum inhibitor of coronavirus main proteases. So this is what leads me to the sobering postscript. Understanding the evolution of coronavirus main protease active sites in 3D and their polyprotein cut site sequences tells us that the free market for emergency pharmaceuticals actually failed us. So failure of the free markets is defined as inefficient distribution of goods and services that within the economy. These inefficient distributions occur when the incentives for rational economic behavior are good for the individual but not good for society. So simply put, capitalism is not always the answer to get the to get the right outcome when it comes to the public good. So for better pandemic preparedness benefiting the entire world, Pfizer could have been incentivized with public money to continue working on structure based drug discovery of a broad spectrum coronavirus M pro inhibitor. With the benefit of hindsight, we now know that expenditure of approximately $250 million would have saved millions of lives worldwide while we were waiting for the Moderna and BioNTech vaccines to come online. And prevented economic losses in the two to the tune of trillions of dollars. When the MERS CoV epidemic struck. I argue that there could should have been a scientifically informed discussion regarding the very real possibility, technically, and the desirability of public investment in pharmacologic countermeasures targeting new coronavirus is crossing the species barrier to humans. Experts knew after MERS that it was simply a matter of when not if there would be another jump to human with the possibility of a global pandemic. We were very lucky the first two times we were extremely unlucky. The third time. Fortunately, though, for us, Pfizer made a very substantial investment back in the early 2000s during the SARS CoV epidemic, which set the straight the stage for very rapid discovery development and emergency use authorization of packs. Similarly, the NIH made a huge investment in the development of mRNA vaccine technology. So the recent successes that we've seen with SARS CoV to have revolutionized how vaccines are designed manufactured tested in the clinic and undergo regulatory approval. To recap, I've explained how open access to research research data generated with both public and private funds, particularly 3D structures of coronavirus proteins archived in the PDB enabled basic and applied researchers to make a difference during the pandemic. That the world desperately needed them to make to succeed. So to quote Dr. Fauci once again, show me a person who's vaccinated, got infected, took packs of it and died. I can't find anybody. You've got to love the PDB. So, thank you. I'm very grateful to the NSF NIH and DOE which collectively fund the PDB, the RCSB PDB, our hosts at Rutgers UCSD and UCSF, and all of our partners in the worldwide protein data bank and then shown on this slide is the very large team of structural biologists data scientists software developers and IT professionals who together working with me deliver the RCSB PDB data and services to many millions of users around the world at no charge and with no limitations on data usage. Thank you very much. All right, thank you so much. Really inspiring talk. We're going to shift gears now from the interface of chemistry and biology to the interface of chemistry and material science. And we're going to hear about a younger repository that's nonetheless off to a smashing start. Ali Strachan is the Riley Professor of Materials Engineering at Purdue University and the co-director of NSF's nano hub. That's lowercase NaNO, capital HUB. Before joining Purdue, he was a staff member in the theoretical division of Los Alamos National Laboratory and worked as a postdoc and scientific Caltech. He received a PhD in physics from the University of Buenos Aires in Argentina. And his research focuses on development of predictive atomistic and multi scale models to describe materials from first principles and their combination with data science to address problems of technological or scientific importance. Areas of interest include high energy density and active materials, metallic alloys for high temperature applications, materials and devices for nano electronics and energy, as well as polymers and their composites. In addition, his scholarly work includes cyber infrastructure to make simulations models and data widely accessible and useful for research and education. And he's been recognized by several awards, including the early career faculty fellow award from TMS in 2009, an R&D 100 award in the category of software and services for nano hub and the Riley chair professorship in 2023. Please join me in welcoming Ali. Thank you. So I'd like to tell you about this infrastructure nano hub for open simulation and data. So what we're, what we seek to do is connect research grade software and data infrastructure with their end users who are domain experts but not computational experts they could be students, instructors, experimentalists. And if we do that these we allow software developers to publish their codes in ways that they're that are accessible. We turn research codes into apps that you can run on your web browser as fully self contained end to end workflows for data that are presented to end users. All of these products are actual publications. They have do is their index by Web of Science. And as I will discuss their fair. And what we do that our team on nano hub is develop the infrastructure that makes this possible. And as you can guess with all of these simulation infrastructure and tools we have quite a bit of educational content. So let me show you an example of what I mean by making tools available. So if you want to run a molecular dynamic simulation, there's open source codes. We use one that's called lamps from Sandia National Labs. You have to get access to computation resources you have to install lamps, and then you have to learn to speak lamps. And this is a relatively simple lamps script. This is an ad hoc language, which has nothing to do with knowing about molecular dynamics, it just how lamps want you to talk to it. And we take months training students to do that. So on nano hub, we can wrap these tool around an app that's delivered on your website. And with a few clicks we have undergrad students running molecular dynamics and learning about material science without worrying about the computational intricacies of running the simulation and providing hardware. So this is one example out of many that I'll discuss. We also have end to end data science workflows where all you need is an internet connection and a standard web browser, everything is containerized, everything will run the same to everyone. And nano hub is truly a community platform. We have about 800 published apps and tools. 170 courses, thousands of educational resources. And these are contributed by 1800 or so folks from all over the world. We have performed so 250,000 individuals have run simulations on nano hub from all over the world. We run about a million simulations every year for our users. We have 1.5 million visitors every year to the site and nano hub has gathered about 2500 citations in the open literature. So how can a simulation infrastructure help with fair data and address the challenges that we're discussing in this meeting. I know this is preaching to the choir we've discussed this before we all know we have a data problem otherwise wouldn't be here. But we generate data in our field in material science and chemistry at quite a high cost and the majority of the data ends up languishing on local resources competition resources. We talked about this we the data we share ends up being biased. We have a I have a colleague from the aerospace industry that told me, look Ali you publish on this side of the distribution we design planes on the opposite side of the distribution of properties. And we just don't share except for selected results. When I was walking in the building today so this quote from Einstein that I think it's appropriate right we have the responsibility. To not conceal part of the information as scientists and bias publishing can get very close to concealing in some cases and then we talked about this also when we publish these results. The results are not what I would call machine learning ready, meaning they're not accessible by machines the metadata is not there it's very hard. We don't have traceability really it's not actionable. And so I one thing that I think it's important to understand is that this is this doesn't apply just to publications or data. It applies and I think it applies particularly importantly to data workflows. It's not just the data but the workflows that generate the data, whether you run experiments, and you have raw data, and you analyze them in certain manners and you produced your outcomes, or you have a simulation workflow that invariant invariably involves multiple steps preprocessing preprocessing steps running your actual simulation maybe in a high performance computing and post processing those results before you publish all that needs to be made fair and reproducible. And so what we do in nano hubbies, we asked the tool developers the workflow workflow developers to write this in Jupiter, and formally declare inputs and outputs in that Jupiter workflow. And when you do that we do automatic unit conversion we were talking about this earlier, the simulation software or libraries check for consistencies of those inputs. And then what we do is we take this workflow, and we publish it on nano hub so we're a publisher of fair workflows. These workflows each one of these simulation tools the 100 of them have the eyes. The inputs and outputs of the workflows are queryable through an API so you can ask hey do you have a workflow that would do this, or what type of inputs do you need to run workflow X. More important or equally importantly. The workflows are containerized. And so they're not just fair, but they're reproducible these workflows would run on nano hub the exact same way today, a year from today for as long as we can run these containers. We have tools on nano hub that are 20 years old and they still run the same way they used to run. In my opinion, putting making code available on get doesn't guarantee that the person who downloads the code will get the same result that you obtain when you run your code originally on your hardware with your libraries. All of this is containerized on nano hub so it works for everyone the same way. And then these workflows, I like to think of them as the quantum of compute can be access and launch in multiple ways. They can be accessed via apps for undergrad students where you click a couple of buttons. You press simulate and you run the workflow, or like in this example they are launched from a machine learning workflow that seeks to optimize the property and optimization algorithm. The workflow that I was discussing a minute ago uses molecular dynamics this code lamps to calculate the melting temperature of an alloy that was developed by PhD level students multiple steps is the simulation converged. They make a decision about the melting temperature or is it a negative result. All that is done at the PhD level. An undergrad student in my lab had wanted to learn about machine learning. So he consumed the simulations as a service, and he worked on the machine learning loop that called for the simulations and got the results, and he ended up writing a paper, consuming these quantum of compute that we call symptoms. Importantly, every time you run anyone around the world runs one of these symbols, we store the inputs and the outputs because you had to formally declare the inputs and the output so we know what they are. They're index on a database that globally accessible we call it the results database. So when by the time we publish the paper, the data was already fair. Okay, you can explore the data. You can see here melting temperatures as a function of composition. If you look at the top right you see every single simulation we performed, including all the simulations that did not converge. So you can learn what type of input parameters lead to convergence and which input parameters lead to a result from which you cannot make a decision so negative results are automatically stored. This really accelerates innovation. And I think there's a strong driving force to sharing this type of workflows. I understand I'm a little short on time, but these type of tools can be used for experimentalists to we have tools that ingest raw experimental data we spend with undergrad students four months collecting oxidation data from the literature. And then the simulation tool gets these raw data from experiments automatically analyzes the data against 42 possible models for oxidation. It does a statistically rigorous calibration and model selection. So it tells you what is the mechanism what are the models to describe your data and ranks all these models. And then we compare that with what the original author had said. Guess what how many of you think that the authors when you collect experimental data they fit 42 models and do model selection, none of them. Okay, so about 30% of the data published data, the model selected was not in the top five of our analysis. Okay, so we have inconsistent data in experiments. And that's a problem when you want to do machine learning. We use these for education quite a bit. I'm not going to discuss the details but in undergrad classes core classes at Purdue, we use these data infrastructures to teach fundamental material science. And so I think this is particularly important. Because of two things. First of all, workforce development, the future engineers need to be familiar with what we're discussing here in this meeting. They need to be users they need to be used to making the data born. ML ready and AI ready, but at the same time we were talking about our responsibility with society and having educating citizens who can actually consume data and can make their own decisions about the data that's available. Well, they need to have access to these modern tools and know how to use it. And I think we have to do that at the undergraduate level. And so this is this is being done today. Last thing I want to discuss making data findable in the way as geeks talk about the metadata and doing a query and writing a Python query is one thing. I think we have to be smarter. Okay, we publish as globally the same number of papers as there are new songs coming out every year. Okay, I bet I can ask any middle schooler in the country to find a song, and they will find it better and faster than my PhD students can find the paper. Okay, and so we are actually using LLMs to train private models that we're going to we're using as a chatbot within private data from nano hub so the model can recommend what you should do how to consume the data what state is available and personalized recommendations based on what you've done. Okay, you're a material scientist you've seen this tool and that tool here other things you can do next. And so, so I think we need to when we talk about making things findable making things accessible. We have to think a little bit outside the box and learn from other fields in society. Finally, I want to tell you a little bit about the impact. I will not 250. This is 250,000 simulations user users served nano hub is used in 77% of all minority serving institutions in the US. And we've done outreach efforts, we lower the barriers we make things easy to and usable truly usable. And what you see on the right are institutions ranked in terms of the number of simulation users on nano hub. Okay, number one is for do that's not surprising with her headquarter at Purdue. And then you see Arizona State, a UIC Northwestern MIT Harvard Stanford top universities are one institutions. But within this list, we have Florida agriculture mechanical University, both state is an art to university in Muncie, Indiana. We have University of Texas at El Paso Chicago State University say, HBCU. And we're very proud of this list. Okay, with the top engineering schools in the country. We have the ability to reach really a broad population. And I'll stop with this. We have a grassroots organization in the materials community called Marta. And coincidentally, today is yesterday and today and tomorrow. This is our annual meeting. So this is the materials research data alliance. And in, we're pushing for the same type of goals and I would encourage you to visit our website and check out our meeting. We have one of the fair owes NSF awards, ours is called Martian, and it's to bring fair data to the materials science community and this is a short paper that we put together to push our community towards fair data with baby steps that materials can do and also collective actions and be be good to share notes and collaborate here. I have some thoughts here but maybe we can leave that for the panel. Thank you very much. I'm going to kick things off with a question I didn't anticipate asking but you know I just want to show everybody that I was a good moderator because I listened to all the talks and after I heard them all it brought to mind a somewhat different question than I originally intended. So, here's my question to all four of you which is that it's very clear to me from listening to all these talks that that all of you are very much, frankly data scientists, and when I was in graduate school data science wasn't really a big part of the curriculum. And so what I'm wondering is if each of you could talk a little bit about your journey to becoming a data scientist in the context of what else you had done. And then what you think we can do to improve data science education and more generally motivate departments and students to get more serious about teaching people in various different scientific disciplines to be data scientists in addition. Yeah, there's a picture. So I can start I'm, I'm not a data scientist, but I can play one on TV. The, I, it seems to me that in engineering or in chemistry. I don't think we necessarily want to turn our students into data scientists. Because if we do, first of all, our companies and national labs will not going to be able to hire them because they won't be able to pay the salaries that the companies will offer them. I think we need to teach them and educate them into being expert users of modern tools. I see these as what, you know, Excel, or a word processor used to be 30 years ago. You have to understand how it works. You can drive a car and be a competent driver without knowing how an internal internal combustion engine works. And then when you switch to an electric vehicle, you don't have to relearn how to do it. So to me, that's what we need to do is teach them the fundamentals so they understand the limits of what they're doing and teach them how to use them be competent users of that technology. I would think I would disagree with Jake. You were a data scientist, because, well, I mean, what is the science? What is the data scientist right data scientist is a scientist who works with data. And so you didn't call yourself that but you clearly were doing that. Now, I do think there is a an educational effort that is needed here and and see Cass and some of the things we're involved with actually doing this. But very similar to what I was saying, it's, you know, it's not you're not a data scientist. You pick these things up along the way just as you would, you know, a new NMR software or something. Because it's not that hard. And given, given the context of say your standard computational chemistry course or something, you can easily derive modules that would give you or give the students the kind of knowledge that I was talking about. So I guess I'm more of a data evangelist has become clear today. How did I get there though I started as a radio astronomer, and we used to publish our data as contour plots. And I remember as a graduate student, do I draw this contour at two Sigma or 2.1 Sigma or 1.9 Sigma, because where I drew that contour would show or not show something that I believed was in the data. I thought, you know, this isn't very honest. I should, I should share the data and let people draw their own contours or whatever and decide if my interpretation is believable to them. So it was, it was that thing about transparency that drove me from doing my own research to working at the boundary between research and technology that enables better research to be done through better data management. And that said, I think the skills associated with data science understanding statistics understanding uncertainty characterization are incredibly important. And if we're not teaching those skills, along with the skills in the basic science and we're not doing our duties educators. So I'd like to start by thanking Jake for calling me a data scientist because I have been struggling for 10 years to reinvent myself as a data scientist. So I, I have on my third career, one was just traditional academic structural biology research, the second was R&D in the pharmaceutical industry and now I'm now I'm now I'm thanks to you a data scientist running the the RCSB protein data bank. We are tackling just these issues at Rutgers right now with the genesis of a big data science initiative and the question that keeps coming up is, what should we be teaching the biology grad students to make sure that they're actually going to be competitive in the future and knowing that biology is increasingly a data game. So, things like Python programming, I think, probably very important to expose the, the molecular biology cell biology grad students to because they probably never done it before. They weren't the kind of the nerdy kids that they taught themselves Python in high school. They were probably a quantitative, enumerate or enumerate when they, when they got started and then the other thing that is, that is going to be part of the mandatory curriculum. Just as you said, as statistics, they've got to understand its uncertainties in terms of data, error bars, etc. And we, you know, we don't expect these individuals to be developing new AI methods, new machine learning tools, but we know that they're going to be using these and they need to be using them with eyes open. And we need to prepare them for, you know, for that, that brave new world where, where many of them won't do experiments. Many of them will be looking at data sets, whether it's in the protein data bank or some genome sequence data bank. I think it's incredibly exciting for the future of biology, but, but daunting in terms of what we need to do to prepare the students, recognizing that most of them have come to biology, having fled calculus, etc. It's almost, it's almost a given that the guys who, the individuals, the women and the men who choose to do PhDs in molecular and cell biology, didn't do well in math in the early parts of college. They probably didn't do well in organic chemistry, which means, which explains why they didn't go to med school. These are the people we've got to get to because, you know, like it or not, they're the ones who are going to do the research. All right, let's jump to an online question. I think this may be a quick one. This is for Bob. It says OSTP pegged the cost of federal open access publishing in 2023 at less than 400 million a year in their November 2023 report to Congress, which was at least partially authored by Dr. Zaring Halam. Can you square that with your estimate the federal government is covering 10 times that amount. No, I just sort of did a seat of the pants estimate as we do in astronomy, and I don't vouch for the 10% number with any serious analysis just looked at, you know, how much money goes into major publishers and did a simple division. I think I don't know that 400 million dollar number does seem a bit low to me and I don't know what the, you know, what what the math was done there but it's a lot it's a lot of money that goes into publishing and you wouldn't see certain publishers with 30 and 40% profit margins if there were not a lot of money flowing into that system. I think the more important part of the point was that for, you know, comparable or lower quantity we could sustain the repositories and I want to jump to that next. I, when I talked to Steven beforehand. He mentioned to me that his talk would focus on the community benefits and that during the panel, he would hold forth on how the PDB has managed to sustain itself. So I'm going to invite you to do that and then we'll go down the road to talk a little bit about how other people are sustaining repositories that they discussed. So the, what I said at the outset that the PDB was the first open access digital data resource in all of biology is not an exaggeration. It's also not an exaggeration to say that we've been living, at least the spirit of the fair and fact principles not necessarily the letter but the spirit of the fair and fact principles for the last 50 plus years. The question of sustainability is, is really, I think continues to be a challenging one, even for an institution, the RCSB protein data bank which is organization which is funded to the tune of 10 million a year. That's my current funding from NIH NSF and DOE combined 50% from NSF 40% from NIH and 10% from from DOE. And what transpired this cycle for the funding renewal and we're just in the final stages of that was a huge surprise to me in 2018 when I went through the previous renewal. I was put through the ringer to prove value. I was told in no uncertain terms, we do not want to see letters from Nobel Prize winners telling us that the PDB is important. We want you to show us the data. So we published a series of peer review papers that documented the economic impact, the impact on the literature, the impact on the patent literature of PDB data based on hard numbers. And that got me a modest increase in the, in the funding got me about a got me from six to 7 million a year. What got me from 7 million a year to 10 million a year was the pandemic. They finally got it because they saw the real world example of, of how structural biologists and how the PDB played a critical role in the pandemic. I think, I think ultimately it's about, I think all I've said this in his presentation it's ultimately about the value that you can document that the, that the resources is providing. Could I use more money of course I could. I wish I'd lobbied because my mother got 11 million a year instead of 10 million a year, but it, but I am deeply grateful and humbled by the responsibility that that I've had for the last 10 years. I should say, for those online, as well as in the room that I, we are in the process at Rutgers of identifying and recruiting a successor to run the protein data bank here at the US worldwide protein data bank data center at ASP PDB. There's an ad in science magazine. Trying to recruit my successor if anybody is interested in the job who feels they're qualified. Please apply. I need to get this monkey off my back. I want to go back to doing full time research. And I look forward to ushering in a successor who is going to take the PDB to even higher heights. Her focus is on somebody who combines strengths in both structural biology and data science because I'm convinced that the future of the PDB rests very substantially on on the use of on the use of data and ensuring that the data that are stored in the PDB are interoperable with all sorts of different types of bio data because we've got a huge ecosystem that we have to interoperate with them. So it looks like there's a quick follow up which is when companies make products or money using information from the PDB. Do they pay royalties to the PDB and to the researchers who generated the data that they use. The question whether or not they pay royalties to the people who generated the data depends on what patent protection or what intellectual property protection. The university might have sought on the on the on behalf of those investigators. All of the data and the PDB are distributed on the most permissive creative commons license CC zero. And that means anybody in the world and download the data at no charge with no limitations on usage. And it's my understanding based on a very detailed analysis of drug approvals by US FDA that every one of the recent anti cancer drug approvals in the last 10 years or so has used PDB data at some point to to facilitate the discovery and development of the new anti cancer agent. And if you look across all therapeutic areas again over the last decade or so the approval of the US FDA drugs for these all these different different types of diseases. It was about 90% facilitated by PDB data. I tell us about the sustainability model for nano hum. Yeah, so now how has been supported for two decades by the National Science Foundation. It's has been about the 65 million dollar investment by NSF. If you think about the number of simulation users it's about $250 per user simulation user that NSF has paid. Yet this is the we're in our last year of NSF support. So sustainability is a challenge right we've we're working with NSF to try to see if they can continue supporting nano hub in another form. We have partners so not obvious not going anywhere we have lots of partners, but it is our funding model in this country is not conducive to supporting infrastructure that benefits our community be it research or education. And so we have partners now workforce development partners. As I said we're working with NSF, but I think something that they we need to change the landscape and something that can be done by the funding agencies is and by our colleagues to be quite frank is to resist the idea of reinventing the wheel. Every time you want run writes a grant and and decide well I'm going to reuse what already exists and I'm going to build my research based on that NSF supported lots of nano effort materials discovery type MGI type work. They never suggested to the authors that they make their data make their tools available or nano hub which is their own product. Right and so these data mandates could have more teeth and I think we all need to funding agencies and academic institutions need to resist the idea of everyone things. Well nano hub is a website we invested $100 million including produce support on that website okay it's not a website and people think well I can create a hub on my own with my grad student. Guess guess what that's not sustainable that's not going to work and we need that will help sustain efforts that are actually useful by the value they provide absolutely. It's kind of well I think the first thing that I hear from from listening to the two of you I really need to ask for a raise but. No I think that the difference here is when a very different part of the life cycle of the data infrastructure. I mean PDB obviously has been around for 50 years. Your system has been 20. You've been about a year and a half now so it's a different very different. Part of the life cycle having said that as I mentioned already in the presentation I think the key to it is demonstrating value and documenting value. And I think if we can do that then I will be quite optimistic of having getting this funded in a sustainable way we still will have to ask every you know. For five years for renewal and I'm fine with that because every now and then then you need to justify what you're doing. But as long as we can do this I'm pretty optimistic about this resource being being sustainable. And if not then probably should go away. In that case you make sure that what is there stays there and you move on. All right I just are there folks in the room who have questions. Yeah, please. Luis Sanchez associate professor at Niagara University. I have a question about the open reaction database. I've been using sci finder for about 20 years or so right and I was thinking about the fair data practices and I would say that in my field. The finability can be a little tricky right you need to use something like sci finder to get information about organic reactions and all that. And my today I just learned that the open reaction database also gets data for sci finder. And then my like I just thought about it like how we're looking about sci finder maintaining this database for for a long time. And I wonder how what if if this is fair to sci finder or the ACS that suddenly they're going to lose subscribers because they can find the same information for free. Sarah that question. I'm sure this maybe has this question has appeared before right. Do you want to weigh in on that before we go to the panel. Can we give her a mic. Maybe to preempt some misunderstanding we do not currently read in data from these all I'm saying it is possible. The idea is that the infrastructure is there to take data from any different sources some of which are proprietary and and protected like sci finder or reactions or things like that. Other things that are open like USPTO which we actually do use. But then I think the real challenge but also the mother load in terms of data is actually the thesis is the electronic lab notebooks if and that's a big if we can clean those up. So there's a lot of lost information that is out there that we then can do it put in and then you know there's some things about linking directly to real answers on but no. We do not read in data from from proprietary sources. But let's just if you want to just comment on the nexus of the you know fair repositories and sci finder and how you see those fitting in together. Yeah isn't that an interesting question. You know so I'm not here speaking on behalf of my colleagues at sci finder they kind of run a separate business from from ours they they spend a lot of efforts and resources to license in data to be able to to use for their products. I think as they consider like what their their future opportunities look like you know that they see lots of opportunities and data around life sciences they see lots of opportunities around data in material science. So I think for sci finder it's thinking more broadly about data across sciences and and you know perhaps think about how what kinds of cross disciplinary things can can you figure out when you can collect all of the data as long as it is fair right. Right let's go over to that time. I'm going to ask a somewhat self serving question here and that regards data and metadata standards. And so given that data and metadata standards sort of enable interoperability and reuse whose job is it to catalyze and sustain those standards. For example is it the community. Is it the device manufacturers. Is it the repositories themselves and in addition to that are there good frameworks or exemplars of when this process has been done well. You weren't listening it's Bob's job. Okay. Universally. Yeah it's all on me. Yeah there are there I'll start with there are good examples. Again going back to my days in astronomy. In astronomy we developed international data standard called fits, which has a core metadata model which is very simple. And because it was simple it was adopted globally, and is the standard that NASA. All major astronomical facilities used for for data storage and transport that led to the virtual observatory which is a suite of standards metadata standards for data discovery and data access. But I think what it takes in other communities is a collective will coming to NIST as I did 10 years ago I saw a lack of metadata standards and material science for discovery. And so we constituted a working group under the auspices of the research data alliance to develop a vocabulary and metadata dictionary for describing data holdings in material science. And the way we did it in both cases as you get knowledgeable people in a room. You put a bottle of scotch and lock the door and don't let them come out until they've they've agreed. No, it takes it takes time. It takes sometimes a couple of years, at least to develop a community standard. But you have to do it with the community so that you have community buy-in you have community participation and ownership you can't as a you can't unilaterally declare here's the standard you're going to use because people will say well what I had nothing to do with that you can't force that on me. So it's it's again it's a social engineering challenge, but it's one that can be successful when you demonstrate the value and the interoperability that you get as a result. So the story of the protein data bank very similar community engagement initially in the small molecule crystallographic community led to the creation of the CIF the crystallographic information format. That was then adapted by the protein data bank in consultation with the structural biology community to create PDB X MM CIF macro molecular CIF. Subsequently we worked with the protein structure prediction community again a different working group to create model CIF that underpins the alpha fold predictions of protein structure and the predictions of protein structure that are stored in the model archive. We just published what we just got accepted hasn't actually appeared in print yet a paper describing the creation of the integrative hybrid method CIF which is a sort of super set for structural biology involving multiple tools being brought to bear to determine structures of very large very complex systems that won't succumb to a single a single measurement technique. So the key has been consistently has been one of community engagement community buy in. I should say the same thing took place with respect to the standards of validating both the experimental information that's stored in the PDB or in the electron microscopy data bank or in the biological magnetic resonance data bank and the validation of the atomic coordinates that are stored in the protein data bank against both known chemistry and against the experimental data. None of this could have been done without the buy in of the community and in fact the utilization of some of the community software tools that underpin the validation calculations. I would add briefly that I don't think you want to let the good be the ex the enemy of the excellent and sometimes not even within a community but you need a sub community to agree on these metadata especially in a field like materials where we call it the materials genome initiative, but we don't have a genome in materials every lab measures things differently. And so it's it's hard except for sub communities to agree on a general set of metadata that you would use to describe the materials and you have soft materials and ceramics and metals and microstructure. Everyone measures microstructure differently and the example that I was showing is a relatively simple example where from very simple data different labs measure completely different things. So, as long as we share the raw data and the analysis tools and the decisions that the workflow that we used to make decisions, then someone else can come and decide that this is not the right appropriate way of analyzing the data and you can evolve your metadata with it. And this is this actually a very important point. When you do this. I think a decent amount of humility is probably called for because what was your data now. The next great tool comes along, which you haven't thought of when when you designed the standards. If you have the data, the original data available and you can reanalyze that this what I meant earlier in a talk when I say data over format. Go to the original data because you have no idea what's going to happen with it five years from now, nor does the rest of the community. And so having the respect towards the data and what might be hidden in there that you currently don't see should inform these kind of standards. To add that we we had to add a whole raft of new metadata items to pdb XMF if data standard to accommodate the data coming from the free electron laser crystallography experiments, which are fundamentally different in the way they're done from the traditional synchrotron crystallography experiments. So we have this working group we bought in the right experts to help us, and we now have a set of data standards that fully support the archiving of any of the extended metadata coming from the free electron laser at Stanford and the same one that's in Hamburg, the Russian European joint initiative, etc. Thanks that's a really good question. Let me build on it a little bit specifically going back to to Ali and Olaf. How do you publicize to the community at large that your repository is, you know, basically open to them and providing them with with a value add. We have this problem at the journal because you know people will just send us, you know, scans in the PDF that are not fair, and we'll say you know can you do something more. Okay, yes, we'll just dump it all in Zanotto. Right, which is not really tangibly different from putting it in the PDF. So the question is, you know, there's a there's a small but surmountable barrier to get people to, you know, adopt better metadata standards and and help facilitate reuse of their data but but it's a barrier. And, and you know what what what what can you do as sort of repository directors to help people be aware of, you know, how much value they can derive by surmounting that barrier. I think we were, even though we are fairly early on in the game I think we were very lucky in that we have a major bullhorn in both some of the leaders in these communities that that use machine learning method for synthesis and say look this is. This is what we're using in years to demonstrating value, but then on the other side, where in industry where quite a few of these these methods are more applied. And yes, that do tend to be a bit more top down in other words some of their industry leaders simply decided, this is how we're going to do things. And that's the end of that. Having that level of a bullhorn has been been very, very useful. Of course, you know the the the write ups and things like chemical engineering news and such also help. Yeah, it's similar to us for us we've we've organized we organize workshops we organize teach the teacher events where we bring instructors and we tell them hey we can here's a lecture that you can give and here's the homework assignment and this is how you introduce these to your students, we do a lot of that, and would really be very interesting partnering with publishers so not a hub is one of the options that is offered to their authors for depositing their material. There are carrots and there are sticks. In the case of the protein data bank, it's the key to the success was the stick provided by the protein crystallography community. 20 years ago, 20 plus years ago, thought leaders in the protein crystallography community came together and made a judgment that every publication should be accompanied by a PDB ID. Many of the questions asked mandatory. They went to the journals science was one of the early adopters. Thank you. They also went to the funders and said, we advise you to stop funding people who will not put their data into the PDB. And that actually happened in some cases. So, so in a sense we have a captive audience with our depositors structural biologists but it's a captive audience we try to treat extremely well. And that leads because we want them to be happy. We want them to deposit complete high quality data that will then be promulgated to the rest of the community for their for their free use. But it did ultimately take the community coming together and saying, no more. We must all share our data. Various holdouts including Nobel laureates were forced to come in line. But but it did happen. Bob any anything to add. All right, Marty. Okay, so incredibly inspiring session, looking at how well thoughtfully designed data sets democratize can lead to real advances. Of course, we can look backwards at historical data and now have seen great examples of where that can be very empowering. Provocative in some sense, looking forward, some of the most powerful discoveries we're seeing can be generated through actually iterative on demand generation of new small data sets through active learning. Let's say under Bayesian optimizer based algorithm. If we can democratize the data graveyard, if you will, to be provocative. How do we democratize the closed loop kind of forward generated automated discovery, because the power in the ladder approach may far outweigh what we're seeing in the looking backwards approach. I can chime in quickly. If, if the, the, the acquisition function that you're using is computational, I think we have lots of ways of democratizing it. We, I showed an example of using active learning on nano hub. One of the participants of a workshop from India was there learned about active learning and he wrote a paper using nano hub resources and entirely on his own following what what we had done. The key challenge in my opinion moving forward is connecting these data infrastructures and AI systems to labs, whether it's complete completely autonomously or their manual steps, we can move in that direction. And there's lots of efforts of even low cost labs, little small printers that you can buy and put even in a high school lab where you can that can be driven by an AI system. And so I think that's a way of encouraging large number of citizen scientists. To some degree, this is already happening so co scientist, which is system built at Carnegie Mellon as part of CCAS does some of that. I think the question here becomes where when when you have the smaller data sets that you mean. And when you do learning on how do you identify them so this this idea about intentional what is the right data. And so for that is I think we're still learning on how to identify these these data spaces that you then explore. Once you've done this. I think you can move in fairly quickly and I think all the machinery to do this already exists. In the initial part, I think that's a bit of an unsolved problem, in my opinion, if you can define the question, then you can start thinking about, okay, I need this kind of data to do this and it would take me this long to do that. But that initial step, I think is is where people are currently working on. A question from zoom that I confess I don't know the answer to, which is, what is the end game of a repository rugs out of money and has to shut down. Have there been high profile examples has someone swooped into archive the data and how do we think about that and a cost benefit sense as they start proliferating and that risk increases. There's a company in our field in material science called citrine informatics the company's still there, and they supported an open repository for the academic community where they provided services for free. And they are still very much involved with the community we had data sets there on their repository and the company, all of a sudden decided well this is not in our interest anymore and so they shut down. So first data was available right people could download entire data but we had to find other homes for the data that we had deployed there so this is always a concern and I think as a community we need to this is a very important question because there are examples. And this will continue to happen and we need to think about sustainability. These are active resources right so you need to maintain them and you need to keep the lights on. If people want to access them. I think the higher highest risk here is not for, you know, reasonably well established repositories but for the long tail data. Somebody has a three year grant they produce a nice data set the grant expires the funding at the university. Well, that's not our problem. And they're probably uncountable data sets like that have been that have been lost completely. You know, in other areas. You know, some in some NSF funded repositories have tried to go private content subscription models and pay to deposit models. I wrote a co wrote a paper of 10 years ago on sustaining domain repositories and you know that's a model, but a lot of people don't want to pay for stuff, especially when it's been free before. Absolutely. And that's why I argue so forcefully for the for what I think is the right approach which is the funders have to accept the responsibility for long term support of the research outputs that they create through their funding. And it's not that expensive. And they just should do it. Unfortunately, I'm aware of a 300 million dollar data dump data abandonment that was by the national by the National Institute of General Medical Sciences they spent 300 million over 15 years supporting the protein structure initiative and at the end of it they killed it. And when asked, well, what should we do with the data, they said, throw it away. We're not going to pay you to sustain it. And all of those data were lost. And there was a lot. Just going back to comments that were made earlier, there were a lot of negative results there that were actually very, very valuable. And that was all those data were abandoned should have tapped HMI. Was that was that a possibility and did they reach out to them? I can't imagine a change and I would have thought that was their responsibility somehow. This was this was a structural genomics initiative. This was solving large numbers of protein structures and putting them in the public domain to inform biology. In fact, had that not happened, we probably wouldn't have had the as soon we would eventually had but we probably would not have had as quickly. The enormous innovation that took place with the application of AI and machine learning to the protein structure prediction challenge by various teams but ultimately by deep mind, you know, Google deep mind that without a fall to they chose to compete in that space because the data were so clean in the PDB. And the data were all organized with, you know, with proper or proper data model, etc. It was very easy for them to move into that data set and actually apply their tools their superior access to computing project management talented programmers and and jump and jump ahead. You know, other groups had shown modicums of success using AI machine learning methods but it took that sustained effort at deep mind with the with the PDB data with the deep genome sequence data that was also available to make alpha fold to a reality. Of course that was very rapidly followed by Rosetta fold to and other open fold they're all all these different competitors now they're actually doing a better job than alpha fold deep minds moved on. That was a publicity play for them. And they're now doing they're now doing AI structure guided drug discovery. All right, we have a follow up question on this general topic which is, you know, what are the pros and cons of seeking industrial partners and related question is how do you deal with proprietary data in that context. Maybe I can start out on that is because this is something that has been from the very beginning very much part of the model of CCS. And so I, I personally believe that it's extremely important to seek industrial partners less because of funding or because they're they're won't for the most part unless you define that value, but because it helps you to at least in our space. To identify the important questions. And so that is the really the initial step. And then you build on that and then you can start thinking about well how about contributing data what formats do we have we don't have any, you know, all 172 year lens available so how do you if you think about data intake. Talk with them because they have dealt with it. So the information that they can provide you is more important than, you know, a few hundred thousand dollars of funding later on once you demonstrate value. I still think that that the interactions with industrial partners is very valuable, even though I agree with the person asking the question the discussion around intellectual property. What is proprietary we were joking about that over lunch on, you know, all your data is non property and all of mine is. These are difficult discussions but they they have a there's a lot of very smart people there and I got a lot to contribute if you want to have an impact. I consider this essential. For us on nano habit what what the partnerships that have made sense and this is very recent is commercial software providers and deploying their tools for educational use on nano hub because a lot of our students are going to go out and work in industry and and those are the tools that they're going to need to know so. And for the longest time the software providers were completely against sharing any of their tools online through the cloud on on nano hub and over the last maybe three years that has been a big change in their attitude towards the cloud and towards enabling students access to to their to their tools and I think that really benefits the community. Everyone can go to nano hub now run mad lab or a bunch of other commercial codes engineering codes and T cat codes for free for education professors can use it without having to download or install the software which is quite difficult. Probably part of that is also that the industry as a whole went from, you know, you buy a software to your end the software more. Yeah. Yeah. Yeah, I would just add Jake that at at NIST we have a lot of cooperative cooperative research and development agreements with private industry, which after I think it's typically a five year period any data collected under that crata is obligated to become public. But what that brings to light is that this move toward fair data management is just as important in industry as it is in the public sector and the university research sector. I was talking recently with a colleague from Exxon mobile who talked about challenges they have within their company and managing data different divisions do things in different ways they acquire another company which has a different data management philosophy different data formats. None of them are interoperable. Nobody knows what the other group has and now they're all expected to work together and build an integrated system so these fair data principles and everything we're talking about here in terms of improving data management are just as critical if not more for you know the profit making private sector. There are as many protein structures in in total distributed among the top 20 pharmaceutical companies as there are structures in the PDB. I would love it if the pharmaceutical company would dump 200,000 structures on me tomorrow. I don't know how I would manage it, but it would be a huge step forward because we would have 3D structures of many examples of same protein different small chemicals bound that would accelerate the computational chemistry in ways that are simply not possible at the moment. We've had small projects with industry to do that and I would be very receptive to a much larger project that hasn't happened yet. Bring it back Donnie. Tell them. Yeah, no problem. So, Danny Schultz, I'm at Merck. And this has been really enlightening and the question that you just asked Jake kind of was one that I was going to lead with and that. I think the industrial sector as a whole is really interested in collaborating and innovating beyond our walls because we know that the power that it can have and I think all brought up a wonderful point. The problem selection and is really critical there because it leads to really great applied sciences and so I guess my question is, are there specific strategies or initiatives that industry can start to take on or contribute to to advance fair data practices. Besides dumping 200,000 protein structures into a database. Yeah, so in the last 12 months roughly we initiated a number of discussions. Both in Europe and the US with various roundtables there and kind of what are the right questions or what are the questions that a company will never allow beyond their own walls versus you know things that are best done cooperatively. And so, to summarize in a grossly simplified fashion tools and methods and validation are things that are probably done beyond better than beyond your walls, because develop particularly the validation part. If you go out of your your own ecosystem does it still work. And so organizations like CCAS or nano hub can play the role of an honest broker if you will, in that context in providing a neutral marketplace to discuss these, these different approaches. Those are areas which work exceedingly well. Once you get down to specific problems, I need an inhibitor for that protein. Well, you don't even want to know anybody else know that you're working this target right. And so that's when it's maybe how do you apply these tools to particular question that's probably not as well done. Yeah, I would second that I think identifying pre competitive research areas where the data can be shared, and it's co funded. I think the semiconductor industry does this very well there's an organization called SRC they found a lot of academic work at a pre competitive level and lots of companies chip in. They work together and they establish as a community where they they want to go what their priorities, and it helps having stronger ties between academia and industry. And the second thing I would say is work for development. Okay, the main thing we generate is not the research is the students who actually do the research and then they go out. So I think having the appreciating and hiring students that are expected to be knowledgeable about data, knowledgeable about about fair data. It's something that can then be fed back to our institutions and push the need to change how we educate our students there has to be a clear need from companies. So that universities can update the way we teach and and reduce the distance with the education that is needed for modern chemistry materials. And so you're in the data in patents that are not peer reviewed, how reliable and trustworthy those data are for prediction for using in the AI slash ML algorithms. As usual, the answer is a strong it depends. So, yes, they are not peer reviewed, they are reviewed because there is a, you know, a big stick there and saying that you patent can potentially be invalidated if what's in there is not correct. Potentially that is a big caveat there. There is a lot of inconsistency there. And so what we find in analyzing the PTO is that you can filter a lot of that out by by looking for internal consistent inconsistencies. Once you've gone through this. I would say probably upwards of 80% is what I would consider reasonably reliable. And then the other 20% you will have to filter out. So I'm glad you talked about reproducibility. The dirty little secret of NIH funded biological research in this country is that 50% of publications cannot be reproduced. The patent literature. I believe, as you do, is much more reliable because in companies results have to be consistent. The assay can't be run three times and you choose the data point you like and move on and try and get a paper that assay has to work every time it's it's it's conducted in the in the company. I'm not saying that 50% of the papers in, you know, they're published with NIH money are the result of fraud. I don't believe that they are a small number to be sure, but I think it's a very small number. I think the problem we're trying to do research in biology is best exemplified by thinking about what happens to a cell that's in a tissue culture plate. The genome of that cell today will be different tomorrow. And that genome will be different the next day, because these pro these cells that are living on tissue culture are constantly evolving under the selection pressure of having to survive in that environment. And so doing controlled experiments and biology is just really hard. And it's so it's not surprising that there's this variation in in NIH funded research. I think protein crystallography and a mass spectroscopy electron microscopy are among the most reproducible of the of the biological sciences today. And part of that is the PDB and the data validation that they did the data consistency etc. There's a there was a question on the chat. I'm just asking about the distinction between what's reported in the paper and what's actually stored in the repositories and in the PDB certainly there's much more data concerning the experimental system in the PDB typically than there is in the publication. So if you want to be able to reproduce something you frequently better off going to the archive and not to the literature. Years ago, my colleague and plant at nest and I wrote up a paper on reproducibility challenges. And, as you say, it's it's very rarely fraud it's very rarely intentional corruption of data, but it's especially in biology. It's the unknown unknowns. There are experimental parameters and laboratories that can change from time to time the temperature humidity. The reagent maybe is not what you ordered from the supplier it has, and the, the mouse line is not what you were told it was, and so forth. So, there's just many things that can go on that you don't you're not aware of affecting your experiment. And my, my colleague Baron months and the Netherlands talked about going to a lab I think it was in. Maybe it was in Argentina, I don't know, but where they were having great trouble reproducing some experiment based on cells. They couldn't figure out was what was going on. So he went down there and he looked over their shoulder. And they were smoking in the laboratory and blowing their cigarette smoke onto their samples. Well, you've got to stop smoking. All right. We're nearing the end of our time I want to ask what's hopefully a quick question and then then we'll go on to the wrap up. I'll let you can weigh in on this there was a sort of general question on the chat about where GitHub fits into the repository ecosystem and I'm curious what what your thoughts are on that. Yeah, well, I think it has or any type of version control system has a huge role to play. And what I think is what I think is that that's not the end of reproducibility and sharing and fair is putting your stuff on a git repository for the reasons that I mentioned. So, when you deploy a tool on nano hub, the code is on a git repository, and we use version control. And what's published is a containerized version of that piece of code that has all the environment and the settings so that it can actually run reproducibly all the packages all the libraries that are needed so it will always run the same way. So I think git is not. These are complementary infrastructures and, and, of course, putting things on git is better than putting things that that not sharing, or putting them in your website. I said tarble, but it's not going to be discover or it's not going to be queryable to a certain degree and it's not necessarily going to be reproducible without the right environment if it's code. And for the protein data bank we maintain GitHub repositories for both the code and for the data model for the data dictionary. So all that's publicly about. One of the parts that to address some of the problems that I was mentioning is, I think it's not enough to deposit the code you also have to put a test suite that you know if you run this, you're supposed to get these answers here, because the worst thing that can happen is that you run this on a different machine different environment. And you're not aware that something changed. All right, so we will just about wrap up the panel I think this was a terrific discussion, and I just want to end by reflecting on the fact that you know we're in Washington DC this is where people set the budgets. And, you know, I'd like to be optimistic that, you know, as I said multiple times when we were planning this workshop that open data in particular is not going to be an unfunded mandate. And so, you know, I think most if not all of us here are our US citizens. So I'm curious, you know, what can we do how can we advocate for this how can we as citizens as scientists and as you know interested parties. How can we help ensure that open data is is actually achieved and and is sustained from the perspective of talking to policymakers and, you know, helping them understand the urgency. The community has to speak. As I said I've been trying to, you know, encourage these changes for decades with very little success I'm afraid. You know, when I talk to my colleagues at at the National Science Foundation, they say, well, we need to hear this from the community. We won't just impose this and we won't dock their budgets two or 3% to do this unless they demand it. So the community has to understand they have to understand the value proposition that I think is very clear to the people in this room. But they have to hear that as as a groundswell. And my experience in astronomy is that if, if NASA stopped funding these discipline data centers. NASA headquarters would be stormed with a riot. You know, that's the kind of groundswell that I think is needed. Until then, we'll keep talking the right talk and hoping for change, but it will happen a lot faster if the research community demands it. Maybe as a first disclosure, I'm actually not the US citizen here. But anyway, I think Stephen made a very important point earlier. There are times in history when you can demonstrate the necessity for that in a unique fashion. And as you said, there was one of those. And I think one of the parts why CCAS is is doing really well is right now because simply everybody's talk about machine learning or AI if you prefer that term. And the community needs to be prepared to say, at that point you're asked and some of the members of CCAS have gone back and forth between their universities and Washington. Many, many, many times in the last six months because there is a major AI freakout going on. And those are the points where you really can say, look, in order to do this right and not get wiped out here are the steps that we need to take. I think we need to continue to talk to our representatives and advocate for science. I think we also need to learn how to talk to the general public and I think we do a pretty bad job at that. What we do is a land grant institution. Right. So we serve the people of Indiana and the nation. And I don't think we have good avenues to communicate and to tell our story and more than the avenues we don't have the right language to communicate with the general public. We do a poor job at talking to folks who are not geeks like us and to be able to see a little bit outside of our community. And I think that'd be important to have an educated, educated citizens make the take this into consideration when they go to vote. So I think we can learn from others, learn from other fields. We have amazing communicators, you know, Carl Sagan, Dick Feynman, who actually went outside. They were not just talking to their colleagues, right. They were talking to people. I think we need more advocates like that for science. In all fairness, there's some great people in chemistry. I mean, there's somebody here in DC, actually, Rachel Berks, who is just fabulous in this. Andre Isaacs up at Holy Cross. We have people like this. And whatever we can do to give them a platform and get the word out will achieve exactly that. And even last word. I agree with Alejandro science communication is going to be the key. Getting training students to be able to talk about science in ways that people find accessible is is essential for the long term viability of scientific enterprise in America. And we and I agree with it. We do a terrible job right now. All right. That was a slightly pessimistic last word, but let me just say that we have the prospect of doing a terrible job. Let me say that I think the problem is surmountable. But we have to make the decision that we're going to reward students for going the extra mile reward assistant professors a reward professors for going the extra mile to do the communication. So, Elaine you talked about the need to completely revamp the value system within universities. What it takes to get tenure, etc. What it takes to get promoted. This is all I think this is all part and part of the same issue. It's not bury your head in the papers and publish as many minimal publishable units as you possibly can to meet some sort of scale, weight, weight of paper test to get to get promoted. It's, it's, it's the impact that that the work has and it's the impact the individuals have that should be rewarded. And we can do it. We can do it. All right. Thank you so much. We're going to take a quick 10 minute break and then we'll come back and have a wrap up and give you a preview of what's coming tomorrow morning.