 All right, so I think I want to open with a very broad question, which I think can trigger a lot of discussion. I think we're becoming really good and really make data accessible. I think all of us are really rigorous in that when we publish paper. We have our also data sharing plan when we write our grants. But are these data reusable? Do we do it in a way that we can really enable integration? It looks like there are a few people that are thinking about integration, but I don't know if the average scientist when it is submitting this data is really thinking how other people are going to use this data and integrate. How transparent are we when we submit this data and making it open for other people to use? Because the first two talk really allied how you really need to think about the language and the characteristics of the data. So I want to open this for discussion. The microphone. Fundamentally, the I in the fair data principles is the hardest. The reason for that is really that we are releasing data and we are making data sets available, but oftentimes those data are sort of submitted in sort of this rushed format by an undergrad or grad student that may have never navigated the NCBI SRA, for example. And then there is not a lot of good infrastructure in place to help curate the metadata that are associated with submissions for genomic or metagenomic data sets. And so I think that having like better infrastructure in place to help with that curation would be important and importantly that we should start to use our own analytics to do that. So why do we all have to start from square one with understanding GSC and all of the standards and et cetera? Is there a way that we can sort of suggest some metadata or ontologies to people to help guide them along? So I agree with everything she said. But just to follow up on some of those things, I think one of the things that people think about when they think about the I is just that if we have a common identifier or even a common label that we can just hook those data sets together and harmonizing data is a lot more complex than that. It's all about the context, the models, the context in which it was captured. And so really seriously thinking about how do we harmonize data actually takes a lot more ingenuity and technology than people think. And so I think that's, you know, there's lots of great tools that allow you to navigate linked information, but that's not the same as having integrated data that we can really do mechanistic kinds of things with. A case in point is we've been doing a lot of work recently on harmonizing disease definitions because the way in which you define rare Mendelian diseases is different in different places and different countries. And it turns out that you might get a different diagnosis depending on which terminology you're using because of the context of those terms and those terminologies. So that's one example. I think the other example is really how do we go about moving data in an interoperable fashion across these chasms of semantic despair, as we like to call them. These, you know, how can my, let's say for example I want to do a study studying environmental health at the, you know, oceans edges and looking at microbiome in the ocean in the seafood that people eat. Like how can I get my data to talk to your data? Like it's really, really hard to cross some of those big chasms that we need to cross in order to really understand the phenotypic effects of changes in the genome. All right, so we're there, Ashley. And remember to introduce yourself and use the microphone. Hi, Ashah Zender. I think one thing that came up, I think in the symposium that you put on last year was that a lot of it's driven by journal requirements. So if you look at, we've been looking at publicly available data as well. And it's a big difference between some of the like JeB and BMC genomics in terms of what they expect their authors to deposit and in terms of the format and the structure and what authors can get away with. So they should agree. The academics are driven by the journal requirements. And so if the journals have a strict requirement for how that data is deposited and in what format, then authors must combine by that. And so I think a lot of it can be driven by journal requirements. I was actually trying to, I wanted to bring that up about the journals because a lot of our data submission is really driven by journals. So, Sophie? Okay, I have several thoughts on this. I think somebody brought up, I can't remember, which one of you about how actually making your data usable by others is something that should be rewarded and expected by granting agencies. So one of the other hats I wear is working on pediatric cancer genomics where this is a huge problem and some of the biggest funders like ALSF and St. Baldrick's have actually made that a requirement of a category of their evaluation of grants. And it seems like NIH and NSF and USDA. But do they agree ALSF should be shared because that's what? I mean, also the way we should be shared should be homogeneous. You have to prove that your data is being shared, that other people are utilizing your data and that's a factor of. But I think the other thing is getting back to the journal thing is, you know, there's different ways that you can deposit your data and there is an incentive because you don't get rewarded for it to do it the easiest way possible. So, for example, EBI makes it very easy to deposit your data and that's great, but then it makes it harder for other people to use it, say, you know, as opposed to depositing your data in GEO, which has few but more metadata requirements. That's true. Just to insert, I think the privilege is a moderator. The journals are great but they're not about interoperability. They're about format matching. I'd be interested in what you guys think about the requirement being that they're deposited in a place that checks against not just your syntax but whether or not it makes sense compared to other things deposited already. So there's already a unit test of interoperability that occurs in a national repository someplace. What do you think of this? I think that's fundamental to actually getting the eye and interoperability moving forward. And I think that one of the major issues is that there are so many different kinds of ontologies as well and we can't seem to agree as a community oftentimes on what to use, but if we have some sort of mechanism to say, oh, you are actually some sort of harmonization score, like can we score? That would be great. It's something that gives us some sort of measure of how closely you're matching with other data sets based on the semantics you've used. That would be very helpful. Okay, there. Yeah, here. I would be very interested in seeing some support for a trait database for all species as a way of integrating things because that then advances the science a lot. And the second point is just curious about thoughts about, like in your presentation about the quality of the reeds and so forth. If you have a database where people can deposit anything and everything, it doesn't become usable. The lowest denominator is the data, okay. Like that, he's depositing sound in the environment here. And so what I'm wondering is thoughts on the mechanism of how to set up quality control, like it's put in, or at least what's annotated as different quality levels. So I want to say that quality is of the utmost importance because as we all know with any kind of computational analysis, garbage in, garbage out. And I know that the National Microbiome Data Collaborative has made that one of the second fundamental pieces of really putting together this project that makes data interoperable, that we don't just let anything through the gates but we actually have some sort of metrics to determine the quality of the data. And I think fundamentally for comparative genomics, that's, you know, and all of the issues that Melissa had discussed and also Beth. And I'll let them talk about that because I think those are fundamental. So there's kind of two points there and also going back to the journal problem. So I think for at least for genomic data, you know, the field is mature enough that at least a genomicist knows what quality data looks like when they see it. We're not there yet with phenomics data. And so we're just scratching around the edges and we have some algorithms and some tools to help us assess quality of phenotyping data but it's actually really challenging. So that's one problem is that it's just so early to actually define what that even is. And then the second problem is for the journals, especially where for Mendelian genetics, right now we often end up and this is true for a lot of the model organism data as well, we end up with these summary tables in these journals that say this many organisms have these features. And you don't know which features went with each other, right? So you don't have any individual level information which maybe is okay given all the other issues I talked about with how we associate that with the actual whole mouse genome and really speaking to Beth's talk. But it's really not okay when we're trying, when we have literally three patients in the entire literature and they're all lumped together and we don't know which ones had which features. And so we're trying to work on standards that can be kind of like a fast A for phenotypes that can be used in these different contexts through the Global Alliance for Genomics and Health called phenopackets, which would try to address that. But again, super, we don't have fast A yet for phenotypes. So this is our big challenge in making these things standardized and available for standardized genomic data. Yeah, and I just want to interject one point to that just to layer one more piece is, I think we do it well now for raw data sets for genomics, but for post-process data sets or things that have some sort of analysis on top of them. There is really no metadata or information that's oftentimes provided about how to reuse those data sets. And there's been a huge investment in computational time and producing them and we need to get better at that too. Okay, we have a question there, yes. So in terms of interoperability, I think, I really liked, I think it was Bonnie's point about touch points between data sets. And I think this is really the way to, you know, have deep integration between very disparate things. And I think it'd be very useful for people just to compile a list of these touch points. You know, what are these types of things you should be looking for when you want to bring very disparate things together and just make up sort of a meta list of these things? So I did indeed bring that up because one of the things that we found in trying to like bring together all of these ocean science data sets is you can't do it all. The data sets are so ugly and so monstrously horrible that half the time you really need to sort of just find some small semblance of pieces that actually will fit together. In the ocean science world, you know, this is latitude, longitude, depth of time. These are things that everyone kind of has to have even though they all do it a different way. But I think that that needs to happen perhaps like in many other domains. Like let's not try to conquer it all. We I think your best point about, you know, doing things prospectively and it's going to be important. But that was Melissa, that was Melissa. But anyway, so I think that's important. Let's build the architecture to do that and to do it better and just be, you know, let's do this better as a community. But at the same time, we want to reuse all the wonderful data that's out there. Let's find those touch points. Let's figure out how to connect it. Can I just extend that for a minute? I want to ask you a question about this. So I think, Melissa, you mentioned this, but of course, this is an image of your work, Bonnie, which some of the touch points we all use are things like homology between genes or phylogeny, taxonomy of some sort of other, sometimes chemical similarity. But those are all conditional analyses. That is, there are things that are of the moment and people have different theories for why you should use one thing or another. How do you encode when you are looking for touch points between data and the fact that those touch points are conditional based on a particular analysis? Outside the data set itself. Quick note there. I think one of the things that's really important is developing those common pipelines and protocols as a community that we decide this is, you know, this is a pipeline we're going to all use and those comparative analyses are then going to be interoperable and we can understand them together as a unit. And then on top of that, making those analyses and pipelines available in containers that can run anywhere because not everyone has all the compute in the world as much as we all want it. So I think that that kind of democratization of code that can run anywhere and harmony of those analyses will help us. Okay, I have a question over there. I wanted to talk about issues regarding data interoperability and sharing that the warmth brought up in the session. One is standards for sharing genomes. So one of the things, for instance, DNA Zoo does is we share all the genomes with that restriction even on publication as soon as they're available. And I think that there are a lot of genomes that are in these, you know, two-year, three-year, four-year waiting process until it finally publishes, et cetera, and I think that's slowing down the field as a whole. On the other end of things, I think that it's really important to increase transparency throughout the process. That means not just the raw data but various intermediate process data types. So it's very, very easy for people to reproduce the whole pipeline. And indeed, as much as possible to encourage people to make protocols publicly available. For instance, it's very frequently the case that people have the option to use a publicly available protocol or, for instance, a proprietary kit where it's actually never going to be known exactly what the experiment was. And I think encouraging people to make their experiments as reproducible as possible also means, you know, use public protocols, share the detailed protocol that you generated. That may have been an opinion. I think everything is highly encouraged. I think everybody agrees that this is encouraged but it's not enforced. I mean, I agree with you for protocols. If you look at Chipsick protocols, nobody specifies the amount of antibodies and the number of cells. It's really hard to reproduce some of these protocols. I just want to put in a plug for protocols.io, which is sort of the github of protocols where you can fork and create new protocols from existing protocols. And those are dynamic where you can ask the original author about different points in the protocol. And I know a number of PLOS journals and other journals have adopted protocols.io as a mechanism. And I think that's important. And we link to protocols in protocols.io for our project to really make it as transparent as possible. I just want to say another quick word about that. So I also love protocols.io and would highly recommend it. One of the nice things about it is that you can also link specific things, these ontologies, these kinds of things within it. I think the other thing, and maybe this is also something to talk about tomorrow, is just if we think about the literature, and I actually had the honor of reading the first philosophical transaction science journal that was 350 years old a few years ago. And the current way that we write an article in a science journal doesn't really look very different than it did 350 years ago. There are places where we tell stories about our observations. It's not where the data is. It's not where the experimental details go. It's really the ecosystem of the protocol, the data sets, our data visualizations, our resources, the physical resources, as well as the computational ones, like in the packages, that together constitutes the science. And I think we just have to move away from that journal-centric view of the world as a whole in order to solve many of the problems that we're talking about. So here, here to you. There's a question over there. I just want Adam to comment on the K-Base. You have the tool that where you talk about the protocols and experimental procedures where you implement that as a part of the platform. Maybe you want to shine some light on that. Well, first I want to third that protocol that is awesome. But so what K-Base is supposed to do is it's designed to capture your data, your analyses, and your thoughts simultaneously in these things called narratives. And those are shareable and transportable in a sense within K-Base. Unlike some tools which are much more distributed, K-Base is more centralized. And the fundamental reason for that is because of the interoperability issue, which is to be able to check when your data comes in, is it truly meaningful in this sense. But one of our big goals is to allow, push in and push out of K-Base easily, and to do that sort of interoperability checking, and to propagate that knowledge quickly. But I think this is one of our big challenges, is that the things that Bonnie and Melissa brought up in particular, which is a little bit different than it is in mammals. We do lots of different organisms beyond mammals. It becomes very complicated. One of the questions to you is, I was going to say, this is not the sense of the pulpit right now. By the way, everyone should sign up for K-Base. It's very, very useful. But take something like a plant genome, which takes forever to assemble and forever to annotate. They have an agreement, a general agreement, for how that's done. Do you think that's the wrong way of going about things? Well, I mean, look, I'll give you an example. We have at least 20 unpublished plant genomes right now without standing collaborators, but getting everyone on those groups to agree to share the data out of publication is an issue. So we're talking about nine de novo genomes of chickpea, different accessions. I think this kind of stuff, if it was publicly available, it would greatly accelerate research, for instance, in chickpeas. But it's difficult. We don't just have a culture where it's expected to share all of this stuff and that people won't do certain antisocial things. I think even if it was understood that there were strong social norms about if someone does nine chickpea genomes, well, if you're going to do the dead obvious things that one would do with nine de novo chickpea genomes, you kind of reach out to them and do it collaboratively or something like that. If there were a few social norms about that, I think people, and there was something like a Nagoya protocol type statement, I think people would be more comfortable doing it. I just think everything would speed up a lot. Titus? Hi. That's good. OK. Titus Brown, UC Davis. So I think one of the things that comes up a lot is how do you do enforcement? Sorry, is this? Say again, sorry. Can you repeat? Sorry. I'm not close enough. I'm rarely accused of being not loud enough. So I'm unused to this. OK. So the enforcement is whether the grant agencies are requiring it or the journals are requiring it, the enforcement is done by the reviewers because the program officers and journal editors generally have less of a clue at any rate than the reviewers do. And so the social aspect that's come up now a couple times really seems to be key. And there's this requirement of some sort of community of practice. And there may be many different communities and many different communities of practice given the variety of people represented in this room. And basically we need communities to develop, disseminate, and enforce whatever it is that is good practice. And I've been thinking about this a lot. I don't have any solutions. If I did, I would be saying things much more loudly. But there's a really strong role in bioinformatics at any rate of both open online communities and training across global training in moving bioinformatics to the point where at least when I review papers, I can now say, here are the best practices for making your data available, your code available, and making your analysis reproducible. And I think that's the sort of thing that we need to figure out how to, I would like to figure out how to support it at a funding agency level in a way that right now is very challenging. Yeah, Claudio? I want to bring up the point, so it's Claudio Mello from HSU. Bring up a point that I was not aware into quite recently, but it refers to quality and certification of database that are deposited. That's something called the NIH Data Science Strategic Plan. There's a webinar. There's the Meta-April. I knew this because we have a small database of gene expression in the brain of a songbird, and we were invited to an AGMS funded. And it turns out there is a whole set of criteria and guidelines to get an approval, like something called the course trust seal. This is one example of a certification you can. And you need to meet your database and you meet a certain number of criteria. So it's interesting, perhaps starting point to discuss, and perhaps apply to genomic and expression database transcriptomics and so forth. But for us, it turned out that it was not very practical because getting the certification and then maintaining one issue is you deposit your data, then how do you create, how do you maintain it when you update and upgrade? We don't have a team to do that. Remember, like a small lab. So but it's an interesting concept whether the genomic data should have certificates because you'll be like a high standard of quality that you could trust. Trustworthiness is one of the criteria. So I think there's a webinar, and it's perhaps an interesting point to bring this question and consider in terms of the quality of the database. So I wonder what the panel has to say to that. Yeah, so, oh sorry. So this kind of speaks to both of your points. So I think one of the challenges is that generic programs like the core data trust or FAIR, they're helpful in the sense that they create the recognition that we need to do more to make our work more reusable. But they're not specific enough to actually do science with any of that data. So at the end of the day, you know, it comes back to Titus's point about having, that the reviewers are really the ones that really need to know. So for example, both the FAIR and the core data trust requirements don't delve deep enough into licensing before those of us that integrate data and want to redistribute it. Because currently I'm illegally redistributing very many NIH funded resources because of a lack of licensing interoperability. But nobody but me or a few other people in this room like Titus would know that, right? So it's- You know that you have been video recorded, right? Yeah, we would never do that. But they're all gonna change their licenses after this meeting, right? But the point being that, you know, when it comes down to actual science, you know, these are helpful resources to help us do better, but they don't actually get us to the interoperability that we need from the science. And then when it comes to the journal reviewers, I think also there's, the problem is that there's so many standards to choose from, right? And journals are kind of somewhat randomly picking them. And so when you go to one journal to another, you end up with different requirements and different answers. The communities of practices are not adequately driving the actual standardization best practices. And so that's how do we coordinate these things, I think is a big question. The education is a really big part, but it's the whole ecosystem that's kind of broken in that sense, sorry. I just wanted to add one extra point on top of those great points. I think oftentimes as an integrator, and I don't know if you've had this same experience, I have been called a data pirate, where I thought, wow, my goodness, I really actually have no interest in your dataset in particular. It's really about, you know, the amalgamation of putting all of these datasets together. So I think there's this fear of data piracy, having your work stolen from you. And then also a fear of some of these new incoming technologies that can actually amass a lot of data from a lot of places. And by having things integrated, that actually opens up the door for that. So I just bring that as another point of things that I have seen. I just want to add one really quick point. So there's an award called the Research Data Parasite Award that you can actually win. You should apply, I'm on the panel. But I also want to make the point, though, that, you know, we get that same impression. And so here we are, diagnosing patients like Jessica, that make me cry every time. And the sources of data are not happy that we're taking their data and doing this. So this is why it's so important that we have these use cases that demonstrate the scientific positive outcomes for society of doing that reuse. And let's not just do it for patients, but all the different cool things. And I think value, so for a data resource that's trying to get renewed in their funding to be able to say somebody use my data and solve the disease with it should be a feather in their cap. But right now it's actually the opposite. It's saying that they didn't do it. And so they don't get funding. And so that balance is wrong, right? Well, we go back to the social behavior that was mentioned before. Yeah. I think there could be models of credit. One of the great things about having these data systems is being able to have a difference in citation. It's not just the journal. It's who you actually used your data. And being able to say my data was very influential. Save a life, yeah. Yes. And then, yes. I have kind of a question and a comment. Question being a number of folks have started talking about making all data sets have a DOI. There was an article in Nature in June about this. And I'd be curious to hear what the panel has to say if they think that will help in terms of sharing data. And then the comment I wanted to make quickly was about the quality of some of this public sequence data. I've worked with some of that and I've definitely seen it's a large range. Some of the poorer quality data I think has value because it's a teaching opportunity for students and to show what that data looks like, which negative controls, bad controls are sometimes good. But I do wonder if there's either a way we can share software to help folks define what's good data or how we help with that in a way to say maybe what's... Screening. Not bad data, but in a way that doesn't offend people that maybe their data is not so good. And then we have another question, but you can take this, yeah. Yeah, so I'll take this one. So I'm obviously not somebody who integrates data at all. I'm just sort of a naive generator of data. But to your point about variable qualities of data, I mean, this is something I've grappled with as well. And I don't know what the solution is. I feel like Bonnie and Melissa might be better suited to address that question, but I think we probably do need to have rigorous rules in place about what meets certain cutoffs. And I don't know if there needs to be like a data police structure that just patrols the SRA archives. But I guess I'll turn the floor back to you guys who actually are the ones who are integrating this and maybe more on the ground floor of what it's like to deal with data sets of variable quality. So I think with respect to the DOI, so there's a certain feeling that the DOI is like special. But really what's special about it is that it's persistent and you can persistently retrieve something. And there are many systems to do that. And so DOIs might be suitable for some circumstances, but they can sometimes cost more money depending on who's managing them. So the answer really is finding the right host that can guarantee the persistence and the resolution of the information about your data set and making sure that that's part of your data sharing plan and your regular scientific processes. If I just put a quick comment on this because the course you'll trust, I agree with your comment about lack of applicability to the science, but this criteria are part of it. So at least I recommend that going through those criteria for high quality addresses some of these, like yes, longevity and just worth it. So I don't know, just make a comment. I think regardless, I agree with your points, but there's some criteria that would be useful for people to go through and think about. It would be good. We've actually sent them a number of recommendations for how to improve their criteria for data integration and data reusability. So I actually think if they worked more with the data integration communities, their criteria would improve to understand the bigger landscape than they've actually had in hand in the past. All right, Paul. How exciting. I'm Paul Thomas, University of Southern California. I work on gene phylogenies and evolution of gene function. And I'm also involved in a couple of consortia like the quest for orthologs consortium and the gene ontology consortium. And so one of the things that strikes me about this session, and I'd love to hear some comments about this, is really the extent of infrastructure I think that you're starting to talk about. So we talked in the morning session about the importance of having reference genomes across the tree of life. I think that's an obvious starting point. What other sorts of infrastructure do we need to build on that? So Bonnie talked a lot about ontologies and standards. Melissa talked about standards as well as interoperability through evolutionary relationships, which is really an analysis on top of the genomes. That's really where to really leverage and utilize the comparative genomics. We have to be thinking about, what are we thinking about in terms of infrastructure that's reusable, that's a community effort at the level of comparative genomics? And so I was wondering if you could maybe talk a little bit about that, because as we know, right, orthologs, printing orthologs, for example, relies on good gene predictions, which in turn relies on good genomes. And there's a virtuous cycle of how the comparative genomics then feeds back into improving each individual genome. And so again, the ecosystem starts to get pretty big. And I was wondering if you could then just comment a little bit about sort of the size of the necessary infrastructure. So harkening back to our earlier talks this morning, how the data universe is now more precious than oil. I think that we need to start investing in our Amazon recommendation systems of the future, and that Beth and others with really deep insights into the data can actually make recommendations or suggest improvements of a data set to increase its reusability. What do you think of that? Bolly back to Beth. I mean, a question I have for you two is somebody who generates data and wants other people to ultimately use it, what should I do to maximize that possibility? Because frankly, I'm a little bit clueless. I mean, I follow the default path of uploading my data to the SRA archive, but what more can I do? Well, unfortunately, the answer is not enough because we don't have enough tools to actually help you. But one of the gaps that I think Paul is maybe getting at and that actually that Bonnie spoke to as well is how do we share back the improvements that we're making? And many of the resources and data sources, whether it's an individual lab or a database, we don't have a good way of sharing the improvements that we're making or the enhancements and the conclusions that are drawn from those. There's some data sets that we know of that probably have 10,000 people writing the same script to clean up in the same way. And there's no way to kind of feed that back to the source so that it can be more readily disseminated and then the next person writes the same script. So I think this is one of the infrastructure challenges that we have and it's not just about portability because we can still hand it back to you, but it's about sort of the best practice as a social problem and how do we share that back? And then you as a data provider, how would you, if somebody took your data and did something useful with it, how would you like to receive it? Tell people how that might work and maybe you'll start to get some of that. And that's very unique to you and your data, even though it may use some of the standards and some of these kinds of things. I should also just add one comment that there is a black market among post-docs and graduate students for this that we need to learn from. So, Mark. Microphone. So you were mentioning this sort of data economy thing, which I think Carlos alluded to earlier. I mean, this is obviously a very popular word now. One of the problems we have is that in this data economy we have no currency. In academia, the only currency really is publishing papers and so forth. There's very little way to compensate people for sharing data or for sharing improvements and so forth. And I think we have to develop some sort of currency for doing that to encourage people. And I don't really know how to do that, but it would be really good to talk about. In part, that's what K-Base is supposed to do within its system. That is, you propagate your data, you share it, people use it, you know that. And then once they've shared it back to you, you can see it. But the idea is that this credit score you get of how much you donate versus how much you influence seems important. So just to speak to that a little bit. So a few years ago, I actually worked with NIH. If you remember that painful change when we changed the biosketch, I actually was involved in those discussions. And part of the reason was that we could start to put things besides publications on our biosketch. And so when you hire somebody or when somebody's going up for tenure and promotion, you should be valuing the things that are on those biosketches and CVs that are non-publications. And so how do we do that? Especially for those more senior people in the room, you put them on your CV, you put them on your biosketch so that when reviewers review your grant and review your tenure and promotion, they see that those are things that are valued in their community. So we all here today can start to help that problem. So we're just gonna take questions, comments from the audience now, because we're finishing up, Carlos. Fantastic discussion. I just wanted to add, and maybe this is in part to your question, if we think about the rise of preprints and just how transformative bioarchive has been, how do we produce kind of the bioarchive for data with the right incentives of around badges and the kind of things that we need to do? And I'm wondering, one of the projects I was involved with is ClinGen, which had this kind of problem in some regards, because you're taking data by imminent domain, you're saying to the labs, you better deposit with us or we're gonna make it really difficult with you down the road to get reimbursement. So share first, that's sort of the carrot, or the stick is we're gonna go talk to CMS. There's nothing kind of equivalent here, but we did create kind of this incentive, particularly around curators and giving them credit, which is kind of part of what we had to do. I'm curious thoughts on the data side of how to make that open, where's the kind of bioarchive for Docker containers that have data and methods for reproducible research, because I think that's the other huge part of this, right? We ran an analysis, it works once on one very, very, kind of spit and glue system, how do you make that really scalable? Titus? I wanted to mention a word that hasn't appeared so much, which is software. Bonnie mentioned it during her talk, it was wonderful, but a lot of this ecosystem that we talk about depends, consists of software. And it's this odd fact that the deeper and more broadly useful the software is, the less it's been supported by federal agencies. For example, the Python Scientific Computing Ecosystem, which is about half of data science, is being supported almost entirely by philanthropic foundations like Sloan Moore and the Chan Zuckerberg now. And basically federal agencies have been really good at funding specific methods, but not deep infrastructure building. And I would love to see that fixed. Okay, well, you pick. Sorry, you're first, I was looking for a start with Melissa. Hi, I'm Melissa Wilson at Arizona State University. And I think one thing, we talked about all the genome generation, but not about the alignments and who's gonna produce those and what should be the reference when we're doing them. And maybe even more importantly is to have the discussion between data generators and software developers and analysis about batch effects. So SRA now no longer, they'll strip the read groups. So that's one thing. So European Nuclear Tide Archive does not do that. And I feel like there's not been enough outrage in our community that we don't have the batch effects. And so thinking about data quality, there's certain things that we won't be able to disentangle if we don't know how the data were generated. And that's just in the FASTA files, but some of the things we've done in lab is seen that if I align with Star versus Hisat, the first dimension of variability is the software that I used to align. And I see collaborators in clinical genetics just taking read counts from project A and merging them with read counts from project B and finding significant differences. And I said, oh, that's great. Yeah, it's because the software find different things. So I think, I don't know how we have a better communication between those people who are generating the data and those people who are analyzing it. And then the parasites all of us out there, how do we combine the things so that we're not making biological inference based on statistical variation? Very quick. I wanted to just advocate for, we need a reference person. A lot of people generate assemblies using different strategies and then argue that their assembly is great and has this N50 and that accuracy and this and that and the other. But the reality is that there's actually no reliable data, reliable gold standard that you can use, genome in a bottle, try it with GM12878, but GM12878 has different snips and heterozygosity levels and all kinds of variations from lab to lab to lab. There's lots of disagreements even on the basic level of what are the snips within genome in a bottle. What we need is a person who exists and walks among us and contributes blood samples and such and is the reference person. Their genome is completely known, it's completely public, they have no privacy, but you assert that your genome assembly methods works really, really, really great. We can figure that out by seeing how well you do on the reference person. Samantha? Hi, Samy Friedrich, I'm at Oregon Health and Science University and I just wanted to just commenting on what Beth said earlier. I'd like to believe that most scientists that have data sets want to put them out in the world in a way that's responsible and makes them accessible and transparent, but they just don't know what they don't know. So I want to give a shout out to the library, which is an oft-forgotten resource at most academic institutions unless you're studying for exams as an undergrad. Our library has started doing a lot of open data sort of programs to try to help researchers navigate that because it is a big world with a lot to learn. So don't forget your library. So I do want to mention the importance of those data management plans and the peer review and that the agencies are working with the community in order to get those up to snuff with all the discussion that's happening here. And one of the ones that I think is particularly good is the one for the Plant Genome Research Program, which thinks about fair, best practices, large-scale genome-wide data, small-scale data, software, making software available, and importantly, other digital products that might be coming down the pipeline. These are things such as 3D model or 3D printing models, circuit board designs, and I think increasingly we need to be thinking about machine learning models and the way that those are gonna be made available, tested against reference data sets, and of course, able to be used by others in the future. I think this is a great discussion and as somebody whose lab is generating a lot of this data, I think the other thing we haven't really talked about is the time and the tools required to access and know how to do these things. So it's so rapidly changing to ask a postdoc or graduate student to spend the time to do this accurately is a real limitation. It has nothing to do with wanting to keep their data private. It really doesn't have the same kind of rewards as writing a paper or finishing your thesis. So I think back to the psychology of it, that's a big limitation I see. And I second the idea of the libraries helping with this role because it's almost like we need another expert involved in the process and not just rely on ourselves to keep relearning the next new digital way to do these things. So it's a good idea to date up loader. Jenny. So I'm going to take prerogative and make last statement for this session. So first of all, thank you everybody. Wonderful discussion and thank our moderators and our speakers, they've been fabulous. And I want to remind people that one of the things that funders are looking for here are recommendations. And so thinking about two things, I think there's tensions that we've heard in the room. One is time versus quality, right? How fast do you release something versus one of those standards? And that's attention that we're always going to hold as a community. And then the other is community driven. So I would say bottom up methods versus the caret versus the stick, the top down methods, the funders, the journals, what are we demanding versus what do you want to do? And we want to align those things, right? So help in doing that would be good. And I will say for the next session, if your talk is not loaded, we're missing two talks, please come load them. It's because we're going to try and keep things on time because right now they are on time. So back here in 15 minutes. Thank you, everybody.