 Carolyn Hutter, the NHGRI moderator. Karen, do you want to do a quick introduction of yourself? Sure. Hi. For those of you who don't know, I'm Karen Davis. I work at RTI International. I am the co-moderator for this session. I'm also co-chair for the ECC. Great. And so just as a similar to what you experienced in session one, we'll have a brief presentation from people from the Anvil. Go into discussion and have a little bit of time at the end where hopefully Karen and I will be able to share what we've captured for the breakout report. I'm going to mostly be behind the scenes, sort of capturing things as notes, but I may chime in some during the discussion. So sorry, move my slides forward. Same guidelines as before. We want to be using your automatically muted. Make sure you come off mute if you're talking. We request people to raise your hand and or make active use of the chat. I will, in a minute, put in the direct link to the notes page for this session. So you can use that in addition to the chat. And again, be candid, be heard. And we want everyone to have a chance. So keep in mind, I personally am one of the talker people to step back for a minute and then to take advantage of openings. And you can reach out to Annville NHGRI staff if you have, including me, if you have any concerns. We have a great set of discussions for this session listed here. We certainly want to hear from all of you as people we specifically ask to focus on this area, but really everyone in this breakout group should participate. I'll just leave this up for a second for people to look at the names. I'm not going to read through them all. And again, just like last time, we'll be going through a sort of modified SWOT approach and asking people to also consider the cross-cutting themes around cloud use, clinical genomics, and interoperability. And I think that's really challenged us today to also think of a fourth cross-cutting theme of how to make sure that we're reaching a broad and diverse community with what we do with the sample. So on that note, I'm going to stop sharing my screen and turn it back over to Karen. Karen? Thanks. So we are, I just want to point out a couple of kind of logistical things. We're going to first hear just some background on the infrastructure section. And then there is a, we have help from a timekeeper. So we're going to try to do 10 minutes on each strengths, weaknesses, opportunities, and threats. And then at the closure of that, Carolyn, I'm hoping you can bring up the blank slides and we can work together to fill in the themes we've heard across all those sections. So we capture everything really succinctly to take it back to the main room. They might, I might pre-populate them a little bit, but I will bring the slides. Yes. Yes, I like that. For especially for the discussants, feel free to use the raise your hand if you've got something you really want to say. And I'll try to keep an eye on everybody. With so many people, it can be hard to tell when somebody just unmutes if you're looking at the wrong screen. I want to make sure that all of our discussants have an opportunity to be heard. But I would like to start off with introducing the two members of the Anvil team. I hope I pronounced the names right. Forgive me if I don't do well. We have Dr. Jeremy Gex and Dr. Benedict Patton. And they're going to give us the lead in, in addition or to build on the materials that we were already provided. So over to you. Sure. Jeremy, do you want me to share my screen? Sure, that would be fantastic. Yeah, let me do that. OK, just one second. I also just want to apologize to people. I have been in a study section all morning. So I'm coming to this a little fresh. So if and also for people who are involved in the Anvil and the infrastructure side, if I misspeak or missstate anything because this is a big project with many different pieces, please jump in and correct me. OK, so Jeremy, I'm going to start and do the first few slides and then just stop me. I think I know where I am. OK, so yeah, so we're going to talk broadly about infrastructure here. I'm going to very briefly mention some of the key aspects of the infrastructure that we put together for the Anvil, including the portal security aspects, the dual system, and then we're going to mention some of the other aspects. Jeremy's going to talk about APIs and some of the interoperability infrastructure. So in terms of the portal, well, if you come to the front end of the project, you can see that we have a large number of pieces that are represented on this, by the way, nicely redesigned front page, which I think is really nice now. And just for people who don't know, directly from this portal, you can actually gain access to the Anvil data set catalog that represents all the data. And this is a live view of the catalog. I'm sorry if it's probably a bit small on your screen. I can make it bigger. But what it shows you here is that we have, well, many consortia projects represented, nearly four petabytes of data. And directly from this view, you can click on any one data set. You can select and find using fast grid search and click on any one data set. I won't belabor the point. And directly request access, in this case, through dbGaP, using the dbGaP APIs to then gain access to data and then bring it into the Anvil. But switching back, here is a summary of the data growth that we have seen over the course of the project, which I think is quite remarkable. Obviously, there's some stepwise jumps as some of the large projects have been ingested, but you can see that big upward trend. It's notable that, yes, of course, we want people to analyze data sets in situ within the cloud, because that is both efficient and because it is accessible. But we do allow people to egress data, that is to download data to their host institution. And we can now do that for free for certain data sets using the Gen3 clever safe infrastructure, which is really nice. So you don't have to pay cloud egress charges to download that data to your host institution, which is, by the way, has been a significant stumbling rock for some of these very large data sets that can cost hundreds to thousands of dollars to download. There is a strong emphasis in the infrastructure that we built around Anvil, around security. It won't belabor this point. But Anvil is FedRAM compliant and operates a FISMA-moderate environment that is compliant with the Dynist 853 standard. And importantly, when one is operating within a workspace, the study registration consent group mapping is part of the identity of that workspace. And when you clone that workspace and work within that workspace, that authorization is essentially carried along with it. So there is a sort of model, somewhat analogous to Google Docs, the way I think about it, that allows us to think about and manage sharing in a safe and compliant manner that is relatively user-friendly, I think. There's also the DuoS system, so short for Data Use Oversight System, which is essentially a computer type, or whatever, automation of the process of data access requesting, in which a matching algorithm is used to match data requesters to the requirements around data depositors' data, allowing data access committees essentially to churn through data access requests much more efficiently by essentially employing an ontology, a matching algorithm that works on top of a data use ontology. And if you're interested in the details here, and I'm sorry I'm going so fast, please go visit the DuoS.org Institute website. So Jeremy, I think you're going to jump in here. But underneath all of the different components of the Anvil are APIs, application programming interfaces, which standardize the way in which the different aspects of the system work together with the intent of making it possible to essentially connect all the different Anvil software and components together in an interoperable fashion. And Jeremy, are you jumping in from here? I'm not actually totally sure. These slides changed a little tiny bit this morning. I'm happy to jump in from here if you'd like me to. Sure, go for it. Do you want to tell me to skip slides or do you want to share? If you're willing to advance slides, that would be fantastic. No worries. OK, so let's stay here for just a minute. And I want to make the comment that I've been really impressed throughout the Anvil project around the embrace of APIs, because in some sense APIs give you the standard way to talk amongst components. And when you standardize this, you open up the possibility to not only make your lives easier, your our lives easier as Anvil developers, but also to start to open things up to the community. And I know that there was some discussion at the end of our last session in particular about how we can empower non-Anvil developers, non-Anvil scientists, to start using the Anvil in ways that they see fit rather than through our individual tools, for instance. And so I'm going to try to keep that theme and that thread going here, that the goal of our infrastructure is really to empower scientists and developers across a diverse set of problems. So we're starting here. But these APIs allow lots of room for growth and reuse and adaptation that wouldn't otherwise be possible. Because we've embraced APIs, it becomes possible. Next slide, please. OK, so one example here is that we have a Python library now that wraps much of the Anvil functionality and provides some ways that if you're scripting in Python, you can talk to the Anvil components. Do things such as single sign-on or query Anvil components for data availability. This works inside and outside of Anvil. And we use it internally to power some of the infrastructure that I'm going to talk about in just a minute. But my point being is that anybody can pick up this library and start using it for their own applications if there are someone who uses Python. Next slide, please. So as you probably saw in the background materials that I've seen in a couple of presentations today, we have this fantastic integration between Galaxy and Terra at this point where you can launch the Galaxy computational workbench all inside Terra. This is powered through a bunch of different APIs, for instance. And so next slide, please. As you can imagine, one of the key challenges here is how do you get data from Terra workspaces, which are the canonical way that data is organized into Galaxy workspaces or Galaxy histories. And so we've done a fair amount of work here using DIRS APIs in particular. And these DIRS APIs are from GA4GH, or the Global Alliance for Genomics and Health, to make it possible to actually move data out of Terra and into Galaxy. And actually, this arrow should be a two-way arrow at this point. We've now enabled it so that when you're done with your analysis in Galaxy, you can push those results back into Terra as well. So we have this really nice synergy going on here through the use of APIs that makes it easy for us to develop, but provides a roadmap for other applications to come on board, too, using this PyAnvil library. Next slide, please. Another example of integration through our APIs is Galaxy and Dockstore. This is through Terse, which is another GA4GH API. We have Galaxy workflows now that are stored out on Dockstore. When you launch your Galaxy instance, you can then go back to Dockstore and say, I want to use this workflow from Dockstore in my Galaxy server, and it works. One of the really interesting advantages here, for instance, is that this will work on any Galaxy server, not just those on Anvil. And so we've done some empowering through our development in Anvil that helps the whole NHGRI and genomics community by connecting Galaxy and Dockstore. Next slide, please. APIs also power the work that we're doing in the NIH Cloud Platforms interoperability working groups that we're a part of. This is the idea that we'll connect Anvil with other data commons across NIH as well. All of this is fed by the ability and by the agreement to use APIs to exchange data and to talk about analysis tools. Next slide, please. Some of the technologies that we're using in NCPI, as well as in the Anvil, we're using RAS, which is an authentication service that NIH has pioneered. We're using DIRS, as I talked about before. This comes from GA4GH. Let's us talk about resources, files in a cloud agnostic way. So rather than having to talk in particular about S3 buckets or GS buckets, for instance, we talk about DIRS URIs, and those are resolved under the covers, along with the necessary security that comes into play. And then finally, in our push towards clinical research, this notion of fire or fast health care interoperability resources, this idea that there's an API around electronic health records is increasingly important to us. Next slide, please. And so here's a simple example of RAS and DIRS and production in the Anvil. When you go to Anvil, you can log in via RAS. You then can do a cohort search, starting to look for data. When you found a cohort of interest that you want to analyze, you can then export that data to Terra using DIRS. And it's automatically resolved at that point. We'll not talk about this much today, but the idea of doing data local computing is really important in the context of the Anvil and other cloud computing projects. It reduces costs, such as egress fees, and things like DIRS make that possible. Next slide, please. Here's our first really interesting success that we've had with fire. So what we've done here, if we move from left to right very quickly, is we start off with our raw data files from our various consortia partners, and we start to go through a reconciliation process where we take that metadata, those records that are associated with the genomic data, and we reconcile them together. We eventually then, if we follow the bottom path, go to a fire server. And this is where some of the magic starts to happen, where you have this fire data in place. And once you have this fire database, you can then go off and share that fire database with our example, for instance, the kid first, kid's first dev server. And from there, it's possible to query Anvil data from within kid's first using a standard interface of fire. And we think this is a really nice example of how you can empower and enhance the use of data that's in the Anvil using these various APIs to make that data more freely available in a variety of different circumstances. So we can obviously do this in PyAnvil as well, but we can also do this through the kid's first portal. Next slide, please. So clinical infrastructure, and there is a missing word here. We should really be talking about clinical research infrastructure right now, I think. The idea that what we're trying to do, looking forward to the Anvil, and here I'm starting to turn a little bit from telling you about what we've done to what we'd like to do, is we're starting to think more carefully about how we can empower translational research. And so you've heard a little bit about this throughout the day. The idea that some of our components have already been used in translational research, things like Seeker and Bioconductor and R, for instance, and fire provides this opportunity as well. When you release these data sets from flat files and you put them into a fire server, they're much easier to work with. They can be used interactively in ways that flat files can't. And so by embracing a lot of these different technologies, we think we can begin to build up a clinical research infrastructure within the Anvil. Next slide, please. Okay, and so future directions. These are just to see the discussion. We're clearly very interested in your feedback. But three things that we think are especially important from an infrastructure perspective are number one, to continue to build out our APIs. I've tried to argue that these are the foundations upon which these different components can interact. And they also provide ways for other applications and other developers to plug into the Anvil to make the best use of those different components if they see opportunities. And so we wanna provide unified stable API endpoints for the Anvil. We wanna provide API wrappers in common programming languages such as Python and R, so the developers can take advantage of that. And we wanna continue to incorporate these community APIs and features. If we can adopt APIs that are being developed by the broader community, whether it's GA4GH or the fire community, this is gonna benefit everyone. We don't have to develop APIs on our own and they can quickly be adopted throughout the common space than our NCPI work, for instance. We're also thinking about how we can make the data in the Anvil more useful from an infrastructure perspective. And this might take the lines of more automated data set ingestion and moving more quickly to process results. Process results may not be flat files in the future. They may include aggregated results such as cohorts, such as fire databases and ultimately providing catalogs of curated data sets with genotype and phenotype data that's ready to use for all sorts of applications. If we think not about data though, when we think about how can we empower the applications that sit on top of the Anvil, that's a slightly different question. Here are some ideas about what we might do going forward in the Anvil here. We can make it easier to navigate between these Anvil applications. So it's already possible, as Benedict nicely showed, to look at these different applications than to get to them. But it can still be a challenge to move data and results and workflows across these different applications and platforms. And so we wanna try to make that a little bit easier for users so they can move to the application that's most useful to them. Machine learning is a hot topic. The idea of being able to provide ready-to-use models is important and potentially a role of the infrastructure. To say, we have a library of models we can use. Here, go and pull one off the shelf and apply it to either data in the Anvil or apply it to your own data. And then finally, we wanna improve our clinical data ingestion in applications by increasing the use of FHIR. We really do believe that FHIR is a very powerful API, a powerful standard that we can leverage to take the clinical data that's in the Anvil and make it more accessible to the community and to applications to process along the way. And I think with that, we can move into the discussion phase. Excellent. Thank you so much for the overview. Where there's a lot of work clearly, underneath the things that you touched on here. So what we're going to do, of course, is like to start off with the discussion of strengths. I think one of the things that works well in this area is we were told about three themes and two of them are directly addressed in the infrastructure, both cloud and interoperability. So I'm really interested in hearing from all of you where you think the strongest aspects are. Shannon. So I've been taking notes as we go. I think there's a number of strengths here that are actually pretty exciting. And Jeremy did a great, I don't wanna reiterate them, but I think Jeremy did a great job of summarizing them. The data growth is incredibly impressive. I think Duos is a game changer, although I've been having a sidebar note with Valentina and Caroline about can it really be expanded to other institutes and what some of the barriers are for that? Because what we don't wanna see is it silo, even if people are using the Duos standard. And then the JERS API, that's fantastic, because one of the challenges would be with something like Terra about users feeling that it wouldn't be possible to be able to utilize Galaxy or be intimidated. So I think those are all huge strengths. So there's already in the notes, but that's definitely in the strengths category for me. Excellent, thank you. And I do see there's, I don't know if you're all watching the chat or the notes, there's lots of different things to watch here, but there is kind of a side conversation going on in the chat. Other, Vivian. Hey there, could you hear me? Yeah. So I'm gonna follow on, because the chat thing there with about Duos is something I wanna follow up on. I totally agree with what was just said. A couple of other things, it'd be great to see what the, I think it's really strong because if it really helps automate some of the work that gets done will be really important. I'd like to see the outcome of the matches that are being done by those that generate data and those that are gonna use it and how conservative that is. It's most likely gonna be conservative, but it would be good to see it. And if that could be used across NIH in a consistent way, that would also be really good and it touches on the interoperability piece. Totally agree about the Duos APIs. I'd also like to add in the Fire APIs in there as well. So this is, I think the APIs are a strength in terms of on interoperability, but like all of the SWAT stuff, there's also the other side of it. So maybe when we get to that, I've got a few comments around those APIs. I think it's great, but I have some questions around those as well. Couple of other strengths. I think the portal is trying to move in the direction of both bind-formatics and non-bind-formatics user. I think it, again, I'll refrain from some of the weaknesses a little bit later. I think Dockstore integration is a good idea as well with the potential there to other third parties to wrap their tools and make them available in that environment and workflows in particular. I really liked the idea of the machine learning piece here on the models that was mentioned by Jeremy at the end. And I think the idea of models and having the outcomes of some of those results available as well for other people to reuse would be really good. I wanna say this, but I wanna make sure that I, it's actually maybe a threat here, but I wanna make sure that we say that. I've been doing a lot of work in this area and you can easily bias your models. And when we're looking at non-sufficient European data, you can accidentally bias these models. So ensuring trustworthy models that are not biased, I think is really critical. I think everyone wants to do that. I'm just bringing it up here because I think it's pretty critical. And then the other piece, which is kind of not spoken here because it's kind of just assumed is the fact that we're leveraging a commercial cloud provider here. And this is Google and GCP. There are other differences, so I'll blunt them in one, but we're using GCP essentially. And the fact that we're able to use that is gonna give us flexibility and scalability and potentially also offices the opportunity to be able to reuse some of their APIs. Google has been using healthcare APIs. I don't know how well they intersect here. That's maybe a challenge or a threat through other parts of the swan analysis. But since they're already doing that, I assume Tara is working very closely with Google. So understanding what they're doing in that space and being able to leverage that would be extremely helpful for this project and it avoids scientists having to essentially dig into the super minutiae of cloud and is best positioning an HGRI to leverage a resource both in money and time in the right spaces. Anthony Teem and everyone else, Benedict Jeremy are focusing on the science and leveraging that infrastructure. And maybe we can hear a little bit about that a little bit further on. I'll stop there, there's plenty of other ones, but I don't wanna hog the space too much. Hope that's helpful. Thank you, Vivian. Wow, that was quite a lot. I saw that Luke had his hand up next. Thank you. Yeah, I just want to, I agree with a lot of stuff that's been said. So one thing that I just like to call in particular is the dedication to security and just got some additional clarification. I really want to applaud everyone involved on this. Not only do I find the overall architecture and engineering of the platform beautiful, but the fact that you've been able to manage utility and security while being innovative is really tremendous. There, it sounds like there's a level of security focus that is beyond a lot of healthcare systems. And similarly, this will resurface the threat, but I think that overall, the AMBO has really paid appropriate attention to security because we know that healthcare has a target on it. Yeah, I mean, so like, I just want to type this in but I'll say that a lot. We just finished our annual FedRAMP audit. During this audit, they have third-party pentesters come in and they found nothing. They found no highs, no moderates, but they don't have context and they only are limited to like a month worth of pentesting. So we do our own internal stuff. We have an internal red team that is constantly pentesting it. And we actually find highs and critical, not frequently, but enough times a year that makes us say, this is why we have our own internal red team and don't trust third-party pentesters because we take this stuff really seriously. FedRAMP is a hard bar and then we do way beyond what FedRAMP is required. So that's, yeah, we take it pretty seriously. Security is something that has been paid attention to throughout this program. I do wonder, one of the things that was touched on is being able to pull out data. I mean, I think it's great that you can do it for free. When somebody pulls out data, you lose the security control. That is fundamentally true on any system on planet Earth. Data exfiltration is a real, I'm not gonna say it's a concern for this specific project, but certainly Broad and Terra at large deal with some projects where it is a real concern and this is something we think about. So I'll take a little bit of a double stage here. One is we do monitor within Anvil stuff that egresses the Anvil. And if there are real concerns that we can address them. So, I know Candace certainly knows this can speak to it, but we do reporting all the time to the Anvil program managers and we say, hey, so this is what this month looks like and this is what this week looks like and this is what's going on and let us know if you've got a real concern with any of these sort of egresses that we're seeing. And we have a record of every single data access. So like we know who it is and who's been authenticated to it and whatever. Now, once it is in fact egressed, yes. We lose control over it. That's life for now. But Broad and Terra are working on more less draconian ways of doing data exfiltration and prevention. Right now the method that we have is really like it's pretty hardcore and really you can't get your results out kind of thing which is required by one of our things and no one wants that, right? So like what is the balance? How can we contextualize when it's results that you need to get? How do we know that your results don't just have the entire data set in them? Which is that's like the big fear. And we're working on it and working on it actively which I don't think a lot of other institutions can say that they're working on this kind of contextual data egress challenge. But it is a challenge and we're constantly finding balance. And specifically within Anvil, we do monitor egress but we don't prevent it. And that's kind of how we balance it for now. Thank you, David. So we have two to three more minutes for the strengths portion. I'm gonna call on people in the order in which I saw the hands pop up. So Brandy was next. Yeah, hi. So I'll be very brief. I think from a strengths perspective, the Anvil should be commended for the work as it relates to NCIP. I think that's been challenging but has made an enormous amount of progress and it really speaks to the democratization and as well as the interoperability objectives. Great, thanks. And then I saw Lucila. Yeah, I just wanted to say that the duals is a great asset to this. And I would just not use the DAC as a gold standard because I think the duals can be better than the DACs. Well, that certainly sounds like that could also go into the opportunities section as well. Thank you. And Vivian, you have your hand up again. I did. Maybe you didn't put it down. Sorry about that. No problem. Any others who want to add in comments on the strengths section, either to... Karen, there's been some discussion of fire as a strength, but without... I was just wondering if people want to sort of expand on how they see that as a strength. Anybody want to comment on that? I don't know if you can see our hands up. Sorry. Sorry. So let's do George and then Shannon and then we'll move to the next section. Did I just... I'll say a comment that bridges strengths and weaknesses about fire. So number one, it's good that you're doing fire. It's a data exchange standard. It's working with other initiatives, Vulcan's just one, but there's like a zillion initiatives that HL7 and fire are working with. We're working with them closely, Odyssey and HL7 with OMA. So that's the strength and that is the direction and there's so much work going into it. It's like a good thing to join on to. On the other hand, the weakness is that if you really want to be supporting clinical research, I mean, you got to get people who do clinical research really steering the infrastructure because like fire is a data exchange standard and that's good, but that's like 5% of the problem. Remember fire, even with the profiles is not quite fully specified as, you know, Robert knows, because we do this for all of us. So then you get it and then you have to adjudicate it because everyone does it a little bit differently. It's better than the old days, but we're still not there. Need vocabulary support, then you got to build phenotypes which are at a higher level than the data exchange standard and that has implications for the infrastructure, not just the analysis tools, because you need to support all the vocabularies that all that data comes from. And then even security comes in, because if you're thinking that while we have fire and we can like have a fire hose of, you know, data coming from EHR as well. One in 200 notes is on the wrong patient and a lot of notes have names in them. So now you've got patients who haven't consented to be in Anvil and their notes are sitting, their name is sitting in the database. So you got to figure out what am I, I'm not going to do notes or I will do notes, but I'm going to preprocess it. So I need an NLP resource. So there's a lot of resources for, and it's not just people who build clinical databases. You need the people who do the actual clinical research advising you because you need the whole stack to advise the infrastructure, not just the lower level advising you. Thank you. Shannon. Can you hear me? Yes, we can now. Okay, sorry, sorry, my speaker was acting weird. I think I'm just going to echo everything that he just summarized. I think it was really in line because I was in kind of a couple of camps. Maybe I'll say that the third camp with the opportunity. I did mention Vulcan. And I saw there's a note in the chat too that I think is also going to it. It's kind of to me right now in one of my earlier chat comments is that this is probably going to be very simple at this stage. And so the question is how do we fires evolving very rapidly? How do we stay in touch with that and also make it relevant? And then all of the things that were just covered which was much more exhausted than I was thinking. So I do think groups like Vulcan, for example, there's a tremendous opportunity there, especially because you could see an anvil related use case actually becoming one that they might adopt, right? So that's actually a potential to have people who are actually working in the standards actually thinking about some of our use cases. So I just wanted to bring that up, but that was it. Thanks. Great points. Hi. Sorry, Karen. Yeah, basically we need use cases for using fire. Just exchanging is not enough. And there are problems with the current format for many reasons that were pointed out. So we're going to figure that out. Yeah, that sounds like it could also overlap to the opportunity area to address some of the, you know, is it, is it a theoretical thing or do you have the people who really would be making use of the results providing input? That's what I heard George focusing on. And I think that's a really good point. I think we can transition. I'm looking at the clock. I was, oh, there it is. Let's move into the weaknesses area. Specific suggestions of things that are important to be addressed. Things that could be improved could be better. And I appreciate the presenters did give us a slide on things they think need to be worked on for the future. I think we'd like to hear from the group. And it looks like Sandy has your hand up. Yeah, so more an opportunity than a weakness and also one that was informed by conversation I had with Anthony a little ways back. So maybe, you know, a poor lens into some ideas that came from that conversation. But just about all, I look at this and I would love to apply this infrastructure in my work. But relative to the conversation earlier in the last breakouts, just about all of the work that I do now involves clinical process flows. So need to be able to integrate this into clinical process flows in order to really be able to use it. And the Fed ramp, you know, plus plus security is, you know awesome in terms of preparing us for that. I do think, you know, fire is a really good basis for as George said, you know, that 5% that is that, you know sort of the format for the, for the interchange. But in order to like really deploy this it would have to be in a clinical realm. There'd have to be some notion of security context. Such that we could actually, you know keep a notion that can keep your data private and then determine who it's shared with and when it's shared. And it's like it's, it's both, it would be both necessary that that's implemented and also necessary that the perception is there. So it's like, it's beyond the capability with the perception of the capability which could be harder is, is also necessary. But I just point out that it is interesting that like data bricks and like snowflake and places like that, they're getting to the point where they're winning this argument that like even though they're using centralized cloud infrastructure people perceive the context that they're creating to be theirs. And if Anvil could get to the point where it was perceived as being able to create similar context then like I think the whole clinical world opens up to it. Sandy, that's really, really helpful. And you know, you and I have talked about this before. You know, it's certainly the case that you or another clinical center can put your data in Anvil and maintain control over who accesses it. What more do you want? I get that what you're saying is a lot of this is about perception. What else would help you get that perception? Yeah, I mean, I do think so part of it is a notion. I think that part of it is the cloud, the cloud foundation. So having it for, you know, in our world we're in Azure. So having it being a notion of this is an extension of your Azure environment where there's just these components that are managed centrally. And, you know, and ideally the notion that data is somehow like I think it can be managed centrally, but the notion of, you know, you really have like full control and I get Anthony that this may already be the place in the case it may just literally be like the quote unquote marketing materials that like highlight it. But the notion that, okay, this is your context, you're in complete control of it. Here's how it integrates with the rest of, you know, the cloud infrastructure that you're in. And here's how you can integrate it with everything else, but these are your decision points. Got it, thank you, Cindy. Great points, thank you. Other others, other comments. So I don't know, Adam says you count to seven before you start to speak again. It's hard for me. Vivian, did you have something or is that a hand up leftover? Nope, no, it's not leftover. Okay, great. Let me lower it. So I'll pick on two since I have quite a few here. So I really do like the portal. I didn't see it before it turned into the new interface, but it would probably, it's gonna get busy really quickly, particularly for non-informatics users. So some sort of UIUX, which is often hard for academics to do because it's not the sort of thing that happens frequently would help just make it cleaner, particularly as people are doing it. I'm gonna bet there's a lot of work that's already been done in the backend for user workflows and all that kind of stuff, but making that cleaner and easier. I'm not talking about slick marketing, I'm talking about just easier to use. I'll give you an example. If you go into the data or into Dockstore, it's really hard to search it, even to just go through it. It's a lot of listing, which is completely understandable to do it that way. But it's, as you add each component to the front page, it's gonna get too busy. I think a weakness is search. I think I saw somewhere, and maybe the guys on Anvil can confirm this, it's some sort of graphing search. I think that's great for a lot of different reasons, but some real active approach to search because people are gonna come into this looking for workflows, data sets, components, APIs, and so being able to do that without having to use a data model that is brittle, and some way to search in a fast, active way as this thing grows, I think is gonna be really pretty key. I've got other ones, but I'll stop there because I'm sure other people have comments. Great, thank you. Benedict wrote a note that the DocsTour search, they're extremely aware of the limitations. Other comments, wow, this is a quiet group. Luke, go ahead. And then Matthew, Luke. Thank you, I'll be brief. This is a question I'm not actually sure what catered to Gordy to put it in. How does provenance fit into all of this within the infrastructure? I'm just kind of curious. And I apologize if I missed this in the briefing. You're asking about data provenance? Holistic provenance, provenance from the data, the data provenance, provenance of the workflows at the code that's running, controlling version as it applies to reproducibility, which I know is very clearly thought of. So I assume it's, I think it's a dress. I was just curious maybe more about the details. So maybe one of the team members can quickly respond to that, or we can put it in there as an open question. We have an opportunity to address provenance, Brian. Oh, hi, Karen. Yeah, I'm just kind of jumping back in here. Yeah, I think from a provenance perspective, there's multiple aspects. There's a lot of different aspects to look at here, but things like Dockstore and having provenance of where the tool came from, the versions of the tool, what dependencies are embedded in that workflow. I think we're really super strong there. And also the sort of workspace environment we were talking in this submission and engagement breakout about the process of not just submitting to Anvil, but actually using the platform to actually perform the analysis and then do the submission effectively in place. And when you work with workflows and tools and do execution in a Tera workspace, you're actually recording what has been done to that data, right? How has that data been transformed? So I think there's a lot of provenance that's been baked into the system. I think over time, they'll see more and more of that exposed through things like the portal, which right now presented fasted search, but I know it's going to be expanded over time to be richer forms of search and exposing some of that data. But the bones are there, I think. Thank you, Brian. Brandy, you want to add something in the weaknesses area? Yeah, I think, and I'm not sure that infrastructure is the perfect place to discuss this, but one of the things that I noticed on the readout, and maybe this was also discussed in the community engagement piece, but I think the amount of effort or really resources that are being dedicated to end user cloud support seems very low compared to the potential, especially when you're looking towards reaching a bigger range of community members, both the clinical scientists as well as researchers from diverse institutions that may have much less experience working with cloud environments. And I think that my understanding is that in those researchers need their own cloud accounts to begin with, which there's lots of great benefits to doing that, but especially for those users with coming from different institutions, that may be a challenge for them to even get started. Thank you, excellent points. All right, we are, anybody's dying to add another weakness. We can continue on that, but we're just about into the next section. And weaknesses and also strengths always feed into opportunities. So I did hear a number of opportunities already, things like do some UI UX work on the portal because the portal appears as components get added, make it very busy, hard to maneuver, et cetera. So what other opportunities should we make sure we include here? Again, I see Brandi's hand is up and then Vivian's. Oh, not intentionally, I didn't lower it. Vivian is yours left over, is that new? No, it's new. Okay, good. We're going from weakness to opportunities, is that right? That's correct. So one that I see on the compute side is, so Anvil is huge amounts of data, huge amounts of compute. And as we move with genomic data, we'll get into long read sequencing. So I can see CPU constraints coming up very fast. And if we're gonna start looking at machine learning and looking at models, there's gonna be a huge cost on cloud. So you might be transparent, but you'll transparently know you owe $2 million instead of a hundred, that probably isn't what you want, right? So one of the opportunities here is, what is the plan or is there any plan for using GPUs? Because that would accelerate things and it would potentially reduce cost in terms of usage and speed up things obviously in terms of results. So there's that infrastructure. Okay, Anthony says there's GPU support. Great, that wasn't clear. I would imagine that it'd be good to see it. All right, and it'd be nice to know how that gets deployed for all sorts of reasons, because I'll just bring this up. We've been threading a lot of new machine learning models through GPUs of recent times and they don't work always on GPUs. So optimization for that is really key. It's not just the case of having it. So that's number one. Flipping to a slightly different perspective for opportunities. I think it came up in the strengths, which is this is on GCP. So how is the group leveraging the strength of what Google already does in this area for healthcare APIs? Are they using those? Are they aligned? What's the right way to leverage the current native resources existing with GCP? So there's not such a heavy lift from Anvil. And I mentioned this in the chat. Brandy brought up a good point about a lot of users don't know how to use this work. So is this a place where HG could leverage what the Strides Professional Services is doing, which is helping people you get on board with whatever cloud service you want? Moving slightly to a slightly different perspective on infrastructure for opportunities. This also comes from the provenance piece. Could you also look at publications? There's gonna be a whole workflow that happens from this. Is there a way to take the workflows into such a way that can help with the publication of that information, the citation of it? And is this a place where you can play because a lot of people put a lot of effort into data and tools, but the paper, the journal article is still king or queen. And therefore all the people that do all that kind of hard work is lose. And on top of that, it gives you a good provenance trail. And it helps with understanding what's in your system and maybe incentivizes the use of the system. I'll stop there. Thank you. Yeah. Interesting. Connecting to publications as a citizen. I'm gonna put Elena on the spot. Elena, do you wanna comment on that a little bit in relationship to some of the discussions we've been having with journals? Is Elena on mute or not in this room? I thought she was in this room. I thought she was here, but I'm not seeing her right now. Okay. So just so people know, we actually had started last year a conversation with journal editors specifically in sort of enhancing metadata and data sharing and sort of transparent. And I think a key thing that came out of that relates to what you're talking about Vivian is like the transparency of what is actually in publications and when data shared, what's actually been shared and what was actually used. And I do think there's a really good opportunity through Anvil to help make that connection clear, not just for the journal and the journal editors, which is a little bit of what we were talking about, but also for people who are wanting to look at the sort of reproducibility of publications. Yeah, that's exactly where I was getting at. I realized it in my video. I'm sorry about that. Not that you necessarily need to see me, but anyway. Yes, I was getting exactly at those publication pieces. I think if you can do that, it's gonna touch on all of the fair piece. And it will tie very nicely into the whole trail of data, data citation, publication of assets and people writing their scientific journals. Right. And of course I called on Elena, who was doing a lot in the background to support this meeting right at a bad time. So apologies for that, Elena. She's back. I think she's back, but I think we answered the question. Good job, Carolyn. Casey Taylor has your hand up and then Lou Silva. I just had one comment somewhat related to the fire discussion. So it seems like there could be an opportunity to make the connection between study participants and between the Anvil and the EHR ecosystem by focusing on study participant preferences a bit. So for projects like eMERGE, where you have a data coordinated nation center that's kind of facilitating the communication between the institutions and the data owners or the data holders, if there are a way to capture participant preferences that are during the consent process. So like the preferences for sharing or to be followed up on, that might just having that ability to have those might allow for some connection between EHR data and the Anvil data. Thank you. Lou Silva. Yeah, my comment was related to Vivian's remarks in terms of transparency, not only in publications, but whole access to what came out of it and so on. So it's all part of the same ecosystem and definitely there are ways of doing this and as well as giving credit to who produced the data, who put the data there and so on. And the other aspect of it is about those of cost and also when you have to compute with data sets that are in other enclaves that you will not have in there is thinking about the near future where the distributed analytics will take place and then you can have this all together. So it was very exciting times. Thank you. And the materials talked about data ingest. I mean, a number of them focused on that where all the data is essentially in the same place for the long-term. And we've already talked about it like with FHR and EHR data but not all data is going to get ingested. And so this ability to control the reach, who gets what? Again, you're talking about access authorization, knowing what something should be used for but also not having to drag it around and move it to someplace else and still be able to combine it. And that's, so interoperability can affect in a lot of different areas, not just interoperable tools, but federated data if you wanna talk about it that way. And this is something that's talked about a lot now around NIH and I think is gonna be growing because big, big data sets, you're not gonna move them all around and have multiple copies of them. And I think that's what some of you were referring to in some of the earlier discussion. Anything else on opportunities? And certainly, I've picked up some opportunities when we're talking about other sections. I mean, they tend to overlap. We can move into the threats area now. I did have one more thing on opportunities and I was just thinking about the question that Anthony asked me earlier, like what more could be done relative to the issue, the challenge that I raised earlier and particularly like thinking about this in terms of federation. If there was hospital systems are gonna be approving some of these other infrastructures like both other clouds but also Databricks, Delta Lake, Snowflake, all of that kind of stuff. And if like from my perspective, like the pitch to a healthcare system or like the leadership of a healthcare system was, okay, you've already approved this infrastructure. What we're now asking you to do if there was a capability to put essentially an instance of Anvil for your own data on top of that infrastructure that you're controlling, it's like already approved and then that infrastructure is capable of federating with everything else, that may be an easier path to get things in. It may not be worth the effort required to do it. I totally recognize, but it may help in some ways. That's a really interesting, it's kind of a way of turning it on its side. It's kind of like, can you do Anvil on a box so it can integrate with hospital systems and they trust that integration. Yeah, and it would probably raise like the overall cost to the hospital system of doing it, but it may still be easier to get it in. Also, one of the things that would, in terms of prioritization, there's always lots of things you can work on. We go back to some of the comments I heard about, well, what are clinical use cases or clinical research use cases? And so making sure those things may drive where one might prioritize that trusting, combining some of the hospital systems so that it's trusted. And Anthony was gonna say something as well. Oh, sorry, Sandy, I don't mean to talk too much, but I really care about your opinion on this. Are you saying make it so that a hospital could deploy their own instance of the Anvil? Or make it so that there's like a, and I'm not intending to be specific to a vendor, but if there's like a snowflake or Databricks Delta Lake pre-established capability, that would be an instance that you could control. And then it's an instance on top of a vendor that is used for a bunch of other things within the healthcare system. So there's like integration there. I think that that could be an attractive lower lift in terms of selling internally. So were you referring to some kind of, we talked about ABI, so an API with a Databricks Delta Lake or with snowflake applications? No, what I'd be referring to and maybe, and again, this may be too high, like a lift or not worth the investment, but like if the data, you could create an instance of Anvil where the data was actually like housed in those technologies the same way that like the other hospital system data would be housed in those technologies, but you're controlling the format and the way it's configured in the context so that it integrates with other Anvil datasets on other technologies, including the existing ones, then I think that would be a different conversation an easier conversation within the hospital systems. I guess the thing I struggle with this Sandy is, well, first you're kind of rewarding bad behavior because Databricks doesn't let data out, doesn't expose their data with APIs that other people can cross analyze it. So you're basically saying, let's push everything into the commercial people that have a profit motive and out of the hands of the nonprofit one. So I don't like that. But then the second thing is part of, a big part of what we take on in running the Anvil is operating a security perimeter and a testing to, we have a security perimeter that protects the data and we know where it's going and we know who's touching it. And it kind of breaks that, which is a big part of the value that you're adding. Yeah, I think it's fair. And I get the trade-offs associated with what I was suggesting. And it's just, I think that the benefit is it could lower the barrier a bit, but it may not be worth it. I do think that one thing to consider relative to the security perimeter is just also like the desire, there'll be desire for interoperability within a healthcare system with other like non-Anvil centered data resources. And I guess the question of how that happens and maybe the best way to do it is to maintain the data perimeter and just rely on the APIs. We have a third-party guide to people who wanna run their own infrastructure but still integrate with the Anvil APIs. And there's a base level of security they need to hit and be able to prove, which is not any different than what Google does that say if you wanna get access to sensitive scopes and sensitive APIs, they make you go through a verification process. If you want some very basic APIs, anyone can do that. But the moment you wanna start authenticating Google users and then having touching Google Drive APIs and things like that, they make you go through a verification process that is a very, it's a sliding scale of how bad it can be, but it's all the way up to essentially doing FedVamp. And that's their level depending on the sensitivity. So we're doing the sensitive data and we know this data is sensitive. So we take that kind of approach that anyone who wants to integrate with this system needs to at least be able to display that they have some sort of security, right? Some sort of attestation. It doesn't have to necessarily be as hardcore as ours but it's gotta be something. We've seen too many GUIs come up that are fly by night, easily hackable things that leak tokens all over the place to just say, yeah, here's our API, go at it. I don't think that that's a good posture. Yep. Okay, so we've got about four minutes on the threats category. Are there other items you all wanna bring up? All right, let's see, I've got a few hands here. Let me start on the far left with Vivian. Quick comment on Sandy. So I have seen the same thing with Databricks and Snowflake being used both in clinical settings and also in other government health agencies. So I totally agree with the approach. I think I understood what you meant. My question is, does Anvil stay as Anvil within NIH or is it going outside of that? If it's staying within NIH, that's one thing. If you go the Snowflake Databricks route, that's a different one. And I agree with Anthony and I think David that the security parameters are a problem. However, I am seeing penetration into the organizations. I am also seeing a lot of push from government healthcare agencies wanting open APIs far more. So that conversation may arise in the future, but it is a lot of overhead to do that. I think you pointed that out and there's a lot of places to go. So I definitely see that coming in the industry. It's just a patient where NIH wants it to be. Yeah, good point. Shana. So this is already in the doc and I've mentioned in the chat but I guess just for summary here, I think to me the one thing that I'm worried about which kind of echoes earlier is the potential siloing across NIH for the duos because one of the keys is that we have this kind of seamless interoperability. And I've already been talking offline about how if there's ways that we can help there's a really interesting group on this call and we each have contacts so if there's ways that we can help especially with specific use cases or pilots that might start to build some trust around that. And even if they're just not using duos but using the duos standard we still need to make sure that this will work because I think that would be the undoing of a lot of this, right? Because this has been one of the issues is that movement. Right. Okay, Luke. I'll be very brief because after hearing David talk I know that he knows this and he's all over it from the security side but just so it's on the slide hopefully is for threats, right, is security. It's always an ongoing thing. Again, David, this team is on top of it. The fact that you have a formal red team is above and beyond what many like I said in the strengths that many healthcare institutions are doing. So I think we need to list it as a threat but absolutely put an asterisk and say that there's sufficient mitigation in place already by this team and well done once again. That's a good point. And security, you know, there's the pen test you know, those kinds of things to see if people can infiltrate but then there's the appropriate use of potentially sensitive data. Sorry. That's what I was gonna say that the infrastructure and services themselves are rather secure, you know, as you pointed out but like if you have authorized access to the data you can do whatever the heck you want with it currently and that's a thing. And we've accepted that as a risk currently and the open question is should we continue to or should we go back to the drawing board and what that means? Right, right. And there's, you know, yeah, then you start going into insider threat misuse. I mean, you can make this- We say honest but curious in the business. We expect most of our quote threats to be honest but curious. Correct. Because they're already authorized users, right? We know that for, it's gonna be very unlikely for someone to get an account that's authorized. It'll be an honest but curious person or a person looking to exceed their research mandate. There's those and then there's the people who don't realize that they've just given their credentials to somebody was a whole other category. But security is itself, as you all know, all of this stuff is evolving and every time we think we have it solved, we find out, oh yeah, there's another, you know, another hurdle to jump over. So I agree. I think you all have been really diligent about this. I also think it's an evolving threat. So I think we'll put it in there. It is difficult with any kind of planning this. It's difficult with an Anvil and Google Cloud in general to give your credentials to someone because of how two-factor authentication and tokens and expiring works there. It is really difficult to do that. Nothing, it can't happen but that's a low-lakelihood event. It's not like using any password. It's a lot more than that. And the Chinese got into solar winds too. You know what I'm talking about. They got in through the development chain which is a different animal. All right. We're gonna run out of time. Thank you, David. I appreciate you sharing the information. I see two hands up. So hopefully we can do these quickly because then I want Carolyn to show what she's been recording. I have notes and we'll make sure we're all kind of like the summary we're gonna present. So I see Brandy's hand is up. I thought Shannon's was, but... Yeah. I'm happy to jump in. My understanding is that Anvil relies largely or solely on NIH, RAS and other, you know, those identity services. And I think that there is a threat that those reliances can be, you know, either challenging the ball maybe moving and since that's like the front door into the system, it might just be worth calling out that as those evolve, there's just continual effort to maintain, you know, compliance with those authentication and authorization schemes that, you know, we've all seen evolve a bunch of times the last several years. Excellent. Thank you. Other threats we haven't touched on. Other comments on this section. All right. Carolyn, do you, I know you've been keeping notes. Do you want to share? Yeah, let me share what I captured. Not accidentally, but the... Okay, so... Oh, wrong. I always do that. So under strengths, I got data growths. I mean, you can read them, but do you want me to read them, Karen, or do you want, why don't we all, why don't everyone take a minute to read this and then let me know of things you think could be added or worded differently? So Karen, the way I'm sharing my screen, I'm not able to see chat or hand. So I'm going to rely on you. I've got it. I've got Brandy's hand just up. I also want to mention, I had written down that somebody had made a point about the machine learning models being a really nice addition. Okay. Also leveraging of the GCP tools and APIs. Oh, I'm right. I'm into the weakness line by mistake here. Okay, sorry. I can add that in. That was me. I'll put that into the notes if you want. If you put in the notes, I'll transfer it over. Thanks. Perfect. Any other... Okay, go ahead. Go ahead, Karen. Okay. So I'll move on to the next, which is weakness. Hang on a second, Karen. Can you go back to the previous one? So data growth is impressive, true. But I think this is even more impressive. Anvil reduces the time to access data by what it's doing. And that's impressive by what it does with duos and all of the munging that it does. Let's not forget about that. It could take you months to get access to data if you just went through a DAC or a coordination center. So it's not just the growth. It's what it's actually doing to do that. And it gives you the opportunity to combine data sets, presuming if consent can allow you to do that. So there's things you can do with the data now that you couldn't before that size doesn't so much matter. Yes, it's impressive, but it's what you can do with it that I think is important. Thank you. Thank you. Can I just take on to that? Also the reduction because of the fact that the egress charges are really been largely eliminated. That's another huge one that really should be on there. I would just say that. So you I apologize. I'm sorry, Karen. I just thought we were moving on. No problem. I'm glad to hear you said the same thing. I was just about to say, are we missing anything gigantic here? Can we go to the weaknesses? All right, Carolyn next slide. And I have to say someone with a known spelling disability, doing things like this in real time is one of my least favorite things. So please point out words that are misspelled as well, because I don't see them. You're doing a great job. One of the things I wanted to add. I love the fact that there is the ability to download the data and download it free, but there is always that concern about you lose security control when the data leaves. And I heard a number of people really focusing on the security thing. So we probably just want to mention it. Other. End up or is that a holdover? I think she's right. More resources for end user support. Those with less experience. That's good. I heard a lot of discussion, a lot of it in the chat about connection with publications. I think that was an opportunity area. I have that an opportunity. Good. Great. Other weakness items. I mean, it's, it's kind of a testament to all the work here that the strengths list is much longer. And the opportunity as well as the opportunities. Okay. It looks like we only have a couple more minutes. So I'm going to move this. Work with Vulcan. Improve the perception that Anvil would support clinical community. So what I heard on the strides thing was leverage the strides services. I can add to that. So the, I think, and brandy correct me if I'm wrong. Brandy was talking about, you've got a bunch of people that want to use this and they want to. They're not so good on cloud. The strides professional services has a connection point to be able to use what are your favorite flavor of cloud provider is. That, that's the way to help people do that. And they've been primarily looking at, you know, onboarding more to the IT side things, but they're moving towards more of the bioscience piece. Yeah. And I think strides is one, one piece of it. I think that there's also opportunity in the program budget itself to directly support, you know, cloud credits are funding in a bigger way. Yeah. Yeah. Okay. Yeah. So I think that these GPOs are the cost controls, alignment, GCP and other sources to ease. Yeah. There's a quite a bit of ease of use and cloud use showing up here. Transparency and analysis forward looking ways to support publications, connect participants, ability, incorporate the preferences at consent. Enhanced interoperability to federated data, data ecosystems. So, there was a whole lot of discussion about, I don't think it's worth talking to. You know, where are you talking to someone? I don't know. I think it's a great candidate for having the bill on a box. That is not what Sandy was actually saying. I don't think Yeah. I. I will admit this part of the conversation. I had a really hard time turning into a bullet. Or other models for hospitals. Okay. That's fine. Any other comments there. We go. We got two minutes. Okay. Okay. think there's, I mean I know I'm jumping here but we're getting tight. It was mentioned it's something I was going to bring up but I think Vivian covered it also just the potential for cost issues. You know running things and finding oh my god you get a big bill. The other thing that didn't really come up but I think should is you know any like government funded thing it comes you come up with sustainability questions. Does it become something that gets perpetual you know NIH funding or is it a way it becomes self-sustaining? They're just typical systems threats. I agreed and one question I had here it's a threat for sustainability but it could also be an opportunity is this particularly you know in a public-private partnership is this where they fit all right if this becomes bigger than Ben her then DevOps is going to be a big problem maybe that can be handled by Broad but maybe you can't I don't know so there are things that probably could be done in different places in different ways. So we're at 11 seconds I just want to thank all of you great discussion great comments really appreciate you all being part of this today