 Everyone, welcome to theCUBE's coverage of AWS Amazon Web Services Global Public Sector Partner Awards Program. I'm John Furrier, host of theCUBE. Here, we're going to talk about the best COVID solution at two great guests, Ben Amor, with healthcare and life sciences lead at Palantir. Ben, welcome to theCUBE. Sam Michaels, Director of Automation and Compound Management, NCATS National Center for Advancing Translational Sciences, NCATS, part of the NIH National Institute of Health. Gentlemen, thank you for coming on and congratulations on the best COVID solution. Thank you so much, John. So I got to ask you, the best COVID solution is, when can I get the vaccine? How fast, how long is going to last? But I really appreciate you guys coming on. I hope you're vaccinated. I would say, John, that's outside of our hands. I would say if you've not gotten vaccinated, go get vaccinated right now. Have someone stab you in the arm. You know, do not wait and go for it. That's not on us, but you have that opportunity. Now that we have that done, I got to get on a plane, all kinds of hoops I got to jump through. We need better solutions anyway. You guys have a great technical, so I want to dig in. I'll see you aside, getting aside. You guys have put together a killer solution that really requires a lot of data. Can let's step back and talk about, first, what was the solution that won the award? Can you guys give a quick second at the table for what we're talking about? Ben, we'll start with you. So the National COVID Co-op Collaborative is a secure data enclave putting together EHR records from more than 60 different academic medical centers across the country, and they're making it available to researchers to ask many and varied questions to try and understand this disease better. San, take us through the challenges here. What was going on? What was the hard problem? I'll see everyone had a situation with COVID where people broke through and cloudhouse, he drove it, Amazon is part of this awards, but you guys were solving something. What was the problem statement that you guys were going after and what happened? I think the problem statement is essentially that, the nation has the electronic health records, but it's very fragmented, right? As Ben has highlighted, there's multiple systems around the country, thousands of folks that have EHRs, but there's no way from a research perspective to actually have access in any unified location. And so really what we were looking for is how can we essentially provide a centralized location to study electronic health records, but in a federated sense because we recognize that the data existed in other locations. And so we had to figure out for a vast quantity of data, how can we get data from those 60 sites, 60 plus that Ben is referencing from their respective locations and then into one central repository, but also in a common format because that's another huge aspect of the technical challenge was there's multiple formats for electronic health records, there's different standards, there's different versions, and how do you actually have all of this data harmonized into something which is usable again for research? I mean, there's so many things that are jumping in my head right now. I want to unpack one, at the time COVID hit, the scramble and the imperative for getting answers quickly was huge. So it's a data problem at a massive scale, public health impact. Again, we were talking before we came on camera, public health records are dirty, they're not clean. A lot of things are weird. I mean, just massive amount of weird problems. How did you guys pull together? Take me through how this gets done. What happened? Take us through the steps. You just got together and said, let's do this. How does all happen? Yeah, it's a great, and so John, I would say, so part of this started actually several years ago. I explained this when people talk about N3C is that NCATS has actually established what we like to call, we support a program which is called the Clinical Translation Science Award Program. It's the largest single grant program in all of NIH and it constitutes the bulk of the NCATS budget. So this is extramural grants which goes all over the country. And we wanted this group to essentially have a common research environment. So we try to create what we call a secure scientific collaborative platforms. Another example of this is when we call the Rare Disease Clinical Research Network, which again is a consortium of 20 different sites around the nation. And so really we had started working this several years ago that if we want to build an environment that's collaborative for researchers around the country, around the world, the natural place to do that is really with a cloud-first strategy. And we recognize this as NCATS. We're about 600 people now. But if you look at the size of our actual research community with our grantees, we're in the thousands. And so from the perspective that we took several years ago was we have to really take a step back and if we want to have a comprehensive and cohesive package or solution, we have to treat this as really a mid-sized business. And so that means we have to treat this as a cloud-based enterprise. And so NCATS several years ago had really gone on this strategy that to bring in different commercial partners, of which one of them is Palantir. It actually started with their Intramural Research Program and obviously very heavy cloud use with AWS, we use Azure, we use Google Workspace, essentially use different cloud tools to enable our collaborative researchers. The next step is we also had a project. If we want to have an environment, we have to have access. And this is something that we took early steps on years prior that there's no good building environment if people can't get in the front door. So we invested heavily and created an application which we call our Federated Authentication System. We call it Unified NCATS-Auth, is so we call it Una for short. And this is the open-source in-house project that we built at NCATS and we wanted to actually use this for all sorts of implementation, acting as the front door to this collaborative environment being one of them. And then also by really this interest in electronic health records had existed prior to the COVID pandemic. And so we had done some prior work via a mixture of internal investments and grants with collaborative partners to really look at what it would take to harmonize this data at scale. And so like you mentioned, COVID hit, it hit really hard. Everyone was scrambling for answers. And I think we had a bit of these pieces in play. And then that's, I think when we turned to Ben and the team at Palantir, and we said, we have these components, we have these pieces, but we really need something independent that we can stand up quickly to really address some of these problems. One of the biggest one being that data ingestion and the harmonization step. And so I can let Ben really speak to that one. Yeah, Ben, and Lou, Labry please, because you're solving a lot of collaboration problems. Dude, not just the technical problem, but ingestion and harmonization. Ingestion, most people can understand isn't it in the data warehousing or in the data is to know what that means. Take us through harmonization because not to put a little bit of shade on this, but most people think about these kinds of research or nonprofits as a slow moving, standing stuff up as Sam was saying, it takes time, you break it down by the time you get in things are over. This was agile. So take us through what made it agile because that's not normal. I mean, that's not what you see normally. It's like, hey, we'll see you next year. We'll stand that up. Yeah, did the data, Sam? Yeah. Yeah, I mean, so as Sam described, this sort of the question of data onto interoperability is a really essential problem for working with this kind of data. And I think we have data coming from more than 60 different sites. And one of the reasons we were able to move quickly was because rather than saying, oh, well, you have to provide the data in a certain format, a certain standard. N3C was able to say, actually, just give us the data how you have it in whatever format is easiest for you. And we will take care of that process of actually transforming it into a single standard data model, converting all of the medical vocabularies, doing all of the data quality assessment that's needed to ensure that data is actually ready for research. And that was very much a collaborative endeavor. It was run out of a team based at Johns Hopkins University, but in collaboration with a broad range of researchers who were all adding their expertise. And what we were able to do was to provide the sort of the technical infrastructure for taking the transformation pipelines that are being developed, the actual logic and the code and developing these very robust kind of centralized templates for that that could be deployed just like software is deployed, have change management, have upgrades and downgrades and version control and change logs. So that we can roll that out across a large number of sites in a very robust way very quickly. So that's sort of that's one aspect of it. And then there was a bunch of really interesting challenges along the way that again, a very broad collaborative team of researchers worked on. And an example of that would be unit harmonization and inference. So really simple things like when a lab result arrives, we talked about data quality, you would expect it to have a unit, right? Like if you're reporting somebody's weight, you probably want to know if it's in kilograms or if it's in pounds. But we found that a very significant proportion of the time, the unit was actually missing an EHR record. And so unless you can actually get that back, that becomes useless. And so an approach was developed because we had data across 60 or more different sites, you have a large number of lab tests that do have the correct units. And you can look at the data distributions and decide how likely is it that this missing unit is actually kilograms or pounds and save a huge portion of these labs. So that's just an example of something that has enabled research to happen that would not otherwise have been able to happen. Not to dig in and rat hole on that one point, but what time savings do you think that saves? I mean, I can imagine just on the data cleaning side, that's just a massive time saving, just to infer, okay, based on the data sampling, this is kilograms or pounds. Exactly, I mean, so we're talking there's more than three and a half billion lab records in this database now. So if you were trying to do this manually, I mean, it would take, it would take you thousands of years, you know, so it just wouldn't be possible. It would be a black hole in the data set, essentially, because there's no way it would get done. Okay, okay, Sam, take me through, like from a research standpoint, this normalization, harmonization, the process, what does that enable for the research and who decides what's the standard format? So, because again, I'm just in my mind thinking how hard this is, and then what was decided? Was it just on the base records? What standards were happening? What's the impact of researchers? No, it's a great question. Well, a couple of things I'll say, and Ben has touched on this, is the other real core piece of N3C is the community, right, you know, and so I think there's a couple of things you mentioned with this, John, is the way we execute this is, it was very nimble, it was very agile, and there's something to be said on that piece from a procurement perspective, the government had many COVID authorities that were granted to make very fast decisions to get things secured quickly, and we were able to turn this around with our acquisition shop, which we would otherwise be dead in the water, like you said, way to year, go through a normal acquisition process, which can take time, but that's only one half. The other half, and really you're touching on this, and Ben is touching on this, is when he mentions the research, as we have this entire research community numbering in the thousands from a volunteer perspective, and I think it's really fascinating. This is a really great example to me of sort of this public-private partnership between the companies we use, but also the academic participants that are actually make up the community, again, who the amount of time they have dedicated on this is just incredible. So really what's also been established with this is core governance, and so as you think from a systems perspective, is the Palantir, this environment, the N3C environment belongs to the government, but the N3C, the entire actually program, I would say belongs to the community, and we have co-governance on this. So who decides really is just a mixture between the folks on NCAS, but not just NCAS, there's folks at NCAS, folks at NIH proper, but also folks at other government agencies, but also the academic communities and entire these mixed governance teams that actually set the stage for all of this, and again, who's going to decide the standard? We decided we're going to do this in OMOP 5.3.1, is the standard we're going to utilize, and then once the data is there, this is what gets exciting, is then they have the different domain teams where they can ask different research questions depending upon what has interest scientifically to them. And so really, we viewed this from the government's perspective, is how do we build again the secure platform where we can enable the research, but we don't really want to dictate the research. I mean, the one criteria we did put is your research has to be COVID focused, right? Because very clearly in response to COVID, so you have to have a COVID focus, and then we have data use agreements, data use requests, we have entire governance committees that decide, is this research in scope, but we don't want to dictate the research types that the domain teams are bringing to the table? Yeah, I mean, I think the National Institute of Health, when you think about just that their mission is to serve the public health, and I think this is a great example of when you enable data to be surfaced and available, that you can really allow people to be empowered, and not to use the cliche citizen analyst, but in a way, this is what the community is doing, you're doing research and allowing people from volunteers to academics to students to just be part of it. That is citizen analysis. You got citizen journalism, you got citizen research. You got a lot of democratization happening here. Is that part of it, or is that a result of this? It's both, it's a great question. I think it's both, and it's really by design, because again, we want to enable, and there's a couple of things that I really, we clamor with at NCATS, I think NIH is going with this direction too, is we believe firmly in open science, we believe firmly in open standards, and how we can actually enable these standards to promote this open science, because it's actually non-trivial. We've had the citizen scientists actually doing the tricky problem from a governance perspective, or we had the case where we actually had to have students that wanted access to the environment, well, we actually had to have someone because they have to have an institution that they come in with, but we've actually crossed some of those bridges to actually get students and researchers into this environment, very much by design, but also the spirit, which was held to enable by the community, which again, so I think they go hand in hand. We've been planned for it. Yeah, open science is a huge wave. I'm a big fan, I think that's got a lot of headroom because if you look at open source, what that's done to the software industry, it's amazing. And I think your federated idea comes in here, and Ben, if you guys can just talk through the federated idea, because I think that might enable and remove some of those structural blockers that might be out there in terms of, oh, you got to be affiliated with this or that or a friend's got to invite you, but then you got privacy access and this federated idea. Not an easy thing, it's easy to say, but how do you tie that together? Because you want to enable frictionless ability to come in and contribute, at the same time, you want to have some policies around who's in and who's not. Yes, totally. I mean, so Sam sort of already described the UNA system, which is the authentication system that NCATS has developed. And obviously, from our perspective, we integrate with that, it's using all of the standard kind of authentication protocols and it's very easy to integrate that into the Foundry platform and make it so that we can authenticate people correctly. But then if you go beyond authentication, you've also then to actually, you need to have the access controls in place to say, yes, I know who this person is, but now what should they actually be able to see? And I think one of the really great things N3C has done is to be very rigorous about that. They have their governance rules that says you should be using the data for a certain purpose. You must go through a procedure so that the access committee approves that purpose. And then we need to make sure that you're actually doing the work that you said you were going to. And so before you can get your data back out of the system or your results out, you actually have to prove that those results are in line with the original stated purpose. And the infrastructure around that and having the access controls and the governance processes all working together in a seamless way so that it doesn't, as you say, increase the friction on the researcher and they can get access to the data for that appropriate purpose. That was a big component of what we've been building out with N3C. Absolutely, and really in line, John, with what NIH is doing with the research all service, they call this RAS. And I think things that we believe in and their standards that we're starting to follow and work with them closely, multi-factor authentication because of the point Ben is making and you raised as well. One, you need to authenticate, okay, this, you are who you say you are and we're recognizing that and you're the all N piece, but then the all to Z, what do you authorize to see? What do you have authorization to? And they go hand in hand. And again, non-trivial problems and especially when we base this, typically a lot of what we're using is we'll do direct integrations with our UNA package. We use in commons for federated access. We're also even using login.gov. Again, because we need to make sure that people had a means and login.gov is essentially a runoff, right? If they don't have an organization which we have in commons or a federated access to generate a login.gov account, but they still are beholden to the multi-factor authentication step. And then they still have to get the same authorizations because we really do believe access to these environments seamlessly is absolutely critical. Know who our users are, but again, not make it restrictive and not make it this friction filled process that's very difficult. Yeah, I mean, you think about non-trivial, totally agree with you. And I think about like, if you were in a classic enterprise, I thought about an IT problem, you know, like, you know, bring your own device to work. And that's basically what the whole world does these days. So like you're thinking about access, you don't know who's coming in, you don't know where they're coming in from and the churn is so high, you don't know, I mean, all this is happening, right? So you have to be prepared to provision and have provide resource to a very lightweight access edge. That's right. And that's why it gets back to what we mentioned is we were taking a step back and thinking about this problem. You know, when M3C became the use case was this is an enterprise IT problem, right? You know, we have users from around the world that want to access this environment. And again, we try to hit a really difficult mark which is secure, but collaborative, right? That's not easy, you know, but again, the only place this environment that could take place is in a cloud-based environment, right? That's let's be real, you know, 10 years ago, forget it. You know, again, it would have been difficult, but now it's just incredible how much they've advanced that these real virtual research organizations can start to exist and they've become these real partnerships. Well, I want to, that's a great point. I want to highlight and call out because I've done a lot of these interviews with awards programs over the years and certainly in public sector and open source over many, many years. One of the things open source allows is the code reuse. And also when you start getting into these situations where, okay, you have a crisis, COVID, other things happen. Non-profits go the same thing, they lose their funding and all the code disappears. Same with these, COVID when it becomes over, you don't want to lose the momentum. So this whole idea of reuse, this platformization, the platforming of and refactoring, if you will, these are two concepts that cloud enables. Sam, I'd love to get your thoughts on this because it doesn't go away when COVID's over. Research still continues. So this whole idea of replatforming and then refactoring is very much a new concept. First of the old days of, okay, project's over, move on to the next one. No, you're absolutely right. And I think what first drove us is we were taking a step back at NCATS. How do we ensure that sustainability, right? Because my background's actually engineering. So I think about, you want to build things to last. And what you just described, John, is that funding, it peaks, it goes up and then it wanes away and it goes. And what you're left with essentially is nothing, right? You know, it's okay, you did this, invest me into a body of work and it goes away. And really, I think what we're really building are these sustainable platforms that we will actually grow and evolve based upon the research needs over time. And I think that was really a huge investment that both, you know, again, NCATS has made, but NIH is going in a very similar direction. You know, there's a substantial investment, you know, made in these really impressive environments. How do we make sure they're sustainable for the long term? You know, again, we just went through this with COVID but what's going to come next? You know, what are the research questions that we need to answer, but also open source is an incredibly important piece of this. I think Ben can speak to this in a second. All the harmonization work, all that effort, you know, essentially this massive complex ETL process is in the N3C GitHub. So we believe, you know, completely in the open source model, a little bit of a flavor on it too though, because you know, again, back to the sustainability, John, I believe, you know, there's a room for this marriage between commercial platforms and open source software. And we need both, you know, as we're strong proponents of NCATS at both, but especially with sustainability, especially when I think enterprise IT, you know, you have to have professional grade products. That was part of, I would say, an experiment we ran at NCATS. You know, our thought was, we can fund academic groups and we can have them do open source projects and you'll get some decent results. But I think the nature of IT and the nature of these environments have become so complex, the experiment we're taking is we're going to provide commercial grade tools to the academic community and the researchers and let them use them and see how they can be enabled and actually focus on research questions. And I think, you know, N3C would show we've been very successful with that model while still really adhering to the open source spirit and principles. Well, an amazing story, congratulations. And you know what, that's so awesome because that's the future. And I think you're onto something huge, a great point. Ben, you want to chime in on this whole sustainability because the public-private partnership idea is the now the new model. Innovation formula is about open and collaborative. What's your thoughts? Absolutely. And I mean, we at Palantir have been huge proponents of reproducibility and openness in analyses and in science. And so everything done within the Foundry platform is done in open source languages like Python and R and SQL and is exposed via open APIs and through Git repository. So that as Sam says, we've pushed all of that ETL code that was developed within the platform out to the NCATS GitHub and the analysis code itself being written in those various different languages can also sort of easily be pulled out and made available for other researchers in the future. And I think what we've also seen is that within the data enclave, there's been an enormous amount of reuse across the different research projects. And so actually having that security in place and making it secure so that people can actually start to share with each other securely as well and be very clear that although I'm sharing this, it's still within the range of the government's requirements has meant that the research has really been accelerated because people have been able to build and stand on the shoulders of what earlier projects have done. Okay, Ben, great stuff. 1000 researchers open source code on GitHub. Where do I sign up? I want to get involved. This is amazing. Like sounds like a great party. We'll send you a link. Yeah, if you do a search on N3C, do a search on that and you'll actually, it'll come up with a website hosted by the academic side and it'll show you all the information of how you can actually connect. And John, you're welcome to come in. You know, by all means. Billions of rows of data being solved. Great tech he's working on. Again, this is a great example of large scale. The modern era of solving problems is here. It's out in the open, open science. Sam, congratulations on your great success. Ben, it's award winners. You guys doing a great job. Great story. Thanks for sharing here with us in theCUBE. Appreciate it. Thank you, John. Thanks for having us. It is global public sector partner awards. Best COVID solution, Palantir and NCATS. Great solution, great story. I'm John Furrier with theCUBE. Thanks for watching.