 in the room already, so I'm gonna start slowly. People are always welcome to join. Hi everyone, good morning. My name is Sarah Bemamara, I'm the Associate Director for Research Services at Welk Cornell Medicine, which is a medical campus of Cornell University. We are located in New York City, while our mother university is located in Ithaca, upstate New York. And today I'm gonna talk about this service that we call DataCore, which is our secure computational enclave for hosting and analyzing sensitive data at our medical campus. Before I start, I just wanted to highlight one thing. So this service would not be possible without a contribution of many who have not been able to join us unfortunately today. However, there is one person who is here and I want to highlight, it's Terry Willer, who is our Library Director at Welk Cornell Medicine. She's the lady with the blue jacket and the blue dress, the blue lady, you cannot miss her. And so she's very knowledgeable also about DataCore, don't hesitate to reach out to her if you have any question. All right, so let's start about DataCore. Let's talk about DataCore. And before diving into the technical details, I just wanted to give a little bit of context about DataCore and how basically it fits in the whole picture at Welk Cornell. So what we use at Welk Cornell as a framework is this research data lifecycle. You're probably all very familiar with it, but in a nutshell, you start with the planning, designing, running experiment, then comes the time to collect data, to analyze it, then comes the time to manage, store it and preserve it, share it and publish it. And it ends with the discovery and the reuse and the citation of those data sets. And DataCore really intervenes in this piece where it's about collaboration and analysis, data analysis. But I want to highlight that DataCore is a service that connects with multiple other services that we offer at Welk Cornell. The others are, for example, the scientific software hub that we use as a platform to provide scientific software to our faculty or our researchers and students. It also connects very much with our data retention tool that we designed recently to help our researchers comply with the latest regulation from NIH regarding data management and sharing. And it also connects very much to our data catalog that we built to foster discovery and reuse of data sets. All these services, they really are aiming to achieve three main goals. The first one is to really build a data management program, research data management program to help our researchers throughout the research and lifecycle and facilitate their work. The second objective is to engage our stakeholders and also foster the user-fair principles, which I'm sure you're all familiar with. And the third objective is to also provide storage solutions for our researchers, especially when they reach the end of the research data lifecycle. And I just wanted to take a few minutes also or actually a few seconds to share about what the library does at Welk Cornell. So, librarians, they just don't manage DataCore only, they curate data, they administer and run the DataCore service. They also run the Scientific Software Hub service. We manage and develop our data catalog and run it. Same for our data retention tool that we started in July 2022. We also, of course, support our faculty, our researchers with their data measurement and sharing plan, anything that relates to data repository, picking a good repo for their data set. And we also work very closely with our Research Integrity Office to help with any matters that relates to data integrity. And here on the right side, you can see this diagram that really shows how all our services connect with each other. DataCore is here, but it really connects very much with our data catalog, our long-term repository in our Scientific Software Hub. So now some history about DataCore. How did it all start? DataCore was born from an institutional demand for a secure enclave that was really coming from all our researchers, but particularly from one of our biggest and most important department at WellConel, which is the PHS department, stands for Population Health Sciences. Those people, they are doing a lot of computational work. They deal with massive data sets and only sensitive data set. They get most of their data set from the Center for Medicaid and Medicare Services. So they really had a need for a computational power storage, but also like a space that is secure to host and analyze their data set. So we started building what we call now DataCore. At the beginning it was for few users per project. Now we have up to project that have up to one users, up to, sorry, from one users to multiple users up to 31 users, and even up to entire classrooms as well. And it is really built to favor collaboration. So now people inside our institution, but also outside can perfectly use DataCore. There is absolutely no issue. And it all started with the department subsidy. That means the PHS department actually was subsidizing at the beginning of the service, but since it's been blooming and really is used really wide across different departments right now, we are moving to a chargeback model to researchers based on their projects. So we don't get this subsidy anymore. So what we want about DataCore is really to be a secure space, a secure environment. And this diagram here is really to illustrate the different layers of security that we have within DataCore. DataCore is really an enclave. So when a researcher connects to DataCore, all what they see is their data and the scientific software that they need to do their analysis. If you are an internal user, you have to be part of the WCM network. So in order to connect to DataCore, if you're outside the institution, you have to use our remote apps. I mean, in both cases, you have to use our remote apps to connect to our DataCore enclave. And these remote apps make sure that you use due authentication, WCM credential. If you don't have any, we provide you some that you're connected to the VPN to access the enclave. Another important thing is that the DataCore curation team really takes care, are really the intermediary in the data custodian for users of DataCore. So users never, like PI, never handle the data set by themselves. The data provider directly gives us the data. We make sure that they are put in the DataCore. And any import or export of data set is ensured through the data curation team, which is made mostly of librarians in fact. And any transfer of data set between the PI and the DataCore curation team happens through secure WSFTP. So I'll give you also here a little bit, some numbers about, current numbers about DataCore. We have a lot of projects. Currently we have 83 current projects and about 300 users. 62 PI's are using the service. And over time we had about over, almost 200 project, over 600 users have been using DataCore. And 82 PI's as of now. So it is really trusted in my true services of now. So much that our IRB and the New York Presbyterian Hospital, which is part of the Well Cornel Medicine family, is encouraging people to use DataCore to host electronic PHI data sets or any medical records pretty much. So this has been published a while ago already in 2018. It is an old paper, but check it out because it really shows the principle governing operation and management of DataCore, especially regarding data governance. Oh, I forgot to mention that any project in DataCore has to have Data Governance document associated to it. Either a DUA, data use agreement, or IRB protocol of both. What we want also DataCore to be is really a collaborative tool. And this has been illustrated by three big projects that have been hosted through DataCore. One of them is the New York Insight Clinical Research Network that happened during COVID. So this has been a project that has been used by multiple New York City medical institutions to gather COVID data. At the end of the pandemic, we had so many data. We didn't know what to do with it, but we knew that it had to be available for people. And we realized that not everyone in their institution necessarily have a restricted environment to host this sensitive data because it was identified. So what we decided to do is that we would allow people to use DataCore to get a piece of those data sets. Whenever they needed to do their study, they can have the WCM credential necessary for it, and they can perfectly use DataCore. Right now we have a lot of artificial intelligence related projects. For example, using LAMMA. For example, for a recent project that works on automating clinical node summarization. We also have an interesting project that works on pediatrics epilepsy, learning health system. And this project is very interesting because it gathers data from children hospitals from multiple institution across the nation. We have children hospital from Ohio, Colorado, Texas, Illinois, Boston, Seattle, participating in the study, and they're all using DataCore to host and analyze the data set. I want to emphasize that DataCore team is really crucial in this whole process because they really act as intermediary between the PI, between the external data provider, and with the research team, they really handle all the administrative burden. So the PI can really focus on analyzing their data. So for example, we take care of anything that relates to IRB or make sure that the data governance is, and the technical environment is in agreement. So we make sure that the DataCore users access is in agreement with the IRB and with whatever the PI wants it to be. And of course with the data provider as well. So really try to lift the burden as much as possible from the PI's. So now, how does it look like when the PI connect to DataCore? Well, it really looks like they are connecting to their Windows machine or their Linux machine. What they have on their machine is basically one folder where they have their primary data. This is a read-only folder. They cannot touch it, just read it. So they cannot take the risk of writing anything regarding source data. They also have a shared workspace where they can collaborate and use to connect to share results or analysis with their different users. And then each user have their own workspace environment and they have read and write permission as well as in the shared workspace. By default, we provide SAS, RStudio, which are a statistical tool, SPSS as well, and MSOfficeSuit. So add no charge. And DataCore creation team really takes care of importing or exporting any file in or out from DataCore. We make sure that it's fully identified when it is exported out of DataCore. And any users connecting to DataCore cannot copy paste anything out of DataCore. They have to use WCM central authentication to connect and our remote apps. And they are completely isolated from internet or any other network devices or any other services. I want to also, obviously, to acknowledge that it takes a village to run DataCore. I just put like here, like you have in parenthesis the number of people that are involved in each different team that are helping running DataCore. They are not like people who are dedicating 100% of their time to run DataCore. They just give a percentage of their time. But the library staff is really, really involved and we have four library and involved in to running DataCore. Talking about the DataCore creation team, here it is. So Terry is really at the top. She oversees the whole service and I come to help manage the DataCore along with John Ruffin who advises on the architecture and design of the service. But really Alice Chin is the one making most of the work along with her two reports, Patrick Chen and Eric Lohano who are data management specialists. Alice is a certified member. That means that for the identification so she ensure that anything that is taking out of DataCore is completely de-identified and she's been certified for that. And obviously we connect very closely with our technical team that I mentioned just before. So how does it look like concretely in our daily operation? What DataCore does, the DataCore creation team takes care of anything that relates to onboarding project or onboarding project out of DataCore. We also make sure that any operation runs smoothly and we review each project on a regular basis to make sure that everything is up to date and current. So when we onboard somebody, I mean a project, sorry, usually we check of course the data governance documents. We make sure that we set up the environment exactly in agreement with whatever is stated in the DUAs or the IRBs. We also check the timeline of the project with the PI. We make sure that they specify all their project requirements regarding RAM, storage, space, software application, whatever they need. And we make sure that we provide an operational environment that runs smoothly but still in agreement with security regulation. If they want to make any change to their environment to just put in a request. So that could be for data import or export. That could be also to change user access if they want to add another user to the project they reach out to us. We'll make sure that this is done and of course that the user is added also in the IRB protocols. And we make sure that whatever record we have with the project is up to date. We'll make sure that the project is still running and that they're still using the amount of resources that they're asking for. And then comes the end of the project. We make sure that it's onboarded. We close the project and when the project and comes, we make sure that the environment is deleted but also we take care of the data. So if it needs to be destroyed, we take care of that and we provide them a certification of destruction. If the data needs to be archived we take care of that also for them. Now there are different features and options available in data core. I want to just emphasize that we use our ticketing system that is called ServiceNow to manage all the requests that we get from data core. And we have built our own web application to manage data core projects that we call Marigold. It's home ground. And we recover for the cost for all our costs. So for, and we charge according to the number of users per project, the software they use, the computational power that they use and researchers are trusting so much data core that they involve us as early as in the grant application process. So they usually ask us how much it's gonna cost to host data core. They ask us for quotes and we're obviously happy to provide that for them. They always have the option during or at the beginning or after the beginning of the project to add more memories, CPU, or more GPU if they want. We can connect the database to their environment. We can increase the storage and add any license software that they want. It is really important to note that data core environment is completely connected to our data catalog and our retention tools so it facilitates when they reach the end of the research data life cycle. They are calving for example of the data and publishing and advertising it also through the institution. Now I talked about a charging model. I just wanted to give you here like some figures. So you get an idea of how much it costs to have a data core project. Overall it's between $500 and $610 depending on the options that you pick. You have up to $85 charge for software. It depends which software they pick. The storage is between $2.50 and $25 per 100 gigs. The most of the cost comes from hosting so like the server running the server. We offer by the baseline is 32 CPU and 128 gigabyte of RAM. And we charge also for staffing cost $250 per user. So right now in terms of status we have been certified by the Center for Medicaid and Medicare Services for our on-premise environment and also for AWS Cloud environment. We are working on cloud initiatives so we already have an AWS environment that people can use to analyze the data set. We provide Windows Linux operating system as well as GPU depending on the, well on demand. And right now we are working on also offering Azure to our researchers. We are still in a testing phase and we're also working on getting our Azure environment CMS certified so we can host sensitive data set. We also have some package repository available for researchers so even though it's disconnected from internet they can directly install their R packages themselves without depending on the admin team. And we are working on a Python repository, the CRAN repository for R packages is already available. And I'm gonna finish with this slide actually about the challenges and successes that we've been encountering when we set up data core. So of course data core is set up to reflect the current regulations but also the current computational demand which are like very moving targets. So it has been a little bit challenging to always keep up with whatever researchers ask us in terms of computational power and also to be in agreement with the latest regulations. And every time those two things changes we have to train them. We have to train ourselves to use new resources or new tools. So sometimes it is a bit challenging to keep people engaged because when they have to change their habits every now and then it can be discouraging for some of them. We try to facilitate that as much as possible and to make it as seamless as possible. And of course there is a cost to build such services. It is not impossible but there is some level of investment on the staff and a little bit on the financial part too. But we really I think have been successful in those services because our library and our information and technology services are very, very close. We report actually, we're part of the same department and we both report to our CIO our chief information officer which really facilitates the work that we've been doing. We have very motivated staff, especially at the library and the leadership is very, very supportive of our initiatives. And we get a lot of help from our researchers, feedback on the new services and what is working, what is not working. And so with that I think I'm gonna finish here and I'm happy to take any question. Yeah, John Pettis from Virginia Tech. This is really cool. I wish we had something like this at Virginia Tech. That would be great. Thank you. I was curious like what kind of questions you're asking on your onboarding form. So I was just looking on your line. Like are you asking for a detailed data management plans in that form? Or are you asking more about like what quotas do you need? Are you at what level of content are you asking for before people get accounts and start working? Yeah, that's a great question. Thanks for asking this. So one of the first question we asked because so on campus we have our data core but we also have another cluster that is dedicated for non-sensitive data. So obviously for researchers it's very tempting to go directly to this cluster that they're very familiar with to analyze their sensitive data except that it's not compliant with the regulations. So the first thing that we have in the form and it is common between both our cluster like non-sensitive data cluster in our data core is to have a form that asks them the type of data they are dealing with and the amount of resources that they need, RAM, storage, applications and so on and the time that their project is expected to last as well. And depending on this like first triage form we orient them, somebody orient them towards the non-sensitive data cluster or towards the data core. And usually they follow the path pretty delusionally. And so when they get to us at this point at the data core we ask them very technical questions. So the first thing actually we ask them is to provide us any data use agreement or any IRB protocol that's the very first thing. And then our data creation team read those documents very carefully. Try to understand all the requirements, the restriction that are needed to host these data sets and then engage in a discussion with the researchers to make sure that whatever they expect from their environment is okay with regulation whatever is stated in their data governance documents. And of course we talk about technical aspects such as for example, we had a recent researcher who was really fun of chat GPT as a lot of researchers are now but who wanted to apply this kind of large language model to facilitate the reading of notes that are taken by a physician or to facilitate the core selection. So like, you know, finding very easily the right candidates for their study. And they were very excited by using chat GPT for that we stole them absolutely not. Please do not do that. And so there is some education involved in that. And but we managed to do that very easily because data core users are trusting their core to really facilitate their work, reduce their burden and taking care of anything that relates to security. So we really try to have them focus on what they need to do and be aware of what they shouldn't be doing and guide them through the whole process. So I hope that answers the question. Any other questions? This is very interesting. At our university, we're navigating a lot of different regulated research needs of our community. And I'm wondering from what you've learned here how extendable is this type of approach to other security requirements like NIST 8171. Thanks. That's a very interesting question. So actually there's two ways to answer that. But first of all, I think like building something like data core is definitely like reachable. You guys can do it. You just need to have the, it's really a teamwork. You just need to have the right people involved. Not 100% on it. But just spending the necessary amount of time so you get the technical skills that you need to run things smoothly, especially on the security side. Now because we have worked quite hard on getting the CMS certifications, that helps once you have one certification that paved the way for a lot of other certifications. So it requires a lot of work on the technical side, meaning like reporting exactly about all the measures that you're taking to maintain your data safe. But once you have that, you can expand very easily to other type of certifications that are needed to host other type of data sets. So it's really worth the investment is a lot of work. I'm not hiding it. We've been spending, I think a full year, full year and a half on the CMS certification for the cloud or even though we had already like a, you know, it was very much a service we knew what we were doing for a while. But it still requires a lot of time. It is worth the investment though because it really facilitated all the rest of the work that you'll be doing. I don't know if that answers your question, I hope so. But let me know if, yeah. Yeah, yeah, exactly. I wanted to go back to this slide because I mean that shows the amount of technical skills that are needed to run data core but it is really important also to have, I think like no one should hesitate to involve librarians in this kind of endeavor because they are very good at organizing data, administering things, running things, managing teams. They are used to working in teams. And I think that that's also like a big win, I think, in running those kind of environment and removing a lot of admin burden from both researchers and technical team as well. Because they don't have to do all the back and forth. We take care of that. You're welcome. All right, we're almost at time. So there's no more question. Terry's there to answer any question, don't forget. And I'm happy to chat over a break if you need. Thanks very much.