 So we have quite a few people here, so we'll go ahead and get started. Hi, everyone. My name is Melanie Ganey. I'm a librarian at CMU Library, supporting some of our science and engineering departments and a lot of our open science and data initiatives as well. And I'm here with Lulie Wong, who is our postdoctoral fellow for data curation at CMU Libraries. And together we are helping to organize this Retivis trial. We're trialing the Retivis platform for the next year and a half. And it can be used for research and instruction purposes. And we will be having a demo today to go over all that functionality, help you decide whether you want to try out the platform. And so I'm going to turn it over to Ian Matthews, the co-founder and CEO of Retivis, who will do the demo. And then at the end, we'll talk about next steps for the trial. I'll be monitoring the chat if you have any questions. Feel free to put those in the chat. And we are recording this session so you can send it along to colleagues who may have missed it after and we'll email it out as well. Wonderful. Thank you for having me here today. With that, I will dive into it. So yeah, so Retivis for research and we'll kind of talk a little bit today also about its utility assessment instructional tool. My name's Ian Matthews. I'm the co-founder and CEO of Retivis. As a very brief background, our mission in Retivis is to enable research groups to distribute rich datasets and to provide scientists with the tools to understand them. We strive to do this in a way that reduces barriers to working with data and that develops intuitive tools that make data science as accessible and reproducible as possible. Today we work with several thousand researchers kind of scattered across the country as well as internationally. We're used institution-wide at several major universities various labs and research centers across the country. We're very specifically focused on this use case of working with research data in an academic context and to that end, our product development is continuously informed by our research community and our hope is that over the course of this pilot, your feedback is the most valuable thing to us and we can really kind of work together to continue to evolve the platform and make it meet the specific needs of your work. So with that, much of what I'll do today is just a walk through the platform and then we can kind of get into Q&A and do some deeper dives. To ground us, you can think of what it is as a comprehensive data platform. It's a place where you can either upload or discover datasets, where you can configure and apply for access to restricted datasets and where you can ultimately perform your analysis. That said, it is not a closed platform. We have an open API and kind of any step of this process could be replaced with alternate tooling or workflows. Yeah, we never wanted to be closed platform we also wanted to be something that is comprehensive of the place where you can do kind of your entire data research work. So with that, I'm gonna kind of step through these in tandem and to begin, we can talk a little bit about dataset discovery and dataset management on Redivis. So datasets are created by individual researchers uploaded to their personal workspace or within what we call an organization. The organization could be a research lab, it could be a large research center at Cardi G. Mellon and kind of all else in between. Redivis supports tabular geospatial and unstructured datasets of effectively arbitrary size. And we have automatic issuance of things like DUIs built in, virgin control and supportive reproducibility as well as affordances for rich metadata documentation and auto-generated statistics. So with that, we'll do a walkthrough of the Stanford Institution on Redivis. Cardi G. Mellon also does have its own landing page at redivis.com slash CMU. It's a bit of a placeholder right now, but excited for some datasets and groups to get in here. And then we can start demoing with CMU data. So backing up here, this is the Stanford University Institution on Redivis. This is kind of a landing page. They've branded it the data farm. We can see some various organizations that are hosting data at the institution and also kind of explore datasets that are hosted on Redivis across Stanford. So if we wanna look for demographic data, we can pull this up and see some specific datasets that are being hosted here. If we know the particular research groups that we want to work with, we can go to their data portal. So for example, this is the landing page for the Stanford Center for Population Health Sciences. I imagine some people on this call represent particular research labs or groups that would want to kind of have their own landing page, their own kind of branded data portal. So this is something that you could configure within the Carnegie Mellon Institution. Here we can see some kind of top level information about this organization, those sorts of data that they host. And we can also, again, browse across their various datasets. And I'm wondering if my internet's conking out. But I still see, okay, it's back. Sorry, I've been having some internet issues today. It was crossed. Okay, so we can browse across the datasets in this organization. So for example, let's say that we want to look like, there's a lot of medical data in here. That is obviously fairly high risk, but they also have some environmental datasets. So let's say that we want to look for precipitation trends in the U.S., say, we can search across all these datasets for any of them that's precipitation, this is a fairly deep search. So it's not just looking at kind of like top level documentation, but really going into the datasets, metadata, value labels, this term matched specifically on the element variable within this daily observations table. And if we look at the bottom right here, you can see that the PRCP value, we matched on precipitation and we have close to a billion records of measuring precipitation in this dataset. On this table, we can kind of learn a bit more about the data. If I really just want to touch the data, I can go into the sales view here and view kind of the content that's making up this table. It's obviously quite large, it's close to three billion records. This particular dataset goes back to the 1760s of 250 years. We can explore these univariate summary statistics and give a sense of the data distribution and realize that the data is more concentrated in the more recent era. When we want to dive further into a dataset, we can navigate to the dataset page. This is a place where a researcher can learn more about a particular dataset or if you're the data owner where you kind of imbue all sorts of documentation, metadata. As I mentioned, Reddit is automatically issues DOIs. This is through data site. We populate the DOI metadata with all sorts of information about contributors to that dataset, related identifiers, what have you so that people ultimately can get credit for their work. We can see that this dataset is populated with various logical information here. You can also explore the other tables in this dataset. So ultimately we'll do a little kind of geospatial map with these datasets for representation trends. So we want to validate that we do have geospatial dispersion data. This exists in a stations table where we can see we have the latitude and longitude encoded. Reddit does things like auto generate certain summary statistics based on geospatial information. So we can see kind of the dispersion of this dataset largely concentrated in the United States, but it is the global dataset. The last thing I just wanted to highlight on this page is that all datasets on Reddit is our version controlled. You can see that we're looking at version 1.2 of this dataset right now, but I can go back over time. I can see what's changed between versions. We can see in version 1.1, a new year's worth of data was uploaded here. Importantly, historic versions never changed. So if you've been working on version 1.1 of this dataset, you can continue running your analysis in perpetuity at the same set of results, but you can also have an easy upgrade path to be in working with newer versions of the data. Additionally, Reddit is storing kind of an efficient or a level diff across versions. We're not duplicating the data with every version, but really just keeping track of what's changed to minimize storage costs across versions. So this was a brief tour of a tabular dataset with some geospatial elements as well. We do support a diversity of data types. So I'm gonna pop over to our demo organization just to showcase a little bit more on the different datasets we support. To start, we do have full support for geospatial data. This is a forest fires dataset where we've uploaded shape files that were pulled down from the fire service around US fire primitives. We support shape files, GeoJSON, KML, and what happens is every feature in the geography file gets encoded as a single row in a table. But importantly, we have all of the polygon information here that you can kind of pull up to further inspect it. And then we have all sorts of tool in that supports things like geospatial joints and what have you as well. And then finally, we do support basically any other file type, your non-tabular unstructured data. So you can upload any file that wouldn't say fit into a table into this files tab on the dataset page. We do have built-in previews for a wide array of different file types. So obviously things like images and TIF files, we can show inline previews to kind of the end user. But we also kind of more obscure file formats, CIF and PDV are for proteins and molecular structures. So we have kind of an open source viewer that we've been able to work in here for exploring this particular file format or things like HDF5 files, which are fairly common from instrument recordings and what have you. If this is going to load here, but if not, we'll move on. This one's a bit bigger. Hope. So that's kind of a brief tour of datasets. Melanie, does it make sense? Should we kind of take questions as we go or do we think it's better just to hold off until the end? I guess we could just pause for a second and see if anybody has any questions so far. I haven't seen anything in the chat. Okay, I guess we're good to keep going. Cool. All right, so those are datasets. I wanted to take a brief aside here to talk about data ingest since I think this is going to be relevant for a lot of people on the call. It's a little less exciting, but it's a crucial component of all of this. So right of the sports, the diversity of file types that could be uploaded either through the interface or through the API. In terms of tabular files, a wide variety of formats, so obviously things like CSVs and TSBs, Excel files you can throw in. We also support some more kind of obscure formats like the SAS, SPSS and Stata data formats and allow adjusting those with their full metadata and value labels and stuff like that. With respect to geospatial data, you can manage shape files, geo-json, KML, and then as we saw with unstructured data, you can basically upload any file type. This system is designed to ingest really from wherever the data reside. So obviously that's on your local computer or a server. You can push it directly to Redivis, but we have built-in connectors to things like Box and Amazon S3, GCS, Google Drive, what have you to pull in the file. And then we have robust tooling that allows you to transform your data, to UTL your data, to stack multiple files into one table, something that we see fairly commonly is, you know, you'll have one CSB per year and you want to kind of concatenate those into a single table, complete with robust type and inference and versioning. And then again, tools that allow you to manage metadata, documentation, identifiers, things like that on the data set. So to walk through the creation of a data set, we're going to go to this demo organization and we will create a new data set. When we initially created data set, we can choose kind of its access classification. I'll get more into this in a moment. Notably, when the data set first gets created, it's unpublished. So this access classification doesn't really apply until the data set is released. But what this is going to do is it will give anybody in the world access to the metadata for this data set, but they will have to apply for access to get access to the data. So we're dropped into this data set editing interface and this is where we can upload tabular data, geospatial data, as well as those non-tabular files. To start, maybe I'll create a tabular data table here. And again, we can import from a bunch of different sources. I'll actually just pull some files in Google Drive, in part because you can actually see while I screen share, also because my internet's a little off. So in here, we can pull in, this is kind of like the SQL Pebble Link data set, but it's with penguins. So it automatically determines the file type based on the ending. But again, you can provide, you can specify any file type here and enhance settings for weird things that can happen with the loaded files, but generally you'll find that our auto predictor is quite robust. So we can import this tabular file and while that's growing, I could meet the page now, there's nothing happening kind of on my computer, it's just all happening in the cloud, but I can also go ahead and upload the geospatial table. And again, we will pull in from Google Drive, a GeoJSON file. Once that import has completed, we can inspect our table, we can validate that things like as we expect, we can click on different variables to look at the univariate summary statistics and make sure that things are just generally passing the SNAP test or aligned with our expectations. And same thing for geospatial data. So this is a bunch of U.S. states. And so we can see the geometries that have been pulled in here. And then finally, we can also upload non-tibular files. So I'm gonna go ahead and pull in some pictures of cats and dogs from Google here. I am a full folder and we can import this, let's call it cats and dogs. I am using Google Drive to ingest data here. I'll mention that Google Drive is actually not a great tool for this. Their API is pretty limited and they'll model things and generally if you're transferring terabytes of data, I would not recommend doing it through Google Drive. That's kind of the point of the security level here. Actually, well, that's uploading. We do have a question in the chat. Do you want me to read it or are you able to see it? I have it up now, but... I can read it. Yeah, yeah. Is the geospatial map feature something that is part of Redivis? If there are other data visualization types or features we would like, is there a space for plugins or would that be something the Redivis team would develop? Yeah, so I guess going back to these example data files, so these are all kind of designed to be pluggable right now. So all of the different file viewers here. So let's see if the issue of file will load here. So this is all designed to be pluggable and right now these are all managed by the Redivis team, but if there's a new file format that you would like to kind of develop a previewer for, we'd love to work with you on that. And in the future, you definitely imagine a world where maybe we can make the system allow for kind of like third party applications. There we go. So for example, and then there's also, I'll get into in a little bit, there's kind of a notebooks interface as well where you could of course, have the full affordances of the Python and R toolkit where you could build visualizations in there. All right, I think our cats and dogs have finished here. No, I'm sorry, everyone. I'm just gonna hop over to my cell phone for internet because this is not the right way to do it. No worries. Not Melanie, am I in the call here? Yes. Yes, okay. And that's coming through. I'm really sorry about, but this just started happening this morning. So yeah, so here we have kind of our non-tabular data that had been pulled in, bunch of images of cats and dogs. Well, we have a backup here that I put together. So we have a bunch of images of cats and dogs here that have been loaded as non-tabular files. What's nice about this is that the files can also be kind of indexed within a table. So in a case where you have millions of files, each file contains a unique file identifier and you can basically reference these images, these non-tabular files as if they existed within a table and bundled sorts of queries and analyses against them. All right, so that was the data ingest interface. And now I wanted to hop over to quickly talk about access management. You can definitely host fully public airsets in Rediverse, but it is a HIPAA compliant SOC2 certified interface or platform that is designed to store PHI. Data owners can configure tiered access to data sets that are hosted on Rediverse. One of the biggest pain points we heard early on with working with restricted data is that researchers often couldn't see anything about a data set until they had jumped through all these hoops to gain access. So the idea behind the system is that you can make the metadata or at least some of the documentation about a data set barely available while restricting access to the underlying data and protecting your patient privacy or confidentiality. Data owners in the system define requirements that must be fulfilled for different tiers of access. And you can use these requirements to collect and store documentation. So things like signed UAs, what have you. And of course, this is all tracked in comprehensive searchable auto logs. So to pull up a quick example here, this is kind of a very complex example for a fairly high-risk data set that is managed by the Stanford Center for Population Health Sciences. But I think it's illustrative of some of the flexibilities of the system. So what you'll see here is that overview access to this data set is public. Anybody can go to this page, they can see the data set exists and they can see some of its documentation. Metadata access for this particular data set is somewhat locked down. So people have to be a member of the Stanford community. They have to fill out a conflict of interest at a station, various other forms. And again, this is highly customizable. These are really just forms that are defined by the data owner. And then finally, in order to get data access, there is a slew of additional requirements that have been defined here. So the whole idea is as a researcher, it's very easy for me to see what I need to do to gain access to this data set and whether I will be able to. And for the data owner, you can really define the kind of a process-driven system for people to apply for access to a data set and kind of collect the relevant documentation as you go. One other thing to note here, which is relevant for some of the high-risk data sets is once somebody does have access to the data, you can define additional data export rules. Very often it's a case where, you know, we don't want people to be able to download a data set to their computer once they have access because kind of cats out of the bag with respect to security at that point. So you could say define that, you know, exports are only allowed on admin approval. So you can validate that the table being exported doesn't retain any PHI. Or in this case, we see that they do allow export to certain on-premise systems at Stanford, but a researcher couldn't just download a table to their computer. Are there any questions on access management here? So the final piece of this is analysis. And that's kind of where we'll spend the rest of our time. Redivis offers a scalable compute platform for tabular geospatial and unstructured data. It's a place where researchers can go to filter, aggregate and join data, either through a graphical interface or through SQL or kind of mix and matching the two. And then you can analyze and visualize the data in our Python and Stata and SAS notebooks. This all happens within a collaborative environment where you can kind of share your project with any of your colleagues in a classroom setting with anybody else in the class, including from external institutions. And this environment kind of has reproducibility built in as a natural kind of byproduct of the research process on Redivis. And again, there's a robust API that's really designed for shore operability. So with that, we can go back to our ecological data set that we were looking at before and do a quick analysis of precipitation trends in the United States. So to do that, we will create a new project. Oops. And we can see here, so this is the project interface. This is where researchers can perform analysis. We're looking at the same data set that we were before. Yet by default, we bring in a 1% sample of the data set. To highlight the performance characteristics of the system, we will work with the full data set today. This is, you know, I think a few hundred gigabytes, three billion records in this main table. So our first step in our analysis is going to be to reduce this daily observations table to the observations that we actually care about. So if we look here, we can see that what they basically done is stack every weather observation into a single table. So we have precipitation, we have min-max temperature, we have snowfall, what have you. So let's begin by just kind of doing something simple and pulling out the precipitation observations from this table. To do that, we'll create a transform. And in this fairly trivial transform, all that we need to do is filter rows, keeping those where the element equals the RCP. And then finally, we can choose which variables we want to propagate from the input table to the output table. In this case, we can choose the station identifier, the date of the observation and the observed value of precipitation. Let's go ahead and run this transform. This is operating on three billion records. The output table will be close to a billion. Usually it's about 10 to 20 seconds to execute, which is quite a bit faster than kind of most, you know, something like state that are Python. I should mention that all of this is being transposed to SQL behind the scenes. So this is obviously very important for reproducibility purposes. If you are somebody who knows SQL or prefer to work in SQL, you can also just go in here and start writing SQL. Let's get the interface. So there we have it. We have our output table with the weather station ID, the calendar date, the observed value and the 960 million observations that we expected. Our next step is going to be to annualize the data. So precipitation cycles, at least kind of kind of a back of the envelope way are an annual cycle. So to do that, we will create another transform and in this transform, we will start by creating a new variable where we extract the year, we extract the year from the calendar date. And then to annualize the data, to aggregate the data, we're going to create what we call a partitioned variable. Let's call this annual precipitation where we compute a running sum by each station. By each year, we'll compute the sum for that station for that year of precipitation. And then we can drop the tickets to clubs, our results. And then finally we'll choose our output table. So again, we're going to want the station identifier, the calendar year and the annual precipitation for that year. And I can go ahead and run this transform. And again, we can see the code that's being generated behind the scene. And it will just run the SQL if we throw that work for that quick second. There we go. And so there we have our output table. We have the station ID, we have the year, we have the annual precipitation. All these zeros are coming up at the top, that doesn't really reflect the real data. We can compute summary statistics as we go. And this is really helpful for just evaluating our work and to kind of making sure that our data pipelines are doing what we expect them to and that our output data at least passes through the basic snuff test. What we see here is that our average rainfall per station is 7,600. This is in tenths of a millimeter. So we're looking at 76 centimeters per year per weather station. That at least sounds vaguely right. So we have a reasonable degree of confidence that things are working. We can also see that the data aren't perfect. We have negative values for precipitation in here, which doesn't make a whole lot of sense. It's not very common. Our meeting is definitely a positive number. But as we all know when working with data, things are really perfect. And really this interface provides mechanisms for you to kind of understand what's going on, understand the quality of the data and actually go back upstream and change your analysis. But for the sake of this demo today, we'll keep pressing forward. So we have our annual rainfall per year for weather station. And the final step in this data transformation process is going to be to bring in the latitude and longitude information. And to do that, we're going to join in another table, the stations table from the GHCM Daily Weather Data Set. And we will join these two tables on the unique identifier for each weather station. And now in addition to all of the variables from our source table, we can also property things from the stations table with sort of latitude and longitude. I should mention, so here we're just surrounding two tables within the same data set. I can add any number of data sets that I have access to on Reddit is to this project. And I can perform joins across those tables. These would be data sets hosted for Carnegie Mellon. These would be data sets hosted elsewhere, as long as you have access, you can bring them in here and combine the different data sets together. So there we have it. We have each station by year, the precipitation and its latitude and longitude. And the kind of final piece of this pipeline is going to be to visualize south of the table in a notebook. So we're going to go ahead and spin up a quick Python notebook to work with this table. Notably, so we started with billions of records in the source and bringing billions of records into a Python notebook, bringing them in memory is generally not the best idea on those computational systems. But we were able to use this data transformation interface to kind of define inclusion, exclusion criteria, aggregate data, join data to a more reasonable size that we can now pull in for further analysis. So by default, this is just bringing in the top thousand rows to a data frame. Let's go ahead and pull this all together. The nice thing about this all happening on the cloud is that things are quite fast in terms of networking. So it's still three million records that fold right into the notebook. I am not willing to attempt to code live here, but we can do a little bit of Panda's data frame and evaluation just to get a third year deviation in precipitation from our source data. And then we'll use probably to draw this on that. And there we had it. We can see some of the precipitation trend in the United States. Southwest is getting drier, parts of the Midwest are actually getting a bit wetter. And of course, you know, at this point, this is just a Python workspace. There are notebooks, there are state and SAS notebooks. And so you can install any number of dependencies that you would like to work with for the particular notebook environment and kind of use the full data science toolkit that's available there. To give you a sense of how notebooks might be used in maybe a slightly more computationally robust way. I did have this one last example of this is kind of an unstructured data set or semi-structured data set that contains a bunch of chest x-ray data. This is from a paper published, I think, around 10 years ago. And the original data for this paper, so it's a bunch of chest x-rays that have been labeled with demographic characteristics of the patients. And the original imaging data was basically just distributed through this box folder. There's a readme, there's a bunch of zipped image files, the kudos to the authors for making this available. But obviously this is a little bit tricky to work with. So we have brought this particular data set into Redditors where we see that we have all of the imaging data available as files in here. But notably, we can also treat them in a semi-structured way in that we have these images but we also have metadata about the images. So what we see here is for a given image, we have particular patient demographic characteristics as well as the finding labels for that radiology span. So there's negative finding or kind of these various pulmonary conditions. And so what we can do with this data set is we can bring it into a project at the crate here. This is a featured project on the demo organization to look at cardiomegaly, so an oversized heart in these radiology scans and ultimately train a convolutional neural network using TensorFlow to kind of detect cardiomegaly in these images. And so there's kind of some various work happening upstream here where we define kind of our training and validation set. But ultimately these files all come down to this image classifier notebook where we're able to load in the data, define various model parameters and ultimately train a model that can be used to classify cardiomegaly in images and evaluate the results. And you can see here again, this is very bad for the envelope but we can see here that the model performs reasonably well at identifying this condition. So in summary, Redivis makes it easy for researchers to upload, find, access and dive into their data. It has built-in compliance, not a bit of an ability for high-risk data sets. It provides interoperability via an API and integration with open source tools and really allows for reproducible research to be a natural output of the process. So all data sets are version controlled. Every project defines the series of steps that were taken to kind of create data derivatives and output tables and final analyses and kind of really trying to push things into kind of a more fair and open data science ecosystem. So with that, we have just had this last slide. Melanie, I don't know if you wanna present this about the CME trial Redivis. Sure, thank you. So as I mentioned earlier, we have a trial for Redivis that goes now through December 2020, sorry, 2024. And during this trial, you're welcome to put up to one terabyte of data onto the platform, try it out. Anything larger than one terabyte, we just wanna have a conversation with you before and determine on a case-by-case basis what we can actually accommodate in terms of these very large data sets. But please reach out if your data set is larger than one terabyte because we might be able to accommodate that. And so to get started with the trial, please feel free to send me a message in the chat right now or you can contact Luling or me via email at any point. And we can, oh, sorry, there is a typo in my email address there, but it's andrew.cmu.edu. And we can give you access to the platform and set up an onboarding meeting with you if you'd like to go over how to get started again. And there's also this website that has some resources on it for getting started there's some tutorials and some videos from Redivis. So you can remind yourself about some of these features as you play around with it. And we will also send this recording out afterwards. And so with that, we do have time to take more questions if anyone has any questions. Okay, great. Melia, I have more of a comment than a question. I just want to make sure that any researchers that are going to be using the system to share any of your data that you are making sure that you're only sharing data that you have permission to share, whether that's a contractual agreement that's been signed or if it's data that's been collected using an approved IRB and the consent form that the subjects signed gave you permission to share the data and make it publicly available. Just make sure that you have any necessary permissions in place before you do share any of the data. Yes, thank you so much for mentioning that. And Lu Ling, is there anything you'd like to add? Oh, no, not at this point. And I'll just say, as you try it out we really do welcome any feedback you might have about the platform. So please feel free to get in touch with us and let us know what you think. This feedback is really valuable as we determine how to proceed with the license at the end of the trial period. So we really appreciate any feedback you're willing to give us. I have a question. That's okay. I think it's for Ian. My question is about the likelihood and the sort of track record, a red of us and reducing contracting frictions with third parties. So here's the kind of problem that we face quite a bit. I'm Chris Telmer. I'm a faculty computing representative from the Tepper School, the business school here at CMU. So we'll get some research funding or something like that. And it'll come along with some sensitive data. And then we'll have to sign contracts. Our contracting people here will contract with whoever is providing the data. And those sort of contracting negotiations inevitably come down to, okay, how are you gonna store the data? And then a bunch of technical discussion in terms of security and so on and so forth. And they can take an awful long time. And it can be pretty torturous. So I guess my question is about red of us becoming a norm in this kind of context or is it likely to become a norm such that we can just say to the third party, we're gonna put the data on red of us. And that's gonna reduce the complexity of the sort of situation by situation contracting in terms of the data storage. Yeah, that's a great question. That would definitely make my life a lot easier too. I would love to see kind of a world where this becomes more normative. I don't think we're there yet. That said, our team has a lot of experience in kind of working through these security questionnaires that kind of inevitably arise as part of the contracting process. And I do think on your end, it can make things quite a bit easier in ultimately deferring some of those questions to us. So I think we're getting there. There is a lot that you can point to. Red of is, I think to date still might be the only cloud solution for CMS, so Center for Medicare and Medicare Services. We're hosting their 20% data cut through Stanford. That was like a big thing. Beforehand, you basically had to go to Maryland to work with the data or VPN into Maryland. So there are a lot of examples where very high-risk data are being hosted on Reddit is, I have not seen particular data vendors be willing to kind of take prior art as an example or as kind of, okay, now you don't have to fill out the form, but hopefully kind of the world can evolve in that direction. Thank you. Can I ask a question to build on that? And I apologize if this is super obvious and I missed it, but would Carnegie Mellon be hosting its own data on Redivis or would we be able to use it to host third-party data? So if we did acquire data from some company or a hospital system or another researcher, would we put that data on Redivis or would it just be our own data that we've collected? Or is that a determination that we make? Yeah, I guess I don't wanna answer her Carnegie Mellon policy that said we do have a lot of, those are some of our earliest examples. So that particular group at Stanford, the Center for Population Health Sciences, most if not all of their data sets. So they have this market scan data set it's the insurance claims data set that they have procured and are hosting through Redivis. So it's not their data that they've collected, it's data that they have purchased or acquired from someone else. It's somebody else's data, somebody else owns that data. Exactly, exactly. So there's nothing in like, oh, I'm sorry. No, no, please. There's nothing in the Redivis terms of use or the Redivis contract that Carnegie Mellon would enter into that would preclude us from putting someone else's data on Redivis' site because sometimes you see situations where we would only be allowed to put something we owned up there, but that's not the case here. Okay, I just want to clarify that. And then do you have examples of how Stanford or maybe other academic institutions or other research institutions might control access internally? So in the case that Jen mentioned where somebody maybe has an IRB protocol that limits the sharing, the further sharing or use of that data set, how do institutions control who can access the data? Yeah, so I do think this particular group and this is something that you can look at yourself all of the kind of the access information that this is public. So redivis.com slash Stanford PHS and you can include it in the follow-up email. But if you go to this page for you, this will say apply for access and kind of the big red button. And so you can see how they are, let me just look that before, right? So you can see how they're managing access here. And so they have IRB approval, what have you. The systems, again, very flexible but the particular need that PHS had was, they have, I think close to 2,000 researchers now who are working with their data and these data sets come with all sorts of contractual requirements, right? And they don't have a personal relationship with the data manager, it doesn't have personal relationship with everybody that's working with the data. So what this system allowed them to do was kind of define the rules that were required for different tiers of access based on their contractual obligations and basically ensure that everybody that can touch the data is kind of fully compliant with those terms because by definition they have completed all these requirements that requirement has been approved by an administrator that requirement has not since expired, I think. So there is an administrator on the backend reviewing who's requesting access to each of these data sets. Yeah, yeah. I mean, you can configure requirements so they auto approve on submission. So if you just want to collect some demographic information or whatnot, you could do that. But of course, for a high-risk data set you would want somebody on the backend who's validating that the submissions are correct or appropriate. There are other workflows that we've seen. I think the business school at Stanford is doing this a little bit. I think some groups like Columbia where if you have an external system that's some sort of like kind of group, work group access directory manager, you could use our API to kind of assign access based on that system. So that's a little bit more like high touch but they're kind of different workflows that you could use there as well. Thank you. Any other questions? I have another one. But I don't want to step on anybody else's toes who hasn't had a chance to go yet. I'd like to try my question again and I'm going to just be more specific. And I think I pretty much got adjusted the answer. It kind of builds on what Julia just said. Here's a specific situation. We contract with a third-party data provider to license the data that they're going to give to us. So who owns it? They own it, we license it, we have the rights to do certain things with it. They make us get our computing group to essentially build a server to satisfy certain security protocols. And contracting of that and then the labor required to build the server to satisfy these security protocols was kind of big. So here's my question. If we're a Retivus subscriber, can I just punt that to you guys and say I want to put the data on Retivus and call up Ian and he's going to tell you what security protocols because I don't know this stuff that are there. And he's going to work with you to find a solution or maybe there's no solution. But at any rate, is this the kind of service that you would provide if we were a... Yeah, absolutely. So that's the thing that we have experience with and where we kind of take that load from you. And I should have said in a lot of cases we can expedite things like we have a SOC2 report for Retivus. We have penetration testing reports. We have all sorts of documentation that's ready to go and basic things like encryption and transcendent arrest and pull documentation and data disaster recovery. So yeah, because we've kind of done this many times, you absolutely can and should just punt those questions to us that are about kind of the technical capabilities and security framework with Retivus system and then we can provide all the documentation around that to the vendor. Well, that would be a value add there. Thank you. That's helpful to know. Does anybody have specific questions about, I don't know, the data sets that they have for their research and different data types that would be helpful to dig into? It is right after lunch, let's see. Oh yeah, so Brian brings up a great point. Is there anything else to add about using it for teaching or didn't... And I think there was one person on the call that was maybe interested in using it for instructional purposes. So not sure there's any questions related to that, but I can imagine that might be a case where you're using a data set that you didn't generate yourself. Right, right. Yeah, sorry, that's a great question. And I guess we didn't drill too much into those use cases. So yeah, I mean, this is designed to be a collaborative environment, which is obviously important for research collaborators, but it also really opens up a lot of possibilities for instruction. So in a classroom setting, you could host the data sets that are associated with a particular problem set or unit or the class as a whole. That could all be hosted on Redivis with the documentation with kind of like everything that we've seen here, configuring access as needed. So maybe it's probably for a classroom setting that's just within the public or maybe you just restrict it to members. And you could use it simply as a distribution tool. So people can download the data set from here and any table on Redivis can just be downloaded. But I think kind of where the real power lies in allowing for students to spin up projects where they can explore the data. So let me just talk over to the project that we just talked about earlier. Just talk over to the project that we just created here. So you can provide template projects as well for a classroom. So you could feature certain projects that do some basic cutting and analysis. And then anybody with read access to that project can go in and fork it. So kind of just create their own branch of the project as it was at the given point in time. You can share this project with your collaborators. You can make it completely public. Anybody with added access come in here. It's kind of like a Google doc. You can work in real time. We can leave comments to each other as we go. And then I think one of the really nice things for kind of an introductory to data science framework is these computational notebooks. There's a lot of pain that can happen in getting the Python environment, for example, set up on everybody in a classroom's computer. And a lot of it works on my computer, not on yours, but dependencies are installed, stuff like that. And this kind of interface really allows for students to quickly get into analytical workflows. So these, this comes pre-installed with the data science toolkit and Python. You can install any number of dependencies to go alongside the notebook as well. And then again, kind of, you know, it's reproducible. So a student can run, you know, a particular analysis and somebody else can come in there, build on that, get the same results in Python. Great, thank you. So with that, we have a few more minutes if anybody has any last questions. I should mention, I think one example of this is a group out of Georgetown. They spun up this green space challenge. You can maybe include this in the slide deck. But this was kind of like a data science hyperbomb that they used right at this floor. They had an organization that they created. So they created this organization where they were hosting a bunch of data sets around various kind of environmental and convolecule indicators that, you know, they gave people access to. And members of this challenge would publish, you know, their projects that would then be for, you know, go for review. And there's actually a cash prize for some of the top projects and analyses that were used as part of this challenge. So I think this is maybe a really good effort to lean on in terms of thinking about writers for instructional science. Cool, thank you. Yeah, so with that, I've already heard from some folks, but if you think about it and you want to try out the platform, just again, touch with Luling or me and we can help you get set up with an account. Luling also just shared our lip guide in the chat. This is just all of our resources that might be useful for you as you start to try it out. And with that, thank you so much for attending. And again, always happy to get feedback. So please reach out. Hey, Melanie, can you point us to where a copy of this video is going to be so that we can pass it along to our colleagues who weren't able to come today? Oh, yeah, I'll send it out in an email. Yeah, great. Thanks very much. Yep, of course. Thank you, everyone. Thanks, Ian. Thank you, Melanie Luling for organizing. Thank you. Yeah, thanks for everyone for being here.