 I think it's time to get started. Welcome, everybody. I'm Cliff Lynch, the director of CNI, and you've joined us for one of the project briefing sessions for our Spring 2020 virtual meeting. I'm delighted you're with us, and we'll be doing a presentation. Our speaker is Jamie Wittenberg from Indiana University, and I'll say a word about her topic and turn it over to her in a moment. We will take questions at any point that you'd like to enter them through the Q&A tool down at the bottom of your screen. So feel free to queue up questions as they occur to you during the presentation, and we'll deal with all the questions at the end of the presentation. When Jamie's finished, Diane Goldenberghardt from CNI will appear and will moderate the Q&A. So with that, let me just say this is, I think, a very timely presentation. And we have had a lot of underscoring of the importance of research data management throughout the current emergency, and indeed throughout the last few years of developments. Two of the biggest challenges are understanding what the right infrastructure we need to support that is and how to deal with it at scale given the enormous demand among our researchers. And so I think this will be a very timely and helpful presentation. Thank you for agreeing to do it, Jamie, and over to you. Thank you, Cliff, and thanks to everyone for being here during this difficult time. I think, Cliff's right that this is in many ways timely, especially as we start thinking about how we can consolidate resources and provide access in a more robust way to some of the resources that we have in our libraries. I'm going to talk today about a project called the Cadre Project, the Collaborative Archive and Data Research Environment. And I'm going to start by just outlining the problem for you. Oh, if I can advance this slide. Here we go. And the real dilemma here, and I think this is a problem that's familiar to many of us in research libraries is that we aren't able or historically haven't been able to provide sustainable standardized access to licensed data sets or often to large open data sets for text and data mining. And there are a few reasons for this. A lot of us are actually doing the purchasing. So we're acquiring large licensed data sets for our research communities, but few of us have adequate infrastructure to support it. And it's really not just about storage, about where we sort of put the data sets, but it's also about access. So how do we get the researchers to the data in a way that facilitates exploration and advances their research while also implementing requirements in our licenses and our data use agreements? And many of our researchers have text and data mining needs that aren't being met because we haven't been able to provide consistent and standardized access to these resources. One of the big reasons for this is that for a lot of licensed data sets in high demand, we really can't just, you know, purchase it and put it on a shelf or make it accessible. We need to run annual updates. We have to curate the data. We need to find a way to implement authentication to ensure that only, you know, the patrons that should have access to it do have access to it. Sometimes do education and instruction around what's allowable under the licenses. And in some cases even review derivatives that researchers are creating. And in order to actually implement a model for this at a broad scale, even at a university scale, it's really cost prohibitive for most individual libraries to try to develop and implement this. It requires expertise and technology and neither of those things are cheap. And it's simply just not financially feasible for most research libraries to build and staff and enclave and hire a data steward and run these updates regularly working with researchers to analyze these data sets without violating the terms of the license. So this is true of licensed data sets and large open data sets, you know, that might be too big to download on an average laptop or computer. And a lot of researchers who could benefit from text and data mining, our library acquired and open resources aren't able to do so unless they have a graphical user interface. So often when researchers are working on these data sets, the only way to really do it is to code or build their own databases. And so these aren't, you know, accessible resources to researchers that prefer or are only able to do research using a GUI or don't have the ability to analyze a data set in Python or R. And this means that the majority of our users can't use the resources even if it benefits them. So our solution, Cadre, is a cloud-based platform that provides secure access to library licensed data sets, essentially a science gateway. IU decided to address this problem, writing a proposal to IMLS and asking for funding to build a shared solution. We call it Cadre. And while it's not feasible for individual libraries often to build out large data support, we think that as a community it is possible. So what we set out to do is build a science gateway in the cloud that can support a range of licensed and open data sets. And these are two of the data sets that this platform currently supports, Web of Science and Microsoft Academic Graph. We are seeding it with bibliometric data sets because they're important to researchers in information science and science of science communities. The Clare of it, now Web of Science group, Web of Science data set, enables us to really pilot the platform with a large licensed data set. And Microsoft Academic Graph allows us to do it with an open data set. We also added the USPTO, USPAT and Trademark Office data set to our server in October. And all three data sets are currently available to Cadre users. So by sharing the cost of this solution across a lot of academic libraries, we've been able to build and provide a solution at a lower cost to members and make this platform free, build a free tier for non-members. And the cost is really a fraction of what it would be to provide this kind of service alone. And because we have support from research libraries with the most active users, it allows us to provide that free tier of service. Right now we're working with libraries as you can see here in this project partner slide in the Big Ten Academic Alliance because we are seeding the platform with a data set that has a consortially negotiated license. So the BTAA negotiated a license for all BTAA members with Web of Science group. The project is being led by the IU Libraries, IU Network Science Institute and BTAA with support from the libraries at Iowa, Michigan, Michigan State, Minnesota, Ohio State, Penn State, Purdue and Rutgers along with three out of the four Big Data HUDs, Microsoft Research and Web of Science group. So we do have a graphical user interface, as I mentioned before, and custom computational resources, space to share a marketplace to share queries and algorithms, drive data results, that kind of thing. I think it's important to note that anyone is able to use the platform. So not just researchers who are expert in navigating Big Data sets. And actually right now, not just researchers at all, anyone can use this platform. So users can query the data sets and do their analysis within the platform in the cloud. There's a suite of tools for analysis that are preloaded. And then when they're done, they can save their queries and push their derivatives, ideally to a library repository or another kind of repository for dissemination. And they can run those queries across collections of data sets. So we've already aggregated Microsoft academic graph data and Web of Science data. So bibliometrics researchers, for example, have a larger source of citations to work from that are already joined. This is our project leadership group. The project is led by myself and my co-PIs, Patty Mabry at Health Partners Institute, Val Penche, and Sharon Yan at the IU Network Sciences Institute, and Rob VanRenis at the Big Ten Academic Alliance. And the platform has a sort of series of characteristics that I would say define it and make it sort of unique. So it's cloud-based and it provides secure access. It's specifically designed for library-licensed data sets and open non-consumptive data sets. Before we started development, we collected a series of use cases from potential users and users of an early prototype that was built by the IU Network Science Institute to ensure that the product we did end up developing really met the needs of university libraries in particular and the researchers who would use these resources. Because this is a shared service, we are able to begin rolling out tiered memberships, and what that means is that smaller institutions who cannot afford to participate in membership are still able to access the platform and use the open resources on it in a way that sort of supports this kind of data analysis that their institutions are for them individually. And the project is really committed to making sure that our workflows support both the library-licensed data sets and analysis as well as the open data sets and analysis. Our access to our data sets is standardized and we do our best to work from shared standards to ensure reproducibility. This is something that the project team has been investigating throughout the course of the grant. How can we enable researchers to share their output and their queries in a research asset commons, a sort of medium-term storage space? I wouldn't call it a repository, but it's not quite a scratch space either. It's a marketplace for sharing and collaborating research output. How do we ensure that, you know, when Microsoft Academic Graph is being updated every two weeks, researchers who are interested in reproducing some of these queries are still able to do that in the future. And so there's some exciting research that's happening there to support reproducibility. And finally, user-friendly graphical user interface and other tools enables our users to query data sets in a way that's really accessible for long-tail researchers or researchers who don't have the capability to code. We've received some great feedback on that feature especially from researchers who might not traditionally be doing text or data mining as part of their research, but who can really benefit from it. So this is how it works. You log into the platform from, you know, any institution. You can use your in-house credentials. So, excuse me, I have a cat interrupting me here. You can then access the data sets, both, well, all three of our major data sets now, Microsoft Academic Graph, Web of Science, and the U.S. Patent and Trademark Office data. You can use the GUI query builder to query big data sets, and that GUI looks a lot like what a user would expect when doing something like, you know, running an advanced search in Web of Science. And you can build your own data analysis and visualization tools so you can analyze the results with shared tools and with open tools, and then you can reproduce your work and share it with other researchers. I won't go into this too much. This is a model of our product, but you can see that these key features of authentication and authorization make it possible for us to seed the platform with library licensed resources, and that we are working to integrate computation and storage infrastructure that meets our user's needs. We don't have a full index of all of the tools that will be available in the product yet, but we are working with our early adopters and alpha users to determine what will be made available, and that will be a sort of iterative process as well. Right now, I want to draw your attention to a new cohort that we have developed, a research cohort for the study of coronaviruses. In response to the recent global pandemic, we've opened up a fellowship program, a special fellowship program, for those who are working on coronavirus-related research. We're calling it the RCSC program, and we're currently accepting proposals to join. This program essentially provides access not only to the data sets that I already discussed, but also the COVID-19 open research data set, which is a national library medicine data set. If applicants are approved to join this cohort, they will have full access to all of our tools, as well as free computing resources, technical support, and the ability to present their research. We have an existing fellowship program that was closed to applications, and we reopened it specifically for this special fellowship program, but I want to take a minute to talk about some of our other research fellows. In early 2019, we opened up this research fellowship, and we had eight fellowship teams across disciplines that discussed really compelling research proposals incorporating big data and bibliometrics. They traveled with our project team to the international conference on scientific metrics and infometrics in Rome in 2019 and are presenting their work in a series of webinars throughout 2020. These research fellows have enabled us to collect really detailed user stories for our product and are doing some really interesting work using the platform and the data sets that it makes available. This first team is from Purdue University, and they're looking at citation of data in STEM education resources in STEM education research, assessing rates of citation for data sets and analyzing a sample of data citations and publications to figure out whether they can help evaluate the relevance of these data sets for reuse, and they're using both Microsoft academic graph and Web of Science for this. This research team understanding citation impact of scientific publications through Ego-centered citation networks is researching network-based citation measurements in the sciences. They're looking at the citation impact of scientific publications using an Ego-centered citation network, so that contains the citing relationships between a publication and its citing publications, and they're using Web of Science and Microsoft academic graph data for this research as well, and are giving a presentation on it as part of our fellow series on June 17th, if you're interested in that. This project team from Michigan State University is building upon a report navigating the structure of research on sustainable development goals, and they're looking at patterns of global collaboration and support for the UN's sustainable development goal call for action, so essentially they're designing a prototype that will analyze and visualize partnerships over time in sustainable development goal supportive research, and they are also using Web of Science and Microsoft academic graph data. They'll be presenting their research on April 29th at 3 p.m. if you're interested in that fellowship group. We also have this highly collaborative fellowship group measuring and modeling the dynamics of science using the cadre platform, and they're trying to better characterize the influence of scientific papers, which of course is usually measured by citation rate, and they're interested in distinguishing between papers that destabilize existing knowledge with new or novel concepts and papers that are consolidating existing knowledge. They're using citation data from Web of Science and Microsoft academic graph for that. These IU Bloomington and University of Warsaw researchers are trying to determine the impact of the introduction of long-distance flights on international scientific collaboration, which I think is particularly interesting and relevant right now, and they're measuring collaboration through co-authorship and co-affiliation here, and they're using Web of Science and Microsoft academic graph data from 1998 through 2017 for this. We have this cohort from the University of Michigan and the University of Michigan Medical School doing a comparative analysis on papers published in math and biology legacy journals and on newer journals with different publication models. They're trying to use our cadre data sets to develop methodologies for comparative bibliometrics and content analyses for these and provide insight into publication trends and theoretical and applied domains here. We also have this research group of one, Samuel Hansen from the University of Michigan, who is using reference and citation, aging and bibliographic coupling and network breadth and depth to try to find similarities and differences between research fields in the maths and the sciences, and he's using Web of Science data from 1900 to 2017 for this analysis, so quite a broad range of data there. We also have a final fellowship group that joined us a bit late in our fellowship program and we were really excited to bring them on. They're from Ohio State and their topic is Assessing the Rise of China as a Scientific Nation. So their project is providing an assessment of the nature of China's publications in science and engineering over the past two decades and they're examining the published scholarship from China including case studies and macro overviews looking at impact and collaborations that are involved in China's rise and they're using Web of Science and Microsoft academic graph data here. They have already presented as part of our webinar series on April 6th and a recording of that presentation is available on their resources page of our website. These are the upcoming presentations that I mentioned earlier. So before I wrap up and address some of your questions, I wanted to just touch on what's next. We're working on a sustainability and governance model for how this platform will continue. We have commitment from our existing members for the next few years and we're interested in expanding our cadre membership beyond the Big Ten academic alliance and determining what new data sets we can add to the platform that would support, you know, big data research essentially at large academic libraries. You can contact us by email, find us on Twitter or access our website to find instructions for logging in to the alpha implementation of our platform or looking through our many resources that includes documentation for all the data sets I talked about. Thank you, Jamie. That was really interesting and not surprisingly our first question has to do with access to the platform. So Emily Gore is asking, is the cadre platform itself open source? Are any other libraries or consortia using or planning to adopt the platform? If so. I'm so glad you asked that Emily. I should have mentioned it. Yes, the platform itself is open source. Everything is available on GitHub and you can access all of our materials on GitHub by going to our website, cadre.iu.edu and you'll find the link there. As far as I know, other libraries or consortia are not planning on adopting a platform in that they're not planning on downloading our open source software from GitHub and standing it up themselves, but we have gotten a lot of interest from other universities in joining our membership model and using our hosted version of the platform. Great. Thanks, Jamie. And thank you for that question. We do have another question that's come in from Monica McCormick and Monica asks, can the platform be used to conduct institutional research? I can think of lots of ways my university wants to mine that data but not necessarily for big data scholarship as much as institutional activities. Is that kosher? Thank you for that question, Monica. A great question. I don't think I am in a position to say whether or not that's kosher from an institutional perspective but certainly from the cadre perspective that is acceptable and encouraged. I can give you an example. One of the use cases, one of our early use cases was an institutional data use case and librarians in the big time academic alliance were interested in determining what their over calling shadow acquisition budget was. So what researchers on the campuses were paying in article processing charges to have open access publications made available in gold open access journals. And they used web of science data in cadre to analyze who the corresponding authors were, what institutions they were a part of, what journals they were publishing in and what the cost of that article processing charge was to determine that figure with the idea that that data might help inform transformative agreement negotiation. Interesting. Okay. Thanks. Thanks, Jamie. And Monica says excellent example. Thanks. And thanks for that excellent question, Monica. So the floor is still open for questions and we've got time for a few more. If you have a question for Jamie, please feel free to type it into the Q&A and we'll fill that live or you can type it into the chat box. We also have the ability to turn on your microphone. So if you would like to make a comment or engage directly with Jamie, if you want to just raise your virtual hand, I can see that and I can turn on your microphone and we can hear your comment or question live. And while we're waiting to see if we have any other questions from our audience, I actually see that we have, Clifford would like to ask a question. So I'm going to step back and Cliff ask a question. Thank you. I thought this one would be easier to ask than type. I'm wondering how you're handling the provisioning of the analysis from a computational point of view. It would seem to me that users could certainly come to this with some things they want to explore that would take quite a lot of computer cycles. Are you recharging them or how are you managing that? So the technical details of that are a better question for my technical lead, Val Penchev, than for me and I'm happy to have him follow up with you. I will say that right now we are exploring what kind of usage our fellows and our platform users need in order to determine how we will continue offering service in our sustainability model. Cloud computing is expensive and I think that going forward we are looking at models for enabling users to use the cloud computing credits or time that we have purchased as part of the project or potentially connecting to local resources at their own institutions depending on what kind of access they'll need. That's interesting because I've seen a couple of other experiments trying to set up these kinds of reference data sets that can be used by a whole community and often those are being put in the cloud essentially with an interface but then you use your own account and instantiate your own virtual machine to host the computation if it's significant. I will say the major contribution or I should say the bulk of the contribution that we've received from our partner institutions in terms of financial support has been funding for cloud computation so it is and besides staffing I would say one of the more expensive components of this project and we are looking at models for how we will continue to be able to sustain that going forward but one thing that has been pretty clear is that the more members we have that are participating the less expensive it becomes for everyone involved. Fascinating thank you. Thank you. Very interesting. Thanks Jamie. Well while we're waiting to see if we've got any more questions coming in we do have a minute or so left I just want to remind everyone that this webinar is part of CNI's spring 2020 virtual membership meeting we're glad you could join us and just sharing with you there and chat a direct link to the complete schedule for the rest of our offerings which runs through May so we hope that you'll join us we have a couple more sessions coming up this week two more tomorrow on IIIF and ARCS in the open 3.2 billion persistent identifiers and we'll have another one on Friday on immersive scholar development documentation display and dissemination of experiential research and scholarship so lots more to come well with that I think we're right at about time and I don't see any more questions come in so Jamie I just want to thank you again for sharing your really interesting and important work here at CNI thanks so much for being a part of the program we really much appreciate it. Thanks so much for having me yes many thanks this is fascinating work and thank you for sharing it with us indeed and thanks so much to our attendees for making time to be with us here we hope we'll see you back at another webinar soon and have a great day and be well bye bye