 All right. Well, thanks for rejoining us and welcome back. We're in the final segment of the first day of CNI fall 2021 member meetings virtual event, and we'll conclude the day with two to project briefings. For the first one, I am delighted to welcome Wajin Wang and Melanie Ganey, both from Carnegie Mellon University. They're going to talk to us about an end to end open science and data collaboration program. And one of the things that really caught my eye about this and I think is worth considering carefully Is that they're at least as I understand their work and they will tell us about it a little bit more authoritatively in a moment. What they're really trying to do is engage the full research process as distinct from simply picking up outputs at the end of it. Which has been a more characteristic mode of engaging researchers at many institutions and one that has had very mixed success. So I'm really interested to See what they have learned through this work and how successful it's been. And with that, let me just thank you both for joining us and I'll go away and turn it over to you. Thank you for having us. I'm Melanie Ganey and I'm going to start the presentation talking about the open science and data collaborations program at CMU and then Wajin Wang will do the second half of the presentation. In the first part of the talk, I'll be talking about the program and how we use it to support open science at Carnegie Mellon and a little bit about how we adapted to the pandemic. And then Wajin is going to talk about how we are currently developing instruments to assess the impact of the program. The mission of CMU libraries is to create a 21st century library and in our minds, a big part of that will be supporting the future of science, which will be becoming increasingly more open. So to that end, we created the open science and data collaborations program back in 2018, the last few years we've been rapidly expanding the initiatives of the program and we are currently turning to a phase of assessing the impact and success of those initiatives. When we talk about open science at Carnegie Mellon, we really try to frame it more as open research, since we hope that our tools and services will be useful for researchers across the disciplines, not just the sciences, in spite of our own backgrounds as research scientists. And when we talk about open sciences as a commonly used term in the community that refers to a wide variety of practices, many of which we support with the program. And when we talk to researchers, we also try to emphasize that the goal is really to make research products as fair as possible, meaning findable, accessible, interoperable and reusable. There are five main categories or pillars of support in the program, be licensed tools and provide training for them that support collaboration and sharing research products publicly. We foster collaboration opportunities, particularly those that bring researchers across disciplines together. We do assessment, including benchmarking against our peer institutions as well as research on the impact of open science. We provide training opportunities, both in the form of short workshops that cover open science practices and tools, as well as longer carpentry workshops that focus on programming and open source languages. And finally, we hold annual events. Two examples are our symposia, the Open Science Symposium and ADAR Artificial Intelligence for Data Reuse Conferences. And I'd also like to note that we also work closely with specialists at CMU Libraries that focus on open access publishing, open educational resources, as well as research data management, which are all closely related to open science. When we talk about the program being end to end, we really mean that we have developed services that can be mapped onto all of the phases of the research lifecycle, everything from the beginning designing and planning stages to the end where researchers are publishing. So for example, some of our tools are really meant for that documentation or planning phase, such as lab archives, which is an electronic research notebook. We then support our in Python with the carpentries workshops, which is really getting to the collecting and analyzing part of a research project. And then we support many platforms that can be used to publish research products, such as protocols IO, open science framework, as well as our institutional repository kill tub. And then many of our events kind of touch on all of these phases of the lifecycle. And as I mentioned before, the services that are in the dashboxes represent services that our colleagues focus on, but we work closely with those folks to ensure that the users at CMU are really receiving holistic support for open science. I'd like to share an example from Carnegie Mellon I think really highlights the power of data sharing and collaboration. This is the bull 5000 data set. This project was a collaboration between researchers in psychology and robotics Institute at Carnegie Mellon. And they published this data set back in 2019 on kill tub or institutional repository and this was one of the first large data sets that we hosted on there. And our former colleague on a von Gulick worked with these researchers to help provide guidance on things like how to structure the data to optimize its reuse and other considerations around versioning and licensing. What is really novel about this data set is its sheer size. So in this project, the researchers scan the brains of people that were looking at 5000 natural scenes. And typically in these types of projects, a person might look at under 100 scenes, for example. So the really large size of the data set makes it really useful for computer vision scientists, because typically their algorithms have been really difficult to apply to the smaller data sets. And this quote just speaks to the power of that collaboration. This is from Nadine Chang one of the researchers on the project in robotics Institute and she said computer vision scientists and visual neuroscientists essentially have the same goal to understand how to process and interpret visual information. And this data set has been really important in bridging that gap between these two research communities. Since it was designed and implemented as a collaboration between researchers in both disciplines. As I said this data set was shared in kill tab as well as in some other disilent plenary specific repositories to help improve its discoverability. And if you look at the record and kill tub you can see that since 2019, it's been downloaded almost 73,000 times. There are already publications coming out that have cited this data set. So we're already seeing a large impact on from this data set and I think it's just a really great example that we use a lot of Carnegie Mellon to illustrate the benefits of data sharing. When we talk about data sharing as well as other open science practices. We really try to meet the researchers where they are. And I think that depending on their discipline the types of data that they work with, or even the culture of their work group, which is often determined by the PI or the professor. They might have different attitudes or willingness to share data and other research products. So we interact with people at Carnegie Mellon. There are many folks who like to keep their data private because they have concerns about data sharing or they're just simply not interested in doing the work that that requires. And then we meet many people that are doing some sharing because of the mandates from funders and publishers. And that's a great opportunity to talk to them about how they could share more if they wanted to. And then we also have a group of open science advocates and champions on campus that are really doing, you know, sharing all of their research products, the workflows, the code, the data sets, as well as the manuscripts. And we really work closely with those advocates to try and highlight the benefits of open science to the rest of the community. So you could take that gradient of open science attitudes and really map it onto most of our products or services. So we can really tailor the outreach to specific people on campus. And so one example here is how we might do outreach for the product protocols IO, which we license. And if you're not familiar with the platform, it's an open access repository for step by step research methods or protocols. I think we found it's useful for people, no matter where they might fall on that spectrum. So for example, a person can choose to keep their protocols private on there, or just share them with the members of their lab group, which many people like that option. But even though they're not publicly accessible at that point, it still does really help improve the reproducibility because their version controlled. So we do really encourage researchers to do even that. But then what we're really trying to do here is move the needle to the right, where a researcher might then opt to make that protocol public, or link to the protocol in their manuscript. So, as I said, we're really willing to meet the researcher where they are, but we're always hoping to convince them, guide them towards the more public options are possible. This is a service I like to highlight in terms of opportunities that the pandemic presented. So, it's called the data co lab and basically this is a matchmaking service where we pair researchers that have these rich complex data sets with data scientists on campus that can analyze them. When you go to the website and you can see a list of the projects have been supported by the data co lab. But one thing that was really interesting was that during the pandemic particularly the early part of the pandemic. This really provided almost an internship like opportunity for students who otherwise were having a difficult time finding internships. So it was incredibly useful in that context as well. And these quotes are from a pair of researchers that were matched by the service. The top quote is from the person that collected the data saying this project started in a way because of COVID data co lab gave me the confidence that this is doable and I don't have to do this all by myself, referring to the fact that they didn't have to do all of the analysis themselves. And then the bond quotes from the data scientist who said I certainly did learn some new skills and use some of the work I've done only in theory on real data sites. Another thing we noticed during the pandemic was that there was an increased need for virtual collaboration tools. So one example we're highlighting here is lab archives the electronic research notebook platform. And so in the sciences are still many people using paper notebooks, and we saw this increased adoption on the pandemic because as scientists have to start working from home or it was so much easier for them to collaborate having all of their own notebooks in the cloud. And this is a quote from Sarah Werner, a PhD student biological sciences who says, I began using lab archives last fall, which has been intensely useful due to the COVID-19 pandemic. In March when I prepared to work from home I do not have to worry about taking home countless notebooks. I just brought my laptop home with me as usual. So we saw a large increased adoption of many of our cloud based collaboration tools during the pandemic. Okay, and with that I'm going to hand over to Wajin who's going to talk about how we've been working recently on developing instruments for determining the impact and success. Assessing the impact and success of the program. Thank you, Melanie. Hi everyone. So after we have running the program for now more than almost three years. We've seen that people are getting engaged with us and we observe some behavioral changes are starting to happen from research communities. So we really wonder exactly how much impact we're making and in what way. So that's like, next slide. So we want to be able to answer questions like what are our active users of our services and how do they benefit from our program offerings in their daily work and how can we actually do more to support their research. So we started to look into the several ways to assess our impact, both quantitatively and qualitatively. Next slide. So the first exercise we did was to develop a logic model. Well, honestly, honestly, this wasn't actually the first but we felt in the end that they should have been done as the first step. And we're very thankful that one of our associate deans, Brian Matthews introduced us to this idea and help to develop it. So this exercise we started from listing out our activities in the spreadsheet that includes for each activity with our investment was our outputs and what our goals, and then we summarize all these in the infograph like this. There's a lot of information to unpack here, but let me walk you through our first look at the two blue boxes here are our activities and outputs. So you can see in the first blue box we have five groups of activities from top to bottom, we have the tools workshops events collaborations and outreach efforts. We summarize their corresponding outputs in the second box here. How much activities do we offer how frequent how many users participate and for some of them we have how many products users produce using our services. So we also outlined our expected outcomes for our collective activities in the pink box here. Which is the basically short medium long term goals. And turns out after we're doing the activity it really aligned very well with our initial intention of establishing this program in the first place, which is to first help researchers learn about open and reproducible workflow then help them to adopt such practices in their daily research, and then eventually, hopefully help to create a cultural shift in their communities and and across disciplines. So working through such a logical model really helped us to think carefully about how each hour activities align with our overall goals and plan strategically because it's sometimes it's so easy to get excited about something new, and unfortunately, it was too many things and end up compromising quality for quantity. In our version of the logic model you may notice that we have also another measurement of impact as their partnership, we've formed along the way that's the last box here. Here we want to especially emphasize that we have a close relationship with the Mellon College of Science. We started our partnership in 2018 by hosting co hosting the first open science symposium together. And most recently, we established the Emerald Cloud Lab partnership. For those who are not familiar, the Cloud Lab is a AI driven laboratory where all equipments are remotely controlled so no grad students in the lab. And so for this, our open science program has been involved from very beginning to support open sharing of the results produced from the Cloud Lab, and you will hear more about this partnership on December 14, where there's will be a closing keynote in the in person C9 meeting. Next slide. So in addition to this logic model of course we were also wanted to find some quantitative ways of measuring our program impact. So we've collected some data across all our services integrated all these data together, the ones that we are able to integrate and ask some questions. So there are some simple questions like who, who are our users, where our top users and what disciplines are the most engaged with us. And there are some are more in depth and more difficult to answer, like how do people use our tools or activities and why do they do that and ultimately, we want to answer what impact we're making in people's research process and maybe in the whole research ecosystem. Next slide. We've been working on developing assessment metrics framework to answer these questions. What you're saying here is still a work in progress but basically, we looked at metrics and variables in the data we already have. Some are from the dashboards or directly from the vendor and others are from, for example, registration records. We have some derivative metrics and variables based on the questions would like to answer and group to all these metrics into a, what we call a five W's and one H framework. For example, for the question, who are our users. Each user affiliation directly collected. And for the how question. How do people use our services. We have that there's a red circle on the right here we have activities or outputs for each user. And this can be a number of projects or a number of events participated and combining information we can come up with a derived metric, which is the super users. As I mentioned this framework is still working progress and is very limited by what data is available to us. But from here you can already extract some interesting patterns. And I'm going to show you just a couple of them over here. Next slide. This plot you're looking at is provides a disciplinary information regarding how our user regarding who our users are. So it's a summary of how many users do we have for each department, based on their preliminary affiliation, the primary affiliation, and this data comes from the integrated data set where the usage of a given service or activity is represented by a true or false value and if it got a true in any of the services and they're counted as user. So that you can see that the top college top user is from the Heinz College of information systems and public policy which is followed by biology university libraries and psychology. So I think that results can partially be explained by the disciplinary culture. For example, we know that Heinz College has some of our biggest fans because based on our personal interactions and university libraries and psychology have been the leading forces for open science movement around the globe. But it's also really nice to see that some of the engineering and computer science departments also are emerging to the top. Next slide. But we also know that having the most users doesn't really necessarily reflect our user activity. So we also looked at how how active users actually are. For example, by looking at number of projects per user. And this we did at a tour activity level, instead of using all the overall integrated data set because each platform has their own measurements. So this box box box here is from the kill top data, and it's a departmental breakdown of the number of public public items, a public items owned by each user. So you're looking at the only the top 10 departments ranked by the median median number. You can see that some of the new departments emerged on top that including type of school of business music and physics. And what you also might have noticed is that there's a wide range of distribution patterns. But overall the medium that the medians are actually pretty low there. They're less than five items per user. But there are some users who have shared a lot of items. For example, in the psychology there are many outliers over there. And these were we call them super users. Next slide. So we're able to find out what these super users are and their departmental affiliation. Of course here, the information is the identified, but internally, knowing who the supers are has been really helpful and valuable for us to make some targeted outreach. And one of the important things for us to know about this, and our impact is how to, how much we have changed the researchers behavior over the years. So from here, you can see that there has been an increase of a number of depositors until pop and users that have accounts on a lab archives and protocols. I owe every year since we started the program in 2018. We would like to attribute this increase to the outreach effort of the open science and data collaborations program and to claim the impact but of course we want to be really cautious about not jumping to conclusions because many factors might be at play here. That's including the influence of pandemic so we want to look at this down the road look at this for the long term. Next slide. Okay, one of the most difficult things we think in the metrics framework to answer is the why question. So why are people using our activities, what are the motivations, is it for to meet the mandates to get credits or for some other reasons for this type of questions we'd really like to get direct feedback from users through surveys and reviews. I think that we collected, like I mentioned before helped us to identify the super users and then, early this year we reached out to the super users and assembled a open science advisory group from students faculty and research stuff. And we've been asking them questions about their motivation to use our services how their experience was and what else would like us to support them. And based on these initial inputs, we hope to be able to design surveys and interviews that are targeted to certain communities or user groups and ask questions like what I listed here. How, whether, or how much our services saved them time and money or whether and how much we have helped them to bring in grants publications collaborations or get them better career options. And also, whether we have helped their specific community to adopt open science and create a culture shift within their community. Eventually, next slide. So eventually we'd like to be able to answer the question what impact are we making in the research landscape at large. We realize this is not an easy thing to assess and so we really welcome collaborations and partnerships from other institutions were also interested in this topic. So was this last but not least I'd really like to thank my colleagues my devoted colleagues, who has been helping working on this project with us, we're a team of library faculty and staff, but most of us don't have the open science in our job description, but we just volunteered time together with the common goal of using data for social good and making the research more open and reproducible. So I'm really grateful for everybody involved. And of course, also thank you for listening. Next slide please. We're eager to get we're eager to hear your feedback input, and we're happy to answer questions. Well, thank you so much for that. That's the the measurement impact assessment tool is really extensive that's that's really interesting data. We probably have time for one quick question before we move on. If somebody wants to drop it in the chat or raise their hand or whatever floors open. Thanks Cliff about the plug about a dark. That's a that's a really good conference I hope it's going to start happening again. Yeah, we hope. Yeah, I got a little, got a little interrupted by the pandemic like so many things. Yeah. Okay, well I think, I think we'll go ahead and move on. I do also want to just echo the point that who John made about the cloud lab closing plenary for the in person meeting which we will be also videotaping for those who won't be at the in person meeting and will subsequently make that available but that's that's an absolutely fascinating complimentary piece to this. And I think you'll find it very interesting. All right, we will pause for just a moment now. Thanks again Melanie. Thank you. Thanks for having us. Our pleasure.