 Can you see my screen okay? Okay, should I go ahead and get started? My name is Melanie Ganey and I'm here with Wajin Wong and together we co-direct the open science and data collaborations program at CMU libraries and we are also both librarians for the biological sciences department as well as some of the other computer science and engineering departments on campus. Today we're going to be talking about what open science is and how we support it with the program. So as you use the Emerald Cloud Lab, researchers will be generating huge amounts of data very quickly and one question that will come up is how will you share that data and really the goal of open science is to share as much as possible to make the data as open as possible. We're going to start with just a couple of examples of the power of data sharing, one that's happening on a global scale and then one that is right here at Carnegie Mellon. So as we all know there has been a huge importance for data sharing during the COVID-19 pandemic from the very early days of the pandemic back in January 2020. The COVID-19 genome was shared in China. There were tracking dashboards happening in Asia and we've also seen this tremendous surge in preprints, open access publications, data sharing, open source projects all helping us to better understand COVID-19 and importantly this really sped up the development of the vaccines. But an important thing to know is that the cost of this rapid speed of publishing and sharing data comes with some retractions which have been in the news and so I think this will be continue to be an interesting example that we watch as the pandemic evolves and will help inform practices around data sharing in the future. When we talk about open science, particularly here at Carnegie Mellon, we're really referring more broadly to open research. So open science is a term that's widely used in the science community but really all of the practices and tools we refer to could be used by any discipline. They work well in engineering, computer science, as well as social sciences and humanities fields and really this is an umbrella term that refers to a very large variety of practices, many of which we support with our program and we'll talk about that in more detail later in the talk. But the ultimate goal of all of these practices is to make research products fair meaning that they are findable, accessible, interoperable and reusable. There are many benefits to practicing open science so I'll briefly mention a few of them here but one is that it allows you to get credit for all of your scholarly output, not just the peer review publications that happen at the end of projects but all of the interim products that you generate as well such as data code, workflows, protocols, etc. It allows you to increase the impact of your research and reach broader communities and this in turn really enables interdisciplinary collaborations. It improves the reproducibility and the reusability of the data. It is important for democratizing knowledge and research by getting all of this out from behind the pay wall. You really are allowing a much wider audience for the work and democratizing that work and importantly it complies with both recommendations at the university level as well as mandates from many of the funders and publishers now. The second example I'm going to talk about is from here at Carnegie Mellon and this is the Bolt 5000 data set so this was a large data set that came out of the psychology and robotics institute departments at Carnegie Mellon in 2019 and what was really novel about this data set was the sheer size of it so in this project researchers were skating the brains of people while they looked at 5000 natural scenes and typically in these type of projects a person might look at 50 to 100 natural scenes and it's really the sheer size of this data set that makes it very powerful because it allows computer vision scientists to apply their algorithms to the data set. And the power of this is emphasized in this quote from Nadine Chang a researcher at their Boggs Institute who worked on this project who said that computer vision scientists and visual neuroscientists essentially have the same end goal to understand how to process and interpret visual information and this project really helps bridge that gap between those two research communities. This was actually one of the first very large data sets we hosted on our institutional repository KillTub and our former colleague Anna Van Kulwijk helped these researchers think about things such as how to format the data for reuse and how diversion and license it. It was also hosted on some other repositories just to improve the discoverability to both cognitive neuroscientists as well as computer vision scientists. So these are all the type of things that we can help researchers think about. If you look at KillTub you can see that this data set has been downloaded close to 73,000 times since 2019 so it's already having a large impact. There are already publications coming out that cite this data set and so we like to share this as an example of how data sharing can really improve the impact of the research as well as foster these interdisciplinary collaborations. I mentioned earlier how it's now possible to get credit for what we call interim research products and I just want to point out that the NIH allows researchers to cite these interim research products in their grant applications and also in progress reports as products coming out of the funding so you can now get credit from the NIH for data sets, preprint publications, code, etc. I also mentioned briefly that there are mandates now for many publishers and funders and this is a very small list of some of those that you might be familiar with if you work in particularly biological sciences or chemistry but this list continues to grow very rapidly and really what it means is that the funders and publishers are really asking researchers to share their data that underlie the figures in their publications or their funded research and we anticipate that going forward in the future this will just become more of the norm so as you start your research products it is really worthwhile to think about data sharing as you collect the data since it's likely that you will be required to share it at some point. These are just, this is an article that came out in 2016 that talks about the FAIR principles and is a really great article on how to practice open science and open research and one thing that we like to highlight is that it's not simply enough just to share the data but there really need to be some considerations taken into account that makes it fair so it makes it reusable to others so a lot of this comes down to having proper documentation techniques and making sure that the code is well annotated etc and Wajin's going to talk more about that now. Okay yeah so now the next part I'm going to talk about some practical considerations of how to share your research and make your data fair. So first I want to point out that the FAIR principles apply to all the research products not only papers but also your data that includes your role in intermediate data and also software code your protocols and you know let notebooks as well. Next slide yeah so the key in making in practice the key in making all the research data fair is actually to use good data management practices throughout the entire research life cycle here that you can see the research process kind of in general can be summarized from the designing and planning of your experiments through the data collection and analysis and then in the end you want to publish an archive your research product and the results and share with the communities and hopefully optimize for reuse. Throughout the research life cycle it's good to use good data management at every step of the way and here I'm not going to go through every details but just want to point out that the key step is at the very beginning of the research process just think carefully think carefully because as researchers often we will work on a project we're so focused on the results and moving forward and get things published and often we only get start thinking about sharing sharing and writing up we actually write the paper and then we have to describe what exactly was done in our experiments and what data was collected and analyzed and how did we do that and then as often because we do that because the funders and publishers require that but this is often too late because we can no longer remember how things were done in great detail so the lesson learned is really don't wait until last minute to share your data but plan it out in the very beginning so keep your literature organized in the citation manager choose hardware and software with eventual sharing in mind and choose electronic notebooks that allow sharing and not only sharing but collaboration with your collaborator so it's easy to communicate and also to think about important part is also to think about the statistics and study design that you're eventually going to do and perhaps pre-register your study and another important ingredient at this is to write a data management plan next slide so I'm going to expand a little bit more on the data management plan it's probably not new to those of you who if you already had the experience of writing a grant application but nonetheless it can be very confusing but it's required by many funders and here's a example of by NIH and this is the their recent version of the data sharing policy where they've provided more stringent requirements for data sharing and and research and sharing the data and research outputs so if you want to look at exactly what the funders want often you have to dig a little deeper for example for the NIH so it is in this supplemental information that you can find what are the elements they're looking for in the data management plan next slide so but more or less the funders have a similar other funders may require different things but more or less similar so in a good data management plan in general you should consider consider these components what are the data to be collected that includes the types of data um the images or genomics data for so and so forth and as an amount are you going to produce data in the matter of uh metabytes or terabytes what the variables are going to be collected in your data set excuse me and also some of the details of your instrumentation reagents and your protocols so and so forth another important aspect to consider is the accessibility and how others may use your data so related to that is whether your data would require specialized tools or software to access and if access or manipulate and if that's the case then you should share those software as well and then also should clarify the data format and metadata standards next slide and in your data management plan there should also be preservation considerations just think ahead of time where would you want what kind of repository do you want to use would a generalized repository be sufficient or do you have a specialized repository like like the SRA type of repository that is more suitable for your data and then the goal is basically so this repository can issue identifiers to make your data more findable and identifiable and then people are more likely to use it and further can be preserved for a long time and then another another important issue in the data management plan is also to designate responsibility and who does what and importantly what happens when the key personnel of the grant leaves the lab right so the research would continue in what way okay next slide excuse me so another important part of another important part of the data sharing that's very important but often neglected is some of these day-to-day data management practices first is file naming conventions so this is important to avoid special characters and and space especially this is especially important if you want to introduce a programming and automation your workflow and use the command line so for when you start using the cloud lab this is going to be something to consider and then it's often good to keep the file names short but the names should be meaningful and include necessary details and the most important thing is to be consistent next slide and here's are some examples of file naming conventions the top is like without using the file naming conventions so when you go back and if go back to your old project to your previous work you can it can take a long time for for you to look up for the files that you're working on but with the proper name file naming convention as the bottom here you can see this started with days and then the project name and then then with what this description of what this file is and then who did the author and version numbers of these file naming of these files so makes everything super easy to to interpret next slide another important seems obvious but very often my stop procedure that often that we often encounter is that researchers often lose valuable data because the storage or backing up is not done properly so general rule of thumb there's a three two one rule so you should have at least three copies of your files for example in the hard drive or like in a cloud and and then at least two on these three copies should be on at least two types of media and at least one should be off site so ideal situation would be one of these copies would be on the cloud and the one important thing to know is that if you do choose to have two copies on the external hard drive make sure the hard drives are from different lots or batch numbers because the hard drives of the same batch tend to fail at the same time and another important thing is the file formats just for example just it's important to use the non proprietary and loss list all these this type of files for example use tiff instead of a jpeg if so the space allows because tiff doesn't doesn't have any compression and they use the cvs instead of ccsv file instead of excel here i want to say another uh next slide sorry next slide please i want to say one more word about this csv file next slide uh so there has been a lot of problems associated with using excel as a reader for um spreadsheet uh because uh as an example you can see that there's a lot of the g names are messed up because when you enter the g names like uh apr one or mar one then the excel automatically registered that as a date instead of a string of a or a name and then when you save this files like your um it'll be autocorrected to a date format instead of what you already have so this is something to keep in mind keep in mind uh next slide so next part i'm i want to spend some time to talk about uh reproducibility um because one of the main advantages of using the cloud lab is to make the results make your workflow totally automated and makes your results more reproducible um um so um the key point of making the results reproducible is not only to for the scrutiny of peer reviewers or other researchers uh but the important part is also for um your co-workers or your future or the future grad students who's continuing your work or for your future self to remember what was done and how was how it was done to generate um these results um lot of the daily things that you can do that would help to increase the reproducibility that's not only uh some of these some some of these may look like extra work but then in the long run uh you would you would know that it's going to save you a lot of time so if you're working a lot in a wet lab um it's important to document the protocols this is goes without saying document every details and variations of every step um and then it's important to use the reagents um that are um that are from reliable sources and document the details the origin of each resource um and one one more one important thing that many people often uh overlook is the lot member a lot member of uh the reagents right because there could be uh a lot of variation and um there are a lot of uh retracted papers and irreproducible results actually originated from uh using the wrong cell line not intentionally but because the cell lines uh you just walk down to the lab to the next door lab and borrow some cells and or plasmas from them without without verifying the identity of the cell line so that end up contributing to contributing to a lot of wrong results and the last piece of here is the equipments like the calibration and metadata of the equipments are actually very important for what the results you get for this piece um the advantage of the cloud lab is actually all the equipment parameters and metadata are documented and automatically saved next slide um so next part is about computational uh reproducibility there are two very good articles on computational reproducibility ten simple rules and the some good enough practices in the scientific computing i highly recommend you to uh read about uh read these two papers before um doing any computational work uh next slide um the computational reproducibility actually starts from the very basic which is organizing your project um so from the graph on the right you can see there's a lot kind of tree structure where um summer documentation but then you should you're supposed to have some directories like all your data would live under the data repository and then there's the the directory for documentation and your results goes to a separate um directory that's different from your raw data and then your scripts also goes to the to an independent directory the advantage of organizing your files this way is is that you can with uh computational methods you can uh it's easy to navigate through the directories um and then there comes to the next point it's important to use a relative path in your scripts uh so the first bullet point here um so if i want to find the birds count table dot csv file um from my i just have to write in my script uh two dots as your parent directory right because your script would live in the src directory and then two dots means go one step up and then go to the that this tells you go to data directory and find this file but then if i write in a way if i use the absolute path like the second line here the user hua jian project then you can only operate this on this one computer on this one location on your computer and it's not portable and not not reproducible at any different location next slide um so it's also very important to document all the steps of each step of your um each step of your um analysis and be very generous about the text statement to explain uh what this argument means and why you choose this parameter and um what they expected outcome should look like and this is not only for others to that may use your code but also you're doing a favor for your future self and for me uh this future self often sometimes comes in already five minutes after i wrote the code so it's very good habit to have to write a lot of comments to explain what you have done in your code um and also in terms of raw data and intermediate data sometimes the most useful data are not necessarily raw because it might take lots of computational time to to arrive at an intermediate data and from the cloud lab this this might be the most useful data form that you might want to share down the down the road um so i should thought like should document um uh both so it should be diligent about saving the intermediate intermediate data every step of the way uh for the dependencies and computational environment um same same idea as when i talked about uh relative um the relative or absolute path you want your code to be able to run on a different computer by somebody else so it's important for to document all your software packages and libraries used instead including the version number uh so for this it's actually easier said than done to preserve all the dependencies um so sometimes the best solution is really just use a docker use a container such as docker or binder to make your work more reproducible and when you do a lot of uh collaboration work then version control and tracking changes is also super super important and github is one one of the platforms um that's really um that's really popular for people to do that um next slide um so for those of you some of you might be doing some work to um do a statistical modeling or machine learning so that's inevitable that you would come across some randomness um is in this type of work is important to write down or take note of your random seeds that you're using to run your algorithm otherwise every time you run it it would be it would come up with the different results then it's not reproducible um you're of course you should test uh use uh do the test using different random seeds but just take a note of what seeds you used um and of all this um of all this like we're talking about the um cloud lab we're talking about running a command line um this the goal is just to avoid manual manipulation so that um you have very every all your procedures in the in the script for example if you want to you want to say I change all the NA see my data 2.1 and duplicate the first three cells in column A and blah blah so these procedures hard this kind of process is hard to describe and hard to reproduce and very error prone but if you write all these in the script it's very clear what you have done um okay and for to make all these processes easier for all the documentation pro process easier um there's there are these literal illiterate programming platforms such as Jupyter notebook or our our markdown so that these platforms you can not only write down your code but add a textual um interlink with the textual explanations and then run the code show the results and visualizations and then you can also add your interpretation as well so it's basically like writing a paper in the computational environment um and when you're all done with every with all your work when it's about time to publish so make sure you share all your scripts your analysis notebooks and results and um pick the appropriate license and optimize for the maximum sharing but at the same time you want to keep into consideration whether the timeline of sharing and whether you want a bar go who do you want to give access to and also um so and then share in the repository that will issue issue you a doi or other type of persistence persistence uh persistence identifier so that your results can be cited not only the data but the product itself as well next slide so yeah back to you up Melanie okay um so finally we're going to talk a little bit more specifically about how we support these practices at CMU libraries and I'll just start by saying that um the mission of CMU libraries is to create a 21st century library so this is a library that has moved beyond simply lending books to really um try to support the future of research and in our minds the future of research will be that it will be increasingly open so to help support open research we created the open science day collaborations program back in 2018 and there are five main pillars of support that we offer so we license tools and support those tools that facility data sharing and collaboration um I've already mentioned Kilthub we'll talk a little bit more about that in a second we also foster collaboration opportunities particularly those that bring researchers across disciplines together um we do assessment um both benchmarking against peer institutions as well as research into the benefits of open research we provide training opportunities both in the form of short workshops that cover various open science topics to longer um two-day workshops on coding in open source languages and we host annual events such as our two symposia open science symposium and ADAR artificial intelligence for data reuse conference that talk about the benefits and challenges of practicing open science and you can read more fully about all of our services at this URL and I'll also note that we have colleagues in the library that focus on things such as open access publishing um we have an APC fund that can help support that um and we also have colleagues that work on research data management and open educational resources specifically as well and we are happy to help get you in touch with those folks if you need help in those areas as well so first I'll start by talking about the tools I already mentioned kill tub briefly but this is our institutional repository that can be used to publish any type of research product anything from data sets software papers code posters etc we offer custom support for uploading projects onto there and this is indexed on google and it satisfies the mandates coming from publishers and funders for data sharing and at this URL you can read more about how to get started with the service or you can feel free to get in touch with us as well one thing that we can do on kill tub is create a landing page for a specific lab group or program so this is an example of a page that's been set up for Marlene Berman's lab in psychology she has shared many of her projects on here and so this can be a nice single URL you can point to to share your work with others we also have a license for a platform called protocols IO which is essentially an open access repository for step-by-step methods or protocols so it works really well for anything that step-by-stop both wet lab experiments as well as algorithms this is just an example protocol and some things are nice about this is that it has very good versioning which is incredibly important when you are troubleshooting protocols and you can either keep the protocols private share them with just members of your lab group or you can share them publicly and one advantage of sharing them publicly is that then you can just simply put a link to that protocol in a finished manuscript so that will allow a more in-depth methodology than you might be allowed with the strict word limits from the journals plus one also has started to have integrations with various journals so they have one for plus one now where you can actually have an article type in there that is a peer reviewed protocol so this is an interesting thing to think about again getting back to this idea of getting credit for these interim research products you can now have a publication that is an actual protocol in this journal we also support and license lab archives which is an electronic lab notebook so if you are still using paper notebooks in the lab this might be something to consider moving all of your documentation onto the cloud which has been really useful for people during the pandemic especially as we have to keep pivoting to this more remote stance so it's also very useful because you can share all of your documentation very easily with other lab members or your PI and there is a lot of flexibility with how you organize the notebooks we do have custom support from lab archives with our license so if you email me I can set up a one-on-one consultation for you with them and they will help you think about ideal ways to set up your notebook and how to get started and we also have an instructional license for lab archives so if you are an instructor of a lab course in particular this is being used by many of the lab courses at CMU now it can be a really great way for students to document their experiments in the lab courses and it integrates into canvas as well. We offer a lot of workshops each semester on various topics related to open science and we actually have a lab archives workshop coming up on February 3rd so if you are interested in learning more about how to get started with that I encourage you to go to this workshop page and register for that but we offer workshops on everything from coding to using these various tools I've talked about to data visualization publishing etc there's a lot of different topics and as I mentioned we sometimes have these two to three day longer style workshops where we do introductory coding in our Python and this is through an organization that we have a membership with called the Carpentries they are a nonprofit that teaches introductory coding skills to researchers and these workshops have been incredibly popular across campus it's a really great way to get started learning coding in a very supportive environment and hopefully people leave those workshops with some ideas on how to continue that learning because it's not like you're going to leave an expert coder but hopefully it gives you a little confidence to continue learning how to code and then finally another service that we offer that I'd like to mention is that we can offer support for grants as well we can write letters of support for NIH grants that outlines Kiltub as an option for data sharing and the data curation support that the Kiltub team can do and so we've done this for a few of the faculty in psychology now so that's always an option as well and finally I encourage you to get in touch with us if any questions arise during your research or you know down the road you think you want to get started with one of these platforms just please feel free to reach out we offer consultations on any of these topics so you can email us at this email address or reach out to Wajan or I personally we also have a newsletter we send out roughly monthly with updates to new services since we're continuously offering new offerings through the program and then this is just a short survey a few of time it's just to help us understand what are the types of things you would like to learn about in the future so with that that's the end of our presentation and please feel free to ask any questions if any have come up well thanks for attending and like I said just get in touch if anything comes up we can help you with