 Okay, so we will get started with session four now. Our first speaker in session four is Dr. Carrie Jordan. She is the Executive Director of the Carpentries, a nonprofit project that teaches foundational coding and data science skills. I'm thrilled to have Carrie with us here today. CMU Libraries has been members of the Carpentries for the last couple of years now, so we hold these workshops regularly and they're incredibly popular. We always have a very long waiting list and so I'm just really thrilled to hear her talk about her work at the Carpentries. I will give it over to you, Carrie. Thanks Melanie and hello everyone. I'm Dr. Carrie Jordan, Executive Director for the Carpentries and I really am so happy to be here to address you today. The title of my talk is, I want to dance with somebody how personal values drive inclusion in open science communities. Thank you so much for inviting me. It's my pleasure to share my thoughts and experiences with you in hopes that you'll leave empowered to advocate for inclusion. I want to begin with the end in mind and deliver a call to action to each and every one of you. Write this question down and consider it as you engage in this talk and that question is, how do your personal values align with your work in open science? Today I'm really going to open up and be transparent with you about a few things including my thoughts on the open science space and feeling like an imposter, the role that research and data plays in my life and what I've learned about building personal values through my work at the Carpentries. We'll talk about meaningful equity and finally circle back to that call to action that I gave you at the beginning. At some point I may burst into song so just be prepared for that and hopefully throughout the talk we'll have some fun as well. So I thought about my life and where data played a significant role. I was born in Detroit, Michigan in the US. What some would consider a big city and I was born in the 1980s. During a time when Detroit was at or near the top of unemployment, poverty per capita and infant mortality, my parents brought me into this unpredictable world. In the 80s, Detroit became notorious for crime and was repeatedly dubbed the arson capital of America, the murder capital of America and the most dangerous city in America. But for me it was not all doom and gloom. I remember growing up in this house. I remember backyard barbecues. I remember slumber parties with my cousins. I remember all of my uncles living in our basement at one point or another. I remember making snow angels outside in the winter and jumping through the fire hydrant in the summer. I also remember learning that opening the fire hydrant was illegal but I digress. I remember Christmas lights and Thanksgiving turkeys and the first time my mom added baby carrots to our spaghetti. That was a really interesting experiment. I remember loving this house. I remember loving my family. I remember loving living in Detroit. But I do remember some things that are on the not so positive side. I remember dropping to the floor as gunshots rang in the new year. I remember having our home broken into during Christmas and having all our Christmas gifts stolen. I remember my brother being arrested because he fit the description of an armed robber. I remember how difficult it was to plan celebrations with my parents because they divorced when I was three years old. All of these anecdotes are data points. And if we were to trust this data, a Detroiter like me, born in the 80s, would presently live below the poverty line and work in either accommodation or food services. If we trust this data, a Detroiter like me, born in the 80s, would presently rely on federally funded programs to support themselves. But there was another plan for me and that plan began with these two, Myling and Albert. My parents taught me to work diligently to get good grades. I played sports. I sang in the choir. I did everything that I needed to do to get into engineering school. About a 10-hour drive north of Detroit, there's a small town in the upper peninsula called Holton, Michigan. And that's where I went to school for my undergraduate and master's degrees. Michigan Technological University was a place I never thought I'd end up. I was one of the only women, one of the only women and people of color in most of my classes, but I was determined to do well because I got accepted. I got accepted into engineering school. I was invited to the party. Now, after earning a bachelor's and master's degree in mechanical engineering, I decided to pursue my PhD. But I wanted to address the fact that there weren't enough people who looked like me in the field. So I switched my focus from mechanical engineering to engineering education. And that's how I learned about many of the techniques that the carpentries uses to teach workshops and build community. I earned a master's in education and a PhD in engineering education from the Ohio State University. But despite all of that that I just shared with you, I still felt like an imposter. I remember the first time a colleague introduced me to someone looking to learn about open science. And in their introduction, they used the word expert. I was totally thrown off by this because at the time, I'd only been working in this space for about six months. And I could barely import a CSV into R. My research practices were horrible. I had no idea what a workflow was. I had never heard of the term reproducibility. And I was storing my data in multiple formats all over the place. And this person introduced me as an expert. What? No way. In my own words, imposter syndrome is the belief that your success is illegitimate and that at some point you'll be found out. I was working for an organization that develops and teaches lessons on the fundamental data skills needed to conduct research. But my own research practices have been horrible. And any moment now, they were going to find out. The day I realized that I wasn't an imposter in this space was the day that I received this note from a Carpentries community member. It reads, thank you for letting me help with the assessment process and driving our community forward with data. You see, what makes an expert isn't that the individual knows everything. Having comprehensive or authoritative knowledge is nothing if you aren't creating an environment where others feel comfortable contributing to the work. For this individual, I had done just that. I no longer feel like an imposter. And I have the Carpentries to thank for that. The Carpentries vision is to be the leading inclusive community teaching data and coding skills. Through our programs, we are working to dismantle the broken power structures and resource distribution that negatively impact marginalized communities around the world. We are empowering diverse groups of people to work with data and code, but these ideals didn't just fall out of the sky. The Carpentries has built its foundation to build global capacity with values that have and will continue to shape the way we grow inclusive computational communities. I'm passionate about this organization because those values align with my personal values. At the Carpentries, we build community through the lens of equity, inclusion, and accessibility. Those are terms that we throw around often, so I really want to dig into each one and share with you what I mean. Equity is about creating opportunities for equal access to and participation in programs that are capable of closing participation gaps in your community. For example, this image illustrates the difference between equality and equity. Equality is about sameness. It promotes justice by giving everyone the same thing, but it can only work if everyone starts from the same place. In this example, equality only works if everyone is the same height. Equity is about fairness. It's about making sure people have access to the same opportunities, and sometimes our differences or our history can create barriers to participation. So we must first achieve equity before we can enjoy equality. Now, inclusion is the active, intentional, and ongoing engagement of diverse people and communities. Advocating for inclusion increases awareness, content knowledge, and understanding of the ways that we interact within and change community. We've put so much attention on diversity and open science, but diversity does not equal inclusion. And let me share with you what I mean by that. One of my favorite quotes is by Verna Myers. She says, diversity is being invited to the party. Inclusion is being asked to dance. Now, how would you feel? You put your best foot forward. You worked so hard that there's no way that you don't get asked to the party. You put your makeup on, guys, you got a nice haircut. You get invited to the party, but when you get there, you're standing in the corner the whole night because nobody asks you to dance. You know the song, right? I want to dance with somebody. I want to feel the heat with somebody. With somebody. I know you were singing along with me. Inclusion is more than inviting people who don't look like you to the conversation or to the workshop or to the conference. It's ensuring that when they get there, they're able to interact and contribute in the ways that are meaningful to them. Diversity is situated around a deficit model. We need to get women. We need to get LGBT. We need to include all of these people, right? But inclusion promotes an equity paradigm. Now, accessibility refers to program and process design and implementation that offers multiple avenues for access and participation. In other words, accessibility is the usability of a product, service, environment, or facility by people with the widest range of possibility. Now, let me take a pause here for a moment so that we all can do a pulse check. In the beginning, I asked a question. How do your personal values align with your work in open science? I shared my personal values through stories about my journey to data, and I shared anecdotally the values that the carpentries instilled and ones that we were founded on in terms of equity, inclusion, accessibility, and how they inform our decision making. Now it's time for you to reflect. Open science is better served by having diverse people with the skills to use data to address the questions that are important to them. In your role, your values inform how you do your work. I encourage us all to work together to provide easily accessible resources for people who are unfamiliar with the tools and technology that you work with on a daily basis. How can we do that? What if there were greater diversity in the languages spoken where we teach and interact? How can we recognize and appreciate the different cultural norms that exist around data, programming, teaching, and volunteering in different regions? How can we recognize and value the various types of contributions that we see in open science? How can we work with existing organizations to reach broader communities rather than building or reinventing our own networks? How can we authentically work with broader communities rather than approach our work with a we're doing it for them mentality? We won't be able to answer these questions or solve these issues immediately, but I do want you to realize that your story, your values, and your contributions matter, and if we're going to drive inclusion in open science, we can only go further when we go together. Thank you so much, and I will answer any questions that you have. Thank you, Kerry. If anyone wants to throw up a hand in the participant panel or you can send me a question in the chat and we'll just give us a minute to see what questions pop up. I think I missed my three minute. I didn't give you a three minute warning. Yeah, we've been doing okay on time today, so we're just being a little bit more informal about it than we thought we might have to be. Okay, I think I did see a hand or I saw an applause symbol from Joe. Thanks, Joe. Okay, we do have a question from Carly and Carly, please feel free. Why don't you just, actually, unmute yourself and ask Kerry directly. Hi, Kerry. I really enjoyed your talk and I've been to data carpentry workshops before and have really enjoyed them. I was curious if you could discuss some of the barriers that are preventing people from getting into coding, but that maybe differ across different populations. Yes, absolutely. I think we're shifting from the science people are over here doing science and the programmers are over here doing programming and eventually they'll come together. More and more we're seeing, especially now we're seeing, you have to have at least some basic knowledge about programming in order to participate in many, whether it's a postdoc, whether it's a job in industry and so what we're seeing for novice learners coming in, the biggest barrier that we're seeing is language and surprisingly, even in the states where many people do speak English in the states. However, the way that some of the documentation is written or the textbooks around learning coding, it doesn't really translate to how the individual has learned other subjects and other fields and so there's a lot of jargon when you're first learning programming. It's like you should already have a prerequisite knowledge of all of this terminology and so that's kind of what shy or turns people away from programming. They feel like they can't learn programming unless they already know all of this terminology and that's just not true and that's one of the things that we try our best to do in the carpentry that's kind of teach the most useful thing first so that you can get confidence. If you've written a script to do one thing, you can write to do another thing and so eliminating a lot of the terminology that kind of gets lost in the code and focusing on here is how you do the thing without knowing all of the complicated words that go along with it and then of course you can learn those things later. So I think that's one of the major barriers is language in general and some of the assumptions that you need to know the terminology before even opening you know a Jupyter notebook or the command line. That's really interesting. So I work in a lab where there are a couple people learning code and English is not their native language. What could I do to help them in their learning journey? Oh I love that question. Yes. So actually I have a resource that I want to share when I when I finish talking I'm going to put it in the chat for the moderators and then you can share. We actually just started a project called Glossario and it is an open source repository of translations of terminology and all of the community has been contributing to it contributing translations in German, French, Spanish, Korean and a couple of other languages and so everyone can go into the the Glossario repository look at the definitions of the time and even suggest suggest other translations of those definitions and then we're using that in our workshops to kind of you know reduce that barrier. So that's one resource that I definitely want to share with you in case you want to share it or even you know contribute some language to it as well. And then the the next thing that I would I guess recommend if there are language barriers, practice problems. Having multiple practice problems and trying to contextualize the problems within the context of the individuals that you're working with helps a lot and having them kind of speak back to you what they hear because a lot of times when we teach things we come from the perspective of the instructor. I talk about I talk a lot about a lot about this in training. We have teaching objectives but we don't focus on learning objectives so a lot of times as instructors we say I need to cover all these things but we don't consider what do they actually need to walk away with and so if we're thinking about it from that perspective what do I want them to be able to walk away with it should help hopefully help us you know teach those things the most useful things first and then have them kind of regurgitate regurgitate it back to you okay what I understand is that you would like for me to do this or this this command does this you know kind of have have them talk it back to you to make sure that they have the understanding. Thank you so much. Great thank you so much for the question Carly and thank you Kerry for the talk and also I just wanted to share another comment and Nia said that she doesn't have a question but loves to talk I just want to pass that along. Thank you thank you so much for having me. Yeah I'm a huge fan of the carpentries and really love being close to approach I was very intimidated by programming for many years as a researcher so um it just really speaks to me um okay so we will move on but we will have more time to ask Kerry questions during the panel so we'll queue up our next speaker our next speaker is Justin Kitzis. Hi Justin. Hi. All right great. Does that look good over there? Yes it does and I'll do a brief intro for you and then I will let you take it over okay um Dr Justin Kitzis is an assistant professor of biological sciences at the University of Pittsburgh. His research broadly examines how human alteration of natural habitat impacts species abundance and diversity at large spatial scales and he's also authored this book that looks like he's going to talk about the practice of reproducible research. Thank you very much for joining us. Sure all right thanks very much Melanie uh so it's a pleasure to see you all today virtually at least um as Melanie said my name is Justin Kitzis I'm just up the road here at the University of Pittsburgh uh in our lab there our lab's research focuses and at the intersection of ecology conservation and data science. We do a lot of work these days in particular with machine learning models for bioacoustic analysis in the field and although that is our main area right now I'm not going to talk about that today instead I'm going to talk about a book here called the practice of reproducible research which I edited with a few colleagues back when I was at the Berkeley Institute for Data Science which Sierra Martinez knows very well she's our next speaker. So my main goal today is going to be to share some of the take home messages from this book about the practical aspects of doing academic research in a reproducible manner and to be quite frank much of which was heavily inspired by the philosophies and work of the carpentries um and as I go I'll also try to throw out uh some good quotes from some of the authors that we can argue about later on uh so this book was published by UC Press in 2017 you can get a copy at all of the usual places of course um but at the risk of losing your attention for the rest of the talk this is actually the URL at which you can find the complete text of the book freely available online that's practice reproducible research in the imperative command form in the URL. So what will you find there in this book that we worked on? So the initial genesis was to of the idea was to collect a set of case studies in reproducible research so that is very concrete examples of exactly how different researchers went about trying to conduct their research in a reproducible manner. We actually gave each contributor a template to use in writing these case studies so that they're very consistent across chapters which helps to compare uh in each of these chapters we asked each contributor to think about just the last project that they worked on paper software product presentation and describe the workflow that they used to do that work and in what specific ways they tried to ensure that their work would be reproducible. And then around the set of case studies we wrote several chapters describing reproducibility in general providing some basic practices and summarizing the characteristics of the various case studies. As an aside for those who might be interested we did write this entire book collaboratively using github and it was served using git book until very recently when the interface changed. But before I go any further into the into the book and the case studies I want to step back and define exactly what we mean by reproducibility in the context of this particular book. So of course there's no single definition of reproducibility but we did have several contributors define the characteristics of reproducibility in an operational fashion for the purpose of the book. So for example Philip Stark from UC Berkeley wrote the preface to the book which he titled Nileus and Verba which is the motto of the royal society from 1660 and translates roughly to take nobody's word for it. And Phil kind of drew a line from this basic principle of science to show me rather than trust me to our sometimes failures to do this in the modern day science enterprise. So he said in his chapter specifically I would argue if a paper does not provide enough information to assess whether its results are correct it is something other than science. Consequently I think scientific journals in the peer review system must change radically referees and editors should not bless work they cannot check because the authors did not provide enough information and scientific journals should not publish such work. If researchers do not make their code available there's little hope of ever knowing what was done to the data much less assessing whether it was the right thing to do. In the chapter after that Roka Marwick and Steneva follow up on that this was actually a reference to another paper that defined reproducibility a little more technically as the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. And then finally from the introduction which I wrote for the book we use this very short and somewhat narrow operational definition with our with our contributors. Research project is computationally reproducible if a second investigator can recreate the final reported results of the project including key quantitative findings tables and figures given only a set of files and written instructions. So from that sort of basic idea we went forward to collecting these case studies and as I mentioned the set of 31 case studies is kind of at the heart of the book and the idea was to create a collection or a sort of library of examples that readers could consult for examples and ideas that might be useful in their own work. And it turned out that there was a very general framework that crossed all of these submitted case studies involving three parts stages of data input data processing and data analysis. Now in retrospect this seems really obvious. I promise you it was not so easy to come up with this inductively when we were looking at the set of case studies to realize that they all did share this very similar structure. So as an example of what you might find in the book for the case studies I'll just give an example of the case study that I contributed which was on analyzing the spatial distribution of bat species based on recordings of their calls similar to work that we still research work we still do today in our lab. As in all of the case studies every case study starts with a figure and a description of a workflow that was used in this project. This is all of the pieces basically that have to be glued together in some particular way in order to make the research project happen. Aside from that or after that every author then gives a description of pain points and this is what were the most difficult aspects of making their research reproducible. So for example in my case study I focused on in this particular project an unavoidable need for two graphical closed source programs at different points in the pipeline which had to be integrated in with all of the rest of the work that we were doing as well as on issues with dependency management when we tried to share the software with other people. I'm actually going to come back to these pain points in more detail in just a moment and then finally at the end every author was asked to talk about the benefits that they saw why did they bother to do their research in a reproducible way and then some questions such as what does reproducibility mean to you what are the main benefits what are the challenges etc. So when it came time to think about synthesizing some common themes or takeaway points from across the case studies we realized pretty quickly that the commonalities in what was successful or what tools were used for example weren't nearly as interesting as the common reports about what was challenging or what actually did not work when people tried to make reproducible workflows and so in terms of what stops reproducibility we had two chapters by Huffram and Marwick who categorized the challenges to reproducibility into three main categories and the first of these categories is people it's always people always many contributors specifically though highlighted skill and knowledge gaps among collaborative teams with varying levels of expertise using particular tools and we often think about coding in this respect people with varying abilities to program but this wasn't just coding that was raised it was also things like platforms for collaborative manuscript editing particularly the divide over using tech or not ability or familiarity with open software and even knowledge of open science principles more generally and in that respect here's kind of a good quote from the Huff chapter the scientist unwilling to disenfranchise their collaborators could certainly elect to use more widely used tools however the price is often paid in reproducibility when those widely used lowest common denominator tools conflict with reproducibility goals especially the case with tools such as microsoft word excel or matlab and she goes on to describe other sort of tools with only graphical interfaces being a particular challenge so that kind of leads into the second main category of challenges that people raised or brought up which is computers maybe the obvious one the seemingly the most obvious one there were three specific areas that many people highlighted as problematic when trying to do reproducible research from this perspective the first of these is dependency management which is the art use that word purposefully and science of getting openly shared code running successfully on another machine and the difficulty of this of course varies according to the complexity of the research pipeline and the particular platforms that are needed the second is hardware availability with of course something specifically highlighted resources like hpc being particularly hard to access outside of a sort of in group of experienced users and finally third gaps in the pipeline between specific tools that are difficult to link together particularly when some are proprietary or only available with graphical interfaces and then finally the third sort major category of challenges to reproducibility came from institutions and first of these access restrictions on data as we're all very aware there is some data that for legal or ethical reasons simply cannot be openly shared in its original form that was used in the analysis but the authors actually went beyond that sort of challenge to talk about the recommendation of the better setting of standards for scrubbed and representational data to help work around this so even if you can't reproduce the exact results shown in a publication you can at least examine the pipeline and see how the results were generated even if you can't get the exact same one and then second the topic that was the perennial discussion back when I was in this institute for data science was around incentives and with some debate to be fair it was generally agreed by many of our contributors and many of the collaborators on the book even larger that reproducibility tends to be relatively mildly rewarded and irreproducibility tends to be relatively mildly punished and the trouble is of course that doing irreproducible science is easier at least in the short term in that moment because it's kind of the lazy way out it requires less of you than going through the extra steps required to make your work reproducible and it can must be much faster at least in the short term to work that way and in the world that prioritizes fast and novel results that are rarely confirmed this is a recipe for seeing a lot of irreproducible research so as ram and marwick traditional incentives in science prioritize highly cited publications of positive novel tidy results the practice of enabling the reproducibility of those results to be assessed by making the code and data publicly available is not part of the traditional incentives of science I think many of us recognize that the question then becomes what needs to happen and I'll leave you with a thought possibly for discussion about the difference between carrots and sticks so carrots in reproducibility are things like the idea that if you are open with your data or code you'll have more citations people might use your data they might cite your papers more often this is a good thing the idea giving out badges for example this is a good thing this is a carrot and this is in contrast though to sticks what are the sticks in reproducibility journal paper rejections proposal rejections and tenure rejections and I will say that I believe personally as I know many of my colleagues do that we probably don't talk about sticks quite as much as maybe we should I'll be quite honest this has changed a fair amount since we started working on this book probably six or seven years ago I think this conversation's advanced quite far but I would just put out there that it is a lot easier to talk about carrots than it is to talk about sticks but it's worth discussing the necessity to some degree of having sticks to force people to work in this kind of way so I'll just conclude here with one final mention as I mentioned this we wrote this book while at the Berkeley Institute for Data Science which was part of a network of three data science institutes that were supported by the Moore and Sloan foundations a few years back and these institutes have sort of given birth to this larger organization called the Academic Data Science Alliance which is carrying on a lot of the work and ideas that came out of that early work at those three institutes we just had an annual meeting last week so you missed that unfortunately but this community I thought maybe people in this community might be interested in keeping an eye on this website for special interest groups phone calls other events related to the themes of the conference today and so I'll just say thanks and looking forward to any questions or discussion thank you great thank you Justin we've had the topic of lack of incentivization for reproducibility and data sharing come up a few times over the last couple days I think the sticks aspect of it is perhaps new to that conversation for the event does anybody have any questions for Justin you can send me a message in the chat or you can just raise your hand I'm just scanning the participant window right now we'll just give this a minute to let people type stuff out oh we have a question from Alex do you can go ahead and unmute yourself Alex um yeah as long as we're talking about sticks um I am curious to know what where you think you would put the stick first since um the uh I mean you know one of the difficulties I think with introducing disincentives or you know sort of enforcement into the system is how much technical difficulty there still is around reproducibility and so it's kind of like you don't want to enforce on people something that they're really still struggling to do because the tools and things to do it aren't available so where would you think would be sort of the first place to put enforcement like yeah journals and funding that's a great question um and I'm going to I'm going to give you my answer almost by process of elimination because where is it not where is it very hard to do one of the hardest places to do it is at the journals because it requires uh extra effort and knowledge from the reviewers which I think is honestly a lot to ask to ask a reviewer who's you know it's already hard enough to find reviewers and to ask that we now need to find people who are going to try to re-execute all of the code needed to run this analysis I think is a tall order so at some level I think that's a very difficult one I also think promotion and review is a difficult place um for exactly the reasons you raised it's a very high stakes decision um it's hard to criticize people for something they never learned how to do even if the rest of their scholarship is very good so in a sense I'm going to answer a process of elimination that I would suggest considering the proposal review stage because relative to the other two well not relative to promotion and tenure review but there's already a lot of time spent on the process of reviewing proposals so they are relatively deeply read and they are given in some sense to the best projects and what are considered the best projects is that some discretion of the program officers and the reviewers and so if there are offices for example at NSF who would decide that having even stronger reproducibility components than what are currently included in for example data management plans made you the best proposal um that's an area that I think is maybe a little more conceivable than the other ones in terms of sticks and again since we started working on this book things have already moved in that direction to some extent cool thanks great thank you um thank you for the question Alex and thanks Justin for the talk I think at this point we will save any remaining questions for Justin for the panel um and cue up our final speaker of the symposium final invited speaker uh our last speaker is Sierra Martinez so Sierra if you are here you can um get your slides upon the screen okay do you guys see my screen yes and I'll just do a brief introduction and then hand it over to you um Dr. Sierra Martinez is currently the research lead of biodiversity and environmental sciences at the Berkeley Institute for Data Science her research focuses on data intensive research projects that aim to understand how life on this planet evolves in reaction to the environment and climate um she is a longtime open science advocate and has been involved in training for open data education and software thank you for joining us thank you for having me um yeah thank you for inviting me to speak today with so many other inspiring open source advocates so today I'm going to talk about some current work that really is the culmination of a lot of conversations I've had over the year on how we produce full sciences perform so it's a perfect um perfect way to go after Justin so there's a deep disconnect in how scientists perform data intensive research projects and how we talk about it and further how we teach it so while this is like a really dry title um what I mean what I'm going to talk about is really rooted in something exciting and that's a desire to create a faster bigger and more inclusive community um community based research in science so a bit of background about myself um I got my phd in molecular biology uh and I studied evolution and for my phd I studied how dna mechanisms regulate plant architecture and then for my postdoc I moved to flies and I was interested in genome evolution so a lot of my work was within a smaller research group and my workflow was done alone so while performing this work I think this echoes a lot of people what they're saying um they I fell into data science and programming so and I really fell in love with it and it's not because of the programming aspect it's what I perceived as a way to create better science and it improved my research questions so data science tools allow unprecedented amount of reproducibility and when combined with open source practices and open science practices like open data open code open publishing I saw a clear future of a scientific community allowing research to expand and combine their research um like no other time and a future and where we work in large teams to tackle the largest most pressing research questions in science uh so that is what I'm doing now um I'm like I should like I was just introduced uh I'm the lead of research of biodiversity environmental silences at the berkeley instituted data science or what we call bids so in this role I like I pursue answering really big questions that require multiple teams of researchers like how does climate change affect life on this planet these types of ambitious questions really rely on a bunch of multidisciplinary scientists getting all of their data together and integrating it across many many scales so this brings me to the question how do scientists work together and the short answer is computational reproducibility and I have a deep fascination with this topic and an extensive background trying to teach computational reproducibility including a data carpentry workshop that I helped create and I really approach this with empathy in mind for beginners who are trying to approach this and I again I have to give credit to the carpentries for the empathetic um nature I really borrowed from learning and working with the carpentry so a nod to Carrie and the carpentries it's really appropriate that we're in the same session um so essentially this is a really dense term for just being super organized is how you get computational organized computational reproducibility achieved and it's being organized enough so that others can replicate your work then that's really how big team science can be performed so why is teaching computational reproducibility so hard so I find it really difficult to create actual teaching material for teaching computational reproducibility um and over the years I've learned a few lessons and these have kind of been repeated over these last two talks so one uh working openly and with reproducibility in mind I'm echo Justin here it takes time and there's little incentive within academia two every research project is extremely unique including the skills and tools that you'll use for it but also the people you work with their skills and tools and the community standards are not well defined furthermore which I didn't list here uh a project evolves over time so the standards evolve with the project um the third is data analysis workflows get confounded with software development workflows and these are very different workflows yes there's lots of overlap between them but and we can borrow a lot from what's known about software development workflows but the terminology and jargon that is used to describe this we're using synonymous words we're using analogous words that and it just gets really confusing really really fast so I also echo what Carrie was talking about um so in light of that I've been using the word workflow um and pipeline uh these two are often used analogously um so I'm going to just tell you the difference between how I'm going to use them so a pipeline is a series of processes and data analysis and it's usually linear in nature and can be programmatically defined and the key thing here is that it's automated by a computer so steps can be usually described in relation to inputs and outputs and workflow is a series of processes involving how humans navigate through this system of data analysis and it's a mixture of code machine automation documentation and human intervention and a lot of times this is not linear in nature so we do stuff we learn stuff but then we don't teach that stuff this is a profound quote from Sarah Stout um a colleague of mine at the brickley institute of data science we'll now she just moved down to Smith college but um this is really at the heart of the problem we all develop our workflows and analyze our data pretty much siloed in our own in our own work even many times it's siloed within the team that we work in and we'll talk to our team about the results of our research but we really don't talk to them about the decisions we make for designing reproducible workflows so if we're not defining what a workflow is and how to design workflows to ourselves or to our small team how are we going to teach the next generation of researchers so these concerns are at the forefront of my mind when I started at bids as a postdoctoral fellow um me and Justin actually didn't overlap but I've gotten to know over the years um which again this is a nod to to Justin because that book that was written was before my time um but it really informed the culture of bids so I had these thoughts in mind and we have a lot of conversations of that at the Berkeley Institute of Data for Data Science and myself and two other researchers Sarah Stout and Valerie Vesquez a statistician and an environmental biologist um we decided to get together and go beyond defining a reproducible workflow to creating a strategy for designing a workflow so we just released this pre-print and we just got really positive reviews back so hopefully it's going to be published soon um principles for data analysis workflows so the goals of the paper are describe how data analysis workflows are really performed um provide principles on how to design your own workflow tease apart the difference between that software and design workflows and highlight the useful principles that can be borrowed from the software workflows um have it be a tool and language agnostic and define and put this terminology we keep throwing out into context and I apologize that I don't go into definitions of all the terminology I'm using I want you to please go to the paper where you like bold everything and we have a whole terminology list to make it a little easier for beginners to approach this type of work so we developed the ERP data analysis workflow system so the ERP is explore refine and polish um and that's named after the three defined phases of a research project that it that um we see during a workflow um so this is a the main figure of the paper and I'm going to kind of tease it apart a little bit so I can explain the the research products so the ERP data analysis workflow um we propose that it can be defined a workflow can be defined and informed by a series of decisions represented by this tree here so the standards of reproducibility that's documentation organization code structure um this is really a spectrum and it becomes more stringent as the project evolves and each decision can result in several outcomes throughout the whole lifetime of the project including dead ends interesting avenues that might be on beyond the scope of your project but also research projects that I'm focusing on today so the key to the ERP workflow is that these design decisions and the movement through the phases is largely determined by two things your immediate audience who you're communicating your research to and the research products that you are creating so I'm going to just quickly go through the phases here and again go to the paper where we describe this a little bit more detailed and and we'll have we provide questions to ask yourself while going through through designing a workflow but in the exploratory phase this is the phase where everything's really messy and you're kind of trying one data trying something one data thing trying to see if your data fits into this tool or that tool and just being really creative in the process and this is also where you hit a lot of dead ends and the immediate audience here is yourself or your future self so you may not need to carefully document and carefully make everything reproducible because a lot of the stuff you're going to leave in the end I'm not saying that those those pieces aren't important but you don't have to be perfect in your organization and reproducibility that early on when you're exploring the refinement phase your audience is your small team and academia this is usually your lab setting or maybe another student who you're working on the project with you share data and the standards of this is what you do with your team and this is where data management plans come into come into play so the last is the polishing phase and this is your audience is your community so the research products of the polishing phase all happen throughout the entirety of the thing and it's in this L shape so each community has different standards and norms and you may be part of a lot of communities but these communities are at your mind when you're polishing up research products research products for consumption events your wider community beyond your small team so the key point of the polishing phases research products can emerge at any time during the life cycle or research project we have to stop thinking about at the end we're going to do this traditional research paper and that's when everything's going to come out things that help you can emerge at any point in the life cycle or research project so the decision to put effort into these research products can be defined by two motivations and that's getting credit for your work and gaining skills for your next career stage a third one that I didn't mention here is to support reproducible research and that's like that lot this pink line that runs throughout maybe you have a research product that supports that reproducibility and then that makes it important but really when you're going through your project phase you think to yourself what do I need to get to my next career stage if you're not going to want to be a professor maybe creating a package or a library is more important than putting it all into the research or maybe creating a tutorial or bumping up your github profile is going to be more valuable if you wanted to go to industry so I'm just going to quickly go through a few of the research projects that can products that can emerge so in this exploratory phase the quote one person's trash is another person's treasure is really valuable here so don't just leave things if you've taken the time to make notes wrap them up and put them into a blog post or a white paper uh dead ends can warn others if communicated correctly the perils of certain analyses what tools work what tools didn't maybe you spent the time to learn a tool and you might not use it in the research product but get it out there make it part of the documentation for that open source tool um off-scope research results maybe they're beyond the scope of the project you're working on but save those and document them and they can be used for grant proposals and of course clean data is valuable publish it what you learn can be so much great help for others so create tutorials anything that you put work into throughout that whole life plan get credit for it put it out there you spent the time to learn it you spent the time to make notes on it get it out there and help you um yes so the exploratory phase is with your team so the data management plans these are now publishable products and can be using grant proposals um and they can be reused on your team to make every a tight data management plan that works in your small team analysis notebooks and then small tools in the forms of scripts so say you create a really small tool that just converts one data type to another package it up correctly get it out there get it on github get credit for it even before the paper is published just do it all all the time so I just have a few main takeaways um a research project success is more than the traditional peer reviewed paper and we have to get this out and I'm sure this has been repeated throughout all of these sessions it's it's more than a traditional research paper peer reviewed paper standards are defined by your community so it's up to you and your community to find these standards we need to be having a lot more conversations about what are our standards for our community even what are our standards for our team we need to stop just talking about results and talk about standards for reproducibility standards for data standards to share the data we just need to be talking about it more um and then alternative research projects products are more than just the products themselves they act as a guide for your data analysis workflow so they help you design how to be reproducible and they allow you to get that credit and get new school gain new skills that'll help you on your next step your career so again the most important research questions in modern science I believe currently lie on large research teams and reproducible workflows are really the backbone of how researchers learn to work together therefore we need to better define and teach how to design them to ourselves but also to our next generation of researchers so I just want to acknowledge Sarah Stout and Valerie Vazquez who are the co-authors of this paper um Stuart Geiger and the bad who led the best practices and data science working group at BIDS which uh is a fantastic group and a lot of these conversations happened within that group Rebecca Bartifer her very thoughtful feedback and edits of the paper and then of course BIDS Berkeley Institute for Data Science who brought all of us together so feel free to contact me uh on any of this I love talking about it and we're currently revising the paper so if you've had anything that's confusing in here let me know and that's it great thank you so much Sierra that was a very interesting talk um as a former researcher that switched careers to become a librarian I just love the idea of getting credit for your work as you go through your research career and using that to better position yourself for potential career changes in the future um does anybody have any questions for Sierra uh you can raise your hand throw them in the chat just give us a minute so people can type okay we have a question from Brian go ahead and unmute yourself hi Sierra thanks for that talk it was really helpful and I felt um like I could really relate to that so I kind of went through that process in grad school of um taking time to document some of the um sort of like specialized hardware that I had developed and in some ways it paid off in totally unpredictable fashion but I would say that it's really difficult to sometimes explain the reason for spending so much time on these things to faculty um especially like established faculty I so I wonder if you have any advice on um how you best have conversations with mentors or advisors about making the time and sort of being encouraging of the time for working on some of these other uh documenting processes yeah that's a really good point and I think that's what I hope changes the most and kind of why we wrote this paper also is mentors have to realize that these alternative research products are important to the career development beyond someone wanting to be a professor but yeah they're paying you they want you to get research done um so the main thing I can say is if they really are into you not spending time with it you kind of meet them halfway and take advantage of all the new open source community that's kind of structuring behind this which is publish them in a paper publish that data management plan into paper publish that tool in a paper get their name on these things that are now very easily publishable as a paper um and then it's a back and forth your career and is at the top of what needs to be done whether you're being a professor or your grad student and taking that extra time to like well document a tool some people won't want to spend the time doing that and that's okay they just want to get to the paper I mean it's not okay for the reproducibility aspects but if you want to do that and you want to gain those skills you got to fight to spend that time and I I think now it's become such a big topic on research incentives and career options within research that I think you can point hopefully you can point the read the mentor or ask those questions as you're entering graduate school while you're picking a PI they need to value you guys have to be on the same page in the value system but yeah it's a really tricky topic that I hope that's the main thing I hope changes is how students are mentored great thank you Sierra um we had a comment from Alex I just want to share said the graphic is so great I can't wait to show it to my department um one of the graphics in the talk and there's another question I'm going to actually save it for the panel because I think it's possibly something that Kerry and Justin would want to weigh in as well so I will just ask Kerry and Justin to rejoin us for our panel discussion normally we would have you come to the front and sit in these bar stools we have for our panel discussion when it happens in the Mellon Library in Pittsburgh but here we're just on our virtual stage today okay so I am going to let Ali unmute themselves and pose the question to the group not going to get away from it am I hi Sierra uh so my question was originally directed at Sierra but it's actually a good question for all of the panel um for for data professionals for those of us who are sort of in a research support role for a variety of different kinds of research how would you suggest we go about helping the researchers who take one look at this stuff and freak out and go I have no idea where to even in other words the bigness of it is overwhelming to them and if we could maybe give them a little bite size piece um maybe it would help but I'd love to hear your thoughts on where the entry points might be go Justin I was going to say in case Kerry is too modest to say it send him to a carpentries workshop that would be my first recommendation it's a great on ramp for people with with just at the beginning of the journey I think I've seen that many times from the reproducibility side I would also just mention that there happens to be a chapter in the book um which you can find which goes through like the simplest possible reproducible workflow we could think of um so you could also have a glance at that in fact it was a lesson I wrote for a carpentries workshop at one point which later became a chapter in the book so there you have it thank you Justin it's true but one of the things that we teach it really is to teach the most useful thing first and I think it can be difficult to bite an entire elephant um and we want to show people that if you learn how to do one thing you can learn how to do the next so I would try to identify a problem that they're trying to solve with their research and show them how to do one thing so for example we we even do this on our staff we send lots of emails and someone on our team showed us all how to write a very short script that would send 100 emails at once so that we wouldn't have to do the whole copy paste copy paste and for a few individuals on the team who had never done programming before that literally just opened their minds who wow I can automate this and I don't have to sit here for hours and look through everything so I would say ask them where are their pain points what is the most irritating part or the most frustrating part of the work that you do and see if there is a way you can show them how to do one thing whether it's using programming or or whatever that looks like to show them that it's not it's going to be useful because I think that's one of the things that frustrates people about programming just like you used to frustrate people about calculus um is when am I ever going to use this but if you can show them something that extremely useful that could save them time that could save them pain I think that's where they would get excited about learning programming. Yeah and to add on that I think it's all about pieces so your goal should never be to have a hundred percent reproducibility project your goal should be to gain those reproducibility skills so every time you learn a new strategy it's getting 10% closer getting 5% closer we need to just rebrand what reproducibility it's never 100% no one ever achieves 100% it's just like you do something and you should get the satisfaction that you are doing better every time you approach it. Great thank you I love that point about um it not being 100% but just any movement in that direction is really beneficial um we have a question from Alex that is mostly for Kerry so um Alex if you'd like to go ahead and unmute yourself I'll just let you ask that question yourself uh sure uh sorry it took me a second to unmute um uh yesterday um at the sister symposium on uh AI for data reuse um Ken Codinger gave uh from the open learning project gave this great talk on how they're using data to improve instructional design and it like struck me that the carpentries have made many similar insights to what had emerged from their research and that you're a data friendly project and I wondered if you'd ever considered a collaboration with them which could then benefit all the rest of us that would be amazing we we had not um spoken with them about a collaboration but you know as executive director one thing that I try my best to do is not reinvent things or start things from scratch when I know that there are those out there who are doing it successfully already or that we can you know come together and build something even stronger so I would love it if I could get that just if you have the contact information or even a link to their website so that I could reach out because we're often um digging into the research to make sure that our instructor training curriculum is up to date and you know figuring out ways to improve our curriculum our programming and so data driven research is what we're all about and I would love to reach out to them yeah and I think that um you know they might also probably benefit from uh some of the sort of more social insights that you're providing and ways to think about how to incorporate that into their model as well I mean I just feel like I don't I'm not involved with their project I just am seeing that the synergy between the two things and thinking that could be a really powerful collaboration thank you thank you it sounds like we're meant to be yes thank you for bringing that up Alex I've used um oh I in my teaching at SCMU and like I said I'm a huge fan of the carpentries and host carpentries workshops all the time and there is indeed I think some good synergy there um do we have any other questions am I missing any we have a few more minutes um so just throw a hand up or did anybody else have anything else to add okay it looks like we have takes me a while to sift through all this okay well we're waiting I'm also going to put in oh do you have a question Wajan oh no no I got disconnected for a moment so I wasn't sure if my message came through I was just saying like I'm having to connect you and Ken I'm sure he'll be happy to collaborate that's a great point from the audience um and again we have a few more minutes for questions so while we're waiting here I'm also just going to put in another plug for the carpentries um it's come up a lot in this session um if you are in Pittsburgh and you are interested in getting involved with the carpentries um we are always looking for helpers at our workshops um and that could be a great way for you to come and check out workshop and see what it's about and we also often have seats available for instructor training um so again we always welcome people in the Pittsburgh community um to join our receive new carpentries community um and so just get in touch if that sounds like it's of interest to you um and we'll be holding them virtually this year so be interesting okay I am not seeing any more questions um if you do think of questions you can feel free to put them in the slack and so I just want to thank our speakers again from session four uh your talks went really well together better than I could have predicted even um so it was really great to have the three of you join us thank you thank you for the opportunity and I agree our talks flow very very well create a series and go on the road yeah thank you for having me and great organizing point us all together okay so um we are going to have our closing remarks start in just a few minutes um but before that I just have a couple more announcements um I will direct you all to the slack channel again we will be putting a survey in there so we really appreciate any feedback people have from either this event or the AIDAR event yesterday also um earlier in the day Sarah Keiden got a lot of questions that we didn't have time for she was talking about openness in internet policy and building equity um so she has gone into the slack and answered all the questions we didn't get to so thank you so much to her for doing that and so you just have to scroll up um in the oss channel a bit that thread was started around 1 pm but she has some um answers in there for those questions that we didn't get a chance to get to